Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello! I am using DRBD v0.7.16 on top of set of software RAIDs: md8 = raid1 (sdc1, (md6 = raid1 (sdb1, sda1))); // DRBD-meta md9 = raid1 (sdc2, (md7 = raid0 (sda2, sdb2))); // DRBD-data The size of DRBD-data is about 400 GB. Drives sda and sdb are plugged into onboard SATA controller and sdc is plugged in external PCI SATA controller. This is done to be able to hot remove drive from the system to process data on another server. Since modules handling SATA controllers are different for internal and external controllers in our case (sata_promise vs sata_sil), it is possible to do a hot swap of the drive by unloading sata_sil module. And after processing data we just plug drive back and make full RAID synchronization, so we don't afraid to mount removable drive on another server read/write and make it out of sync. If it is important filesystem on drbd device was recently formatted and last synchronization was relatively long ago, so there are must be much data out of sync. That was a preface. I just don't know whether these details are significant or not. The story is that (at least when DRBD on second node is shut down) DRBD starts in about of 1,5 minutes. This is suspiciously long. DRBD's init-script just displays Starting DRBD resources: [ d0 and you must to wait about 1.5 minutes before it continues. After that time everything is ok. Looking in /var/log/messages I see following (output of start process): Feb 26 05:56:48 fileserv1 kernel: drbd: initialised. Version: 0.7.16 (api:77/proto:74) Feb 26 05:56:48 fileserv1 kernel: drbd: SVN Revision: 2066 build by root at fileserv1, 2006-02-15 02:04:23 Feb 26 05:56:48 fileserv1 kernel: drbd: registered as block device major 147 Feb 26 05:56:48 fileserv1 kernel: klogd 1.4.1, ---------- state change ---------- Feb 26 05:56:49 fileserv1 kernel: drbd0: resync bitmap: bits=95217088 words=2975534 Feb 26 05:56:49 fileserv1 kernel: drbd0: size = 363 GB (380868352 KB) ######### ^^^^^ A LOOOONG DELAY ############################################## Feb 26 05:58:22 fileserv1 kernel: drbd0: 363 GB marked out-of-sync by on disk bit-map. Feb 26 05:58:22 fileserv1 kernel: drbd0: Found 6 transactions (324 active extents) in activity log. Feb 26 05:58:22 fileserv1 kernel: drbd0: drbdsetup [13313]: cstate Unconfigured --> StandAlone Feb 26 05:58:23 fileserv1 kernel: drbd0: drbdsetup [13327]: cstate StandAlone --> Unconnected Feb 26 05:58:23 fileserv1 kernel: drbd0: drbd0_receiver [13328]: cstate Unconnected --> WFConnection Than I have compiled DRBD with maximum debug options enabled as I could found in drbd_config.h. Here is output of how it starts from /var/log/messages: Feb 26 06:01:27 fileserv1 kernel: drbd: initialised. Version: 0.7.16 (api:77/proto:74) Feb 26 06:01:27 fileserv1 kernel: drbd: SVN Revision: 2066 build by root at fileserv1, 2006-02-26 06:00:38 Feb 26 06:01:27 fileserv1 kernel: drbd: registered as block device major 147 Feb 26 06:01:27 fileserv1 kernel: klogd 1.4.1, ---------- state change ---------- Feb 26 06:01:28 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,0,<NULL>) Feb 26 06:01:28 fileserv1 kernel: drbd0: drbd_bm_resize: bm_set=95217088 Feb 26 06:01:28 fileserv1 kernel: drbd0: bm[2975533]=0xFFFFFFFF Feb 26 06:01:28 fileserv1 kernel: drbd0: resync bitmap: bits=95217088 words=2975534 Feb 26 06:01:28 fileserv1 kernel: drbd0: size = 363 GB (380868352 KB) Feb 26 06:01:28 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,72,<NULL>) Feb 26 06:01:28 fileserv1 kernel: drbd0: read_sect: sector=0 offset=0 num_words=128 Feb 26 06:01:28 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,73,<NULL>) Feb 26 06:01:28 fileserv1 kernel: drbd0: read_sect: sector=1 offset=128 num_words=128 Feb 26 06:01:28 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,74,<NULL>) Feb 26 06:01:28 fileserv1 kernel: drbd0: read_sect: sector=2 offset=256 num_words=128 ... SKIPPED ... SKIPPED ... SKIPPED ... SKIPPED ... SKIPPED ... SKIPPED ... Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,23317,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: read_sect: sector=23245 offset=2975360 num_words=128 Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,23318,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: read_sect: sector=23246 offset=2975488 num_words=46 Feb 26 06:03:01 fileserv1 kernel: drbd0: drbd_bm_set_lel: bm_set=95217088 Feb 26 06:03:01 fileserv1 kernel: drbd0: bm[2975533]=0xFFFFFFFF Feb 26 06:03:01 fileserv1 kernel: drbd0: 363 GB marked out-of-sync by on disk bit-map. Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,0,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,8,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,9,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,10,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,11,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,12,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,13,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,12,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,13,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,8,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,9,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,10,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]:drbd_md_sync_page_io(,11,<NULL>) Feb 26 06:03:01 fileserv1 kernel: drbd0: Found 6 transactions (324 active extents) in activity log. Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13769]: cstate Unconfigured --> StandAlone Feb 26 06:03:01 fileserv1 kernel: drbd0: drbdsetup [13785]: cstate StandAlone --> Unconnected Feb 26 06:03:01 fileserv1 kernel: drbd0: drbd0_receiver [13786]: cstate Unconnected --> WFConnection So, I don't know what drbd_md_sync_page_io and read_sect means but 23246 sectors is about of 11 MB (provided the sector size of 512 bytes). I think, it is too slow to read or write 11 MB for 1,5 minutes. I hope that here happens something else. And the question is what is it and how to fix? :-) Is it because of RAID or because of unavailable second node and much data out of sync? Thanks.