Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello. My DRBD settings are as follows. Can anyone give me a survey method, setting method, or advice for problem isolation? In addition, please tell me if there is necessary information. ----- # cd /etc # cat drbd.conf include "drbd.d/global_common.conf"; include "drbd.d/*.res"; resource r0 { protocol C; on ruth1 { device /dev/drbd0; disk /dev/mapper/drbd-drbdlv; address XXX.XXX.XXX.XXX:7780; meta-disk internal; } on ruth2 { device /dev/drbd0; disk /dev/mapper/drbd-drbdlv; address XXX.XXX.XXX.XXX:7780; meta-disk internal; } } # cat drbd.d/global_common.conf # DRBD is the result of over a decade of development by LINBIT. # In case you need professional services for DRBD or have # feature requests visit http://www.linbit.com global { usage-count yes; # minor-count dialog-refresh disable-ip-verification # cmd-timeout-short 5; cmd-timeout-medium 121; cmd-timeout-long 600; } common { handlers { # These are EXAMPLE handlers only. # They may have severe implications, # like hard resetting the node under certain circumstances. # Be careful when chosing your poison. # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb } options { # cpu-mask on-no-data-accessible } disk { # size on-io-error fencing disk-barrier disk-flushes # disk-drain md-flushes resync-rate resync-after al-extents # c-plan-ahead c-delay-target c-fill-target c-max-rate # c-min-rate disk-timeout } net { # protocol timeout max-epoch-size max-buffers unplug-watermark # connect-int ping-int sndbuf-size rcvbuf-size ko-count # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri # after-sb-1pri after-sb-2pri always-asbp rr-conflict # ping-timeout data-integrity-alg tcp-cork on-congestion # congestion-fill congestion-extents csums-alg verify-alg # use-rle } } ----- Best regards! 2017-02-03 15:32 GMT+09:00 Seiichirou Hiraoka <seiichirou.hiraoka at gmail.com>: > Hello. > > I use DRBD in the following environment. > > OS: Redhat Enterprise Linux 7.1 > Pacemaker: 1.1.12 (CentOS Repository) > Corosync: 2.3.4 (CentOS Repository) > DRBD: 8.4.9 (ELRepo) > # rpm -qi drbd84-utils > Name : drbd84-utils > Version : 8.9.2 > Release : 2.el7.elrepo > Architecture: x86_64 > Vendor: The ELRepo Project (http://elrepo.org) > # rpm -qi kmod-drbd84 > Name : kmod-drbd84 > Version : 8.4.9 > Release : 1.el7.elrepo > Architecture: x86_64 > > Although DRBD is operated on two servers (server1, server2), > the following error message suddenly appears > and writing to the DRBD area can not be performed. > > . server1(master) > Jan 20 10:41:16 server1 kernel: block drbd0: local WRITE IO error > sector 118616936+40 on dm-0 > Jan 20 10:41:16 server1 kernel: block drbd0: disk( UpToDate -> Failed ) > Jan 20 10:41:16 server1 kernel: block drbd0: Local IO failed in > __req_mod. Detaching... > Jan 20 10:41:16 server1 kernel: block drbd0: 0 KB (0 bits) marked > out-of-sync by on disk bit-map. > Jan 20 10:41:16 server1 kernel: block drbd0: disk( Failed -> Diskless ) > Jan 20 10:41:16 server1 kernel: block drbd0: Got NegDReply; Sector > 117512416s, len 4096. > Jan 20 10:41:16 server1 kernel: drbd0: WRITE SAME failed. Manually zeroing. > Jan 20 10:41:16 server1 kernel: block drbd0: Should have called > drbd_al_complete_io(, 118616936, 20480), but my Disk seems to have > failed :( > Jan 20 10:41:16 server1 kernel: block drbd0: pdsk( UpToDate -> Failed ) > Jan 20 10:41:16 server1 kernel: block drbd0: pdsk( Failed -> Diskless ) > Jan 20 10:41:16 server1 kernel: block drbd0: drbd_req_destroy: Logic > BUG rq_state = 0x90040, completion_ref = 0 > Jan 20 10:41:16 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512416+8 > Jan 20 10:41:16 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3690583, > block 10) > Jan 20 10:41:16 server1 kernel: block drbd0: helper command: > /sbin/drbdadm pri-on-incon-degr minor-0 > Jan 20 10:41:16 server1 kernel: EXT4-fs (drbd0): Delayed block > allocation failed for inode 4326138 at logical offset 1 with max > blocks 1 with error 5 > Jan 20 10:41:16 server1 kernel: EXT4-fs (drbd0): This should not > happen!! Data will be lost > Jan 20 10:41:16 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512384+8 > Jan 20 10:41:16 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3690583, > block 6) > Jan 20 10:41:16 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512296+8 > Jan 20 10:41:16 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3670162, > block 1) > Jan 20 10:41:16 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512296+8 > Jan 20 10:41:16 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3670162, > block 1) > Jan 20 10:41:16 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512296+8 > Jan 20 10:41:16 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3670162, > block 1) > Jan 20 10:41:16 server1 kernel: Aborting journal on device drbd0-8. > Jan 20 10:41:16 server1 kernel: Buffer I/O error on dev drbd0, logical > block 25722880, lost sync page write > Jan 20 10:41:16 server1 kernel: JBD2: Error -5 detected when updating > journal superblock for drbd0-8. > Jan 20 10:41:16 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:41:16 server1 kernel: EXT4-fs error (device drbd0): > ext4_journal_check_start:56: Detected aborted journal > Jan 20 10:41:16 server1 kernel: EXT4-fs (drbd0): Remounting filesystem read-only > Jan 20 10:41:16 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:41:16 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:41:16 server1 kernel: block drbd0: helper command: > /sbin/drbdadm pri-on-incon-degr minor-0 exit code 0 (0x0) > Jan 20 10:41:25 server1 kernel: block drbd0: 122 messages suppressed > in /home/build/akemi/drbd84/el7/BUILD/drbd-8.4.9-1/drbd/drbd_req.c:1447. > Jan 20 10:41:25 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 134218048+8 > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 16777256, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680354, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680421, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680422, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680437, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680491, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680492, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680493, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680494, lost async page write > Jan 20 10:41:25 server1 kernel: Buffer I/O error on dev drbd0, logical > block 14680495, lost async page write > Jan 20 10:46:50 server1 kernel: block drbd0: 168 messages suppressed > in /home/build/akemi/drbd84/el7/BUILD/drbd-8.4.9-1/drbd/drbd_req.c:1447. > Jan 20 10:46:50 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512416+8 > Jan 20 10:46:50 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3690583, > block 10) > Jan 20 10:47:03 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512440+8 > Jan 20 10:47:03 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3690583, > block 13) > Jan 20 10:47:03 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512440+8 > Jan 20 10:47:03 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3690583, > block 13) > Jan 20 10:47:03 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117512440+8 > Jan 20 10:47:03 server1 kernel: EXT4-fs warning (device drbd0): > __ext4_read_dirblock:1375: error reading directory block (ino 3690583, > block 13) > Jan 20 10:47:03 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117450240+8 > Jan 20 10:47:03 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117450248+8 > Jan 20 10:47:03 server1 kernel: EXT4-fs error (device drbd0): > __ext4_get_inode_loc:3979: inode #3688997: block 14681282: comm > xymond_history: unable to read itable block > Jan 20 10:47:03 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:47:03 server1 kernel: buffer_io_error: 159 callbacks suppressed > Jan 20 10:47:03 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:47:03 server1 kernel: EXT4-fs error (device drbd0): > __ext4_get_inode_loc:3979: inode #3688997: block 14681282: comm > xymond_history: unable to read itable block > Jan 20 10:47:03 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:47:03 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:47:03 server1 kernel: EXT4-fs error (device drbd0): > __ext4_get_inode_loc:3979: inode #3688997: block 14681282: comm > xymond_history: unable to read itable block > Jan 20 10:47:03 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:47:03 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:51:59 server1 kernel: block drbd0: 100 messages suppressed > in /home/build/akemi/drbd84/el7/BUILD/drbd-8.4.9-1/drbd/drbd_req.c:1447. > Jan 20 10:51:59 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 4194312+8 > Jan 20 10:51:59 server1 kernel: EXT4-fs error (device drbd0): > ext4_wait_block_bitmap:494: comm xymond_history: Cannot read block > bitmap - block_group = 17, block_bitmap = 524289 > Jan 20 10:51:59 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:51:59 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 0+8 > Jan 20 10:51:59 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:51:59 server1 kernel: EXT4-fs error (device drbd0): > ext4_discard_preallocations:3995: comm xymond_history: Error loading > buddy information for 17 > Jan 20 10:51:59 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:51:59 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 0+8 > Jan 20 10:51:59 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:54:42 server1 kernel: xymond_history[14970]: segfault at 0 > ip 00007fb37d6b5b3d sp 00007ffc0a184090 error 4 in > libc-2.17.so[7fb37d648000+1b7000] > Jan 20 10:55:42 server1 kernel: xymond_history[16324]: segfault at 0 > ip 00007f18816dab3d sp 00007ffd91bd2b40 error > 4 in libc-2.17.so[7f188166d000+1b7000] > Jan 20 10:55:58 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 118411408+8 > Jan 20 10:55:58 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 118411408+8 > Jan 20 10:55:58 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117506384+8 > Jan 20 10:55:58 server1 kernel: EXT4-fs error (device drbd0): > ext4_find_entry:1312: inode #3670213: comm httpd: reading directory > lblock 0 > Jan 20 10:55:58 server1 kernel: EXT4-fs (drbd0): previous I/O error to > superblock detected > Jan 20 10:55:58 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 0+8 > Jan 20 10:55:58 server1 kernel: Buffer I/O error on dev drbd0, logical > block 0, lost sync page write > Jan 20 10:55:58 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 118411408+8 > Jan 20 10:56:10 server1 kernel: block drbd0: 13 messages suppressed in > /home/build/akemi/drbd84/el7/BUILD/drbd-8.4.9-1/drbd/drbd_req.c:1447. > Jan 20 10:56:10 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117854864+8 > Jan 20 10:56:10 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117854864+8 > Jan 20 10:56:14 server1 kernel: block drbd0: 2 messages suppressed in > /home/build/akemi/drbd84/el7/BUILD/drbd-8.4.9-1/drbd/drbd_req.c:1447. > Jan 20 10:56:14 server1 kernel: block drbd0: IO ERROR: neither local > nor remote data, sector 117854864+8 > > ... > > I use xymon on DRBD area. > > . server2(slave) > Jan 20 10:41:16 server2 kernel: block drbd0: disk( UpToDate -> Failed ) > Jan 20 10:41:16 server2 kernel: block drbd0: Local IO failed in > drbd_endio_write_sec_final. Detaching... > Jan 20 10:41:16 server2 kernel: block drbd0: pdsk( UpToDate -> Failed ) > Jan 20 10:41:16 server2 kernel: block drbd0: Can not satisfy peer's > read request, no local data. > Jan 20 10:41:16 server2 kernel: block drbd0: pdsk( Failed -> Diskless ) > Jan 20 10:41:16 server2 kernel: block drbd0: 20 KB (5 bits) marked > out-of-sync by on disk bit-map. > Jan 20 10:41:16 server2 kernel: block drbd0: disk( Failed -> Diskless ) > Jan 20 11:29:25 server2 kernel: drbd r0: peer( Primary -> Unknown ) > conn( Connected -> Disconnecting ) pdsk( Diskless -> DUnknown ) > Jan 20 11:29:25 server2 kernel: drbd r0: ack_receiver terminated > Jan 20 11:29:25 server2 kernel: drbd r0: Terminating drbd_a_r0 > Jan 20 11:29:25 server2 kernel: drbd r0: Connection closed > Jan 20 11:29:25 server2 kernel: drbd r0: conn( Disconnecting -> StandAlone ) > Jan 20 11:29:25 server2 kernel: drbd r0: receiver terminated > Jan 20 11:29:25 server2 kernel: drbd r0: Terminating drbd_r_r0 > Jan 20 11:29:25 server2 kernel: drbd r0: Terminating drbd_w_r0 > > ... > > > After stopping Pacemaker on server2, > I restarted OS of server1. > After that, after stopping Pacemaker on server2, > restart the OS of server2 > As a result, it recovered. > > Since the disks of two servers are located in SAN storage > and the server is operating on the virtual infrastructure, > we believe there is no physical failure of the disk. > > Please give advice on how to isolate the problem and > prevent recurrence. > > Best regards!