[DRBD-user] Error after update to 9.0.8+linbit-1

Frank Rust f.rust at tu-braunschweig.de
Mon Aug 14 16:31:57 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Thanks for the quick reply!

I think it might have started at this point in the kernel log:

Aug 14 12:54:04 virt2 kernel: [228497.005014] drbd vm-109-disk-1 virt5: BAD! BarrierAck #1488589 received, expected #1547462!                                                                                                               Aug 14 12:54:04 virt2 kernel: [228497.005122] drbd vm-109-disk-1 virt5: conn( Connected -> ProtocolError ) peer( Secondary -> Unknown )                                                                                                     Aug 14 12:54:04 virt2 kernel: [228497.005124] drbd vm-109-disk-1/0 drbd176 virt5: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )                                                                                                   Aug 14 12:54:04 virt2 kernel: [228497.005172] drbd vm-109-disk-1 virt5: ack_receiver terminated                                                                                                                                             Aug 14 12:54:04 virt2 kernel: [228497.005173] drbd vm-109-disk-1 virt5: Terminating ack_recv thread                                                                                                                                         Aug 14 12:54:04 virt2 kernel: [228497.005202] drbd vm-109-disk-1/0 drbd176 virt5: helper command: /sbin/drbdadm pri-on-incon-degr                                                                                                           Aug 14 12:54:04 virt2 kernel: [228497.006237] drbd vm-109-disk-1/0 drbd176 virt5: helper command: /sbin/drbdadm pri-on-incon-degr exit code 0 (0x0)                                                                                         Aug 14 12:54:04 virt2 kernel: [228497.131384] drbd .drbdctrl/1 drbd1 virt5: pdsk( Outdated -> UpToDate )                                                                                                                                    Aug 14 12:54:04 virt2 kernel: [228497.318075] drbd vm-109-disk-1 virt5: Connection closed                                                                                                                                                   Aug 14 12:54:04 virt2 kernel: [228497.409379] drbd vm-109-disk-1 virt5: conn( ProtocolError -> Unconnected )                                                                                                                                Aug 14 12:54:04 virt2 kernel: [228497.409395] drbd vm-109-disk-1 virt5: Restarting receiver thread                                                                                                                                          Aug 14 12:54:04 virt2 kernel: [228497.409413] drbd vm-109-disk-1 virt5: conn( Unconnected -> Connecting )                                                                                                                                   Aug 14 12:54:06 virt2 kernel: [228498.703332] drbd vm-109-disk-1 virt5: Handshake to peer 0 successful: Agreed network protocol version 112                                                                                                 Aug 14 12:54:06 virt2 kernel: [228498.703337] drbd vm-109-disk-1 virt5: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.                                                                                           Aug 14 12:54:06 virt2 kernel: [228498.703527] drbd vm-109-disk-1 virt5: Peer authenticated using 20 bytes HMAC                                                                                                                              Aug 14 12:54:06 virt2 kernel: [228498.703535] drbd vm-109-disk-1 virt5: Starting ack_recv thread (from drbd_r_vm-109-d [65200])                                                                                                             Aug 14 12:54:07 virt2 kernel: [228499.887977] drbd_send_and_submit: 342135 callbacks suppressed                                                                                                                                             Aug 14 12:54:07 virt2 kernel: [228499.887981] drbd vm-109-disk-1/0 drbd176: IO ERROR: neither local nor remote data, sector 28975336+8                                                                                                      Aug 14 12:54:07 virt2 kernel: [228499.888098] drbd vm-109-disk-1/0 drbd176: IO ERROR: neither local nor remote data, sector 31496168+8                                                                                                      Aug 14 12:54:07 virt2 kernel: [228499.888101] buffer_io_error: 353007 callbacks suppressed                                                                                                                                                  Aug 14 12:54:07 virt2 kernel: [228499.888103] Buffer I/O error on dev drbd176, logical block 3621917, lost async page write                                                                                                                 Aug 14 12:54:07 virt2 kernel: [228499.888463] Buffer I/O error on dev drbd176, logical block 3937021, lost async page write                                                                                                                 


after that there follow thouthands of similar lines (Buffer I/O error and drbd vm-109-disk-1/0 … and suppressed callbacks)


For a while it looked as if the complete cluster would come down with all sorts of pvestatd hangs, machine not reacting disks outdated. 
So I stopped the disk migration, removed the resource and restarted the drbdmanage daemon on the involved nodes.
After some minutes it healed itself mystically.

In a new try I changed the redundancy default from 1 to 2 in proxmox (what never worked on the earlier versions). After that it seems to be possible to migrate machine images to the drbd even while initially looking very bad (with half of the nodes .drbdctrl/1 outdated and such). But finally the transfer worked.


The question now is: why does this happen and how to avoid it:
Aug 14 12:54:04 virt2 kernel: [228497.005014] drbd vm-109-disk-1 virt5: BAD! BarrierAck #1488589 received, expected #1547462!                                                                                                               Aug 14 12:54:04 virt2 kernel: [228497.005122] drbd vm-109-disk-1 virt5: conn( Connected -> ProtocolError ) peer( Secondary -> Unknown )                                                                                                   



> Am 14.08.2017 um 16:12 schrieb Roland Kammerer <roland.kammerer at linbit.com>:
> 
> On Mon, Aug 14, 2017 at 02:41:56PM +0200, Frank Rust wrote:
>> Hi all,
>> after upgrading DRBD-Dkms to 9.0.8+linbit-1 (on kernel 4.4.67-1-pve) I get those errors:
>> 
>> Software is: root at virt5:~# dpkg -l |grep drbd
>> ii  drbd-dkms                         9.0.8+linbit-1                     all          RAID 1 over TCP/IP for Linux module source
>> ii  drbd-utils                        9.0.0+linbit-1                     amd64        RAID 1 over TCP/IP for Linux (user utilities)
>> ii  drbdmanage-proxmox                1.0-1                              all          DRBD distributed resource management utility
>> ii  python-drbdmanage                 0.99.8-1                           all          DRBD distributed resource management utility
>> 
>> 
>> 
>> kernel:[233761.855451] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [drbd_s_vm-109-d:65197]
>> Aug 14 14:21:49 virt2 kernel: [233761.855451] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [drbd_s_vm-109-d:65197]
>> Aug 14 14:21:49 virt2 kernel: [233761.855544] Modules linked in: ipt_REJECT nf_reject_ipv4 drbd_transport_tcp(O) drbd(O) ip_set ip6table_filter ip6_tables xt_multiport binfmt_misc iptable_filter ip_tables x_tables softdog rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel sb_edac aes_x86_64 snd_pcm lrw snd_timer edac_core gf128mul snd glue_helper shpchp soundcore joydev pcspkr input_leds ablk_helper cryptd i2c_i801 lpc_ich ipmi_si 8250_fintek mei_me ipmi_msghandler mei mac_hid acpi_power_meter wmi vhost_net vhost macvtap macvlan coretemp autofs4 hid_generic ixgbe(O) usbkbd usbmouse dca be2net usbhid vxlan ahci ip6_udp_tunnel ptp hid libahci udp_tunnel pps_core megaraid_sas fjes
>> Aug 14 14:21:49 virt2 kernel: [233761.855592] CPU: 1 PID: 65197 Comm: drbd_s_vm-109-d Tainted: P           O L  4.4.67-1-pve #1
>> Aug 14 14:21:49 virt2 kernel: [233761.855593] Hardware name: FUJITSU PRIMERGY RX2530 M1/D3279-A1, BIOS V5.0.0.9 R1.28.0 for D3279-A1x                     12/09/2015
>> Aug 14 14:21:49 virt2 kernel: [233761.855594] task: ffff883003707000 ti: ffff881dc5100000 task.ti: ffff881dc5100000
>> Aug 14 14:21:49 virt2 kernel: [233761.855595] RIP: 0010:[<ffffffffc0936ae2>]  [<ffffffffc0936ae2>] wait_for_sender_todo+0x142/0x270 [drbd]
>> Aug 14 14:21:49 virt2 kernel: [233761.855606] RSP: 0018:ffff881dc5103db0  EFLAGS: 00000282
>> Aug 14 14:21:49 virt2 kernel: [233761.855607] RAX: 00000000002aef32 RBX: ffff88300dd94000 RCX: 8000000000000000
>> Aug 14 14:21:49 virt2 kernel: [233761.855608] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff881c71ab1998
>> Aug 14 14:21:49 virt2 kernel: [233761.855609] RBP: ffff881dc5103e10 R08: ffff88300dd944d8 R09: 0000000000000001
>> Aug 14 14:21:49 virt2 kernel: [233761.855610] R10: 000000002b300018 R11: 00000000fffffffb R12: ffff88300dd940b8
>> Aug 14 14:21:49 virt2 kernel: [233761.855610] R13: ffff88300dd944c0 R14: ffff883003707000 R15: 0000000000000001
>> Aug 14 14:21:49 virt2 kernel: [233761.855611] FS:  0000000000000000(0000) GS:ffff88181f640000(0000) knlGS:0000000000000000
>> Aug 14 14:21:49 virt2 kernel: [233761.855612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Aug 14 14:21:49 virt2 kernel: [233761.855613] CR2: 0000000000000030 CR3: 0000000001e0b000 CR4: 00000000001426e0
>> Aug 14 14:21:49 virt2 kernel: [233761.855614] Stack:
>> Aug 14 14:21:49 virt2 kernel: [233761.855614]  ffff88300dd94520 0000000000000000 ffff883003707000 ffffffff810c46e0
>> Aug 14 14:21:49 virt2 kernel: [233761.855616]  ffff881dc5103dd0 ffff881dc5103dd0 00000000e2e71a4a ffff88300dd94000
>> Aug 14 14:21:49 virt2 kernel: [233761.855617]  ffff88300dd944d8 ffff88300dd94528 ffff88300dd94520 ffff88300dd944d8
>> Aug 14 14:21:49 virt2 kernel: [233761.855618] Call Trace:
>> Aug 14 14:21:49 virt2 kernel: [233761.855623]  [<ffffffff810c46e0>] ? wait_woken+0x90/0x90
>> Aug 14 14:21:49 virt2 kernel: [233761.855627]  [<ffffffffc093c1fe>] drbd_sender+0x34e/0x400 [drbd]
>> Aug 14 14:21:49 virt2 kernel: [233761.855633]  [<ffffffffc095ad60>] ? w_complete+0x20/0x20 [drbd]
>> Aug 14 14:21:49 virt2 kernel: [233761.855637]  [<ffffffffc095adc0>] drbd_thread_setup+0x60/0x110 [drbd]
>> Aug 14 14:21:49 virt2 kernel: [233761.855641]  [<ffffffffc095ad60>] ? w_complete+0x20/0x20 [drbd]
>> Aug 14 14:21:49 virt2 kernel: [233761.855643]  [<ffffffff810a134a>] kthread+0xfa/0x110
>> Aug 14 14:21:49 virt2 kernel: [233761.855644]  [<ffffffff810a1250>] ? kthread_park+0x60/0x60
>> Aug 14 14:21:49 virt2 kernel: [233761.855646]  [<ffffffff81864a8f>] ret_from_fork+0x3f/0x70
>> Aug 14 14:21:49 virt2 kernel: [233761.855647]  [<ffffffff810a1250>] ? kthread_park+0x60/0x60
>> Aug 14 14:21:49 virt2 kernel: [233761.855648] Code: 8b 7b 10 8b 87 a0 03 00 00 84 d2 74 09 3b 83 38 09 00 00 0f 95 c2 48 81 c7 98 01 00 00 c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00 <84> d2 0f 85 99 00 00 00 f0 0f ba b3 f8 00 00 00 0f 72 43 83 bb 
>> Aug 14 14:21:51 virt2 kernel: [233763.964203] drbd_send_and_submit: 546 callbacks suppressed
>> Aug 14 14:21:51 virt2 kernel: [233763.964206] drbd vm-109-disk-1/0 drbd176: IO ERROR: neither local nor remote data, sector 38107320+8
>> Aug 14 14:21:51 virt2 kernel: [233763.964327] buffer_io_error: 546 callbacks suppressed
>> 
>> 
>> top shows processes using all of one cpu core
>>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                
>> 65197 root      20   0       0      0      0 R 100.0  0.0  92:01.19 drbd_s_vm-109-d                                                                                                                                                        
>> 65229 root      20   0       0      0      0 R 100.0  0.0  91:54.35 kworker/u145:2                                                                                                                                                        
> 
> Any logs you can share before that happend? +/- 50 lines around the time
> this happened would help.
> 
>>> drbdadm status
>> root at virt5:~# drbdadm status
>> .drbdctrl role:Secondary
>>  volume:0 disk:UpToDate
>>  volume:1 disk:UpToDate
>>  fs1 role:Secondary
>>    volume:0 peer-disk:UpToDate
>>    volume:1 peer-disk:UpToDate
>>  fs2 role:Secondary
>>    volume:0 peer-disk:UpToDate
>>    volume:1 peer-disk:UpToDate
>>  virt1 role:Secondary
>>    volume:0 peer-disk:UpToDate
>>    volume:1 peer-disk:UpToDate
>>  virt2 role:Primary
>>    volume:0 peer-disk:UpToDate
>>    volume:1 peer-disk:UpToDate
>>  virt3 role:Secondary
>>    volume:0 peer-disk:UpToDate
>>    volume:1 peer-disk:UpToDate
>>  virt4 role:Secondary
>>    volume:0 peer-disk:UpToDate
>>    volume:1 peer-disk:UpToDate
> 
> Maybe a bit unrelated, but why would you spawn the control volume over
> that many nodes? Especially virtual machines. That does not make sense
> to me. I guess these should be satellite nodes.
> 
>> Is there anything I can do about it? (reboot is no option on production machines…)
> 
> You saw the trace in the kernel? This is what it is going to be.
> 
> Regards, rck
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user




More information about the drbd-user mailing list