Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, We are currently experiencing a strange problem with DRBD : A few days ago, we got a crash of the drbd with : Jan 21 12:13:40 dedie58 ntpd[926790]: ntp engine ready Jan 21 12:13:41 dedie58 kernel: block drbd1: drbd1_receiver[2910] Concurrent local write detected! new: 2253364088s +4096; pending: 2253364088s +4096 Jan 21 12:13:41 dedie58 kernel: block drbd1: Concurrent write! [W AFTERWARDS] sec=2253364088s It happened on node1 just after I synced time on both nodes (yeah, I won't repeat this). At this time, we were having VMs running dispatched on both nodes for a few days. No VM was running on both node at the same time. So we rebooted node1, launched all VMs on node2 and asked for a full resync of the DRBD device which took 5 days (disks are 7k2 of 4 TB). So we tought everything was back to normal but and we moved back a non-important VM to node1. It ran as expected for about 8 hours and finally, VM crashed around 6h30 PM with the below call trace. I checked about fencing, nothing in logs on any node. For the background : Two Proxmox 3 hypervisors running Linux KVM VMs with LVM disks over a DRBD over a software RAID device. For the versions : # cat /proc/drbd version: 8.3.13 (api:88/proto:86-96) GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by root at sighted, 2012-10-09 12:47:51 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:286670783 dw:286670763 dr:8447304 al:0 bm:31 lo:2 pe:0 ua:3 ap:0 ep:2 wo:b oos:0 It seems we lost our VM a few mintues before : > Feb 13 06:36:23 dedie58 pvestatd[3883]: WARNING: unable to connect to VM 106 socket - timeout after 31 retries Nothing else that seems relevant in logs, only cron tasks and they're run every hour so I don't thing this has something to do with. The hour is giving me a hint : the VM is a debian one and daily cron jobs are running around 06:30 AM. This is the only clue I actually have and this can't really explain this crash. Does someone have any idea about what could cause this ? Of course, I can make some tests by moving back the VM to node1 and with more log activated. Thanks for your reading and help, Julien Escario And finally the call trace : > Feb 13 06:38:27 dedie58 kernel: INFO: task kvm:820630 blocked for more than 120 seconds. > Feb 13 06:38:27 dedie58 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Feb 13 06:38:27 dedie58 kernel: kvm D ffff88186ad84740 0 820630 1 0 0x00000000 > Feb 13 06:38:27 dedie58 kernel: ffff8817ebe255f8 0000000000000086 0000000000000000 0005120000000000 > Feb 13 06:38:27 dedie58 kernel: 0000000000011200 0000000000011210 ffff8817ebe25578 0000000000000010 > Feb 13 06:38:27 dedie58 kernel: ffff881873256a80 000000013232ac6f ffff8817ebe25fd8 ffff8817ebe25fd8 > Feb 13 06:38:27 dedie58 kernel: Call Trace: > Feb 13 06:38:27 dedie58 kernel: [<ffffffffa0490a85>] drbd_al_begin_io+0x195/0x220 [drbd] > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8109b440>] ? autoremove_wake_function+0x0/0x40 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811db486>] ? __bio_clone+0x26/0x70 > Feb 13 06:38:27 dedie58 kernel: [<ffffffffa048d168>] drbd_make_request_common+0x1298/0x1870 [drbd] > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8112e115>] ? mempool_alloc_slab+0x15/0x20 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8112e423>] ? mempool_alloc+0x73/0x180 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811dc45a>] ? bvec_alloc_bs+0x6a/0x120 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811dc5c2>] ? bio_alloc_bioset+0xb2/0xf0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffffa048dbfa>] drbd_make_request+0x4ba/0x12a0 [drbd] > Feb 13 06:38:27 dedie58 kernel: [<ffffffff81428faf>] ? __split_and_process_bio+0x47f/0x600 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8141712d>] ? md_make_request+0xcd/0x1f0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8126da04>] ? throtl_find_tg+0x44/0x60 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8126ebce>] ? blk_throtl_bio+0x3ce/0x5c0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8125d835>] generic_make_request+0x265/0x3b0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8125da0d>] submit_bio+0x8d/0x1a0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811dfa98>] dio_bio_submit+0xa8/0xc0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811e0c9a>] __blockdev_direct_IO_newtrunc+0x98a/0xce0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811e104c>] __blockdev_direct_IO+0x5c/0xd0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811ddda0>] ? blkdev_get_blocks+0x0/0xd0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811ddf37>] blkdev_direct_IO+0x57/0x60 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811ddda0>] ? blkdev_get_blocks+0x0/0xd0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8112aa38>] mapping_direct_IO.isra.25+0x48/0x70 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8112bb6f>] generic_file_direct_write_iter+0xef/0x170 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8112bf10>] __generic_file_write_iter+0x320/0x3e0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffffa052050b>] ? kvm_arch_vcpu_put+0x4b/0x60 [kvm] > Feb 13 06:38:27 dedie58 kernel: [<ffffffffa0512118>] ? vcpu_put+0x28/0x30 [kvm] > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8112c053>] __generic_file_aio_write+0x83/0xa0 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811ddcc1>] blkdev_aio_write+0x71/0x100 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811ede07>] aio_rw_vect_retry+0xa7/0x240 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811eef81>] aio_run_iocb+0x61/0x150 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811f03b0>] do_io_submit+0x2a0/0x630 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff811f0750>] sys_io_submit+0x10/0x20 > Feb 13 06:38:27 dedie58 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b