[DRBD-user] Pacemaker - DRBD fails on node every couple hours

Thu Mar 1 11:48:53 CET 2012

first, thank you very much for your time, Lars

>> #cat /proc/drbd
>> #grep . /sys/module/drbd/*version
-------------------------
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by root at drbdnodeA,
2012-02-13 16:06:27
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate B r-----
    ns:0 nr:8496 dw:8496 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

/sys/module/drbd/srcversion:4A4FDD6F2ECF22BD2AD5970
/sys/module/drbd/version:8.4.1
-------------------------

the initrd file also looks good, it contains the 8.4.1 module only

>> So you *do* have a working DRBD,
>> and only the monitor operation fails "occasionally" (much too often,
>> still), with the below error log.

Yes, this *may* to be the case.
I'm not sure if the drbd module really crashes (and gets started again by
pacemaker afterwards) or if it never failed at all.
So far I only really see/know that pacemaker detects "a problem" and
initiates the failover.
After that all services continue to run on the other node and drbd switched
its primary/secondary state. (and I see all these errors/messages in the
log)

if it helps - the crm_mon output changes from:

--------------------------------------
Online: [ drbdnodeA drbdnodeB ]

 Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
     Masters: [ drbdnodeA ]
     Slaves: [ drbdnodeB ]
 Resource Group: g_haservices
     p_ipv4     (ocf::heartbeat:IPaddr2):       Started drbdnodeA
     p_fsmount_cgpro    (ocf::heartbeat:Filesystem):    Started drbdnodeA
     p_exportnfs_cgpro  (ocf::heartbeat:exportfs):      Started drbdnodeA
 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
     Started: [ drbdnodeA drbdnodeB ]
 Clone Set: cl_exportnfs_root [p_exportnfs_root]
     Started: [ drbdnodeA drbdnodeB ]
--------------------------------------

into

--------------------------------------
Online: [ drbdnodeA drbdnodeB ]

 Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
     Masters: [ drbdnodeB ]
     Slaves: [ drbdnodeA ]
 Resource Group: g_haservices
     p_ipv4     (ocf::heartbeat:IPaddr2):       Started drbdnodeB
     p_fsmount_cgpro    (ocf::heartbeat:Filesystem):    Started drbdnodeB
     p_exportnfs_cgpro  (ocf::heartbeat:exportfs):      Started drbdnodeB
 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
     Started: [ drbdnodeA drbdnodeB ]
 Clone Set: cl_exportnfs_root [p_exportnfs_root]
     Started: [ drbdnodeA drbdnodeB ]

Failed actions:
    p_drbd_r0:0_monitor_15000 (node=drbdnodeA, call=26, rc=7,
status=complete): not running
--------------------------------------

Christoph Roethlisberger