[DRBD-user] drbd(?) fail

Tue Jan 29 09:50:14 CET 2008

helo Everybody!

I have a strange problem with my cluster.

Yesterday I saw, node3 of my lustre cluster (it's the pair of node4 of 
the heartbeat+drbd cluster) was freezed up and node4 didn't took over 
the OST.

After reboot it always wrote 'System halted.' on console, but it cannot 
be down. I disconnected node3, rebooted node4, and everything worked fine.

Today, I tried to make it work as before with a fresh system with CentOS 
4.4, drbd 0.7.25, lustre 1.6.4.1. The array drbd1, which is originally 
primary on node4 went fine.

node4:

 0: cs:StandAlone st:Primary/Unknown ld:Consistent
    ns:0 nr:0 dw:15404660 dr:88550854 al:11773 bm:11773 lo:0 pe:0 ua:0 ap:0

node3:

 0: cs:WFConnection st:Secondary/Unknown ld:Consistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0

# drbdadm --dry-run wait_connect ost-3
drbdsetup /dev/drbd0 wait_connect --wfc-timeout=120 --degr-wfc-timeout=120

It said: Aborting.

drbdadm connect ost-3 -> in messages log I saw:

Jan 24 09:31:37 node4 kernel: drbd0: drbdsetup [8135]: cstate StandAlone 
--> Unconnected
Jan 24 09:31:37 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
Unconnected --> WFConnection
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
WFConnection --> WFReportParams
Jan 24 09:31:39 node4 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Jan 24 09:31:39 node4 kernel: drbd0: Connection established.
Jan 24 09:31:39 node4 kernel: drbd0: I am(P): 
1:00000003:00000003:00000053:00000003:10
Jan 24 09:31:39 node4 kernel: drbd0: Peer(S): 
1:00000007:00000003:0000004a:00000004:00
Jan 24 09:31:39 node4 kernel: drbd0: Current Primary shall become sync 
TARGET! Aborting to prevent data corruption.
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
WFReportParams --> StandAlone
Jan 24 09:31:39 node4 kernel: drbd0: error receiving ReportParams, l: 72!
Jan 24 09:31:39 node4 kernel: drbd0: asender terminated
Jan 24 09:31:39 node4 kernel: drbd0: worker terminated
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
StandAlone --> StandAlone
Jan 24 09:31:39 node4 kernel: drbd0: Connection lost.
Jan 24 09:31:39 node4 kernel: drbd0: receiver terminated

Why didn't work it? I wanted to make node4 to be SyncSource, node3 
behaved fine and was listening on the right port with cstate WFConnection.

Than I made a mistake, disabled hertbeat and rebooted node4. Well, both 
node was Secondary, and they started to sync, node3 was the SyncSource. 
Why? What could be the right command?

So the get synced. And after that, I don't know exactly, when node4 
started to behave like node3 yesterday, it wrote 'System haled' and 
everything stopped to work. I stoped heartbeat, reset, mount ost by 
hand, and now it looks fine, but who know, now I'm a bit paranoid.

Still I have to say, node3's kernel was 1.6.0.1 with drbd 0.7.22 (but 
0.7.25 userland) until the last reboot above, I don't know, it could 
cause a problem, or not.

Does anybody have an idea, what happened, what would have to make with 
any part of the story?

--------
The above mail was sent to lustre mailing list, but the sent me here.
Since then I realized, there is a LAN card, with sky2.ko, and probably that was the problem. There were other data lossess without inconsistent drbd arrays (as I can see) with 2.6.9-55.0.9.EL_lustre.1.6.4.1smp, sk98lin.ko was not working at all. I went back to 2.6.9-42.0.10.EL_lustre-1.6.0.1 and with sk98lin.ko still looks fine.

I have two main problems:

1. How can I force which node be the source?
2. Shouldn't be, that if they are slave, the don't do syncronization, or how they decide, which one is the source?

Thank you very much, this would be very important to me,

tamas