Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
helo Everybody! I have a strange problem with my cluster. Yesterday I saw, node3 of my lustre cluster (it's the pair of node4 of the heartbeat+drbd cluster) was freezed up and node4 didn't took over the OST. After reboot it always wrote 'System halted.' on console, but it cannot be down. I disconnected node3, rebooted node4, and everything worked fine. Today, I tried to make it work as before with a fresh system with CentOS 4.4, drbd 0.7.25, lustre 1.6.4.1. The array drbd1, which is originally primary on node4 went fine. node4: 0: cs:StandAlone st:Primary/Unknown ld:Consistent ns:0 nr:0 dw:15404660 dr:88550854 al:11773 bm:11773 lo:0 pe:0 ua:0 ap:0 node3: 0: cs:WFConnection st:Secondary/Unknown ld:Consistent ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 # drbdadm --dry-run wait_connect ost-3 drbdsetup /dev/drbd0 wait_connect --wfc-timeout=120 --degr-wfc-timeout=120 It said: Aborting. drbdadm connect ost-3 -> in messages log I saw: Jan 24 09:31:37 node4 kernel: drbd0: drbdsetup [8135]: cstate StandAlone --> Unconnected Jan 24 09:31:37 node4 kernel: drbd0: drbd0_receiver [8136]: cstate Unconnected --> WFConnection Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate WFConnection --> WFReportParams Jan 24 09:31:39 node4 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jan 24 09:31:39 node4 kernel: drbd0: Connection established. Jan 24 09:31:39 node4 kernel: drbd0: I am(P): 1:00000003:00000003:00000053:00000003:10 Jan 24 09:31:39 node4 kernel: drbd0: Peer(S): 1:00000007:00000003:0000004a:00000004:00 Jan 24 09:31:39 node4 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate WFReportParams --> StandAlone Jan 24 09:31:39 node4 kernel: drbd0: error receiving ReportParams, l: 72! Jan 24 09:31:39 node4 kernel: drbd0: asender terminated Jan 24 09:31:39 node4 kernel: drbd0: worker terminated Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate StandAlone --> StandAlone Jan 24 09:31:39 node4 kernel: drbd0: Connection lost. Jan 24 09:31:39 node4 kernel: drbd0: receiver terminated Why didn't work it? I wanted to make node4 to be SyncSource, node3 behaved fine and was listening on the right port with cstate WFConnection. Than I made a mistake, disabled hertbeat and rebooted node4. Well, both node was Secondary, and they started to sync, node3 was the SyncSource. Why? What could be the right command? So the get synced. And after that, I don't know exactly, when node4 started to behave like node3 yesterday, it wrote 'System haled' and everything stopped to work. I stoped heartbeat, reset, mount ost by hand, and now it looks fine, but who know, now I'm a bit paranoid. Still I have to say, node3's kernel was 1.6.0.1 with drbd 0.7.22 (but 0.7.25 userland) until the last reboot above, I don't know, it could cause a problem, or not. Does anybody have an idea, what happened, what would have to make with any part of the story? -------- The above mail was sent to lustre mailing list, but the sent me here. Since then I realized, there is a LAN card, with sky2.ko, and probably that was the problem. There were other data lossess without inconsistent drbd arrays (as I can see) with 2.6.9-55.0.9.EL_lustre.1.6.4.1smp, sk98lin.ko was not working at all. I went back to 2.6.9-42.0.10.EL_lustre-1.6.0.1 and with sk98lin.ko still looks fine. I have two main problems: 1. How can I force which node be the source? 2. Shouldn't be, that if they are slave, the don't do syncronization, or how they decide, which one is the source? Thank you very much, this would be very important to me, tamas