[DRBD-user] File data corrupt when I switch Nodes

Lars Ellenberg Lars.Ellenberg at linbit.com
Tue Aug 31 13:33:52 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2004-08-31 09:34:04 +0100
\ Kevin Izzet:
> Hi All,
> 
> Firstly can I thank Philipp, Lars and everyone else who has contributed to 
> DRBD, we have been running a production cluster for over a year now
> supporting :-
> Unix/Linux Home accounts
> Unix/Linux Application serving
> Apache
> Samba
> 
> My issue is I am in the process of building a second cluster with the 
> following spec
> 
> HP DL380 G3's
> 2 Gb RAM
> Gigabyte Ethernet for Replication
> Serial for Heartbeat
> Suse 9.0 prof ed, stripped down for server use only....
> Kernel Suse 2.4.21-99-smp4G.
> 1 x 72GB 15k rpm drive for root
> 2 x 72GB 15k rpm drive stripped to 135.6GB for drbd replication.
> Heartbeat 1.2.2
> The server will be running Samba3 and Postgresql 7.4.3 (and 8.0beta for 
> testing)
> 
> I originally installed drbd 0.6.12 which worked perfectly and has been
> running for a couple of weeks without any issues, but after the
> release of 0.7.3 I decided to bite the bullet and give it a go.
> 
> The compile and install went fine, built as module not patched into
> Kernel, I re partitioned the 135GB drive from one to two slices the
> second slice for the metadata, made the partition 500mb to be on the
> safe side. 
>
> Originally used Reiserfs for the file system but switched too EXT3 to
> see if this would fix my issue !!!
>
> Created a new drbd.conf file (see below) and started up the primary
> side all fine, rebuilt the samba Postgresql file structures and
> restarted the Hearbeat services - All ran great.
>
> Next I configured the secondary side and started up drbd, it began to
> sync as expected and completed successfully after 2.5 hours I then
> started up heartbeat on the secondary side and forced a switch from
> the primary to the secondary - All appeared to be going ok until I
> noticed that Postgresql didn't restart on the now new primary.
>
> I checked the logfiles and the startup was complaining that the
> postgresql.conf file was invalid, on checking the file I found that it
> appeared to be corrupt, headrer was complete gibberish but the rest of
> the file contained normal text, I switched the services back over and
> the to my surprise Postgresql started fine , checked the conf file and
> it was just a normal text file !!!!!!
> 
> I have tried rebuilding the drbd devices three times now and also have 
> forced a full resync
> of the data twice but I still get the same issue.
> 
> If I can't get 0.7.3 working in the next month I'm going to have to revert 
> back to using 0.6.12
> so that I can get this cluster into production....
> 
> Any help would be much appreciated - and thanks in advance.........

your ! key is broken... :)

together with other reports which I think are related,
it looks like drbd does not like the Redhat kernel.
maybe it is only an 2.4 kernel issue,
maybe it is a generic Redhat issue,
probably weird timing is involved,
most likely sendpage is involved in some way or the other.
it could be "only" some smp or highmem issue.

maybe it is even a hidden serious generic bug, but I doubt that.
since we did many test iterations with random IO load.
many disconnect/reconnect/sync and/or failover cycles.
At linbit with kernel.org/debian 2.6. kernels,
and in the SuSE labs with SuSEs 2.6 kernels.
And then compared the md5sums of the lower level devices.
And it compared just fine
(at least since we left -pre status).

please disable tcp_sendpage altogether, and verify if this helped.
i.e., get a fresh svn checkout,
and define DRBD_DISALE_SENDPAGE in drbd_config.h ...

	Lars Ellenberg



More information about the drbd-user mailing list