[DRBD-user] Invalidate - me too!

Wed Aug 10 01:14:42 CEST 2005

On Tuesday 09 August 2005 23:51, Dan Cunningham wrote:
> So last night my server started doing the same thing!  We have a 1.5 TB
> array (80% utilized) and noticed some users were getting permisioned
> denied messages accessing certin directories, including the root user.
> We have the same reiser error in the message log each time someone tries
> to access one of the corrup folders.  I actually disconnected the
> secondary this weekend, expecting to do some work on it, I don't think
> thats the problem, but what I am wondering is if I reconnect the
> servers, if the corruption will sync over.  The hardware is a dell scsi
> storage vault connected to a dell 2650 running 2.6.8-2/debian with drbd
> 7.  Any ideas or suggesttions???  Extended downtime for fsck is my last
> option :-(  Also I checked my partitions and the LVM volume has 1G more
> space then the reiserfs on top of it (ie drbd has 1GB for meta info)
>

Sure, the filesystem corruption will sync over when the second box becomes 
connected. 
Well, before our server went into production, I already thought about the 
problem of the long time reiserfsck can sometimes take. Actually drbd is the 
optimal solution:

1.) Tell the users from now on everything they will save on the failover 
device will lost, until you tell them the problem is fixed

2.) Disconnect the drbd device, stop heartbeat, etc. on it

3.) Do the fsck on the failover node and fix everything there. The main server 
stays as it is during this time and will go serving to the clients.

4.) When point 3 is finished, make the the failover node into drbd primary 
state, the main server shall go into secondary state, invalidate the data on 
the main node, reconnect. The data should now go from the failover to the 
primary. You have a data loss of everything that was written ever since you 
disconnected the failover node.

By doing the fsck on the disconnected failover node, you will also see how 
much time the fsck will take and so its up to you to decide if you prefer a  
downtime or the data loss from point 4 (all data the users have written in 
the mean time).

I have to admit that this data loss is probably not acceptable for very 
important data (e.g. databases of online-shops, etc.), but it would surely 
work with our users. Probably I would even remount our home-directory 
readonly.

Hope it helps,
	Bernd

PS: I hope you noticed the announcement about the possible corruption with 
some recent kernel versions, did you?

-- 
Bernd Schubert
PCI / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg