Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 06/19/2012 11:55 AM, Phil Frost wrote: > I want to guarantee that fsync() doesn't return until writes have made > it to physical storage. In particular, I care about PostgreSQL > database integrity. Well, this is proving very frustrating. I still don't know if I'm chasing behavior that simply isn't implemented, or isn't working in my environment. However, I'm very sure something is wrong here. I tried digging around in the source code (3.2.0 kernel from debian squeeze-backports) a bit, and I'm CCing drbd-dev since I don't imagine too many users read the code. I pretty much have no experience with block device programming, but I did find some good documentation in the kernel [1] that provided some good grep victims, specifically REQ_FLUSH and REQ_FUA. I found evidence that these are supported by DRBD, in drbd_main.c: static u32 bio_flags_to_wire(struct drbd_conf *mdev, unsigned long bi_rw) { if (mdev->agreed_pro_version >= 95) return (bi_rw & REQ_SYNC ? DP_RW_SYNC : 0) | (bi_rw & REQ_FUA ? DP_FUA : 0) | (bi_rw & REQ_FLUSH ? DP_FLUSH : 0) | (bi_rw & REQ_DISCARD ? DP_DISCARD : 0); else return bi_rw & REQ_SYNC ? DP_RW_SYNC : 0; } This appears to be responsible for encoding the block request flags into a network format for the peer, and there is an inverse function in drbd_receiver.c. However, [1] also says block device drivers (well, "request_fn based" drivers, but I don't know what that means, but I think it applies) must call blk_queue_flush to advertise support for REQ_FUA and REQ_FLUSH. grep tells me DRBD doesn't do this anywhere, but I do see it in other drivers I recognize, MD, loop, xen-blkfront, etc. So, my hypothesis is that DRBD had the code to pass REQ_FUA and REQ_FLUSH through to the underlying device, but it never sees those flags because it doesn't claim to support them. So, they get stripped off by the block IO system, which figures the best it can do is drain the queue, which is clearly the Wrong Thing. Unfortunately, I don't feel very qualified in this area, so can anyone tell me if I'm totally off base here? Any suggestions on how I might test this? [1] http://www.mjmwired.net/kernel/Documentation/block/writeback_cache_control.txt