[DRBD-user] drbd & multicast?

Volodymyr Litovka doka.ua at gmx.com
Mon Apr 6 18:44:26 CEST 2020


Hi Robert,

please see below

On 06.04.2020 17:23, Robert Altnoeder wrote:
>> On 06 Apr 2020, at 10:17, Volodymyr Litovka <doka.ua at gmx.com> wrote:
>>
>> To avoid this, I'd propose to add additional layer like proxy, which will:
>>
>> - reside on every satellite
>> - receive data over unicast
>> ** thus, drbd code will get minimal changes (now - it sends unicast data
>> to 1+ neighbors, after changes - it will send the same unicast to single
>> neighbor)
>> ** to minimize delay - use local sockets
>> - resend it over multicast
>> - but manage control traffic (e.g. acknowledgments from remote peers)
>> over unicast
> This would probably still require many changes in the DRBD kernel module, add another layer of complexity, another component that can fail independently, and makes the system as a whole harder to maintain and troubleshoot.
>
> Delay would probably also be rather unpredictable, because different threads in kernel- and user-space must be activated and paused frequently for the IPC to work, and Linux, as a monolithical kernel, does not offer any specialized mechanisms for direct low-latency context switches/thread activation in a chain of I/O servers like those mechanisms that are found in most microkernels, or at least something in the general direction like e.g. “door calls” in the SunOS kernel (the kernel of the Solaris OS).

Well, I fully believe in what you're saying and my try to find "a better 
solution" doesn't look too convincing :-)

> Multicast in DRBD would certainly make sense in various scenarios, but it would probably have to be implemented directly in DRBD.

Nice to hear this ;-)

> Anyway, I don’t see that much difference between diskless nodes and nodes with storage. Any one of these nodes always sends write requests to all connected storage nodes, the only difference with diskless nodes is that they also use the replication link for reading data, which storage nodes rather do locally (load-balancing may cause read requests over the network too). So the only thing that would make write performance on a diskless node worse than write performance on a node with local storage would be network saturation due to lots of read requests putting load on the network.

I see a bit different picture, in fact. I'm using VM (disk is 
/dev/drbd/by-res/m1/0) to produce load. I launch test VM (using virsh) 
on different nodes and get corresponding resource usage like this:

# linstor resource list
╭─────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊
╞═════════════════════════════════════════════════════════╡
┊ m1           ┊ stor1 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
┊ m1           ┊ stor2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
┊ m1           ┊ stor3 ┊ 7000 ┊ InUse  ┊ Ok    ┊ Diskless ┊
╰─────────────────────────────────────────────────────────╯
# linstor resource list-volumes
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node  ┊ Resource ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ stor1 ┊ m1       ┊ drbdpool             ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 50.01 GiB ┊ Unused ┊ UpToDate ┊
┊ stor2 ┊ m1       ┊ drbdpool             ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 50.01 GiB ┊ Unused ┊ UpToDate ┊
┊ stor3 ┊ m1       ┊ DfltDisklessStorPool ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊           ┊ InUse  ┊ Diskless ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

when m1 is InUse on stor2 ("disk" node) and I launch 'dd' there, I see 
the following tcpdump output on host (stor2):

# tcpdump -i eno4 'src host stor2 and dst port 7000'
[ ... lot of similar packets to stor1 and no packets to stor3 ... ]
18:59:22.640966 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 1073298560:1073363720, ack 1, win 18449, options [nop,nop,TS val 2730290186 ecr 2958297760], length 65160
18:59:22.641495 IP stor2.39897 > stor1.afs3-fileserver: Flags [P.], seq 1073363720:1073426344, ack 1, win 18449, options [nop,nop,TS val 2730290186 ecr 2958297761], length 62624
18:59:22.642053 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 1073426344:1073491504, ack 1, win 18449, options [nop,nop,TS val 2730290187 ecr 2958297761], length 65160
18:59:22.642606 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 1073491504:1073556664, ack 1, win 18449, options [nop,nop,TS val 2730290187 ecr 2958297761], length 65160

when m1 is InUse on stor3 ("diskless" node) and I launch 'dd' there, I 
see the following tcpdump output on host (stor3):

# tcpdump -i eno4 'src host stor3 and dst port 7000'
[ ... lot of similar packets to both stor1 and stor2 ... ]
19:05:56.451425 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069351304:1069416464, ack 16425, win 11444, options [nop,nop,TS val 3958888538 ecr 1765301734], length 65160
19:05:56.452077 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069416464:1069481624, ack 16425, win 11444, options [nop,nop,TS val 3958888539 ecr 1765301735], length 65160
19:05:56.452664 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069481624:1069546784, ack 16425, win 11444, options [nop,nop,TS val 3958888540 ecr 1765301736], length 65160
19:05:56.453324 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1071808472:1071873632, ack 61481, win 6365, options [nop,nop,TS val 1547141177 ecr 2878616029], length 65160
19:05:56.454142 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1071873632:1071938792, ack 61481, win 6365, options [nop,nop,TS val 1547141177 ecr 2878616029], length 65160
19:05:56.454926 IP stor3.40171 > stor2.afs3-fileserver: Flags [P.], seq 1071938792:1072002920, ack 61481, win 6365, options [nop,nop,TS val 1547141178 ecr 2878616030], length 64128
19:05:56.455700 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072002920:1072068080, ack 61481, win 6365, options [nop,nop,TS val 1547141179 ecr 2878616031], length 65160
19:05:56.456490 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069546784:1069611944, ack 16425, win 11444, options [nop,nop,TS val 3958888543 ecr 1765301739], length 65160
19:05:56.457121 IP stor3.59577 > stor1.afs3-fileserver: Flags [P.], seq 1069611944:1069676072, ack 16425, win 11444, options [nop,nop,TS val 3958888544 ecr 1765301740], length 64128
19:05:56.457730 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069676072:1069741232, ack 16425, win 11444, options [nop,nop,TS val 3958888545 ecr 1765301741], length 65160
19:05:56.458292 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069741232:1069806392, ack 16425, win 11444, options [nop,nop,TS val 3958888546 ecr 1765301741], length 65160
19:05:56.458939 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072068080:1072133240, ack 61481, win 6365, options [nop,nop,TS val 1547141182 ecr 2878616034], length 65160
19:05:56.459735 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072133240:1072198400, ack 61481, win 6365, options [nop,nop,TS val 1547141182 ecr 2878616034], length 65160
19:05:56.460598 IP stor3.40171 > stor2.afs3-fileserver: Flags [P.], seq 1072198400:1072261048, ack 61481, win 6365, options [nop,nop,TS val 1547141183 ecr 2878616035], length 62648
19:05:56.461097 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072261048:1072326208, ack 61481, win 6365, options [nop,nop,TS val 1547141184 ecr 2878616035], length 65160
19:05:56.461633 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072326208:1072391368, ack 61481, win 6365, options [nop,nop,TS val 1547141184 ecr 2878616036], length 65160

and, using bmon (realtime traffic rate monitor for linux) I always see 
about 1Gbps on originating host and:

- in 1st case (originator is "disk" node) : about 1Gbps on receiving host
- in 2nd case (originator is "diskless" node) : about 500Mbps on every 
of receiving hosts

 From what I see I conclude, that in 1st case (originator is "disk" 
node) the single copy of replicated data travels through network, while 
in 2nd case (originator is "diskless" node) there are two copies travel 
through network.

Thank you.


-- 
Volodymyr Litovka
   "Vision without Execution is Hallucination." -- Thomas Edison

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20200406/dc475da0/attachment-0001.htm>


More information about the drbd-user mailing list