[DRBD-user] drbd & multicast?

Tue Apr 7 01:52:22 CEST 2020

On 7/4/20 02:44, Volodymyr Litovka wrote:
>
> Hi Robert,
>
> please see below
>
> On 06.04.2020 17:23, Robert Altnoeder wrote:
>>> On 06 Apr 2020, at 10:17, Volodymyr Litovka<doka.ua at gmx.com>  wrote:
>>>
>>> To avoid this, I'd propose to add additional layer like proxy, which will:
>>>
>>> - reside on every satellite
>>> - receive data over unicast
>>> ** thus, drbd code will get minimal changes (now - it sends unicast data
>>> to 1+ neighbors, after changes - it will send the same unicast to single
>>> neighbor)
>>> ** to minimize delay - use local sockets
>>> - resend it over multicast
>>> - but manage control traffic (e.g. acknowledgments from remote peers)
>>> over unicast
>> This would probably still require many changes in the DRBD kernel module, add another layer of complexity, another component that can fail independently, and makes the system as a whole harder to maintain and troubleshoot.
>>
>> Delay would probably also be rather unpredictable, because different threads in kernel- and user-space must be activated and paused frequently for the IPC to work, and Linux, as a monolithical kernel, does not offer any specialized mechanisms for direct low-latency context switches/thread activation in a chain of I/O servers like those mechanisms that are found in most microkernels, or at least something in the general direction like e.g. “door calls” in the SunOS kernel (the kernel of the Solaris OS).
>
> Well, I fully believe in what you're saying and my try to find "a 
> better solution" doesn't look too convincing :-)
>
>> Multicast in DRBD would certainly make sense in various scenarios, but it would probably have to be implemented directly in DRBD.
>
> Nice to hear this ;-)
>
>> Anyway, I don’t see that much difference between diskless nodes and nodes with storage. Any one of these nodes always sends write requests to all connected storage nodes, the only difference with diskless nodes is that they also use the replication link for reading data, which storage nodes rather do locally (load-balancing may cause read requests over the network too). So the only thing that would make write performance on a diskless node worse than write performance on a node with local storage would be network saturation due to lots of read requests putting load on the network.
>
> I see a bit different picture, in fact. I'm using VM (disk is 
> /dev/drbd/by-res/m1/0) to produce load. I launch test VM (using virsh) 
> on different nodes and get corresponding resource usage like this:
>
> # linstor resource list
> ╭─────────────────────────────────────────────────────────╮
> ┊ ResourceName ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊
> ╞═════════════════════════════════════════════════════════╡
> ┊ m1           ┊ stor1 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
> ┊ m1           ┊ stor2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
> ┊ m1           ┊ stor3 ┊ 7000 ┊ InUse  ┊ Ok    ┊ Diskless ┊
> ╰─────────────────────────────────────────────────────────╯
> # linstor resource list-volumes
> ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
> ┊ Node  ┊ Resource ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊    State ┊
> ╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
> ┊ stor1 ┊ m1       ┊ drbdpool             ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 50.01 GiB ┊ Unused ┊ UpToDate ┊
> ┊ stor2 ┊ m1       ┊ drbdpool             ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 50.01 GiB ┊ Unused ┊ UpToDate ┊
> ┊ stor3 ┊ m1       ┊ DfltDisklessStorPool ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊           ┊ InUse  ┊ Diskless ┊
> ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>
> when m1 is InUse on stor2 ("disk" node) and I launch 'dd' there, I see 
> the following tcpdump output on host (stor2):
>
> # tcpdump -i eno4 'src host stor2 and dst port 7000'
> [ ... lot of similar packets to stor1 and no packets to stor3 ... ]
> 18:59:22.640966 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 1073298560:1073363720, ack 1, win 18449, options [nop,nop,TS val 2730290186 ecr 2958297760], length 65160
> 18:59:22.641495 IP stor2.39897 > stor1.afs3-fileserver: Flags [P.], seq 1073363720:1073426344, ack 1, win 18449, options [nop,nop,TS val 2730290186 ecr 2958297761], length 62624
> 18:59:22.642053 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 1073426344:1073491504, ack 1, win 18449, options [nop,nop,TS val 2730290187 ecr 2958297761], length 65160
> 18:59:22.642606 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 1073491504:1073556664, ack 1, win 18449, options [nop,nop,TS val 2730290187 ecr 2958297761], length 65160
>
> when m1 is InUse on stor3 ("diskless" node) and I launch 'dd' there, I 
> see the following tcpdump output on host (stor3):
>
> # tcpdump -i eno4 'src host stor3 and dst port 7000'
> [ ... lot of similar packets to both stor1 and stor2 ... ]
> 19:05:56.451425 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069351304:1069416464, ack 16425, win 11444, options [nop,nop,TS val 3958888538 ecr 1765301734], length 65160
> 19:05:56.452077 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069416464:1069481624, ack 16425, win 11444, options [nop,nop,TS val 3958888539 ecr 1765301735], length 65160
> 19:05:56.452664 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069481624:1069546784, ack 16425, win 11444, options [nop,nop,TS val 3958888540 ecr 1765301736], length 65160
> 19:05:56.453324 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1071808472:1071873632, ack 61481, win 6365, options [nop,nop,TS val 1547141177 ecr 2878616029], length 65160
> 19:05:56.454142 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1071873632:1071938792, ack 61481, win 6365, options [nop,nop,TS val 1547141177 ecr 2878616029], length 65160
> 19:05:56.454926 IP stor3.40171 > stor2.afs3-fileserver: Flags [P.], seq 1071938792:1072002920, ack 61481, win 6365, options [nop,nop,TS val 1547141178 ecr 2878616030], length 64128
> 19:05:56.455700 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072002920:1072068080, ack 61481, win 6365, options [nop,nop,TS val 1547141179 ecr 2878616031], length 65160
> 19:05:56.456490 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069546784:1069611944, ack 16425, win 11444, options [nop,nop,TS val 3958888543 ecr 1765301739], length 65160
> 19:05:56.457121 IP stor3.59577 > stor1.afs3-fileserver: Flags [P.], seq 1069611944:1069676072, ack 16425, win 11444, options [nop,nop,TS val 3958888544 ecr 1765301740], length 64128
> 19:05:56.457730 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069676072:1069741232, ack 16425, win 11444, options [nop,nop,TS val 3958888545 ecr 1765301741], length 65160
> 19:05:56.458292 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 1069741232:1069806392, ack 16425, win 11444, options [nop,nop,TS val 3958888546 ecr 1765301741], length 65160
> 19:05:56.458939 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072068080:1072133240, ack 61481, win 6365, options [nop,nop,TS val 1547141182 ecr 2878616034], length 65160
> 19:05:56.459735 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072133240:1072198400, ack 61481, win 6365, options [nop,nop,TS val 1547141182 ecr 2878616034], length 65160
> 19:05:56.460598 IP stor3.40171 > stor2.afs3-fileserver: Flags [P.], seq 1072198400:1072261048, ack 61481, win 6365, options [nop,nop,TS val 1547141183 ecr 2878616035], length 62648
> 19:05:56.461097 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072261048:1072326208, ack 61481, win 6365, options [nop,nop,TS val 1547141184 ecr 2878616035], length 65160
> 19:05:56.461633 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 1072326208:1072391368, ack 61481, win 6365, options [nop,nop,TS val 1547141184 ecr 2878616036], length 65160
>
> and, using bmon (realtime traffic rate monitor for linux) I always see 
> about 1Gbps on originating host and:
>
> - in 1st case (originator is "disk" node) : about 1Gbps on receiving host
> - in 2nd case (originator is "diskless" node) : about 500Mbps on every 
> of receiving hosts
>
> From what I see I conclude, that in 1st case (originator is "disk" 
> node) the single copy of replicated data travels through network, 
> while in 2nd case (originator is "diskless" node) there are two copies 
> travel through network.
>

I guess the "solution" is to increase the available bandwidth between 
the diskless node, and your "SAN". One option is using multiple ethernet 
connections bonded or dedicated to each destination (eg, one ethernet to 
stor2, one ethernet for stor3, and a third ethernet for the "user" network).

Or, use faster ethernet such as 10G ethernet connections to improve the 
performance.

All of these issues are simply working around the problem, if you had 5 
copies, then you would love 80% of your bandwidth for the additional 
redundant copies, or need to add 5x the number of ethernet connections, 
or increase bandwidth by a factor of 5. So using multicast would improve 
performance, especially as you increase the number of replicas.

I guess the other question is whether this is "resolved" by using DRBD 
Proxy, and therefore there isn't much interest in adding this type of 
feature to DRBD? If it isn't handled by DRBD Proxy, then hopefully it is 
something that would be considered to be added to the roadmap for DRBD 
"in the future"...

Regards,
Adam

-- 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20200407/3ce855ac/attachment.htm>