From rene.peinthor at linbit.com Thu Jul 2 07:10:10 2020 From: rene.peinthor at linbit.com (Rene Peinthor) Date: Thu, 2 Jul 2020 07:10:10 +0200 Subject: [DRBD-user] satellite daemon must non-3377 port for SSL? In-Reply-To: <5e446ccc-1086-65d1-beef-09556d85fd4f@physics.wisc.edu> References: <5e446ccc-1086-65d1-beef-09556d85fd4f@physics.wisc.edu> Message-ID: Hi! I don't know why you think that the satellite listens on port 3377? Here are the default port bindings: /* * Default ports */ public static final int DFLT_CTRL_PORT_SSL = 3377; public static final int DFLT_CTRL_PORT_PLAIN = 3376; public static final int DFLT_STLT_PORT_SSL = 3367; public static final int DFLT_STLT_PORT_PLAIN = 3366; It wouldn't make much sense to have a combined node, where you only can run either a controller or satellite... We have multiple setups where Controller and Satellite run on the same node. Best regards Rene On Thu, Jul 2, 2020 at 7:02 AM Chad William Seys wrote: > Hi, > I have a "Combined" controller/satellite node which I'm trying to set > up SSL on. > It appears that the controller binds port 3377 by default. This is > also the port the satellite listens to by default. When the node is > Combined, this causes problems connecting to the satellite daemon. > My hope was to have the controller bind to a non-3377 port so that > one would not have to specify a non-default port when creating a node. > However, I haven't been able to get the controller daemon to bind to > anything but 3377. > E.g. This does not work: > # cat /etc/linstor/linstor_controller.toml > [netcom] > type="ssl" > port=3388 > server_certificate="/etc/linstor/ssl/keystore.jks" > trusted_certificates="/etc/linstor/ssl/certificates.jks" > key_password="linstor" > keystore_password="linstor" > truststore_password="linstor" > ssl_protocol="TLSv1.2" > > > Thanks! > Chad. > _______________________________________________ > Star us on GITHUB: https://github.com/LINBIT > drbd-user mailing list > drbd-user at lists.linbit.com > https://lists.linbit.com/mailman/listinfo/drbd-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.altnoeder at linbit.com Thu Jul 2 22:29:00 2020 From: robert.altnoeder at linbit.com (Robert Altnoeder) Date: Thu, 2 Jul 2020 22:29:00 +0200 Subject: [DRBD-user] satellite daemon must non-3377 port for SSL? In-Reply-To: <5e446ccc-1086-65d1-beef-09556d85fd4f@physics.wisc.edu> References: <5e446ccc-1086-65d1-beef-09556d85fd4f@physics.wisc.edu> Message-ID: > On 25 Jun 2020, at 23:03, Chad William Seys wrote: > > Hi, > I have a "Combined" controller/satellite node which I'm trying to set > up SSL on. > It appears that the controller binds port 3377 by default. This is also the port the satellite listens to by default. When the node is Combined, this causes problems connecting to the satellite daemon. As Rene already quoted from the source code, by default, the controller listens on 3376 (plain) and 3377 (ssl), while the satellite listens on 3366 (plain) and 3367 (ssl). The controller?s so-called connectors are configured in its database as property values, in netcom/. Each connector has a port property that can be changed. Either use the LINSTOR client to change that property, e.g. netcom/SslConnector/port for the SSL connector that is configured by default, or if that does not work for whatever reason, start the controller interactively with the debug console (-D) and enter: SetCfgVal namespace(netcom) key(SslConnector/port) value(xxx) where xxx is the port number you want to set. Then restart the controller (enter ShtDwn and restart as usual with Pacemaker/systemctl/start script/etc.) Anyhow, that should not be necessary, because the controller and satellite modules were designed to run on the same node without causing TCP port collisions. br, Robert From cwseys at physics.wisc.edu Thu Jul 2 16:49:59 2020 From: cwseys at physics.wisc.edu (Chad William Seys) Date: Thu, 2 Jul 2020 09:49:59 -0500 Subject: [DRBD-user] satellite daemon must non-3377 port for SSL? In-Reply-To: References: Message-ID: Hmm, OK. After looking at things again, it looks like the actual problem is that 'linstor create' without -p sets the port to 3377 for the Satellite: # linstor node create --communication-type SSL vms20 --node-type Combined SUCCESS: Description: New node 'vms20' registered. Details: Node 'vms20' UUID is: af9198db-512a-4cf3-ad31-a1e79d416596 ERROR: Description: (Node: 'vms20') The requested function call cannot be executed. Cause: Common causes of this error are: - The function call name specified by the caller (client side) is incorrect - The requested function call was not loaded into the system (server side) Details: The requested function call name was 'Auth'. Node: vms20 Show reports: linstor error-reports show 5EFDF10B-00000-000001 root at vms20:~# linstor n l ??????????????????????????????????????????????????????????? ? Node ? NodeType ? Addresses ? State ? ??????????????????????????????????????????????????????????? ? vms20 ? COMBINED ? 128.104.164.119:3377 (SSL) ? OFFLINE ? ??????????????????????????????????????????????????????????? # with -p 3367: # linstor node create -p 3367 --communication-type SSL vms20 --node-type Combined SUCCESS: Description: New node 'vms20' registered. Details: Node 'vms20' UUID is: f5887821-3415-48bc-8d33-1cc4ac19efe3 SUCCESS: Description: Node 'vms20' authenticated Details: Supported storage providers: [diskless, lvm, lvm_thin, file, file_thin, openflex_target] Supported resource layers : [writecache, cache, openflex, storage] Unsupported storage providers: ZFS: 'cat /sys/module/zfs/version' returned with exit code 1 ZFS_THIN: 'cat /sys/module/zfs/version' returned with exit code 1 SPDK: IO exception occured when running 'rpc.py get_spdk_version': Cannot run program "rpc.py": error=2, No such file or directory Unsupported resource layers: DRBD: DRBD version has to be >= 9. Current DRBD version: 8.4.10 LUKS: IO exception occured when running 'cryptsetup --version': Cannot run program "cryptsetup": error=2, No such file or directory NVME: IO exception occured when running 'nvme version': Cannot run program "nvme": error=2, No such file or directory INFO: Linstor node name 'vms20' and hostname 'vms20.physics.wisc.edu' doesn't match. root at vms20:~# systemctl start linstor-satellite.service ^C root at vms20:~# linstor n l ?????????????????????????????????????????????????????????? ? Node ? NodeType ? Addresses ? State ? ?????????????????????????????????????????????????????????? ? vms20 ? COMBINED ? 128.104.164.119:3367 (SSL) ? Online ? ?????????????????????????????????????????????????????????? Thanks! Chad. On 7/2/20 5:00 AM, drbd-user-request at lists.linbit.com wrote: > Send drbd-user mailing list submissions to > drbd-user at lists.linbit.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.linbit.com/mailman/listinfo/drbd-user > or, via email, send a message with subject or body 'help' to > drbd-user-request at lists.linbit.com > > You can reach the person managing the list at > drbd-user-owner at lists.linbit.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of drbd-user digest..." > > > Today's Topics: > > 1. Re: satellite daemon must non-3377 port for SSL? (Rene Peinthor) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 2 Jul 2020 07:10:10 +0200 > From: Rene Peinthor > Subject: Re: [DRBD-user] satellite daemon must non-3377 port for SSL? > To: Chad William Seys > Cc: drbd-user > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Hi! > > I don't know why you think that the satellite listens on port 3377? > Here are the default port bindings: > > /* > * Default ports > */ > public static final int DFLT_CTRL_PORT_SSL = 3377; > public static final int DFLT_CTRL_PORT_PLAIN = 3376; > public static final int DFLT_STLT_PORT_SSL = 3367; > public static final int DFLT_STLT_PORT_PLAIN = 3366; > > It wouldn't make much sense to have a combined node, where you only can run > either a controller or satellite... > We have multiple setups where Controller and Satellite run on the same node. > > Best regards > Rene > > On Thu, Jul 2, 2020 at 7:02 AM Chad William Seys > wrote: > >> Hi, >> I have a "Combined" controller/satellite node which I'm trying to set >> up SSL on. >> It appears that the controller binds port 3377 by default. This is >> also the port the satellite listens to by default. When the node is >> Combined, this causes problems connecting to the satellite daemon. >> My hope was to have the controller bind to a non-3377 port so that >> one would not have to specify a non-default port when creating a node. >> However, I haven't been able to get the controller daemon to bind to >> anything but 3377. >> E.g. This does not work: >> # cat /etc/linstor/linstor_controller.toml >> [netcom] >> type="ssl" >> port=3388 >> server_certificate="/etc/linstor/ssl/keystore.jks" >> trusted_certificates="/etc/linstor/ssl/certificates.jks" >> key_password="linstor" >> keystore_password="linstor" >> truststore_password="linstor" >> ssl_protocol="TLSv1.2" >> >> >> Thanks! >> Chad. >> _______________________________________________ >> Star us on GITHUB: https://github.com/LINBIT >> drbd-user mailing list >> drbd-user at lists.linbit.com >> https://lists.linbit.com/mailman/listinfo/drbd-user >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > ------------------------------ > > _______________________________________________ > Star us on GITHUB: https://github.com/LINBIT > drbd-user mailing list > drbd-user at lists.linbit.com > https://lists.linbit.com/mailman/listinfo/drbd-user > > > End of drbd-user Digest, Vol 192, Issue 2 > ***************************************** > From rene.peinthor at linbit.com Mon Jul 13 10:35:23 2020 From: rene.peinthor at linbit.com (Rene Peinthor) Date: Mon, 13 Jul 2020 10:35:23 +0200 Subject: [DRBD-user] linstor-server 1.7.2 release Message-ID: Hi All! Before a big new feature release 1.8.0 will come, we decided to do another bug fix 1.7.2 release with cherry-picked fixes from master branch, anyway the 1.8.0 release SHOULD be ready in 1 or 2 weeks. linstor-server 1.7.2 -------------------- * Fixed dm-cache, losetup module dependencies * Prevent storage pool deletion if it has snapshots * Fix NPE on controller changes * Improve SSLTcpConnector exception handling * resume-io even if a snapshots fails, to prevent blocked resources * LVM also activates it LV's now * try to fix another TCP-connector dead-lock * add missing etcd storage-pool auto-select migration * allow mixing of external meta-data and data volume driver kinds * skip initial sync on VDO storage https://www.linbit.com/downloads/linstor/linstor-server-1.7.2.tar.gz Linstor PPA: https://launchpad.net/~linbit/+archive/ubuntu/linbit-drbd9-stack Cheers, Rene -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglists at websitemanagers.com.au Tue Jul 14 08:43:37 2020 From: mailinglists at websitemanagers.com.au (Adam Goryachev) Date: Tue, 14 Jul 2020 16:43:37 +1000 Subject: [DRBD-user] linstor issues In-Reply-To: <5c4f368d-5d93-65f6-d70b-88d44d3636ec@websitemanagers.com.au> References: <61f0f003-dd9c-1c36-4e09-e86516b0024a@websitemanagers.com.au> <5c4f368d-5d93-65f6-d70b-88d44d3636ec@websitemanagers.com.au> Message-ID: I'm having another crack at this, I think it will be worth it once it works. Firstly, another documentation error: https://www.linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-using_the_linstor_client > In case anything goes wrong with the storage pool?s VG/zPool, e.g. the > VG having been renamed or somehow became invalid you can delete the > storage pool in LINSTOR with the following command, given that only > resources with all their volumes in the so-called ?lost? storage pool > are attached. This feature is available since LINSTOR v0.9.13. > > # linstor storage-pool lost alpha pool_ssd linstor storage-pool lost castle vg_hdd usage: linstor storage-pool [-h] ??????????????????????????? {create, delete, list, list-properties, ??????????????????????????? set-property} ... linstor storage-pool: error: argument {create, delete, list, list-properties, set-property}: invalid choice: 'lost' (choose from 'create', 'c', 'delete', 'd', 'list', 'l', 'list-properties', 'lp', 'set-property', 'sp') Changing to use delete instead of lost: castle:~# linstor storage-pool delete castle vg_hdd ERROR: Description: ??? Storage pool definition 'vg_hdd' not found. Cause: ??? The specified storage pool definition 'vg_hdd' could not be found in the database Correction: ??? Create a storage pool definition 'vg_hdd' first. Details: ??? Node: castle, Storage pool name: vg_hdd Show reports: ??? linstor error-reports show 5F0D500C-00000-000000 castle:~# linstor storage-pool list ????????????????????????????????????????????????????????????????????????????????????????????????????????????? ? StoragePool????????? ? Node?? ? Driver?? ? PoolName ? FreeCapacity ? TotalCapacity ? CanSnapshots ? State ? ????????????????????????????????????????????????????????????????????????????????????????????????????????????? ? DfltDisklessStorPool ? castle ? DISKLESS ? ?????????????? ??????????????? ? False??????? ? Ok??? ? ? DfltDisklessStorPool ? san5?? ? DISKLESS ? ?????????????? ??????????????? ? False??????? ? Ok??? ? ? DfltDisklessStorPool ? san6?? ? DISKLESS ? ?????????????? ??????????????? ? False??????? ? Ok??? ? ? pool???????????????? ? castle ? LVM????? ? vg_hdd?? ????? 2.95 TiB ?????? 3.44 TiB ? False??????? ? Ok??? ? ? pool???????????????? ? san5?? ? LVM????? ? vg_hdd?? ????? 3.87 TiB ?????? 4.36 TiB ? False??????? ? Ok??? ? ? pool???????????????? ? san6?? ? LVM????? ? vg_ssd?? ????? 1.26 TiB ?????? 1.75 TiB ? False??????? ? Ok??? ? ????????????????????????????????????????????????????????????????????????????????????????????????????????????? I was hoping I could just remove the storage pool from castle (since it doesn't seem to be working properly), and then destroy it, re-create it, and then re-add it and see if that solves the problem. However, while it seems to exist, it also doesn't (can't delete it). Possibly part of the cause of my original problem is that I have a script that automatically creates a snapshot for each LV, and this created a snapshot of testvm1_00000 named backup_testvm1_00000_blahblah.... I've now manually deleted that, and fixed my script to avoid messing with the VG allocated to linstor, but so far, there is no change in the current status (as per below). Would appreciate any suggestions on what might be going wrong, and/or how to fix it? Regards, Adam On 24/6/20 11:46, Adam Goryachev wrote: > > > On 23/6/20 21:53, G?bor Hern?di wrote: >> Hi, >> >> apparently something is quite broken... maybe it's somehow your setup >> or environment, I am not sure... >> >> linstor resource list >> ?????????????????????????????????????????????????????????????????????????????? >> ? ResourceName ? Node?? ? Port ? Usage? ? Conns?????????????????? >> ???? State ? >> ?????????????????????????????????????????????????????????????????????????????? >> ? testvm1????? ? castle ? 7000 ? ????????????????????????? ?? >> Unknown ? >> ? testvm1????? ? san5?? ? 7000 ? ????????????????????????? ?? >> Unknown ? >> ? testvm1????? ? san6?? ? 7000 ? Unused ? Connecting(san5,castle) >> ? UpToDate ? >> ?????????????????????????????????????????????????????????????????????????????? >> >> This looks like some kind of network issues. >> >> # linstor storage-pool list --groupby Size >> >> However, the second command produces a usage error (documentation >> bug perhaps). >> >> >> Thanks for reporting, we will look into this. >> >> WARNING: >> Description: >> ??? No active connection to satellite 'san5' >> Details: >> ??? The controller is trying to (re-) establish a connection to >> the satellite. The controller stored the changes and as soon the >> satellite is connected, it will receive this update. >> >> >> So Linstor has obviously no connection to satellite 'san5'. >> >> [95078.599813] drbd testvm1 castle: conn( Unconnected -> Connecting ) >> [95078.604454] drbd testvm1 san5: conn( Unconnected -> Connecting ) >> >> >> ... and DRBD apparently also has troubles connecting... >> >> linstor n l >> ????????????????????????????????????????????????????????????? >> ? Node?? ? NodeType? ? Addresses????????????????? ? State?? ? >> ????????????????????????????????????????????????????????????? >> ? castle ? SATELLITE ? 192.168.5.204:3366 >> (PLAIN) ? Unknown ? >> ? san5?? ? SATELLITE ? 192.168.5.205:3366 >> (PLAIN) ? Unknown ? >> ? san6?? ? SATELLITE ? 192.168.5.206:3366 >> (PLAIN) ? Unknown ? >> ????????????????????????????????????????????????????????????? >> >> >> Now? this is really strange. I will spare you with some details, but >> I assume you have triggered some bad exception in Linstor which >> somehow killed a necessary thread. >> You should check >> ?? linstor err list >> and see if you can find some related error reports. >> Also, restarting the controller might help you here. >> > Thank you! > > linstor err list showed a list of errors, but the contents didn't make > a lot of sense to me. Let me know if you are interested in them, and I > can send them. > > I did a systemctl restart linstor-controller.service on san6, and > things started looking much better. > > linstor n l > ???????????????????????????????????????????????????????????? > ? Node?? ? NodeType? ? Addresses????????????????? ? State? ? > ???????????????????????????????????????????????????????????? > ? castle ? SATELLITE ? 192.168.5.204:3366 (PLAIN) ? Online ? > ? san5?? ? SATELLITE ? 192.168.5.205:3366 (PLAIN) ? Online ? > ? san6?? ? SATELLITE ? 192.168.5.206:3366 (PLAIN) ? Online ? > ???????????????????????????????????????????????????????????? > > So, all nodes agree that they are now online and talking to each > other. I assume this proves there is no network issues. > > linstor resource list > ??????????????????????????????????????????????????????????????????????????????????? > ? ResourceName ? Node?? ? Port ? Usage? ? Conns ?????????????? State ? > ??????????????????????????????????????????????????????????????????????????????????? > ? testvm1????? ? castle ? 7000 ???????? ? ???????????? Unknown ? > ? testvm1????? ? san5?? ? 7000 ? Unused ? Connecting(castle) ? > SyncTarget(12.67%) ? > ? testvm1????? ? san6?? ? 7000 ? Unused ? Connecting(castle) > ??????????? UpToDate ? > ??????????????????????????????????????????????????????????????????????????????????? > > From this, it looks like san6 (the controller) thinks it has the up to > date data, probably based on the fact it was created there first or > something. The data is syncing to san5 (in progress, and progressing > steadily), so that is good also. However, castle doesn't seem to be > syncing/connecting. > > On castle, I see this: > > Jun 24 11:01:55 castle Satellite[7499]: 11:01:55.177 [DeviceManager] > ERROR LINSTOR/Satellite - SYSTEM - Failed to create meta-data for DRBD > volume testvm1/0 [Report number 5EF2A316-31431-000002] > > linstor err show give this: > > ERROR REPORT 5EF2A316-31431-000002 > > ============================================================ > > Application:??????????????????????? LINBIT? LINSTOR > Module:???????????????????????????? Satellite > Version:??????????????????????????? 1.7.1 > Build ID: 6760637d6fae7a5862103ced4ea0ab0a758861f9 > Build time:???????????????????????? 2020-05-14T13:14:11+00:00 > Error time:???????????????????????? 2020-06-24 11:01:55 > Node:?????????????????????????????? castle > > ============================================================ > > Reported error: > =============== > > Description: > ??? Failed to create meta-data for DRBD volume testvm1/0 > > Category:?????????????????????????? LinStorException > Class name:???????????????????????? VolumeException > Class canonical name: > com.linbit.linstor.storage.layer.exceptions.VolumeException > Generated at:?????????????????????? Method 'createMetaData', Source > file 'DrbdLayer.java', Line #995 > > Error message:????????????????????? Failed to create meta-data for > DRBD volume testvm1/0 > > Error context: > ??? An error occurred while processing resource 'Node: 'castle', Rsc: > 'testvm1'' > > Call backtrace: > > ??? Method?????????????????????????????????? Native Class:Line number > ??? createMetaData?????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:995 > ??? adjustDrbd?????????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:575 > ??? process????????????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:373 > ??? process????????????????????????????????? N > com.linbit.linstor.core.devmgr.DeviceHandlerImpl:731 > ??? processResourcesAndSnapshots???????????? N > com.linbit.linstor.core.devmgr.DeviceHandlerImpl:300 > ??? dispatchResources??????????????????????? N > com.linbit.linstor.core.devmgr.DeviceHandlerImpl:138 > ??? dispatchResources??????????????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:258 > ??? phaseDispatchDeviceHandlers????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:896 > ??? devMgrLoop?????????????????????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:618 > ??? run????????????????????????????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:535 > ??? run????????????????????????????????????? N java.lang.Thread:834 > > Caused by: > ========== > > Description: > ??? Execution of the external command 'drbdadm' failed. > Cause: > ??? The external command exited with error code 1. > Correction: > ??? - Check whether the external program is operating properly. > ??? - Check whether the command line is correct. > ????? Contact a system administrator or a developer if the command > line is no longer valid > ????? for the installed version of the external program. > Additional information: > ??? The full command line executed was: > ??? drbdadm -vvv --max-peers 7 -- --force create-md testvm1/0 > > ??? The external command sent the following output data: > > > ??? The external command sent the following error information: > ??? no resources defined! > > > Category:?????????????????????????? LinStorException > Class name:???????????????????????? ExtCmdFailedException > Class canonical name: com.linbit.extproc.ExtCmdFailedException > Generated at:?????????????????????? Method 'execute', Source file > 'DrbdAdm.java', Line #550 > > Error message:????????????????????? The external command 'drbdadm' > exited with error code 1 > > > Call backtrace: > > ??? Method?????????????????????????????????? Native Class:Line number > ??? execute????????????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:550 > ??? simpleAdmCommand???????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:495 > ??? createMd???????????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.utils.DrbdAdm:262 > ??? createMetaData?????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:923 > ??? adjustDrbd?????????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:575 > ??? process????????????????????????????????? N > com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:373 > ??? process????????????????????????????????? N > com.linbit.linstor.core.devmgr.DeviceHandlerImpl:731 > ??? processResourcesAndSnapshots???????????? N > com.linbit.linstor.core.devmgr.DeviceHandlerImpl:300 > ??? dispatchResources??????????????????????? N > com.linbit.linstor.core.devmgr.DeviceHandlerImpl:138 > ??? dispatchResources??????????????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:258 > ??? phaseDispatchDeviceHandlers????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:896 > ??? devMgrLoop?????????????????????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:618 > ??? run????????????????????????????????????? N > com.linbit.linstor.core.devmgr.DeviceManagerImpl:535 > ??? run????????????????????????????????????? N java.lang.Thread:834 > > > END OF ERROR REPORT. > > Indeed, re-running the same command from the CLI provides the shown > error message: > > drbdadm -vvv --max-peers 7 -- --force create-md testvm1/0 > no resources defined! > > Some other random status information which may or may not be relevant... > > linstor storage-pool list > ????????????????????????????????????????????????????????????????????????????????????????????????????????????? > ? StoragePool????????? ? Node?? ? Driver?? ? PoolName ? FreeCapacity ? > TotalCapacity ? CanSnapshots ? State ? > ????????????????????????????????????????????????????????????????????????????????????????????????????????????? > ? DfltDisklessStorPool ? castle ? DISKLESS ? ?????????????? > ??????????????? ? False??????? ? Ok??? ? > ? DfltDisklessStorPool ? san5?? ? DISKLESS ? ?????????????? > ??????????????? ? False??????? ? Ok??? ? > ? DfltDisklessStorPool ? san6?? ? DISKLESS ? ?????????????? > ??????????????? ? False??????? ? Ok??? ? > ? pool???????????????? ? castle ? LVM????? ? vg_hdd?? ????? 2.95 TiB > ?????? 3.44 TiB ? False??????? ? Ok??? ? > ? pool???????????????? ? san5?? ? LVM????? ? vg_hdd?? ????? 3.87 TiB > ?????? 4.36 TiB ? False??????? ? Ok??? ? > ? pool???????????????? ? san6?? ? LVM????? ? vg_ssd?? ????? 1.26 TiB > ?????? 1.75 TiB ? False??????? ? Ok??? ? > ????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > I've tried to restart linstor-satellite service on castle, but it > didn't make any difference. > > After a reboot of castle, and now I get this: > > linstor resource list > ?????????????????????????????????????????????????????????????????????? > ? ResourceName ? Node?? ? Port ? Usage? ? Conns ? State ? > ?????????????????????????????????????????????????????????????????????? > ? testvm1????? ? castle ? 7000 ? Unused ? Ok??? ? Diskless ? > ? testvm1????? ? san5?? ? 7000 ? Unused ? Ok??? ? SyncTarget(55.99%) ? > ? testvm1????? ? san6?? ? 7000 ? Unused ? Ok??? ? UpToDate ? > ?????????????????????????????????????????????????????????????????????? > > However, looking at the err reports, and I see the exactl same error > about creating the metadata on castle. > > One interesting thing is that the LV seems to have been created: > > lvs > ? /dev/drbd0: open failed: Wrong medium type > ? /dev/drbd1: open failed: Wrong medium type > ? LV??????????????????????????? VG????? Attr?????? LSize??? Pool > Origin Data%? Meta%? Move Log Cpy%Sync Convert > ? backup_system_20200624_062513 storage swi-a-s---??? 4.00g system 3.06 > ? system??????????????????????? storage owi-aos--- 5.00g > ? testvm1_00000???????????????? vg_hdd? -wi-a----- <500.11g > > Any suggestions on where to look next? Or what I might have done wrong > now? > > Regards, > Adam > > > > > > > _______________________________________________ > Star us on GITHUB: https://github.com/LINBIT > drbd-user mailing list > drbd-user at lists.linbit.com > https://lists.linbit.com/mailman/listinfo/drbd-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Jul 15 17:32:32 2020 From: lists at alteeve.ca (Digimer) Date: Wed, 15 Jul 2020 11:32:32 -0400 Subject: [DRBD-user] Weird systemd / pacemaker stop crash issue Message-ID: DRBD 9.0.23, utils 9.13, RHEL 8. I've found a really odd issue where stopping pacemaker causes 'systemctl stop drbd' to hang. This is with drbd configured as a systemd resource and with no DRBD resources yet configured (in DRBD or pacemaker). The global-common.conf is standard (as created when installed). In short; While pacemaker is running, the DRBD resource can be stopped and started fine. However, if the DRBD resource is running when you stop the cluster, it hangs. Once this hang happens, even outside pacemaker, you can't stop drbd ('systemctl stop drbd' hangs when called in another terminal). Now here is where it gets really weird... If you stop the DRBD resource in pacemaker, then stop the cluster, it's fine. More over, the crash never happens again. You can start the cluster back up, then stop it with DRBD running and the daemon stops cleanly. To recreate the crash, I destroy the pacemaker cluster and reconfigure systemd:drbd and the crash returns until one stop of the cluster with DRBD already stopped, then the crash doesn't happen again. Below are links to a series of pacemaker.log files (with debugging on). The first pair starts with the initial config of pacemaker up to the crash on shutdown. The second was fixing a stonith issue and then repeating the crash. The third is a clean start to crash. The fourth shows that drbd can be stopped and started with pacemaker, and still crash. The last is where drbd is stopped when pacemaker stops, which doesn't crash. After this, everything works fine from then on. Initial pacemaker config to DRBD crash; node 1 - https://pastebin.com/raw/7QQGHW5g node 2 - https://pastebin.com/raw/a1BYVzM7 Fix fence issue, repeat test, DRBD crash; node 1 - https://pastebin.com/raw/1fZH1SSP nod2 2 - https://pastebin.com/raw/p7i3absC Fresh cluster start, crash on stop; node 1 - https://pastebin.com/raw/gYNiMHDC node 2 - https://pastebin.com/raw/VzkaPyEb DRBD resource stopped, started, stop pacemaker, crash; node 1 - https://pastebin.com/raw/LDTpmncY node 2 - https://pastebin.com/raw/ryj2J6Qt DRBD stopped resource, stop cluster, start cluster, stop cluster OK node 1 - https://pastebin.com/raw/haECJz8y node 2 - https://pastebin.com/raw/tBSD0ZyJ -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein?s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould From kvapss at gmail.com Tue Jul 21 17:12:37 2020 From: kvapss at gmail.com (kvaps) Date: Tue, 21 Jul 2020 17:12:37 +0200 Subject: [DRBD-user] VMs randomly become to read-only mode Message-ID: Hi, time to time we're facing with the weird situations when VMs are coming to readonly mode: Eg. /dev/vda is allowed to read but not for write: (initramfs) dd if=/dev/vda of=/tmp/1 bs=1 count=1 1+0 records in 1+0 records out (initramfs) dd if=/tmp/1 if=/dev/vda bs=1 count=1 [ 1335.967438] print_req_error: I/O error, dev vda, sector 0 [ 1335.969843] Buffer I/O error on dev vda, logical block 0, lost async page write 1+0 records in 1+0 records out (initramfs) Today about 8 VMs in different clusters become to this state, so I had a time to experiment with them: - dmesg messages are clean - drbd state is seems ok: one-vm-604-disk-2 role:Primary disk:UpToDate m1c10 role:Secondary peer-disk:UpToDate m1c7 role:Secondary peer-disk:Diskless - from the host side the disk is accessible and writable - if VM has many disks all of them will become readonly - simple reboot does not solves the problem, poweroff then resume does. - virsh domfsthaw returns: error: Unable to thaw filesystems error: Guest agent is not responding: QEMU guest agent is not connected Has anyone the similar cases? Best Regards, Andrei Kvapil From kvapss at gmail.com Tue Jul 21 17:50:43 2020 From: kvapss at gmail.com (kvaps) Date: Tue, 21 Jul 2020 17:50:43 +0200 Subject: [DRBD-user] VMs randomly become to read-only mode In-Reply-To: References: Message-ID: I was wrong about few facts: - if VM has many disks all of them will become readonly I just found the VM with the same symptoms but only one disk were in read-only mode - virsh domfsthaw returns error for this vm it is working correctly, but no change and disk is still not writable afterwards I can also append that usually this is happening after the snapshot operations on drive (we are using linstor with lvmthin backend) drbd version: version: 9.0.23-1 (api:2/proto:86-116) GIT-hash: d16bfab7a4033024fed2d99d3b179aa6bb6eb300 build by @runner-cpuncbtu-project-69-concurrent-15tjkf, 2020-06-09 23:46:40 Transports (api:16): tcp (9.0.23-1) Best Regards, Andrei Kvapil On Tue, Jul 21, 2020 at 5:12 PM kvaps wrote: > > Hi, time to time we're facing with the weird situations when VMs are > coming to readonly mode: > > Eg. /dev/vda is allowed to read but not for write: > > (initramfs) dd if=/dev/vda of=/tmp/1 bs=1 count=1 > 1+0 records in > 1+0 records out > (initramfs) dd if=/tmp/1 if=/dev/vda bs=1 count=1 > [ 1335.967438] print_req_error: I/O error, dev vda, sector 0 > [ 1335.969843] Buffer I/O error on dev vda, logical block 0, lost > async page write > 1+0 records in > 1+0 records out > (initramfs) > > Today about 8 VMs in different clusters become to this state, so I had > a time to experiment with them: > > - dmesg messages are clean > - drbd state is seems ok: > > one-vm-604-disk-2 role:Primary > disk:UpToDate > m1c10 role:Secondary > peer-disk:UpToDate > m1c7 role:Secondary > peer-disk:Diskless > > - from the host side the disk is accessible and writable > - if VM has many disks all of them will become readonly > - simple reboot does not solves the problem, poweroff then resume does. > - virsh domfsthaw returns: > > error: Unable to thaw filesystems > error: Guest agent is not responding: QEMU guest agent is not connected > > Has anyone the similar cases? > > Best Regards, > Andrei Kvapil From roland.kammerer at linbit.com Wed Jul 22 10:06:11 2020 From: roland.kammerer at linbit.com (Roland Kammerer) Date: Wed, 22 Jul 2020 10:06:11 +0200 Subject: [DRBD-user] VMs randomly become to read-only mode In-Reply-To: References: Message-ID: <20200722080611.GN1888@rck.sh> On Tue, Jul 21, 2020 at 05:12:37PM +0200, kvaps wrote: > Hi, time to time we're facing with the weird situations when VMs are > coming to readonly mode: > > - dmesg messages are clean Can you double check the dmesg? It sounds a bit like they might have lost quorum. Should be easily grep-able. Best, rck From rene.peinthor at linbit.com Wed Jul 22 16:05:56 2020 From: rene.peinthor at linbit.com (Rene Peinthor) Date: Wed, 22 Jul 2020 16:05:56 +0200 Subject: [DRBD-user] linstor-server 1.7.3 release Message-ID: Hi All! We are still busy fixing the last bits for 1.8.0 release and will just do another 1.7.3 bugfix release before. linstor-server 1.7.3 -------------------- * Added storagepool property to change lvcreate behavior: 'StorDriver/LvcreateOptions' * Fixed losetup version parsing without PATCH level * Fixed node reconnect deadlock https://www.linbit.com/downloads/linstor/linstor-server-1.7.3.tar.gz Linstor PPA: https://launchpad.net/~linbit/+archive/ubuntu/linbit-drbd9-stack Cheers, Rene -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike at angani.co Wed Jul 22 15:16:38 2020 From: mike at angani.co (Manuthu M) Date: Wed, 22 Jul 2020 16:16:38 +0300 Subject: [DRBD-user] Recover Data from inconsistent nodes Message-ID: Hi all, Please assist on recovering data from a DRBD MySQL node that failed to come up after a maintenance window. Currently we have a setup of 2 mysql nodes (DB01 and DB02) using pacemaker and DRBD. Earlier today, we needed to put on standby the primary node - DB02 in this case and use DB01. Pacemaker succeeded in promoting the secondary node (DB01) to primary and all other services under it. After about ~ 45 minutes of usage, we noticed inconsistencies in the data. Further investigations revealed that the data was missing from about 2 weeks ago (07th July). Which means that the sync had somehow stopped between the two nodes. Worth noting is that during this time the nodes had started syncing before we noticed. [root at db01 ~]# cat /proc/drbd version: 8.4.11-1 (api:1/proto:86-101) GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@, 2020-04-05 02:58:18 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:29240140 nr:0 dw:21665384 dr:10156909 al:1027 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:6974196 [=========>..........] sync'ed: 52.5% (6808/14316)M finish: 5:19:38 speed: 360 (1,480) K/sec We attempted to revert to the previous setup by putting the current node on standby and promoting the secondary to primary but failed to come back online. We now have a old copy of the database from 2 weeks ago running on the primary. While we have nightly backups we have lost a significant amount of data from between last night backup to the time the issue arose. Is it possible to recover this and what is the best approach to doing this. Here is the DRBD config file [root at db02 ~]# cat /etc/drbd.d/mysql01.res resource mysql01 { protocol C; meta-disk internal; device /dev/drbd0; disk /dev/centos_db-02/db02; handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } net { allow-two-primaries no; after-sb-0pri discard-zero-changes; after-sb-2pri disconnect; rr-conflict disconnect; } disk { on-io-error detach; } syncer { verify-alg sha1; } on db01 { address 172.19.5.8:7789; } on db02 { address 172.19.5.9:7789; } } Thanks in advance, Mike(null) From johannes at johannesthoma.com Thu Jul 23 14:24:21 2020 From: johannes at johannesthoma.com (Johannes Thoma) Date: Thu, 23 Jul 2020 14:24:21 +0200 Subject: [DRBD-user] WinDRBD 1.0.0-rc1 released Message-ID: Dear DRBD and WinDRBD users, This is the first release canidate of the upcoming 1.0.0 release of the WinDRBD driver and userland utilities. Please download a signed driver (signed by Linbit) from the Linbit software download page: https://www.linbit.com/linbit-software-download-page-for-linstor-and-drbd-linux-driver/ This version is feature complete except the boot-via-WinDRBD feature, which is still experimental. It will be part of the 1.1.0 development branch. Please help with testing so the 1.0.0 release becomes as stable as possible. Thanks for your contribution, - Johannes -------------- next part -------------- An HTML attachment was scrubbed... URL: From juan.sevilla.11 at gmail.com Thu Jul 23 09:19:14 2020 From: juan.sevilla.11 at gmail.com (Juan Sevilla) Date: Thu, 23 Jul 2020 09:19:14 +0200 Subject: [DRBD-user] Down sync Message-ID: Hi, My configuration is this: A) Node drbd01: primary all B) Node drbd02: primary all C) Node drbd03: secondary all, diskless, for quorum proposal. Initially all run correctly, but after various hours the sync between drbd nodes is lost, in spite of the connections (ping) on the networks is ok. Some times, the witness (node drbd03) appears "connecting" to drbd01, another times is the node drbd02, etc. My OS is RHEL 7, and firewalld is stopped and disabled, also SELinux is disabled... What could be happening? [root at drbd01 drbd.d]# uname -a > Linux drbd01 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 > x86_64 x86_64 x86_64 GNU/Linux > [root at drbd01 drbd.d]# > [root at drbd01 drbd.d]# cat global_common.conf > global { > usage-count no; > udev-always-use-vnr; > } > common { > handlers { > } > startup { > } > options { > quorum majority; > # on-no-quorum io-error; > # quorum-minimum-redundancy 1; > } > disk { > } > net { > verify-alg crc32c; > } > } > [root at drbd01 drbd.d]# cat *.res |more > resource DATA01 { > volume 1 { > disk /dev/sdf; > device /dev/drbd4; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7791; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7791; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7791; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource DATA02 { > volume 1 { > disk /dev/sdg; > device /dev/drbd5; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7792; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7792; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7792; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource DATA03 { > volume 1 { > disk /dev/sdh; > device /dev/drbd6; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7793; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7793; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7793; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource GIMR01 { > volume 1 { > disk /dev/sde; > device /dev/drbd3; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7790; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7790; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7790; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > resource MIGRA01 { > volume 1 { > disk /dev/sdi; > device /dev/drbd7; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7794; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7794; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7794; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource MIGRA02 { > volume 1 { > disk /dev/sdj; > device /dev/drbd8; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7795; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7795; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7795; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource MIGRA03 { > volume 1 { > disk /dev/sdk; > device /dev/drbd9; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7796; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7796; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7796; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource MIGRA04 { > volume 1 { > disk /dev/sdl; > device /dev/drbd10; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7797; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7797; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7797; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource OCR01 { > volume 1 { > disk /dev/sdb; > device /dev/drbd0; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7787; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7787; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7787; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > resource OCR02 { > volume 1 { > disk /dev/sdc; > device /dev/drbd1; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7788; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7788; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7788; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > > resource OCR03 { > volume 1 { > disk /dev/sdd; > device /dev/drbd2; > meta-disk internal; > } > on drbd01 { > address 10.10.10.1:7789; > node-id 0; > } > on drbd02 { > address 10.10.10.2:7789; > node-id 1; > } > on drbd03 { > address 10.10.10.3:7789; > node-id 2; > volume 1 { > disk none; > } > > > } > connection-mesh { > hosts drbd01 drbd02 drbd03; > net { > protocol C; > allow-two-primaries yes; > } > } > > } > Best regards. Juan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From juan.sevilla.11 at gmail.com Thu Jul 23 10:31:42 2020 From: juan.sevilla.11 at gmail.com (Juan Sevilla) Date: Thu, 23 Jul 2020 10:31:42 +0200 Subject: [DRBD-user] Down sync In-Reply-To: References: Message-ID: The resource drbd02 is just now down between drbd02 and drbd03. Where can i review the more logs?? Thanks in advance Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: meta connection shut > down by peer. > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: conn( Connected -> > NetworkFailure ) peer( Secondary -> Unknown ) > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02/1 drbd8 drbd03: pdsk( Diskless > -> DUnknown ) repl( Established -> Off ) > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: ack_receiver terminated > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: Terminating ack_recv > thread > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: sock was shut down by > peer > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: Restarting sender > thread > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: Connection closed > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: conn( NetworkFailure > -> Unconnected ) > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: Restarting receiver > thread > Jul 23 10:05:57 drbd02 kernel: drbd MIGRA02 drbd03: conn( Unconnected -> > Connecting ) > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Handshake to peer 2 > successful: Agreed network protocol version 117 > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Feature flags enabled > on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Starting ack_recv > thread (from drbd_r_MIGRA02 [2695]) > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02: Preparing cluster-wide state > change 1863242544 (1->2 499/145) > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02: Declined by peer drbd01 (id: > 0), see the kernel log there > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02: Aborting cluster-wide state > change 1863242544 (19ms) rv = -10 > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Failure to connect; > retrying > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: conn( Connecting -> > NetworkFailure ) > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: ack_receiver terminated > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Terminating ack_recv > thread > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Restarting sender > thread > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Connection closed > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: conn( NetworkFailure > -> Unconnected ) > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Restarting receiver > thread > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: conn( Unconnected -> > Connecting ) > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Handshake to peer 2 > successful: Agreed network protocol version 117 > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Feature flags enabled > on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. > Jul 23 10:05:58 drbd02 kernel: drbd MIGRA02 drbd03: Starting ack_recv > thread (from drbd_r_MIGRA02 [2695]) > Jul 23 10:05:59 drbd02 kernel: drbd MIGRA02: Preparing cluster-wide state > change 1892110034 (1->2 499/145) > Jul 23 10:05:59 drbd02 kernel: drbd MIGRA02: Declined by peer drbd01 (id: > 0), see the kernel log there > Jul 23 10:05:59 drbd02 kernel: drbd MIGRA02: Aborting cluster-wide state > change 1892110034 (0ms) rv = -10 > Jul 23 10:05:59 drbd02 kernel: drbd MIGRA02 drbd03: Failure to connect; > retrying > Jul 23 10:05:59 drbd02 kernel: drbd MIGRA02 drbd03: conn( Connecting -> > NetworkFailure ) .......... El jue., 23 jul. 2020 a las 9:19, Juan Sevilla () escribi?: > Hi, > > My configuration is this: > > A) Node drbd01: primary all > > B) Node drbd02: primary all > > C) Node drbd03: secondary all, diskless, for quorum proposal. > > Initially all run correctly, but after various hours the sync between drbd > nodes is lost, in spite of the connections (ping) on the networks is ok. > > Some times, the witness (node drbd03) appears "connecting" to drbd01, > another times is the node drbd02, etc. My OS is RHEL 7, and firewalld is > stopped and disabled, also SELinux is disabled... > > What could be happening? > > > [root at drbd01 drbd.d]# uname -a >> Linux drbd01 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 >> x86_64 x86_64 x86_64 GNU/Linux >> [root at drbd01 drbd.d]# >> [root at drbd01 drbd.d]# cat global_common.conf >> global { >> usage-count no; >> udev-always-use-vnr; >> } >> common { >> handlers { >> } >> startup { >> } >> options { >> quorum majority; >> # on-no-quorum io-error; >> # quorum-minimum-redundancy 1; >> } >> disk { >> } >> net { >> verify-alg crc32c; >> } >> } >> [root at drbd01 drbd.d]# cat *.res |more >> resource DATA01 { >> volume 1 { >> disk /dev/sdf; >> device /dev/drbd4; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7791; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7791; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7791; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource DATA02 { >> volume 1 { >> disk /dev/sdg; >> device /dev/drbd5; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7792; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7792; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7792; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource DATA03 { >> volume 1 { >> disk /dev/sdh; >> device /dev/drbd6; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7793; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7793; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7793; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource GIMR01 { >> volume 1 { >> disk /dev/sde; >> device /dev/drbd3; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7790; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7790; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7790; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> resource MIGRA01 { >> volume 1 { >> disk /dev/sdi; >> device /dev/drbd7; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7794; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7794; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7794; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource MIGRA02 { >> volume 1 { >> disk /dev/sdj; >> device /dev/drbd8; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7795; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7795; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7795; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource MIGRA03 { >> volume 1 { >> disk /dev/sdk; >> device /dev/drbd9; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7796; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7796; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7796; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource MIGRA04 { >> volume 1 { >> disk /dev/sdl; >> device /dev/drbd10; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7797; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7797; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7797; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource OCR01 { >> volume 1 { >> disk /dev/sdb; >> device /dev/drbd0; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7787; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7787; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7787; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> resource OCR02 { >> volume 1 { >> disk /dev/sdc; >> device /dev/drbd1; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7788; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7788; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7788; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> >> resource OCR03 { >> volume 1 { >> disk /dev/sdd; >> device /dev/drbd2; >> meta-disk internal; >> } >> on drbd01 { >> address 10.10.10.1:7789; >> node-id 0; >> } >> on drbd02 { >> address 10.10.10.2:7789; >> node-id 1; >> } >> on drbd03 { >> address 10.10.10.3:7789; >> node-id 2; >> volume 1 { >> disk none; >> } >> >> >> } >> connection-mesh { >> hosts drbd01 drbd02 drbd03; >> net { >> protocol C; >> allow-two-primaries yes; >> } >> } >> >> } >> > > > Best regards. > Juan. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rob.vanderwal at surf.nl Thu Jul 23 15:01:23 2020 From: rob.vanderwal at surf.nl (Rob van der Wal) Date: Thu, 23 Jul 2020 15:01:23 +0200 Subject: [DRBD-user] drbd-9.0.23-1: make kmp-rpm fails for SLES 15 SP2 Message-ID: Hi, After updating from SLES15 SP1 to SLES15 SP2 the "make kmp-rpm" fails with: ..... calling /usr/lib/rpm/brp-suse.d/brp-99-compress-vmlinux calling /usr/lib/rpm/brp-suse.d/brp-99-pesign No buildservice signing certificate Creating /home/drbdusr/rpmbuild/OTHER/drbd-kernel.cpio.rsasign 53043 blocks Processing files: drbd-kernel-9.0.23-1.x86_64 warning: File listed twice: /lib/modules/5.3.18-22-default/updates/drbd.ko warning: File listed twice: /lib/modules/5.3.18-22-default/updates/drbd_transport_tcp.ko awk: cmd. line:4: (FILENAME=- FNR=1) warning: gensub: third argument `' treated as 1 awk: cmd. line:4: (FILENAME=- FNR=2) warning: gensub: third argument `' treated as 1 awk: cmd. line:4: (FILENAME=- FNR=3) warning: gensub: third argument `' treated as 1 awk: cmd. line:4: (FILENAME=- FNR=4) warning: gensub: third argument `' treated as 1 awk: cmd. line:4: (FILENAME=- FNR=5) warning: gensub: third argument `' treated as 1 ..... Processing files: drbd-kmp-default-9.0.23_k5.3.18_22-1.x86_64 Processing files: drbd-kmp-preempt-9.0.23_k5.3.18_22-1.x86_64 Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/drbdusr/rpmbuild/BUILDROOT/drbd-kernel-9.0.23-1.x86_64 error: Installed (but unpackaged) file(s) found: ?? /lib/modules/5.3.18-22-preempt/updates/drbd.ko ?? /lib/modules/5.3.18-22-preempt/updates/drbd_transport_tcp.ko RPM build errors: ??? File listed twice: /lib/modules/5.3.18-22-default/updates/drbd.ko ??? File listed twice: /lib/modules/5.3.18-22-default/updates/drbd_transport_tcp.ko ??? Installed (but unpackaged) file(s) found: ?? /lib/modules/5.3.18-22-preempt/updates/drbd.ko ?? /lib/modules/5.3.18-22-preempt/updates/drbd_transport_tcp.ko make: *** [Makefile:273: kmp-rpm] Error 1 ..... Any ideas? The kernel is 5.3.18-22-default. Rob From roland.kammerer at linbit.com Thu Jul 23 16:06:22 2020 From: roland.kammerer at linbit.com (Roland Kammerer) Date: Thu, 23 Jul 2020 16:06:22 +0200 Subject: [DRBD-user] Down sync In-Reply-To: References: Message-ID: <20200723140622.GP1888@rck.sh> On Thu, Jul 23, 2020 at 09:19:14AM +0200, Juan Sevilla wrote: > Hi, > > My configuration is this: > > A) Node drbd01: primary all > > B) Node drbd02: primary all > > C) Node drbd03: secondary all, diskless, for quorum proposal. I don't know your write pattern, but this sounds like a bad idea. Multiple primaries are allowed during live migration where you can guarantee exactly one writer. Otherwise: don't do it. Regards, rck From robert.altnoeder at linbit.com Fri Jul 24 11:43:27 2020 From: robert.altnoeder at linbit.com (Robert Altnoeder) Date: Fri, 24 Jul 2020 11:43:27 +0200 Subject: [DRBD-user] Down sync In-Reply-To: <20200723140622.GP1888@rck.sh> References: <20200723140622.GP1888@rck.sh> Message-ID: <756a2d3b-26f8-fff7-1bb1-b75ef63e74c4@linbit.com> On 7/23/20 4:06 PM, Roland Kammerer wrote: > I don't know your write pattern, but this sounds like a bad idea. > Multiple primaries are allowed during live migration where you can > guarantee exactly one writer. Otherwise: don't do it. I'll add some details: Dual Primary is supported for two kinds of situations: 1. Temporary dual-primary for live-migration of VMs using a multi-node replicated DRBD resource 2. Permanent dual-primary with specialized readers/writers (cluster file systems, cluster-synchronized applications, etc.) on a two-node replicated DRBD resource You have a three-node replicated DRBD resource. Noone knows whether that will replicate and resynchronize correctly even if it is connected. In general, dual primary configurations are a lot less robust than normal multi-resource active/active configurations and should be avoided. In most cases, there are other configurations that do not require dual primary mode for the same use-case, so in most cases, a dual primary configuration is a misconfiguration in the first place. Tell us more about what your actual use-case and applications are, and we may be able to help you with the setup. br, Robert From juan.sevilla.11 at gmail.com Fri Jul 24 12:22:43 2020 From: juan.sevilla.11 at gmail.com (Juan Sevilla) Date: Fri, 24 Jul 2020 12:22:43 +0200 Subject: [DRBD-user] Down sync In-Reply-To: References: Message-ID: Hi, Thanks for your response. I need use primary/primary for using the storage blocks by clustered filesystem. In general this configuration is running ok, also when i do a intensive using of the replicated local disks. My problem is that, occasionally, appears a disconnect between nodes. I don't know what you want say when you refer to multiresource active/active. Is it a alternative to dual primary/primary. Thanks in advance, Juan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.altnoeder at linbit.com Fri Jul 24 14:07:43 2020 From: robert.altnoeder at linbit.com (Robert Altnoeder) Date: Fri, 24 Jul 2020 14:07:43 +0200 Subject: [DRBD-user] Down sync In-Reply-To: References: Message-ID: <5e4187a1-a297-feb9-ac4e-ff4f6cae2fcf@linbit.com> On 7/24/20 12:22 PM, Juan Sevilla wrote: > Hi, > > Thanks for your response. I need use primary/primary for using the > storage blocks by clustered filesystem. The question is whether you actually need a clustered filesystem in the first place. That is why I asked about the use-case and applications running on those systems. > In general this configuration is running ok, also when i do a > intensive using of the replicated local disks. With a resource that is replicated between more than two nodes while two nodes are in the Primary role, data could be corrupted, maybe not during replication, but possibly if a node has an outage, or if data is resynced later. This is not a supported configuration. > My problem is that, occasionally, appears a disconnect between nodes. That should normally lead to an immediate power-off of the node that was lost. If it doesn't, then it's misconfigured, and the result is at least a split-brain situation, apart from the potential data corruption due to what I wrote above. E.g., let's assume node A is Primary, node B is Primary, node C is Secondary. Now A disconnects from B, but both are connected to C, and applications still read and write data on A and B. A is missing the data that's being written on B, and read requests on A read old data after an update of that same data on node B. The same is true the other way around. So the result is, that you have two different unrelated data sets on A and B, and the state of any applications that rely on the cluster filesystem may be corrupted. But then, node C is even more interesting, because that one gets updates from both, node A and node B, which have diverged. So the data on node C could be a completely corrupted mix of unrelated updates from A and B, which may even corrupt the filesystem's data structures, thereby making the filesystem unreadable. Upon reconnect, what is supposed to happen? Node A and Node B are split-brained and cannot sync. Even if you sync those two, node C's data cannot be recovered, so you would have to full-sync it for the data on it to make any sense again. And that's just the tip of the iceberg with regards to the background story on why Dual Primary multi-node-clusters are opening Pandora's box in many interesting ways... > > I don't know what you want say when you refer to multiresource > active/active. Is it a alternative to dual primary/primary. Multiple resources/volumes. Some Primary on node A, others Primary on node B. Normally grouped with applications that are independent. E.g. two different database instances that can run on different nodes. Instead of keeping those on the same filesystem, each DB instance gets a separate mountpoint for its data, and each mountpoint is backed by a separate DRBD resource. Resource db1 is Primary with database instance 1 running on node A, Secondary on nodes B and C. Resource db2 is Primary with database instance 2 running on node B, Secondary on nodes A and C. That's a multi-resource active/active setup. br, Robert From juan.sevilla.11 at gmail.com Mon Jul 27 08:28:01 2020 From: juan.sevilla.11 at gmail.com (Juan Sevilla) Date: Mon, 27 Jul 2020 08:28:01 +0200 Subject: [DRBD-user] Down sync In-Reply-To: References: Message-ID: Hi Robert, > Permanent dual-primary with specialized readers/writers (cluster file > systems, cluster-synchronized applications, etc.) on a two-node > replicated DRBD resource This is exactly my case. I need build a system with Oracle ASM on top. ASM is a clusterized filesystem, with its own heartbeat, etc. like OCFS2. I need dual-primary for making a virtual SAN. This system is running correctly with high load of IO. I've restored and recovered a 0,7TB database with dual-primary active. I can't use model multiresource active/active because is only one database beings accessed by two nodes simultaneously. I think it's possible to eliminate the witness based on diskless node (third node) and replace it with a fencing handler. I appreciate your comments, Best regards, Juan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tbskyd at gmail.com Wed Jul 29 12:26:32 2020 From: tbskyd at gmail.com (d tbsky) Date: Wed, 29 Jul 2020 18:26:32 +0800 Subject: [DRBD-user] drbd proxy Message-ID: Hi: I want to do drbd over wan. as far as I know, I need drbd proxy for best result. is this product still valid for sale? or will it be replaced by some other things? I had ask for quote at linbit web site several times, but got no reply. I don't know if there is something wrong. anybody know how to contact the correct window? thanks a lot for help!! From roland.kammerer at linbit.com Wed Jul 29 12:38:32 2020 From: roland.kammerer at linbit.com (Roland Kammerer) Date: Wed, 29 Jul 2020 12:38:32 +0200 Subject: [DRBD-user] drbd proxy In-Reply-To: References: Message-ID: <20200729103832.GU1888@rck.sh> On Wed, Jul 29, 2020 at 06:26:32PM +0800, d tbsky wrote: > Hi: > I want to do drbd over wan. as far as I know, I need drbd proxy > for best result. is this product still valid for sale? It is. > I had ask for quote at linbit web site several times, but got no > reply. I don't know if there is something wrong. anybody know how to > contact the correct window? I will make sure sales gets in contact with you. Regards, rck From gabor.hernadi at linbit.com Fri Jul 31 10:11:45 2020 From: gabor.hernadi at linbit.com (=?UTF-8?B?R8OhYm9yIEhlcm7DoWRp?=) Date: Fri, 31 Jul 2020 10:11:45 +0200 Subject: [DRBD-user] linstor-server 1.8.0-rc1 Message-ID: Hi! This release-candidate contains the following new features: * SOS-report: using "linstor sos-report create" and afterwards the "download" subcommand, Linstor can now generate a tar.gz containing machine- and Linstor-specific information, log files, error reports and such. * Snapshot-shipping[1]: Linstor can now ship newly created snapshots from one Linstor resource to another using 'zfs send' / 'zfs receive' for ZFS or 'thin_send' / 'thin_recv' for LVM. Both tools support incremental shipment. Additionally Linstor can be configured with auto-snapshot-shipping as well as auto-snapshots (without shipping) NOTE: LVM-based snapshot-shipping requires the 'thin_send_recv' tool to be installed which is available from the PPA as well as from https://packages.linbit.com/public/ linstor-server 1.8.0-rc1 ------------------------ * Logging to rest-access.log is now disabled by default * only update satellites if properties really changed * allow to set, controller/satellite settings via REST API * Skip initial DRBD-sync on VDOs * No ErrorReports for typos and other "user-mistakes" * LvmProvider also activates LVs (similar as LvmThinProvider does) * Add Prometheus `/metrics` url with reporting * Add Sentry integration to capture error info * Add modifying resource definition resource group linking * Error-reports are now additionally stored in a local h2 sql db, with more info * Shrinking of volumes (ATTENTION: make sure to first shrink anything above Linstor, i.e. the filesystem, or you risk LOSING DATA) * Fixed parsing version numbers of 'losetup' and 'lvm' * Simplify storage pool create property to a single one * Added 'StorDriver/LvcreateOptions' property * Do not bind controller protobuf Plain/SSL-Connector * SysFsHandler now throttles only data-devices instead of meta-devices * REST-API v1.2.0 https://www.linbit.com/downloads/linstor/linstor-server-1.8.0.rc1.tar.gz Linstor PPA: https://launchpad.net/~linbit/+archive/ubuntu/linbit-drbd9-stack Best regards, Gabor [1] https://www.linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-shipping_snapshots-linstor -------------- next part -------------- An HTML attachment was scrubbed... URL: From bogus@does.not.exist.com Mon Jul 20 09:52:07 2020 From: bogus@does.not.exist.com () Date: Mon, 20 Jul 2020 07:52:07 -0000 Subject: No subject Message-ID: plit-brain-disconnect by rule 100 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: helper command: /s= bin/drbdadm initial-split-brain Then for n2:- Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: drbd_sync_handshak= e: Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: self 30D0D1B4BD67B= AEE:CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: peer BC13E2E36CA8B= 2C6:CE02E3A41E743EDA:272E3DE9D9C74A66:001E2864952E2E96 bits:416 flags:20 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: uuid_compare()=3Ds= plit-brain-auto-recover by rule 90 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /s= bin/drbdadm initial-split-brain Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: meta connection shut down = by peer. Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Connected -> Network= Failure ) peer( Secondary -> Unknown ) Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: ack_receiver terminated Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating ack_recv threa= d Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /s= bin/drbdadm initial-split-brain exit code 0 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0: Split-Brain detected but u= nresolved, dropping connection! Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /s= bin/drbdadm split-brain Oct 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /s= bin/drbdadm split-brain exit code 0 Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( NetworkFailure -> Di= sconnecting ) Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: error receiving P_STATE, e= : -5 l: 0! Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Restarting sender thread Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Connection closed Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Disconnecting -> Sta= ndAlone ) Oct 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating receiver threa= d The logs also have FIXME messages(which may be unrelated) e.g:- Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op = clear, bitmap locked for 'send_bitmap (WFBitMapS)' by drbd_w_r0[1659] Oct 13 12:50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op = clear, bitmap locked for 'receive bitmap' by drbd_r_r0[95684] Sep 23 12:41:34 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[19978] op = clear, bitmap locked for 'set_n_write from sync_handshake' by drbd_r_r0[176= 28] Regards, Jeremy Faith --_000_CO1P132MB02414264745DD95C0E91693985040CO1P132MB0241NAMP_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi,
drbd=
90 kernel module version:9.0.22-2=0A=
drbd90-utils:9.12.2-1=0A=
kernel:3.10.0-1127.18.2.el7.x86_64=0A=
pacemaker:1.1.21-4=0A=
corosync-2.4.5-4=0A=
system is centos:7.6
I have a = 4 node test system(only ever 1 active primary) which is going split-brain u= nexpectedly.
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">n1 is the = primary, n2/n3/n4 secondary.
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">System is = being shutdown every night and sometimes on restart(particularly after weekend shutdown) some of the nodes are split-brain and require a fu= ll resync to fix.
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">Logs seem = to indicate a problem with uuid_compare.
<= br>
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">From the s= ystem log on n1:-
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">Oct 12 07:= 58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: drbd_sync_handshake:
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: self 30D0D1B4BD67BAEE:= CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: peer 554921683EF7CC82:= 0000000000000000:272E3DE9D9C74A66:04B370F60768109E bits:0 flags:20
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: uuid_compare()=3Dsplit= -brain-disconnect by rule 100
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n3: helper command: /sbin/= drbdadm initial-split-brain

Then for = n2:-
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">Oct 12 07:= 58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: drbd_sync_handshake:
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: self 30D0D1B4BD67BAEE:= CE02E3A41E743EDA:1AB2F8FC4793AC46:95EE6B42F9156BF6 bits:786 flags:20
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: peer BC13E2E36CA8B2C6:= CE02E3A41E743EDA:272E3DE9D9C74A66:001E2864952E2E96 bits:416 flags:20
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: uuid_compare()=3Dsplit= -brain-auto-recover by rule 90
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/= drbdadm initial-split-brain
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: meta connection shut down by p= eer.
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Connected -> NetworkF= ailure ) peer( Secondary -> Unknown )
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: ack_receiver terminated=
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating ack_recv thread
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/= drbdadm initial-split-brain exit code 0
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0: Split-Brain detected but unres= olved, dropping connection!
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/= drbdadm split-brain
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0/0 drbd0 cdc0-n2: helper command: /sbin/= drbdadm split-brain exit code 0
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( NetworkFailure -> Dis= connecting )
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: error receiving P_STATE, e: -5= l: 0!
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Restarting sender thread
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Connection closed
Oct = 12 07:58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: conn( Disconnecting -> Stan= dAlone )
Oct 12 07= :58:56 cdc0-n1 kernel: drbd r0 cdc0-n2: Terminating receiver thread<= br>
<= br>
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">The logs a= lso have FIXME messages(which may be unrelated) e.g:-
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">Oct 13 12:= 50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'send_bitmap (WFBitMapS)' by drbd_w_r0[1659]
<= br>
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">Oct 13 12:= 50:42 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[97260] op clear, bitmap locked for 'receive bitmap' by drbd_r_r0[95684]
<= br>
<= span style=3D"font-family: "Courier New", monospace; font-size: 1= 0pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">Sep 23 12:= 41:34 cdc0-n1 kernel: drbd r0/0 drbd0: FIXME drbd_a_r0[19978] op clear, bitmap locked for 'set_n_write from sync_handshake' by drbd_r_r0= [17628]

Regards,
Jeremy Faith
--_000_CO1P132MB02414264745DD95C0E91693985040CO1P132MB0241NAMP_--