[DRBD-user] linstor-proxmox: Intentionally removing diskless assignment

Wed Dec 18 13:16:02 CET 2019

Am 18.12.19 um 12:55 schrieb Roland Kammerer:
> On Wed, Dec 18, 2019 at 12:19:45PM +0100, Andreas Pflug wrote:
>> Could not delete diskless resource vm-105-disk-3 on <migrationSource>,
>> because:
>> [{ "ret_code":53739522,
>>    "message":"Node: <migrationSource>, Resource: vm-105-disk-3 marked
>> for deletion.",
>>    "details":"Node: <migrationSource>, Resource: vm-105-disk-3 UUID is:
>> <uuid>",
>>    "obj_refs":{"RscDfn":"vm-105-disk-3","Node":"<migrationSource>"}
>>  },
>>  { "ret_code":53739523,
>>    "message":"Deleted 'vm-105-disk-3' on '<migrationSource>'",
>>    "obj_refs":{"RscDfn":"vm-105-disk-3","Node":"<migrationSource>"}
>>  },
>>  { "ret_code":53739523,
>>    "message":"Notified '<storageHost2>' that 'vm-105-disk-3' is being
>> deleted on Node(s): [<migrationSource>]",
>>    "obj_refs":{"RscDfn":"vm-105-disk-3","Node":"<migrationSource>"}
>>  },
>>  { "ret_code":-4611686018373647386,
>>    "message":"(Node: '<migrationTarget>') Failed to adjust DRBD resource
>> vm-105-disk-3","error_report_ids":["<reportid>"],
> 
> Here obviously the error report would be interesting:
> linstor err s <ID>.

The error report is:

Description:
    Execution of the external command 'drbdadm' failed.
...
Additional information:
    The full command line executed was:
    drbdadm -vvv adjust vm-105-disk-3

    The external command sent the following output data:
    drbdsetup del-peer vm-105-disk-3 3

> 
> As well as everything one would do to diagnose a DRBD problem, like
> 'drbdsetup status vm-105-disk-3', syslogs/journal grepepd for the
> resource,...

The migration appears fully sucessful to me, drbdadm status shows
UpToDate on all nodes and Primary on the MigrationTargetHost only as
expected.

syslog on MigrationSourceHost contains the same error message "Could not
delete diskless resource" as above, kern.log a normal teardown of the
resource.

syslog on MigrationTargetHost when finishing the migration:

13:41:27 kernel: [1470266.652602] drbd vm-105-disk-3: Preparing
cluster-wide state change 3965084326 (2->-1 3/1)
13:41:27 kernel: [1470266.652710] drbd vm-105-disk-3: State change
3965084326: primary_nodes=C, weak_nodes=FFFFFFFFFFFFFFF0
13:41:27 kernel: [1470266.652711] drbd vm-105-disk-3: Committing
cluster-wide state change 3965084326 (0ms)
13:41:27 kernel: [1470266.652714] drbd vm-105-disk-3: role( Secondary ->
Primary )
13:41:28 kernel: [1470267.884733] drbd vm-105-disk-3
<MigrationSourceHost>: peer( Primary -> Secondary )
13:41:32 Satellite[102668]: 13:41:32.399 [MainWorkerPool-8] INFO
LINSTOR/Satellite - SYSTEM - Resource 'vm-105-disk-3' updated for node
'<StorageHost2>'.
13:41:32 Satellite[102668]: 13:41:32.400 [MainWorkerPool-8] INFO
LINSTOR/Satellite - SYSTEM - Resource 'vm-105-disk-3' updated for node
'<StorageHost3>'.
13:41:32 Satellite[102668]: 13:41:32.400 [MainWorkerPool-8] INFO
LINSTOR/Satellite - SYSTEM - Resource 'vm-105-disk-3' updated for node
'<MigrationTargetHost>'.
13:41:32 Satellite[102668]: 13:41:32.400 [MainWorkerPool-8] INFO
LINSTOR/Satellite - SYSTEM - Resource 'vm-105-disk-3' updated for node
'<MigrationSourceHost>'.
13:41:32 kernel: [1470271.471854] drbd vm-105-disk-3
<MigrationSourceHost>: Preparing remote state change 1415392230
13:41:32 kernel: [1470271.471958] drbd vm-105-disk-3
<MigrationSourceHost>: Committing remote state change 1415392230
(primary_nodes=4)
13:41:32 kernel: [1470271.539535] drbd vm-105-disk-3
<MigrationSourceHost>: Preparing remote state change 3243233882
13:41:32 kernel: [1470271.539774] drbd vm-105-disk-3
<MigrationSourceHost>: Committing remote state change 3243233882
(primary_nodes=4)
13:41:32 Satellite[102668]: 13:41:32.711 [DeviceManager] ERROR
LINSTOR/Satellite - SYSTEM - Failed to adjust DRBD resource
vm-105-disk-3 [Report number <reportAbove>]
13:41:33 kernel: [1470272.124519] drbd vm-105-disk-3
<MigrationSourceHost>: sock was shut down by peer
13:41:33 kernel: [1470272.124530] drbd vm-105-disk-3
<MigrationSourceHost>: conn( Connected -> BrokenPipe ) peer( Secondary
-> Unknown )
13:41:33 kernel: [1470272.124532] drbd vm-105-disk-3/0 drbd1009
<MigrationSourceHost>: pdsk( Diskless -> DUnknown ) repl( Established ->
Off )

It appears to me like a use-after-free error, trying to delete the
diskless resource twice. The test case above involves three drbd
resources, but this happens also on VMs with a single disk.

Regards,
Andreas