-
I have some strange issues when enabling TLS for DRBD Replication. Without it everything works fine. First, the most annoying and non-informative excerpt from [ +0.002960] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knstorageo1-REDACTED-6def: conn( Disconnecting -> StandAlone ) [disconnected]
[ +0.566138] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: conn( Unconnected -> Connecting ) [connecting]
[ +0.448006] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8: State change failed: State change requires a connection (-15)
[ +0.003535] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knstorageo1-REDACTED-6def: Failed: conn( StandAlone -> Connecting ) [connecting]
[ +0.003841] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knstorageo1-REDACTEDc-6def: Terminating receiver thread
[ +0.003267] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knstorageo1-REDACTED-6def: Terminating sender thread
[ +0.033623] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: conn( Connecting -> Disconnecting ) [down]
[ +0.012246] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: Terminating sender thread
[ +0.003478] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: Starting sender thread (from drbd_r_pvc-f854 [110790])
[ +0.024050] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: meta connection shut down by peer.
[ +0.003147] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 tcp:knwkx-REDACTED-a3f9: unhandled state 5
[ +0.002934] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: meta connection shut down by peer.
[ +0.003160] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: meta connection shut down by peer.
[ +0.003500] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: Connection closed
[ +0.002748] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: conn( Disconnecting -> StandAlone ) [disconnected]
[ +1.020472] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8: State change failed: State change requires a connection (-15)
[ +0.003422] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: Failed: conn( StandAlone -> Connecting ) [connecting]
[ +0.003609] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: Terminating receiver thread
[ +0.003096] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8 knwkx-REDACTED-a3f9: Terminating sender thread
[ +0.061875] drbd pvc-f8545fef-1120-4db6-868a-a2f06c8be1f8: Terminating worker thread The TLS Connections between the Nodes look OK. All Nodes connect through public IPs in a /24 range. The first two Bytes of the IP Address below are therefore replaced with > k logs -l app.kubernetes.io/component=linstor-satellite -c ktls-utils
tlshd[845]: add_key: Bad message
tlshd[844]: add_key: Bad message
tlshd[844]: Handshake with knstorageo1-REDACTED-6def (1.2.0.158) was successful
tlshd[845]: Handshake with knstorageo1-REDACTED-6def (1.2.0.158) was successful
tlshd[846]: File /etc/tlshd.d/tls.key: expected mode 600
tlshd[847]: File /etc/tlshd.d/tls.key: expected mode 600
tlshd[847]: add_key: Bad message
tlshd[847]: Handshake with knwhx4-REDACTED-1952 (1.2.8.145) was successful
tlshd[846]: add_key: Bad message
tlshd[846]: Handshake with knwhx4-REDACTED-1952 (1.2.8.145) was successful
tlshd[8]: Built from ktls-utils 0.10 on Oct 4 2023 07:26:06
tlshd[7]: Built from ktls-utils 0.10 on Oct 4 2023 07:26:06
tlshd[7]: Built from ktls-utils 0.10 on Oct 4 2023 07:26:06
tlshd[8]: Built from ktls-utils 0.10 on Oct 4 2023 07:26:06
tlshd[848]: add_key: Bad message
tlshd[847]: add_key: Bad message
tlshd[847]: Handshake with knwhx4-REDACTED-1952 (1.2.8.145) was successful
tlshd[848]: Handshake with knwhx4-REDACTED-1952 (1.2.8.145) was successful
tlshd[849]: File /etc/tlshd.d/tls.key: expected mode 600
tlshd[850]: File /etc/tlshd.d/tls.key: expected mode 600
tlshd[849]: add_key: Bad message
tlshd[849]: Handshake with knwkx-REDACTED-a3f9 (1.2.0.168) was successful
tlshd[850]: add_key: Bad message
tlshd[850]: Handshake with knwkx-REDACTED-a3f9 (1.2.0.168) was successful
tlshd[1218]: add_key: Bad message
tlshd[1218]: Handshake with knstorageo1-REDACTED-6def (1.2.0.158) was successful
tlshd[1219]: add_key: Bad message
tlshd[1219]: Handshake with knstorageo1-REDACTED-6def (1.2.0.158) was successful
tlshd[1220]: File /etc/tlshd.d/tls.key: expected mode 600
tlshd[1221]: File /etc/tlshd.d/tls.key: expected mode 600
tlshd[1220]: add_key: Bad message
tlshd[1220]: Handshake with knwkx-REDACTED-a3f9 (1.2.0.168) was successful
tlshd[1221]: add_key: Bad message
tlshd[1221]: Handshake with knwkx-REDACTED-a3f9 (1.2.0.168) was successful
tlshd[7]: Built from ktls-utils 0.10 on Oct 4 2023 07:26:06 The underlying infrastructure is Cilium as CNI on Openstack. I'm unsure if that matters. The three Nodes involved above are:
The > k get linstorclusters.piraeus.io linstorcluster -o yaml
apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
annotations:
# ...
name: linstorcluster
# ...
spec:
internalTLS:
certManager:
kind: Issuer
name: linstor-internal-ca
linstorPassphraseSecret: linstor-passphrase
nodeAffinity:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-12
- zone-18
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: node-role.kubernetes.io/ingress
operator: DoesNotExist And the relevant > k get linstorsatelliteconfigurations.piraeus.io internal-tls -o yaml
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
annotations:
# ...
name: internal-tls
spec:
internalTLS:
certManager:
kind: Issuer
name: linstor-internal-ca
tlsHandshakeDaemon: true This is probably too little Information, but maybe someone knows where else to look for more debugging information. Thankful for any hints! Happy Easter Holidays! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 10 replies
-
Some more debugging Info (from a different pvc, but still the same issue): > k describe pvc cluster-example-1
Name: cluster-example-1
Namespace: piraeus-datastore
StorageClass: local
Status: Bound
Volume: pvc-ac1edc5f-8b3a-489e-81f4-2430780d76df
Labels: cnpg.io/cluster=cluster-example
cnpg.io/instanceName=cluster-example-1
cnpg.io/pvcRole=PG_DATA
Annotations: cnpg.io/nodeSerial: 1
cnpg.io/operatorVersion: 1.22.2
cnpg.io/pvcStatus: initializing
pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: linstor.csi.linbit.com
volume.kubernetes.io/selected-node: knwhx4-REDACTED-1952
volume.kubernetes.io/storage-provisioner: linstor.csi.linbit.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: cluster-example-1-initdb-7xbzj
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 4m39s persistentvolume-controller waiting for first consumer to be created before binding
Normal Provisioning 4m39s linstor.csi.linbit.com_linstor-csi-controller-549fbc4f96-bl5k6_327f26d4-f1b2-42c5-a542-1d24ab62d2a0 External provisioner is provisioning volume for claim "piraeus-datastore/cluster-example-1"
Normal ExternalProvisioning 4m25s (x2 over 4m39s) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'linstor.csi.linbit.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal ProvisioningSucceeded 4m21s linstor.csi.linbit.com_linstor-csi-controller-549fbc4f96-bl5k6_327f26d4-f1b2-42c5-a542-1d24ab62d2a0 Successfully provisioned volume pvc-ac1edc5f-8b3a-489e-81f4-2430780d76df
Warning VolumeConditionAbnormal 35s csi-pv-monitor-controller-linstor.csi.linbit.com Resource with issues on node(s): knstorageo1-REDACTED-6def,knwhx4-REDACTED-1952,knwkx-REDACTED-a3f9 Logs from the selected node (knwhx4-REDACTED-1952):
The part which sounds like an error is: error="rpc error: code = Internal desc = NodePublishVolume failed for pvc-ac1edc5f-8b3a-489e-81f4-2430780d76df: failed to set source device readwrite: exit status 1" I would expect to find some error about TLS, since enabling it causes this issue. The StorageClass definition: > k get sc local -o yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
# ...
name: local
# ...
parameters:
csi.storage.k8s.io/fstype: ext4
linstor.csi.linbit.com/allowRemoteVolumeAccess: |
- fromSame:
- topology.kubernetes.io/zone
linstor.csi.linbit.com/autoPlace: "3"
linstor.csi.linbit.com/replicasOnSame: topology.kubernetes.io/zone
linstor.csi.linbit.com/storagePool: local
property.linstor.csi.linbit.com/DrbdOptions/Net/tls: "yes"
provisioner: linstor.csi.linbit.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer Some linstor cli output: > lins r l -p
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State | CreatedOn |
|==============================================================================================================================================================================================|
| pvc-7e4076a9-3687-4668-953b-16f2fa4bffc1 | knstorageo1-REDACTED-6def | 7000 | Unused | Unconnected(knwhx4-REDACTED-1952,knwkx-REDACTED-a3f9) | UpToDate | 2024-03-30 10:21:58 |
| pvc-7e4076a9-3687-4668-953b-16f2fa4bffc1 | knwhx4-REDACTED-1952 | 7000 | Unused | Connecting(knwkx-REDACTED-a3f9),Unconnected(knstorageo1-REDACTED-6def) | UpToDate | 2024-03-30 10:21:56 |
| pvc-7e4076a9-3687-4668-953b-16f2fa4bffc1 | knwkx-REDACTED-a3f9 | 7000 | Unused | Connecting(knwhx4-REDACTED-1952),Unconnected(knstorageo1-REDACTED-6def) | UpToDate | 2024-03-30 10:21:58 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ root@knwhx4-REDACTED-1952:~# drbdadm status
pvc-7e4076a9-3687-4668-953b-16f2fa4bffc1 role:Secondary suspended:quorum
disk:UpToDate quorum:no blocked:upper
knstorageo1-REDACTED-6def connection:Unconnected
knwkx-REDACTED-a3f9 connection:NetworkFailure root@knstorageo1-REDACTED-6def:~# drbdadm status
pvc-7e4076a9-3687-4668-953b-16f2fa4bffc1 role:Secondary suspended:quorum
disk:UpToDate quorum:no blocked:upper
knwhx4-REDACTED-1952 connection:Unconnected
knwkx-REDACTED-a3f9 connection:Unconnected root@knwkx-REDACTED-a3f9:~# drbdadm status
pvc-7e4076a9-3687-4668-953b-16f2fa4bffc1 role:Secondary suspended:quorum
disk:UpToDate quorum:no blocked:upper
knstorageo1-REDACTED-6def connection:Unconnected
knwhx4-REDACTED-1952 connection:NetworkFailure As soon as I recreate the StorageClass without > lins r l -p
+------------------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State | CreatedOn |
|==============================================================================================================================|
| pvc-bf7e2808-d072-47be-9d20-4ca3065539cb | knstorageo1-REDACTED-6def | 7001 | Unused | Ok | UpToDate | 2024-03-30 10:42:31 |
| pvc-bf7e2808-d072-47be-9d20-4ca3065539cb | knwhx4-REDACTED-1952 | 7001 | InUse | Ok | UpToDate | 2024-03-30 10:42:30 |
| pvc-bf7e2808-d072-47be-9d20-4ca3065539cb | knwkx-REDACTED-a3f9 | 7001 | Unused | Ok | UpToDate | 2024-03-30 10:42:31 |
+------------------------------------------------------------------------------------------------------------------------------+ root@knwhx4-REDACTED-1952:~# drbdadm status
pvc-bf7e2808-d072-47be-9d20-4ca3065539cb role:Primary
disk:UpToDate
knstorageo1-REDACTED-6def role:Secondary
peer-disk:UpToDate
knwkx-REDACTED-a3f9 role:Secondary
peer-disk:UpToDate Where else can I look for debugging info? |
Beta Was this translation helpful? Give feedback.
-
Probably relevant: When TLS is disabled in the StorageClass, the TCP Connections from DRBD show up. root@knwkx-REDACTED-a3f9:~# ss -tpa | rg '700.'
ESTAB 0 0 1.2.0.168:58929 1.2.0.61:7001
ESTAB 0 0 1.2.0.168:7001 1.2.0.61:56581
ESTAB 0 0 1.2.0.168:59597 1.2.0.61:7002
ESTAB 0 0 1.2.0.168:7002 1.2.0.158:33055
ESTAB 0 0 1.2.0.168:47537 1.2.0.158:7002
ESTAB 0 0 1.2.0.168:7002 1.2.0.61:59779
ESTAB 0 0 1.2.0.168:50395 1.2.0.158:7001
ESTAB 0 0 1.2.0.168:7001 1.2.0.158:34061 When enabling TLS, no TCP Connections in the '700.' range show up. |
Beta Was this translation helpful? Give feedback.
-
What version of DRBD are you using (check From the short excerpt, I can see a
Which means a DRBD version < 9.2.8, as I believe this bug has been fixed since. |
Beta Was this translation helpful? Give feedback.
MSG_SPLICE_PAGES
was added in the 6.5 release, so everything newer will have the needed commits.