Longhorn: [BUG] IPV6 errors in connectivity - too many colons in address

Created on 26 Aug 2020 · 2Comments · Source: longhorn/longhorn

Describe the bug
When trying Longhorn on an ipv6 cluster, you get connectivity errors in longhorn-manager pods due to missing square brackets around ipv6 addresses of the instance-manager pods.

1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address"
time="2020-08-26T07:47:41Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-r-801495b4 out of the queue: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T07:47:41Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d out of the queue: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""

The storage does not work due to this.

To Reproduce
Steps to reproduce the behavior:

Deploy Longhorn helm chart on an ipv6 kubernetes cluster.
Watch the logs of one of the longhorn-manager pods.

Expected behavior
Longhorn-manager pods to successfully connect to the instance-manager pods.

Log
Logs from one of the longhorn-manager pods:

time="2020-08-26T06:36:11Z" level=info msg="Start overwriting built-in settings with customized values"
time="2020-08-26T06:36:11Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/5919/ns/mnt --net=/host/proc/5919/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access /var/lib/rancher/longhorn/engine-binaries/*: No such file or directory\n, error exit status 2"
time="2020-08-26T06:36:11Z" level=info msg="Start upgrading"
time="2020-08-26T06:36:11Z" level=info msg="No API version upgrade is needed"
time="2020-08-26T06:36:11Z" level=info msg="Finish upgrading"
time="2020-08-26T06:36:11Z" level=info msg="Upgrade leader lost: k8s-worker01"
E0826 06:36:11.722198       1 kubernetes_node_controller.go:244] Couldn't get nodes k8s-worker01: node "k8s-worker01" not found
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn websocket controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn Setting controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn node controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn Kubernetes node controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn replica controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn engine controller"
time="2020-08-26T06:36:11Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:11Z" level=info msg="Starting Longhorn instance manager controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn Engine Image controller"
time="2020-08-26T06:36:11Z" level=info msg="Start kubernetes controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn volume controller"
time="2020-08-26T06:36:11Z" level=debug msg="Failed to check for the latest upgrade: Post \"https://longhorn-upgrade-responder.rancher.io/v1/checkupgrade\": dial tcp 34.208.213.149:443: connect: network is unreachable"
time="2020-08-26T06:36:11Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-e-a6a4186d"
time="2020-08-26T06:36:11Z" level=warning msg="Starts to clean up then recreates pod for instance manager instance-manager-e-a6a4186d with state stopped"
time="2020-08-26T06:36:11Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker01\", UID:\"64416b13-01e8-4d43-94bb-e7bb4eeace87\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5260\", FieldPath:\"\"}): type: 'Normal' reason: 'Ready' Disk default-disk-80300000000(/opt/longhorn/) on node k8s-worker01 is ready"
time="2020-08-26T06:36:11Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker01\", UID:\"64416b13-01e8-4d43-94bb-e7bb4eeace87\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5260\", FieldPath:\"\"}): type: 'Normal' reason: 'Schedulable' Disk default-disk-80300000000(/opt/longhorn/) on node k8s-worker01 is schedulable"
time="2020-08-26T06:36:11Z" level=debug msg="Prepare to create default instance manager instance-manager-r-801495b4, node: k8s-worker01, default instance manager image: longhornio/longhorn-instance-manager:v1_20200514, type: replica"
time="2020-08-26T06:36:11Z" level=info msg="Created instance manager pod instance-manager-e-a6a4186d for instance manager instance-manager-e-a6a4186d"
time="2020-08-26T06:36:12Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-r-801495b4"
time="2020-08-26T06:36:12Z" level=warning msg="Starts to clean up then recreates pod for instance manager instance-manager-r-801495b4 with state stopped"
time="2020-08-26T06:36:12Z" level=info msg="Created instance manager pod instance-manager-r-801495b4 for instance manager instance-manager-r-801495b4"
time="2020-08-26T06:36:13Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker02\", UID:\"a992f506-22da-4833-95ef-694b86c77348\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5431\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Node k8s-worker02 is down: the manager pod longhorn-manager-wgk2b is not running"
time="2020-08-26T06:36:13Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker02\", UID:\"a992f506-22da-4833-95ef-694b86c77348\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5431\", FieldPath:\"\"}): type: 'Normal' reason: 'Schedulable' "
time="2020-08-26T06:36:13Z" level=debug msg="Requeue longhorn-system/k8s-worker02 due to conflict: Operation cannot be fulfilled on nodes.longhorn.io \"k8s-worker02\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-26T06:36:14Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-e-778e02bb"
time="2020-08-26T06:36:14Z" level=debug msg="Requeue longhorn-system/instance-manager-e-778e02bb due to conflict: Operation cannot be fulfilled on instancemanagers.longhorn.io \"instance-manager-e-778e02bb\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-26T06:36:14Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-r-837f58af"
time="2020-08-26T06:36:17Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:23Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:29Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:35Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:41Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:47Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:53Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:59Z" level=debug msg="Engine image longhornio/longhorn-engine:v1.0.2 is ready"
time="2020-08-26T06:36:59Z" level=info msg="Listening on fd00:10:d102:1111::a381:9500"
time="2020-08-26T06:37:06Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
time="2020-08-26T06:37:06Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
E0826 06:37:06.948745       1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address"
time="2020-08-26T06:37:06Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d out of the queue: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
time="2020-08-26T06:37:08Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-r-801495b4: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T06:37:08Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-r-801495b4: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
E0826 06:37:08.951681       1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address"  
time="2020-08-26T06:37:08Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-r-801495b4 out of the queue: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T06:37:11Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-r-801495b4: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T06:37:11Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
E0826 06:37:11.741187       1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address"

These logs keep repeating themselves.

Environment:

Longhorn version: 1.0.2
Kubernetes version: 1.18.5
Node OS type and version: CentOS 7

Additional context
This is the state of all of the longhorn pods:

NAME                                        READY   STATUS     RESTARTS   AGE
engine-image-ei-ee18f965-dxmkp              1/1     Running    0          65m
engine-image-ei-ee18f965-kvbsg              1/1     Running    0          65m
engine-image-ei-ee18f965-tvgln              1/1     Running    0          65m
instance-manager-e-778e02bb                 1/1     Running    0          65m
instance-manager-e-9a0b05b6                 1/1     Running    0          65m
instance-manager-e-a6a4186d                 1/1     Running    0          65m
instance-manager-r-801495b4                 1/1     Running    0          65m
instance-manager-r-837f58af                 1/1     Running    0          65m
instance-manager-r-909a3d7d                 1/1     Running    0          65m
longhorn-driver-deployer-6756bb8fd6-gbvvs   0/1     Init:0/1   0          66m
longhorn-manager-7bjmz                      0/1     Running    2          66m
longhorn-manager-8tf5f                      0/1     Running    1          66m
longhorn-manager-wgk2b                      0/1     Running    2          66m
longhorn-ui-6fb889895f-xm6j2                1/1     Running    0          66m

aremanager enhancement priorit3

Source

tomikonio

👍1

Most helpful comment

What's missing as of now:

longhorn-instance-manager : https://github.com/longhorn/longhorn-instance-manager/pull/80
go-iscsi-helper : https://github.com/longhorn/go-iscsi-helper/pull/35
longhorn-engine : https://github.com/longhorn/longhorn-engine/pull/530
longhorn-manager: https://github.com/Frankkkkk/longhorn-manager/commit/aaec3d48717247e8b28e1e2a51f9aa313f5122ee - listen everywhere in v6 (worst case: ipv4-mapped ipv6 so it should work also on v4 clusters) - MR in progress
... and maybe more :-) :tada:

Frankkkkk on 28 Aug 2020

👍2

All 2 comments

I've added some MRs yesterday (https://github.com/longhorn/longhorn-ui/pull/320, https://github.com/longhorn/longhorn-manager/pull/663) that make the UI work and communicate with the manager.

However there's still some more IPv6 fixes missing, notably on the csi-*, instance-manager. As of now, the volumes can't be provisionned and are stuck in a "Not ready - Detaching state".

If I've got time I'll try to do an MR today, but I'm afraid that I'm missing some information on the inner workings.

@yasker We could provide you with a test IPv6 cluster if you desire to test things.

Frankkkkk on 28 Aug 2020

👍1

What's missing as of now: