Describe the bug
When trying Longhorn on an ipv6 cluster, you get connectivity errors in longhorn-manager pods due to missing square brackets around ipv6 addresses of the instance-manager pods.
1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address"
time="2020-08-26T07:47:41Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-r-801495b4 out of the queue: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T07:47:41Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d out of the queue: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
The storage does not work due to this.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Longhorn-manager pods to successfully connect to the instance-manager pods.
Log
Logs from one of the longhorn-manager pods:
time="2020-08-26T06:36:11Z" level=info msg="Start overwriting built-in settings with customized values"
time="2020-08-26T06:36:11Z" level=info msg="cannot list the content of the src directory /var/lib/rancher/longhorn/engine-binaries for the copy, will do nothing: Failed to execute: nsenter [--mount=/host/proc/5919/ns/mnt --net=/host/proc/5919/ns/net bash -c ls /var/lib/rancher/longhorn/engine-binaries/*], output , stderr, ls: cannot access /var/lib/rancher/longhorn/engine-binaries/*: No such file or directory\n, error exit status 2"
time="2020-08-26T06:36:11Z" level=info msg="Start upgrading"
time="2020-08-26T06:36:11Z" level=info msg="No API version upgrade is needed"
time="2020-08-26T06:36:11Z" level=info msg="Finish upgrading"
time="2020-08-26T06:36:11Z" level=info msg="Upgrade leader lost: k8s-worker01"
E0826 06:36:11.722198 1 kubernetes_node_controller.go:244] Couldn't get nodes k8s-worker01: node "k8s-worker01" not found
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn websocket controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn Setting controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn node controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn Kubernetes node controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn replica controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn engine controller"
time="2020-08-26T06:36:11Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:11Z" level=info msg="Starting Longhorn instance manager controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn Engine Image controller"
time="2020-08-26T06:36:11Z" level=info msg="Start kubernetes controller"
time="2020-08-26T06:36:11Z" level=info msg="Start Longhorn volume controller"
time="2020-08-26T06:36:11Z" level=debug msg="Failed to check for the latest upgrade: Post \"https://longhorn-upgrade-responder.rancher.io/v1/checkupgrade\": dial tcp 34.208.213.149:443: connect: network is unreachable"
time="2020-08-26T06:36:11Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-e-a6a4186d"
time="2020-08-26T06:36:11Z" level=warning msg="Starts to clean up then recreates pod for instance manager instance-manager-e-a6a4186d with state stopped"
time="2020-08-26T06:36:11Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker01\", UID:\"64416b13-01e8-4d43-94bb-e7bb4eeace87\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5260\", FieldPath:\"\"}): type: 'Normal' reason: 'Ready' Disk default-disk-80300000000(/opt/longhorn/) on node k8s-worker01 is ready"
time="2020-08-26T06:36:11Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker01\", UID:\"64416b13-01e8-4d43-94bb-e7bb4eeace87\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5260\", FieldPath:\"\"}): type: 'Normal' reason: 'Schedulable' Disk default-disk-80300000000(/opt/longhorn/) on node k8s-worker01 is schedulable"
time="2020-08-26T06:36:11Z" level=debug msg="Prepare to create default instance manager instance-manager-r-801495b4, node: k8s-worker01, default instance manager image: longhornio/longhorn-instance-manager:v1_20200514, type: replica"
time="2020-08-26T06:36:11Z" level=info msg="Created instance manager pod instance-manager-e-a6a4186d for instance manager instance-manager-e-a6a4186d"
time="2020-08-26T06:36:12Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-r-801495b4"
time="2020-08-26T06:36:12Z" level=warning msg="Starts to clean up then recreates pod for instance manager instance-manager-r-801495b4 with state stopped"
time="2020-08-26T06:36:12Z" level=info msg="Created instance manager pod instance-manager-r-801495b4 for instance manager instance-manager-r-801495b4"
time="2020-08-26T06:36:13Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker02\", UID:\"a992f506-22da-4833-95ef-694b86c77348\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5431\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Node k8s-worker02 is down: the manager pod longhorn-manager-wgk2b is not running"
time="2020-08-26T06:36:13Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"k8s-worker02\", UID:\"a992f506-22da-4833-95ef-694b86c77348\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"5431\", FieldPath:\"\"}): type: 'Normal' reason: 'Schedulable' "
time="2020-08-26T06:36:13Z" level=debug msg="Requeue longhorn-system/k8s-worker02 due to conflict: Operation cannot be fulfilled on nodes.longhorn.io \"k8s-worker02\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-26T06:36:14Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-e-778e02bb"
time="2020-08-26T06:36:14Z" level=debug msg="Requeue longhorn-system/instance-manager-e-778e02bb due to conflict: Operation cannot be fulfilled on instancemanagers.longhorn.io \"instance-manager-e-778e02bb\": the object has been modified; please apply your changes to the latest version and try again"
time="2020-08-26T06:36:14Z" level=debug msg="Instance Manager Controller k8s-worker01 picked up instance-manager-r-837f58af"
time="2020-08-26T06:36:17Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:23Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:29Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:35Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:41Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:47Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:53Z" level=debug msg="Waiting for engine image longhornio/longhorn-engine:v1.0.2 to be ready"
time="2020-08-26T06:36:59Z" level=debug msg="Engine image longhornio/longhorn-engine:v1.0.2 is ready"
time="2020-08-26T06:36:59Z" level=info msg="Listening on fd00:10:d102:1111::a381:9500"
time="2020-08-26T06:37:06Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
time="2020-08-26T06:37:06Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
E0826 06:37:06.948745 1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address"
time="2020-08-26T06:37:06Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d out of the queue: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
time="2020-08-26T06:37:08Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-r-801495b4: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T06:37:08Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-r-801495b4: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
E0826 06:37:08.951681 1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address"
time="2020-08-26T06:37:08Z" level=warning msg="Dropping Longhorn instance manager longhorn-system/instance-manager-r-801495b4 out of the queue: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T06:37:11Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-r-801495b4: fail to sync instance manager for longhorn-system/instance-manager-r-801495b4: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a384:8500: too many colons in address\""
time="2020-08-26T06:37:11Z" level=warning msg="Error syncing Longhorn instance manager longhorn-system/instance-manager-e-a6a4186d: fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address\""
E0826 06:37:11.741187 1 instance_manager_controller.go:248] fail to sync instance manager for longhorn-system/instance-manager-e-a6a4186d: failed to get version: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd00:10:d102:1111::a383:8500: too many colons in address"
These logs keep repeating themselves.
Environment:
Additional context
This is the state of all of the longhorn pods:
NAME READY STATUS RESTARTS AGE
engine-image-ei-ee18f965-dxmkp 1/1 Running 0 65m
engine-image-ei-ee18f965-kvbsg 1/1 Running 0 65m
engine-image-ei-ee18f965-tvgln 1/1 Running 0 65m
instance-manager-e-778e02bb 1/1 Running 0 65m
instance-manager-e-9a0b05b6 1/1 Running 0 65m
instance-manager-e-a6a4186d 1/1 Running 0 65m
instance-manager-r-801495b4 1/1 Running 0 65m
instance-manager-r-837f58af 1/1 Running 0 65m
instance-manager-r-909a3d7d 1/1 Running 0 65m
longhorn-driver-deployer-6756bb8fd6-gbvvs 0/1 Init:0/1 0 66m
longhorn-manager-7bjmz 0/1 Running 2 66m
longhorn-manager-8tf5f 0/1 Running 1 66m
longhorn-manager-wgk2b 0/1 Running 2 66m
longhorn-ui-6fb889895f-xm6j2 1/1 Running 0 66m
I've added some MRs yesterday (https://github.com/longhorn/longhorn-ui/pull/320, https://github.com/longhorn/longhorn-manager/pull/663) that make the UI work and communicate with the manager.
However there's still some more IPv6 fixes missing, notably on the csi-*, instance-manager. As of now, the volumes can't be provisionned and are stuck in a "Not ready - Detaching state".
If I've got time I'll try to do an MR today, but I'm afraid that I'm missing some information on the inner workings.
@yasker We could provide you with a test IPv6 cluster if you desire to test things.
What's missing as of now:
Most helpful comment
What's missing as of now:
... and maybe more :-) :tada: