Rancher: [v2.2.4] Upgrade fails for setups with clusters having OpenStack CloudProvider with a LoadBalancer config: `cannot unmarshal number into Go value of type string`

Created on 6 Jun 2019  路  43Comments  路  Source: rancher/rancher

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

Upgrade from v2.2.3 to v2.2.4

Result:

Rancher Server fails to start with the following error mesages:

E0606 07:39:20.296926       8 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v3.Cluster: json: cannot unmarshal number into Go value of type string

Other details that may be helpful:
After a Downgrade to v2.2.3 Rancher is running fine again

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.2.4
  • Installation option (single install/HA): Single Installation

Most helpful comment

@alena1108 Thank you very much! You are absolutely right. Changing the values of monitor-delay and monitor-timeout manually by changing the cluster specs (kubectl edit cluster c-xxx) has fixed the problem (at least for me).

All 43 comments

Ran into same issue myself trying to update. Although mine is an HA setup with my own certs.

@ptrunk do you also have your own certs as @DJAyth mentioned in their setup?
I am not able to reproduce the issue. I am using self-signed.

@davidnuzik yes i use Traefik in front of Rancher with a letsencrypt certificate. So my (single) rancher is running with --no-cacerts.

Thanks @ptrunk I or someone on my team will investigate further.

@ptrunk can you share the output of kubectl get cluster --all-namespaces -o json (with the sensitive info obfuscated)

@ptrunk @DJAyth Do you have openstack lb configured?

Yes as I matter of fact I do. The rancher setup sits behind an Openstack LB, in the root cluster that gets created when the Rancher setup is knows nothing of Openstack to my knowledge.

Also my SSL setup terminates at Rancher itself on the nginx-ingress, which is different than @ptrunk I believe.

@DJAyth our theory is that the bug can be caused by https://github.com/rancher/types/commit/17ad82ec47246e37d734cf9e87ee6590ea60315d - a fix for https://github.com/rancher/rancher/issues/14577. If your user cluster objects have OpenStack with LB provider configured, and monitorDelay/monitorTimeout parameters set, they'd have to be changed from int to string. We are working on reproducing the issue and figuring out steps to mitigate; will update the bug shortly.

@mrajashree Yes my cluster which is managed by Rancher is using the openstack cloud provider with the openstack lb

@alena1108

root@18d92dabdd1a:/var/lib/rancher# kubectl get cluster --all-namespaces -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "management.cattle.io/v3",
            "kind": "Cluster",
            "metadata": {
                "annotations": {
                    "authz.management.cattle.io/creator-role-bindings": "{\"created\":[\"cluster-owner\"],\"required\":[\"cluster-owner\"]}",
                    "field.cattle.io/creatorId": "user-bkxx5",
                    "field.cattle.io/overwriteAppAnswers": "{\"answers\":{\"exporter-node.enabled\":\"true\",\"exporter-node.ports.metrics.port\":\"9796\",\"exporter-kubelets.https\":\"true\",\"exporter-node.resources.limits.cpu\":\"200m\",\"exporter-node.resources.limits.memory\":\"200Mi\",\"prometheus.retention\":\"24h\",\"grafana.persistence.enabled\":\"true\",\"prometheus.persistence.enabled\":\"true\",\"prometheus.persistence.storageClass\":\"default\",\"grafana.persistence.storageClass\":\"default\",\"grafana.persistence.size\":\"1Gi\",\"prometheus.persistence.size\":\"10Gi\",\"prometheus.resources.core.requests.cpu\":\"750m\",\"prometheus.resources.core.limits.cpu\":\"1000m\",\"prometheus.resources.core.requests.memory\":\"750Mi\",\"prometheus.resources.core.limits.memory\":\"1000Mi\"}}",
                    "lifecycle.cattle.io/create.cluster-agent-controller-cleanup": "true",
                    "lifecycle.cattle.io/create.cluster-provisioner-controller": "true",
                    "lifecycle.cattle.io/create.cluster-scoped-gc": "true",
                    "lifecycle.cattle.io/create.mgmt-cluster-rbac-remove": "true",
                    "networking.management.cattle.io/enable-network-policy": "false",
                    "provisioner.cattle.io/ke-driver-update": "updated"
                },
                "creationTimestamp": "2019-05-16T13:48:41Z",
                "finalizers": [
                    "controller.cattle.io/cluster-agent-controller-cleanup",
                    "controller.cattle.io/cluster-scoped-gc",
                    "controller.cattle.io/cluster-provisioner-controller",
                    "controller.cattle.io/mgmt-cluster-rbac-remove"
                ],
                "generation": 1,
                "labels": {
                    "cattle.io/creator": "norman"
                },
                "name": "c-jp966",
                "resourceVersion": "2054723",
                "selfLink": "/apis/management.cattle.io/v3/clusters/c-jp966",
                "uid": "504dbedc-77e1-11e9-a9ac-02420a000140"
            },
            "spec": {
                "description": "Dev Cluster",
                "desiredAgentImage": "",
                "desiredAuthImage": "",
                "displayName": "dev",
                "dockerRootDir": "/var/lib/docker",
                "enableClusterAlerting": false,
                "enableClusterMonitoring": true,
                "enableNetworkPolicy": false,
                "internal": false,
                "localClusterAuthEndpoint": {
                    "enabled": false
                },
                "rancherKubernetesEngineConfig": {
                    "addonJobTimeout": 30,
                    "authentication": {
                        "strategy": "x509"
                    },
                    "authorization": {},
                    "bastionHost": {},
                    "cloudProvider": {
                        "name": "openstack",
                        "openstackCloudProvider": {
                            "blockStorage": {
                                "bs-version": "",
                                "ignore-volume-az": false,
                                "trust-device-path": false
                            },
                            "global": {
                                "auth-url": "https://keystone.url",
                                "ca-file": "",
                                "domain-id": "xxx",
                                "domain-name": "",
                                "password": "xxx",
                                "region": "",
                                "tenant-id": "xxx",
                                "tenant-name": "",
                                "trust-id": "",
                                "user-id": "",
                                "username": "xxx"
                            },
                            "loadBalancer": {
                                "create-monitor": false,
                                "floating-network-id": "",
                                "lb-method": "",
                                "lb-provider": "",
                                "lb-version": "",
                                "manage-security-groups": false,
                                "monitor-delay": 0,
                                "monitor-max-retries": 0,
                                "monitor-timeout": 0,
                                "subnet-id": "64cf703f-9504-4b32-baa3-741905e8b43a",
                                "use-octavia": true
                            },
                            "metadata": {
                                "request-timeout": 0,
                                "search-order": ""
                            },
                            "route": {
                                "router-id": ""
                            }
                        }
                    },
                    "dns": {
                        "provider": "coredns",
                        "upstreamnameservers": [
                            "1.1.1.1",
                            "9.9.9.9"
                        ]
                    },
                    "ignoreDockerVersion": true,
                    "ingress": {
                        "provider": "nginx"
                    },
                    "kubernetesVersion": "v1.14.1-rancher1-1",
                    "monitoring": {
                        "provider": "metrics-server"
                    },
                    "network": {
                        "plugin": "canal"
                    },
                    "restore": {},
                    "services": {
                        "etcd": {
                            "backupConfig": {
                                "enabled": true,
                                "intervalHours": 12,
                                "retention": 6,
                                "s3BackupConfig": null
                            },
                            "creation": "12h",
                            "extraArgs": {
                                "election-timeout": "5000",
                                "heartbeat-interval": "500"
                            },
                            "retention": "72h",
                            "snapshot": false
                        },
                        "kubeApi": {
                            "serviceNodePortRange": "30000-32767"
                        },
                        "kubeController": {},
                        "kubelet": {},
                        "kubeproxy": {},
                        "scheduler": {}
                    },
                    "sshAgentAuth": false,
                    "systemImages": {}
                }
            },
            "status": {
                "agentImage": "rancher/rancher-agent:v2.2.3",
                "allocatable": {
                    "cpu": "8",
                    "memory": "16121976Ki",
                    "pods": "220"
                },
                "apiEndpoint": "https://192.168.101.4:6443",
                "appliedEnableNetworkPolicy": false,
                "appliedPodSecurityPolicyTemplateId": "",
                "appliedSpec": {
                    "description": "Dev Cluster",
                    "desiredAgentImage": "",
                    "desiredAuthImage": "",
                    "displayName": "dev",
                    "dockerRootDir": "/var/lib/docker",
                    "enableClusterAlerting": false,
                    "enableClusterMonitoring": true,
                    "enableNetworkPolicy": false,
                    "internal": false,
                    "localClusterAuthEndpoint": {
                        "enabled": false
                    },
                    "rancherKubernetesEngineConfig": {
                        "addonJobTimeout": 30,
                        "authentication": {
                            "strategy": "x509"
                        },
                        "authorization": {},
                        "bastionHost": {},
                        "cloudProvider": {
                            "name": "openstack",
                            "openstackCloudProvider": {
                                "blockStorage": {
                                    "bs-version": "",
                                    "ignore-volume-az": false,
                                    "trust-device-path": false
                                },
                                "global": {
                                    "auth-url": "https://keystone.url",
                                    "ca-file": "",
                                    "domain-id": "xxx",
                                    "domain-name": "",
                                    "password": "xxx",
                                    "region": "",
                                    "tenant-id": "xxx",
                                    "tenant-name": "",
                                    "trust-id": "",
                                    "user-id": "",
                                    "username": "xxx"
                                },
                                "loadBalancer": {
                                    "create-monitor": false,
                                    "floating-network-id": "",
                                    "lb-method": "",
                                    "lb-provider": "",
                                    "lb-version": "",
                                    "manage-security-groups": false,
                                    "monitor-delay": 0,
                                    "monitor-max-retries": 0,
                                    "monitor-timeout": 0,
                                    "subnet-id": "64cf703f-9504-4b32-baa3-741905e8b43a",
                                    "use-octavia": true
                                },
                                "metadata": {
                                    "request-timeout": 0,
                                    "search-order": ""
                                },
                                "route": {
                                    "router-id": ""
                                }
                            }
                        },
                        "dns": {
                            "provider": "coredns",
                            "upstreamnameservers": [
                                "1.1.1.1",
                                "9.9.9.9"
                            ]
                        },
                        "ignoreDockerVersion": true,
                        "ingress": {
                            "provider": "nginx"
                        },
                        "kubernetesVersion": "v1.14.1-rancher1-1",
                        "monitoring": {
                            "provider": "metrics-server"
                        },
                        "network": {
                            "plugin": "canal"
                        },
                        "nodes": [
                            {
                                "address": "192.168.101.13",
                                "hostnameOverride": "dkw02",
                                "nodeName": "c-jp966:m-0f8b84fb7c0c",
                                "port": "22",
                                "role": [
                                    "worker"
                                ],
                                "user": "root"
                            },
                            {
                                "address": "192.168.101.4",
                                "hostnameOverride": "dkm01",
                                "nodeName": "c-jp966:m-449da24cb51f",
                                "port": "22",
                                "role": [
                                    "etcd",
                                    "controlplane"
                                ],
                                "user": "root"
                            },
                            {
                                "address": "192.168.101.7",
                                "hostnameOverride": "dkw01",
                                "nodeName": "c-jp966:m-9be6efc52982",
                                "port": "22",
                                "role": [
                                    "worker"
                                ],
                                "user": "root"
                            }
                        ],
                        "restore": {},
                        "services": {
                            "etcd": {
                                "backupConfig": {
                                    "enabled": true,
                                    "intervalHours": 12,
                                    "retention": 6,
                                    "s3BackupConfig": null
                                },
                                "creation": "12h",
                                "extraArgs": {
                                    "election-timeout": "5000",
                                    "heartbeat-interval": "500"
                                },
                                "retention": "72h",
                                "snapshot": false
                            },
                            "kubeApi": {
                                "serviceNodePortRange": "30000-32767"
                            },
                            "kubeController": {},
                            "kubelet": {},
                            "kubeproxy": {},
                            "scheduler": {}
                        },
                        "sshAgentAuth": false,
                        "systemImages": {
                            "alpine": "rancher/rke-tools:v0.1.28",
                            "calicoCni": "rancher/calico-cni:v3.4.0",
                            "calicoCtl": "rancher/calico-ctl:v2.0.0",
                            "calicoNode": "rancher/calico-node:v3.4.0",
                            "canalCni": "rancher/calico-cni:v3.4.0",
                            "canalFlannel": "rancher/coreos-flannel:v0.10.0",
                            "canalNode": "rancher/calico-node:v3.4.0",
                            "certDownloader": "rancher/rke-tools:v0.1.28",
                            "coredns": "coredns/coredns:1.3.1",
                            "corednsAutoscaler": "rancher/cluster-proportional-autoscaler:1.3.0",
                            "dnsmasq": "rancher/k8s-dns-dnsmasq-nanny:1.15.0",
                            "etcd": "rancher/coreos-etcd:v3.3.10-rancher1",
                            "flannel": "rancher/coreos-flannel:v0.10.0-rancher1",
                            "flannelCni": "rancher/flannel-cni:v0.3.0-rancher1",
                            "ingress": "rancher/nginx-ingress-controller:0.21.0-rancher3",
                            "ingressBackend": "rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1",
                            "kubedns": "rancher/k8s-dns-kube-dns:1.15.0",
                            "kubednsAutoscaler": "rancher/cluster-proportional-autoscaler:1.3.0",
                            "kubednsSidecar": "rancher/k8s-dns-sidecar:1.15.0",
                            "kubernetes": "rancher/hyperkube:v1.14.1-rancher1",
                            "kubernetesServicesSidecar": "rancher/rke-tools:v0.1.28",
                            "metricsServer": "rancher/metrics-server:v0.3.1",
                            "nginxProxy": "rancher/rke-tools:v0.1.28",
                            "podInfraContainer": "rancher/pause:3.1",
                            "weaveCni": "weaveworks/weave-npc:2.5.0",
                            "weaveNode": "weaveworks/weave-kube:2.5.0"
                        }
                    }
                },
                "authImage": "",
                "caCert": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN3akNDQWFxZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFTTVJBd0RnWURWUVFERXdkcmRXSmwKTFdOaE1CNFhEVEU1TURVeE5qRXpORGcxTWxvWERUSTVNRFV4TXpFek5EZzFNbG93RWpFUU1BNEdBMVVFQXhNSAphM1ZpWlMxallUQ0NBU0l3RFFZSktvWklodmNOQVFFQkJRQURnZ0VQQURDQ0FRb0NnZ0VCQU95c1BRdE9RcUhnCldhSGZJalo3b2NUSFZ2NzFoVGFmUmVkelhRS212WkJhaGM5N2lVYXdWT3dQN1NOQmFqUlNiRlhFSmpIanZ0cUUKdkk5THcyZFE0YUh3dmpqTzY4REVaSVlob0ozS2Nxb3VjYUZBQnd1ZDVCNGFRRi83dThkbTFRZlpBM1BPNmhhawo1UTRUMkE2eDNkZFB6eXlVaTBjUGoxUk1zTW15Z3gyUklnakRjZUZmVGY5Z2JtTG5Ec2Y1bVZGdjE4b1JpMmdjCjZrbk1tYm1lZnAwbGovc29mNTAxVHo1ckYrRUpZZUVJUHRGQm1ZRUg2cDkySjVzK2FhQWRYS29rKzVkaCswQlMKa3gzc2FKZWJud3RMd25kcUVMdmd1K3NRS051LzM1akluKzFQVzZ4dll2RGRiMHpudms1cTVhZkZJWmNDZnhaawpVZzV5bzFYWlg5RUNBd0VBQWFNak1DRXdEZ1lEVlIwUEFRSC9CQVFEQWdLa01BOEdBMVVkRXdFQi93UUZNQU1CCkFmOHdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBR1dhTFdaRVd3b2pHMGw1KzMwUmNlUWlPZ0d0MStvcUx3ZHoKazlGUkdvZW5Gc0VpNWY0Q1RVWVVodkg5Zm1mTDR3eHBTTkNhY1Z3alY3a3NJTitHK2ErTzM4anhWT0dyKzdpQgoydmd6VnU0bFR4ZTk0djliZVk2Y0JhN0l2MWdWdFk2NTBySkwxMlFMS29OYUgrT3ltVmptMVIxNWxTNmhYOEQ0Cms1NWZSQk1vaCtrR0tmbXBDSkxsUzVtbEo4VVgwVzVtWThWNndSM2JLWmtwVEJQcm9walZGeWRHTEEvQXF0RWYKWjVuU2J6US9UazRXQS9BMjErZUFtZDNHcTdZQkIyRkZFV2pDREcyYk1IVStVVm9xallyN1BKL3QyaGhaU3M4MAp1azdHbTdqTXBWRy8zOWtpWWpvV3dqRHdoWnQ2eVc5eTdsNXA5Tzk3N09zZCs4OVFub289Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K",
                "capabilities": {
                    "ingressCapabilities": [
                        {
                            "ingressProvider": "Nginx"
                        }
                    ],
                    "loadBalancerCapabilities": {
                        "enabled": false
                    },
                    "nodePortRange": "30000-32767"
                },
                "capacity": {
                    "cpu": "8",
                    "memory": "16326776Ki",
                    "pods": "220"
                },
                "componentStatuses": [
                    {
                        "conditions": [
                            {
                                "message": "ok",
                                "status": "True",
                                "type": "Healthy"
                            }
                        ],
                        "name": "controller-manager"
                    },
                    {
                        "conditions": [
                            {
                                "message": "{\"health\":\"true\"}",
                                "status": "True",
                                "type": "Healthy"
                            }
                        ],
                        "name": "etcd-0"
                    },
                    {
                        "conditions": [
                            {
                                "message": "ok",
                                "status": "True",
                                "type": "Healthy"
                            }
                        ],
                        "name": "scheduler"
                    }
                ],
                "conditions": [
                    {
                        "status": "True",
                        "type": "Pending"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:50:47Z",
                        "status": "True",
                        "type": "Provisioned"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:51:02Z",
                        "status": "True",
                        "type": "Waiting"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "BackingNamespaceCreated"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "DefaultProjectCreated"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "SystemProjectCreated"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "InitialRolesPopulated"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "CreatorMadeOwner"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "NoDiskPressure"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:48:41Z",
                        "status": "True",
                        "type": "NoMemoryPressure"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:50:47Z",
                        "status": "True",
                        "type": "GlobalAdminsSynced"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:50:48Z",
                        "status": "False",
                        "type": "AlertingEnabled"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:50:49Z",
                        "status": "True",
                        "type": "SystemAccountCreated"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:50:49Z",
                        "status": "True",
                        "type": "AgentDeployed"
                    },
                    {
                        "lastUpdateTime": "2019-05-17T22:40:45Z",
                        "status": "True",
                        "type": "PrometheusOperatorDeployed"
                    },
                    {
                        "lastUpdateTime": "2019-05-20T15:58:07Z",
                        "status": "True",
                        "type": "Updated"
                    },
                    {
                        "lastUpdateTime": "2019-05-16T13:51:00Z",
                        "status": "True",
                        "type": "ServiceAccountMigrated"
                    },
                    {
                        "lastUpdateTime": "2019-06-06T08:17:40Z",
                        "status": "True",
                        "type": "Ready"
                    },
                    {
                        "lastUpdateTime": "2019-05-31T15:56:21Z",
                        "status": "True",
                        "type": "MonitoringEnabled"
                    }
                ],
                "driver": "rancherKubernetesEngine",
                "limits": {
                    "cpu": "2760m",
                    "memory": "2565Mi",
                    "pods": "0"
                },
                "monitoringStatus": {
                    "conditions": [
                        {
                            "lastUpdateTime": "2019-05-17T22:41:21Z",
                            "status": "True",
                            "type": "MetricExpressionDeployed"
                        }
                    ],
                    "grafanaEndpoint": "/k8s/clusters/c-jp966/api/v1/namespaces/cattle-prometheus/services/http:access-grafana:80/proxy/"
                },
                "requested": {
                    "cpu": "2590m",
                    "memory": "1817Mi",
                    "pods": "25"
                },
                "serviceAccountToken": "xxx",
                "version": {
                    "buildDate": "2019-04-08T17:02:58Z",
                    "compiler": "gc",
                    "gitCommit": "b7394102d6ef778017f2ca4046abbaa23b88c290",
                    "gitTreeState": "clean",
                    "gitVersion": "v1.14.1",
                    "goVersion": "go1.12.1",
                    "major": "1",
                    "minor": "14",
                    "platform": "linux/amd64"
                }
            }
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

@alena1108 Thank you very much! You are absolutely right. Changing the values of monitor-delay and monitor-timeout manually by changing the cluster specs (kubectl edit cluster c-xxx) has fixed the problem (at least for me).

Steps to mitigate:

* run kubectl on a rancher management plane, find the affected cluster
* kubectl edit clustername --namespace clusternamespace
* Find the param in the clusterSpec and ClusterStatus, quote the param and add a unit to it. Example: if it was 30, change it to "30s", if it was 0, change it to ""
* Click save

With the fix causing the issue, the assumption was made that the MonitorDelay and MonitorTimeout would never be set on user clusters as with them set the kubelet never starts. But Rancher api framework defaults the int value to 0 if not set. And that causes issues on upgrade.

I'll give that a shot tomorrow morning. Editing the cluster via kubectl will work, but a problem still exists if you try to modify the cluster in the Rancher UI. I can set it to "" with kubectl, but if I edit the cluster in the UI it interprets those as 0 still. But if you save it in the UI they get set back to 0.

As mentioned will test this out tomorrow.

@DJAyth

but a problem still exists if you try to modify the cluster in the Rancher UI. I can set it to "" with kubectl, but if I edit the cluster in the UI it interprets those as 0 still.

Rancher 2.2.4 UI will not set it to 0 anymore once you upgrade your existing clusters via kubectl prior to upgrade. On every subsequent cluster upgrade in 2.2.4, or for newly created clusters, the value will be defaulted to "", as the type in the schema got changed to string.

update the steps of resetting the value from 0 to "" should be performed after the upgrade is attempted, not prior to that. As 2.2.3 understands int, not string, and it would break.

Perfect. I'll give this a shot. Many thanks all.

Hi,

we have the same Problem with the upgrade to 2.2.4 and enabled Openstack LBAAs in 2 Clusters. I tried the workaround, edited the clusters and changed the values for monitor-delay & monitor-timeout.

After that i did an helm upgrade rancher rancher-latest/rancher --reuse-values

But the upgrade fails again:

kubectl --kubeconfig=rancherserver/kube_config_rancher-cluster.yml describe pod -n cattle-system rancher-b95ffcd98-zggn8 
Name:               rancher-b95ffcd98-zggn8
Namespace:          cattle-system
Priority:           0
PriorityClassName:  <none>
Node:               rancherserver2.XXXXX/10.250.0.116
Start Time:         Thu, 06 Jun 2019 21:26:29 +0200
Labels:             app=rancher
                    pod-template-hash=b95ffcd98
                    release=rancher
Annotations:        cni.projectcalico.org/podIP: 192.168.2.28/32
Status:             Running
IP:                 192.168.2.28
Controlled By:      ReplicaSet/rancher-b95ffcd98
Containers:
  rancher:
    Container ID:  docker://9612e698932db42a29468fccc37f987541b00c7a9023ef5c293cfc89bd8197b0
    Image:         rancher/rancher:v2.2.4
    Image ID:      docker-pullable://rancher/rancher@sha256:b22d5e239b8750edb639d0fe181c528fccacea67fb23412dff3c5f66e57e44a2
    Port:          80/TCP
    Host Port:     0/TCP
    Args:
      --no-cacerts
      --http-listen-port=80
      --https-listen-port=443
      --add-local=auto
    State:          Running
      Started:      Thu, 06 Jun 2019 21:28:33 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 06 Jun 2019 21:26:31 +0200
      Finished:     Thu, 06 Jun 2019 21:28:32 +0200
    Ready:          False
    Restart Count:  1
    Liveness:       http-get http://:80/healthz delay=60s timeout=1s period=30s #success=1 #failure=3
    Readiness:      http-get http://:80/healthz delay=5s timeout=1s period=30s #success=1 #failure=3
    Environment:
      CATTLE_NAMESPACE:     cattle-system
      CATTLE_PEER_SERVICE:  rancher
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from rancher-token-m7mqf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  rancher-token-m7mqf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rancher-token-m7mqf
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                             Message
  ----     ------     ----                   ----                                             -------
  Normal   Scheduled  4m33s                  default-scheduler                                Successfully assigned cattle-system/rancher-b95ffcd98-zggn8 to rancherserver2.XXXXX
  Normal   Pulled     2m30s (x2 over 4m32s)  kubelet, rancherserver2.XXXXX  Container image "rancher/rancher:v2.2.4" already present on machine
  Normal   Killing    2m30s                  kubelet, rancherserver2.XXXXX  Killing container with id docker://rancher:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Created    2m29s (x2 over 4m31s)  kubelet, rancherserver2.XXXXX  Created container
  Normal   Started    2m29s (x2 over 4m30s)  kubelet, rancherserver2.XXXXX  Started container
  Warning  Unhealthy  12s (x9 over 4m12s)    kubelet, rancherserver2.XXXXX  Readiness probe failed: Get http://192.168.2.28:80/healthz: dial tcp 192.168.2.28:80: connect: connection refused
  Warning  Unhealthy  0s (x6 over 3m30s)     kubelet, rancherserver2.XXXXXX  Liveness probe failed: Get http://192.168.2.28:80/healthz: dial tcp 192.168.2.28:80: connect: connection refused

@Redbierbaron are you seeing this same error?

Failed to list *v3.Cluster: json: cannot unmarshal number into Go value of type string

Also you don't need to upgrade rancher again after editing the clusters since you seem to be already upgraded to 2.2.4?

@Redbierbaron what did you edit the values of monitor-delay & monitor-timeout to?

I just tested it out on HA setup and it works for me,

  • In 2.2.3, those values were 0
  • I upgraded to 2.2.4, upgrade fails
  • I edited clusters to make them ""
  • pods start running and I can access the login page

@mrajashree i did the same steps:

  • In 2.2.3, those values were 0
  • I upgraded to 2.2.4, upgrade fails - got: Failed to list *v3.Cluster: json: cannot unmarshal number into Go value of type string
  • did helm rollback to 2.2.3
  • I edited clusters to make them ""
  • upgraded to 2.2.4 again
  • pods did not start with error above

Log output from rancher 2.2.4 container:

2019/06/07 06:13:58 [INFO] Adding peer wss://192.168.0.22/v3/connect, 192.168.0.22
2019/06/07 06:13:58 [INFO] Adding peer wss://192.168.1.28/v3/connect, 192.168.1.28
2019/06/07 06:13:58 [INFO] Adding peer wss://192.168.2.23/v3/connect, 192.168.2.23
2019/06/07 06:13:58 [INFO] Listening on :443
2019/06/07 06:13:58 [INFO] Listening on :80
I0607 06:13:58.194517      10 http.go:110] HTTP2 has been explicitly disabled
2019/06/07 06:13:58 [ERROR] Failed to connect to peer wss://192.168.2.23/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:13:58 [ERROR] Failed to connect to peer wss://192.168.0.22/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:13:58 [ERROR] Failed to connect to peer wss://192.168.1.28/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:13:58 [INFO] Stopping cluster agent for c-7cbb2
I0607 06:13:58.693483      10 http.go:110] HTTP2 has been explicitly disabled
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:13:59 [INFO] Starting cluster controllers for local
I0607 06:13:59.359300      10 leaderelection.go:187] attempting to acquire leader lease  kube-system/cattle-controllers...
2019/06/07 06:13:59 [INFO] Starting cluster agent for local [owner=false]
2019/06/07 06:13:59 [ERROR] ClusterController c-7cbb2 failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:13:59 [ERROR] ClusterController c-7cbb2 failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:13:59 [ERROR] Failed syncing peers [{192.168.2.30 [192.168.0.22 192.168.1.28 192.168.2.23] false false}]: failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:13:59 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:13:59 [ERROR] ClusterController c-c8g4g failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:00 [ERROR] ClusterController local failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:00 [ERROR] ClusterController c-c8g4g failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:00 [ERROR] ClusterController c-7cbb2 failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:00 [ERROR] ClusterController c-c8g4g failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:00 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:00 [ERROR] ClusterController local failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:01 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:02 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:02 [ERROR] ClusterController local failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:02 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:02 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:02 [ERROR] ClusterController c-7cbb2 failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:02 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:02 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:02 [ERROR] ClusterController c-c8g4g failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:03 [ERROR] Failed to connect to peer wss://192.168.0.22/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:03 [ERROR] Failed to connect to peer wss://192.168.2.23/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:03 [ERROR] Failed to connect to peer wss://192.168.1.28/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:04 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:04 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:04 [ERROR] ClusterController local failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:06 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:06 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:06 [ERROR] ClusterController c-7cbb2 failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:06 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:06 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:06 [ERROR] ClusterController c-c8g4g failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:08 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:08 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:08 [ERROR] ClusterController local failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:08 [ERROR] Failed to connect to peer wss://192.168.2.23/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:08 [ERROR] Failed to connect to peer wss://192.168.0.22/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:08 [ERROR] Failed to connect to peer wss://192.168.1.28/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:13 [ERROR] Failed to connect to peer wss://192.168.2.23/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:13 [ERROR] Failed to connect to peer wss://192.168.0.22/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:13 [ERROR] Failed to connect to peer wss://192.168.1.28/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:14:14 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:14 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:14 [ERROR] ClusterController c-7cbb2 failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:14 [INFO] Stopping cluster agent for c-7cbb2
2019/06/07 06:14:15 [INFO] Handling backend connection request [192.168.0.22]
2019/06/07 06:14:15 [INFO] Handling backend connection request [192.168.2.23]
2019/06/07 06:14:15 [INFO] Handling backend connection request [192.168.1.28]
2019/06/07 06:14:15 [INFO] Removing peer 192.168.0.22
2019/06/07 06:14:15 [INFO] Stopping cluster agent for c-c8g4g
2019/06/07 06:14:15 [ERROR] ClusterController c-c8g4g failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect, failed to start user controllers for cluster c-c8g4g: failed to contact server: Get https://10.250.2.116:6443/version?timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:15 [INFO] Starting cluster controllers for c-7cbb2
2019/06/07 06:14:15 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
2019/06/07 06:14:15 [INFO] Starting cluster agent for c-7cbb2 [owner=false]
2019/06/07 06:14:15 [INFO] Starting cluster controllers for c-c8g4g
2019/06/07 06:14:15 [INFO] Starting cluster agent for c-c8g4g [owner=false]
E0607 06:14:25.173896      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.174632      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.174706      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.174853      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.175074      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.175233      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.175298      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.175366      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.178985      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.179154      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.179203      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.179236      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.179298      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.179347      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:25.179551      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterrolebindings?resourceVersion=13896140&timeout=30m0s&timeoutSeconds=420: waiting for cluster agent to connect
E0607 06:14:25.179708      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Namespace: Get https://10.250.1.123:6443/api/v1/watch/namespaces?resourceVersion=13896167&timeout=30m0s&timeoutSeconds=489: waiting for cluster agent to connect
E0607 06:14:25.179833      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/rolebindings?resourceVersion=13896141&timeout=30m0s&timeoutSeconds=391: waiting for cluster agent to connect
E0607 06:14:25.179990      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Service: Get https://10.250.1.123:6443/api/v1/watch/services?resourceVersion=13967675&timeout=30m0s&timeoutSeconds=335: waiting for cluster agent to connect
E0607 06:14:25.180059      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/watch/endpoints?resourceVersion=14103482&timeout=30m0s&timeoutSeconds=431: waiting for cluster agent to connect
E0607 06:14:25.180392      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterroles?resourceVersion=13898514&timeout=30m0s&timeoutSeconds=439: waiting for cluster agent to connect
E0607 06:14:25.180720      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/roles?resourceVersion=11190268&timeout=30m0s&timeoutSeconds=524: waiting for cluster agent to connect
2019/06/07 06:14:25 [INFO] Handling backend connection request [c-c8g4g:m-frrh5]
2019/06/07 06:14:25 [INFO] Handling backend connection request [c-c8g4g:m-6z5r7]
E0607 06:14:25.792229      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Endpoints: Get https://10.250.2.116:6443/api/v1/watch/endpoints?resourceVersion=10020221&timeout=30m0s&timeoutSeconds=314: tunnel disconnect
E0607 06:14:26.180090      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.181013      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.182038      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.184885      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.186042      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.187371      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.188219      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:26 [ERROR] Failed syncing peers [{192.168.2.30 [192.168.2.23 192.168.1.28] true false}]: failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: waiting for cluster agent to connect
E0607 06:14:26.980279      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Namespace: Get https://10.250.2.116:6443/api/v1/watch/namespaces?resourceVersion=8232163&timeout=30m0s&timeoutSeconds=513: tunnel disconnect
E0607 06:14:26.980651      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRole: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterroles?resourceVersion=7152695&timeout=30m0s&timeoutSeconds=325: tunnel disconnect
E0607 06:14:26.980885      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRoleBinding: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterrolebindings?resourceVersion=9652537&timeout=30m0s&timeoutSeconds=583: tunnel disconnect
E0607 06:14:26.980964      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Role: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/roles?resourceVersion=478963&timeout=30m0s&timeoutSeconds=484: tunnel disconnect
E0607 06:14:27.180609      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:27.181530      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:27.182508      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:27.187077      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:27.188270      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:27.188641      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:27.189620      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.181354      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.182074      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.183638      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.187594      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.188606      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.189748      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:28.191001      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
2019/06/07 06:14:35 [INFO] Handling backend connection request [c-7cbb2:m-flcgl]
2019/06/07 06:14:35 [INFO] Handling backend connection request [local]
2019/06/07 06:14:36 [INFO] Handling backend connection request [c-7cbb2:m-t7bb9]
2019/06/07 06:14:38 [ERROR] ClusterController c-7cbb2 [user-controllers-controller] failed with : failed to start user controllers for cluster c-7cbb2: failed to contact server: Get https://10.250.1.123:6443/version?timeout=30s: tunnel disconnect
E0607 06:14:38.505823      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
E0607 06:14:38.505891      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
E0607 06:14:38.505953      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
E0607 06:14:38.506027      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
E0607 06:14:38.506079      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
E0607 06:14:38.506174      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
E0607 06:14:38.506265      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: tunnel disconnect
2019/06/07 06:14:42 [INFO] Handling backend connection request [c-c8g4g:m-wg8fx]
2019/06/07 06:14:45 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.0.0:53314: i/o timeout
2019/06/07 06:14:46 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->10.250.0.116:47922: i/o timeout
2019/06/07 06:14:47 [INFO] Adding peer wss://192.168.0.24/v3/connect, 192.168.0.24
2019/06/07 06:14:48 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
2019/06/07 06:14:48 [ERROR] Failed to serve peer connection 192.168.1.28: websocket: close 1006 unexpected EOF
2019/06/07 06:14:48 [INFO] Handling backend connection request [192.168.0.24]
2019/06/07 06:14:48 [INFO] Removing peer 192.168.1.28
2019/06/07 06:14:52 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.0.0:53344: i/o timeout
2019/06/07 06:14:58 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
E0607 06:14:58.195928      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.196023      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.205583      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRoleBinding: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterrolebindings?resourceVersion=9652537&timeout=30m0s&timeoutSeconds=555: tunnel disconnect
2019/06/07 06:14:58 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
2019/06/07 06:14:58 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
E0607 06:14:58.208680      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.209320      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.209366      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.209392      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:14:58.209866      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.ClusterRole ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
E0607 06:14:58.519873      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.520117      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.520230      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.520150      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.520183      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:14:58.520516      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.RoleBinding ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
W0607 06:14:58.520570      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Service ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
W0607 06:14:58.520635      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Endpoints ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
E0607 06:14:58.521904      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.522233      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.522385      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.523129      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Service: Get https://10.250.1.123:6443/api/v1/watch/services?resourceVersion=13967675&timeout=30m0s&timeoutSeconds=571: waiting for cluster agent to connect
E0607 06:14:58.523337      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Namespace: Get https://10.250.1.123:6443/api/v1/watch/namespaces?resourceVersion=13896167&timeout=30m0s&timeoutSeconds=503: waiting for cluster agent to connect
E0607 06:14:58.526138      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.526534      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.527062      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterrolebindings?resourceVersion=13896140&timeout=30m0s&timeoutSeconds=407: waiting for cluster agent to connect
E0607 06:14:58.526590      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.527442      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/rolebindings?resourceVersion=13896141&timeout=30m0s&timeoutSeconds=512: waiting for cluster agent to connect
E0607 06:14:58.526633      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:14:58.527903      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/watch/endpoints?resourceVersion=14103594&timeout=30m0s&timeoutSeconds=339: waiting for cluster agent to connect
E0607 06:14:58.528008      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/roles?resourceVersion=12424649&timeout=30m0s&timeoutSeconds=394: waiting for cluster agent to connect
E0607 06:14:58.529388      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterroles?resourceVersion=13898514&timeout=30m0s&timeoutSeconds=480: waiting for cluster agent to connect
E0607 06:14:59.523617      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:59.524723      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:59.527991      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:59.528519      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:59.529667      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:59.533158      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:14:59.534011      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.524110      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.525285      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.528585      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.529406      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.530517      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.533672      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:00.534630      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.524610      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Service: Get https://10.250.1.123:6443/api/v1/services?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.525750      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://10.250.1.123:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.528951      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.530376      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.RoleBinding: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.531547      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://10.250.1.123:6443/api/v1/endpoints?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.534060      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRole: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
E0607 06:15:01.535079      10 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Role: Get https://10.250.1.123:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0&timeout=30s: waiting for cluster agent to connect
2019/06/07 06:15:02 [INFO] Handling backend connection request [c-7cbb2:m-jxc89]
2019/06/07 06:15:03 [INFO] Handling backend connection request [c-c8g4g:m-jhfl2]
2019/06/07 06:15:08 [INFO] Handling backend connection request [c-7cbb2:m-ktxln]
2019/06/07 06:15:08 [INFO] Handling backend connection request [c-c8g4g:m-6z5r7]
2019/06/07 06:15:08 [INFO] Handling backend connection request [local]
2019/06/07 06:15:08 [INFO] Handling backend connection request [c-7cbb2:m-jxc89]
2019/06/07 06:15:13 [INFO] Handling backend connection request [c-7cbb2:m-t7bb9]
2019/06/07 06:15:13 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.0.0:53798: i/o timeout
2019/06/07 06:15:16 [INFO] Handling backend connection request [c-7cbb2:m-4tlrs]
2019/06/07 06:15:18 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.1.0:47712: i/o timeout
2019/06/07 06:15:18 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.1.0:47718: i/o timeout
2019/06/07 06:15:26 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
E0607 06:15:26.082295      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:26.082345      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:26.082448      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:26.082471      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:26.082780      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:26.082802      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:26.082831      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:15:26.087924      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Role ended with: too old resource version: 11190268 (12424649)
2019/06/07 06:15:26 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.0.0:53898: i/o timeout
W0607 06:15:31.028242      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Role ended with: too old resource version: 477283 (478963)
2019/06/07 06:15:34 [INFO] Handling backend connection request [c-c8g4g:m-sswhn]
E0607 06:15:37.079819      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
2019/06/07 06:15:40 [INFO] Handling backend connection request [c-c8g4g:m-wg8fx]
2019/06/07 06:15:41 [INFO] Handling backend connection request [c-c8g4g:m-6z5r7]
2019/06/07 06:15:46 [INFO] Adding peer wss://192.168.1.30/v3/connect, 192.168.1.30
2019/06/07 06:15:47 [ERROR] Failed to serve peer connection 192.168.2.23: websocket: close 1006 unexpected EOF
2019/06/07 06:15:47 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
E0607 06:15:47.078038      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:47.078047      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:47.078131      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:47.078545      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:47.078609      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:47.079481      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:47.079575      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
2019/06/07 06:15:47 [INFO] Removing peer 192.168.2.23
2019/06/07 06:15:49 [INFO] Handling backend connection request [192.168.1.30]
2019/06/07 06:15:50 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->192.168.0.0:53968: i/o timeout
2019/06/07 06:15:55 [INFO] Handling backend connection request [c-7cbb2:m-4tlrs]
E0607 06:15:56.372190      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.372436      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.372474      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.372454      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.372519      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.372944      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.372972      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.393681      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Role: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/roles?resourceVersion=478963&timeout=30m0s&timeoutSeconds=476: tunnel disconnect
2019/06/07 06:15:56 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
2019/06/07 06:15:56 [INFO] error in remotedialer server [400]: read tcp 192.168.2.30:80->10.250.0.116:48140: read: connection reset by peer
E0607 06:15:56.403368      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.403475      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.404067      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.404323      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.405687      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Namespace: Get https://10.250.2.116:6443/api/v1/watch/namespaces?resourceVersion=8232163&timeout=30m0s&timeoutSeconds=376: tunnel disconnect
E0607 06:15:56.405843      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Service: Get https://10.250.2.116:6443/api/v1/watch/services?resourceVersion=9657754&timeout=30m0s&timeoutSeconds=568: tunnel disconnect
E0607 06:15:56.406710      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.RoleBinding: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/rolebindings?resourceVersion=8193185&timeout=30m0s&timeoutSeconds=581: tunnel disconnect
E0607 06:15:56.406866      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.ClusterRole: Get https://10.250.2.116:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterroles?resourceVersion=7152695&timeout=30m0s&timeoutSeconds=437: tunnel disconnect
2019/06/07 06:15:56 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
2019/06/07 06:15:56 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
E0607 06:15:56.924283      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:15:56.924341      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Endpoints ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
2019/06/07 06:15:56 [INFO] error in remotedialer server [400]: websocket: close 1006 unexpected EOF
E0607 06:15:56.924657      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:15:56.924677      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.ClusterRoleBinding ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
E0607 06:15:56.924715      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.924732      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:15:56.924736      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.RoleBinding ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
E0607 06:15:56.929528      10 reflector.go:251] github.com/rancher/norman/controller/generic_controller.go:175: Failed to watch *v1.Service: Get https://10.250.1.123:6443/api/v1/watch/services?resourceVersion=13967675&timeout=30m0s&timeoutSeconds=495: failed to find Session for client c-7cbb2:m-z7nq8
E0607 06:15:56.929605      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0607 06:15:56.929685      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:15:56.929781      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Role ended with: very short watch: github.com/rancher/norman/controller/generic_controller.go:175: Unexpected watch close - watch lasted less than a second and no items received
E0607 06:15:56.929941      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
2019/06/07 06:15:57 [INFO] Handling backend connection request [local:machine-wzbv6]
2019/06/07 06:15:57 [INFO] Handling backend connection request [c-7cbb2]
2019/06/07 06:15:57 [INFO] Handling backend connection request [local:machine-4hpdd]
E0607 06:15:59.735036      10 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
W0607 06:15:59.740022      10 reflector.go:270] github.com/rancher/norman/controller/generic_controller.go:175: watch of *v1.Role ended with: too old resource version: 477283 (478963)

after 10 Minutes of waiting, hundreds of error and restarts of the rancher-containers, the upgrade finishs.... the errors in the log correspond an issue https://github.com/rancher/rancher/issues/20265 (apparently nobody of the rancher team interested in....)

@Redbierbaron looks like a different issue, as the logs don't show cannot unmarshal number into Go value of type string anymore, and instead you see:

2019/06/07 06:13:58 [ERROR] Failed to connect to peer wss://192.168.2.23/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:13:58 [ERROR] Failed to connect to peer wss://192.168.0.22/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:13:58 [ERROR] Failed to connect to peer wss://192.168.1.28/v3/connect [local ID=192.168.2.30]: websocket: bad handshake
2019/06/07 06:13:58 [INFO] Stopping cluster agent for c-7cbb2

So yeah, if this could just work that would be great. ;)

This morning I edited the cluster via kubectl changing the values for the OpenStack LB to double quotes instead of the 0 it was. However, I still see the exact same behaviour, update fails, same error out of the pod. However, to top it off this time, now my second cluster is missing from the Rancher UI! It still shows with kubectl so it's there, but the UI doesn't see it at all.

Went back through the entire cluster and reset the values back to what they were and the cluster is back in the UI now.

@DJAyth can you try the steps mentioned by @kinarashah - she specifically tested it for HA use case:

In 2.2.3, those values were 0
I upgraded to 2.2.4, upgrade fails
I edited clusters to make them ""
pods start running and I can access the login page

If you change the value while being on 2.2.3, the 2.2.3 UI will fail to load as it doesn't understand string. I've updated my initial comment to reflect this info. Please let us know how if that works for you

This happens on OpenStack even when not using the load balancer. Those fields were still populated even while I was not specifying the loadBalancer block at all when creating the cluster.

The relevant subsection of "kubectl get cluster --all-namespaces -o json" looks like:

                       "loadBalancer": {
                            "create-monitor": false,
                            "floating-network-id": "",
                            "lb-method": "",
                            "lb-provider": "",
                            "lb-version": "",
                            "manage-security-groups": false,
                            "monitor-delay": 0,
                            "monitor-max-retries": 0,
                            "monitor-timeout": 0,
                            "subnet-id": "",
                            "use-octavia": false
                        },

Also found the workaround steps somewhat confusing. In that I was trying to find and edit those settings (monitor-delay and monitor-timeout) on the "local" cluster, cause that is the one to which I am re-deploying rancher server. And obviously not finding it cause "local" cluster is "imported" and won't show you these values. But in fact, it was the subordinate Custom cluster that I needed to edit. And the values for that one were affecting the inability for rancher container to start in the Rancher main "local" cluster. Just FYI for anyone struggling.

To summarize:

  • You are re-deploying Rancher to the "local" cluster. (start the helm upgrade now)
  • Rancher container keeps failing attempts to deploy 2.2.4
  • You launch kubectl session for the "local" cluster.
  • The kubectl command says: "kubectl edit /subordinate-cluster-id/"
  • Rancher containers succeed and everything is smoothly upgraded to 2.2.4

We had the same issue trying to update from 2..2.2 to 2.24, followed the instructions and had rancher running with 2.2.4 finally.

BUT, with every change of a client cluster (editing the cluster.yaml), the cluster is seen unhealthy by rancher.

This cluster is currently Updating.
[etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy

Enabled DEBUG shows the following error on the rancher application container:
`
etcd in client clusters are with 3 healthy member

That looks like known issues with rke (e.g. https://github.com/rancher/rke/issues/1244), where a missing rke cluster state file let rke create new certificates not recognizing that there is a cluster.

Never thought something like this can happen with rancher, with having rke state files as config map in every client cluster. I verified that the configmap full-cluster-state still contains original certificates, but the state files under /var/lib/rancher/management-state/rke in the rancher application container contains new certificates geenrated after update.

Now it seems that all our client clusters are unusable furthermore, either already in faulty state or will be with next change in cluster.yaml.

Anybody with same experiences? Any suggestion how to solve this? How to force rancher to use full-cluster-state configmap? How to avoid to get new certificates created which makes (production) client cluster unusable?

@galal-hussein has this been seen before? after upgrade to 2.2.4

 I verified that the configmap full-cluster-state still contains original certificates, but the state files under /var/lib/rancher/management-state/rke in the rancher application container contains new certificates geenrated after update.

@galal-hussein, @kinarashah
I opened a separate issue for the above problem: https://github.com/rancher/rancher/issues/20909

Some more analysis showed that rancher master is constantly create new certificates for the faulty client cluster, but with this certificates the (unchanged, but working) etcds of client cluster can not be reached.

@surco1971 the issue you are seeing, is unrelated to the original issue described in the first comment. Can you file it as a separate bug?

@alena1108 , that is what I did - see above comment - https://github.com/rancher/rancher/issues/20909. And I mentioned it here, because the origin of all is to try to update to 2.2.4. So it is a kind of related, at least for users, which potentially can at the end having faulty client clusters.

I am seeing the same issue. My steps:

  1. Try to upgrade to 2.2.4
  2. See that theres the error about unmarshalling number
  3. edit the clusters in kubectl and change the monitor- stuff from 0 to ""
  4. The error persists
  5. Figure I will try to use helm to apply the "same" upgrade

???

I got nothing. Help?

@tibers ideally, there's no need to do step 5.
Could you confirm step 3 was applied correctly by doing kubectl get cluster --all-namespaces -o json and checking values of monitor-delay & monitor-timeout?

Since we have a mix of openstack clusters and all sorts of other fun stuff in rancher, what I did was:

for h in `kubectl get clusters | grep -v NAME | cut -f 1 -d ' '`
do
kubectl edit cluster $h
done

Which edits EVERYTHING, and then while I was in vi I changed monitor-delay and monitor-timeout to just monitor-delay: "" and monitor-timeout: "". Then trying to get 2.2.4 installed still throws the error. Is there possibly another value which has to be set to ""?

@tibers you can set it to any acceptable value for delay and timeout based on your use case, just needs to be in string format

I went back and changed both those to "1s" and have the same problem:
reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v3.Cluster: json: cannot unmarshal number into Go value of type string

EDIT: I win the "learn to read" prize today, there were multiple loadbalancer: fields in there so there were multiple instances of each monitor line. Changing all of them fixed the issue.

I saw the problem in client v2.2.0:

cannot unmarshal string into Go struct field LoadBalancerOpenstackOpts.monitor-delay of type int64

But downloading latest rc for client v2.3. instead solved it for me.

Rancher upgrade from 2.2.3 to 2.2.9 went well with the workaround (manually editing those cluster values from 0 to "" with kubectl).
However now that I try to upgrade k8s from v1.13.5-rancher1-2 to v1.14.9-rancher1-1 in one of those clusters I'm getting:

[ERROR] ClusterController c-xxxxx [certificate-expiration] failed with : json: cannot unmarshal number into Go struct field LoadBalancerOpenstackOpts.monitor-delay of type string

I'm not really sure where could rancher be getting the int-type value since the values for monitor-delay and monitor-timeout are strings in the cluster configuration.
I tried including both parameters as string in the yaml file when editing the cluster from rancher web interface but I get the same results.
As a result the cluster is permanently stuck in a Updating state showing CA Certificate or Key is empty

Extra info: our clusters dont use Openstack LoadBalancer and they were provisioned with rancher older than 2.2.0

Has anyone tried this? @kinarashah do you maybe have an idea?
Thanks in advance for any help!

@pameladelgado could you post your cluster yaml?

@kinarashah sure
cluster.txt
I can also create a separate issue.

For anyone else seeing the same issue as @pameladelgado, we fixed it using: https://github.com/rancher/rancher/issues/20909#issuecomment-521352246

Cert rotate helped to fix this issue

rke cert rotate --config cluster.yml

Was this page helpful?
0 / 5 - 0 ratings