Kubespray: Deploy fails into environment with proxy

Created on 10 Apr 2020  路  21Comments  路  Source: kubernetes-sigs/kubespray

Environment:

  • Cloud provider or hardware configuration:
    External

Kubespray version (commit) (git rev-parse --short HEAD):
8f3d8206

Network plugin used:
Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible:ansible-playbook -i inventory/demo2node/hosts.yml -vvvv -b -u ubuntu --private-key=~/.ssh/k8s_rsa cluster.yml --flush-cache

Output of ansible run:
TASK [kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver] ********************
task path: /home/jtaylor/work/ansible/kubespray/roles/kubernetes-apps/ansible/tasks/main.yml:2 fatal: [node1]: FAILED! => { "attempts": 20, "changed": false, "content": "", "elapsed": 0, "invocation": { "module_args": { "attributes": null, "backup": null, "body": null, "body_format": "raw", "client_cert": "/etc/kubernetes/ssl/ca.crt", "client_key": "/etc/kubernetes/ssl/ca.key", "content": null, "creates": null, "delimiter": null, "dest": null, "directory_mode": null, "follow": false, "follow_redirects": "safe", "force": false, "force_basic_auth": false, "group": null, "headers": {}, "http_agent": "ansible-httpget", "method": "GET", "mode": null, "owner": null, "regexp": null, "remote_src": null, "removes": null, "return_content": false, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": null, "status_code": [ 200 ], "timeout": 30, "unix_socket": null, "unsafe_writes": null, "url": "https://127.0.0.1:6443/healthz", "url_password": null, "url_username": null, "use_proxy": true, "validate_certs": false } }, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error Tunnel connection failed: 403 Access violation>", "redirected": false, "status": -1, "url": "https://127.0.0.1:6443/healthz" }

Anything else do we need to know:
healthz is fine via curl on the node:
curl --cacert /etc/kubernetes/ssl/ca.crt https://127.0.0.1:6443/healthz ok

From command being run, can see that proxy is set but no_proxy is not:
<10.74.23.96> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jtaylor/.ssh/k8s_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=10 -o ControlPath=/home/jtaylor/.ansible/cp/83ecd4c0ff -tt 10.74.23.96 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-jobamzrcjodcuxkifinyfxnbjspmyhea ; http_proxy=http://16.100.210.81:8888 HTTP_PROXY=http://16.100.210.81:8888 https_proxy=http://16.100.210.81:8888 HTTPS_PROXY=http://16.100.210.81:8888 no_proxy='"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"' NO_PROXY='"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"' /usr/bin/python /home/ubuntu/.ansible/tmp/ansible-tmp-1586521304.0805278-210095447508559/AnsiballZ_uri.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''

kinbug

All 21 comments

Explicitly setting no_proxy in inventory/demo2node/group_vars/all/all.yml will override the bad value.

## Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy
no_proxy: "127.0.0.1,localhost"

Looks like roles/kubespray-defaults/defaults/main.yaml would normally set no_proxy to a reasonable value for deployment when http_proxy/https_proxy are set.

it's not because of the proxy. I've tried curl -I -k --cert /etc/kubernetes/ssl/ca.crt --key /etc/kubernetes/ssl/ca.key https://127.0.0.1:6443/healthz on the same master node and have gotten the same result (403). but if I run curl -I -k --cert /etc/kubernetes/ssl/apiserver-kubelet-client.crt --key /etc/kubernetes/ssl/apiserver-kubelet-client.key https://localhost:6443/healthz it return 200

I'm having the same issue with Calico | wait for etcd task too

Duplicate of #5891.

I'm hitting exactly the same issue as @jasonltaylor, no_proxy variable is empty, playbook breaks on kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver for me.

Problem disappears when no_proxy is set manually beforehand as described here https://github.com/kubernetes-sigs/kubespray/issues/5935#issuecomment-614287455 however then that autogenerated sane default configuration and entries from additonal_no_proxy are missing.

Reverted #5896 locally and problem seems to be gone.

this is my no_proxy

no_proxy: >- 
  10.0.0.0/8,
  127.0.0.0/8,
  172.16.0.0/12,
  192.168.0.0/16,
  localhost

and I'm still having issues

Maybe @alexkross has some input on this?

Happy to see a PR if a patch is required.

@Miouge1 Was just getting ready to revert #5896 and retest. Is that worth trying or would I have to just reset to the commit prior instead? Just retried latest mainline and seeing same issue.

As @przemeklal mentioned, reverting #5896 is a workaround.

What is the minimum vars to set to reproduce this problem?

On latest master vagrant up with all defaults works just fine, so I suppose one needs to set some vars to see this problem?

Maybe @alexkross has some input on this?

Happy to see a PR if a patch is required.

An array of exclusions for proxy is defined here: https://github.com/kubernetes-sigs/kubespray/blob/910a821d0bd5c29dd227a38a91e82546ca70116b/roles/kubespray-defaults/defaults/main.yaml#L419-L438

When prettyfied and shortened for readability sake this jinja2 template looks like:

  if http[s]_proxy is defined
      if loadbalancer_apiserver is defined
          apiserver_loadbalancer_domain_name, loadbalancer_apiserver.address,
      endif
      for item in (groups['k8s-cluster'] + groups['etcd'] + groups['calico-rr'])
          hostvars[item]['access_ip'] | default(hostvars[item]['ip'] | default(fallback_ips[item])),
          if item != hostvars[item].get('ansible_hostname', '')
              hostvars[item]['ansible_hostname'],
              hostvars[item]['ansible_hostname'].dns_domain,
          endif
          item,item.dns_domain,
      endfor
      ...
      127.0.0.1,localhost,kube_service_addresses,kube_pods_subnet
  endif

For some reason "for" loop have been executed over the list of hosts, but none vars were expanded into values producing weird output strings like no_proxy='"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'

I hope that #5957 have fixed the issue.

this is my no_proxy

no_proxy: >- 
  10.0.0.0/8,
  127.0.0.0/8,
  172.16.0.0/12,
  192.168.0.0/16,
  localhost

and I'm still having issues

no_proxy in CIDR notation is known to be ignored by python modules: https://github.com/ansible/ansible/issues/52705

In the VM that I am provisioning with kubespray, the Ansible python process has no_proxy='' passed explicitly to its environment, which explains why the TASK [kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver] fails with the proxy error (the kube apiserver URL is queried through the proxy while it should be directly queried).

TASK [kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver] ****************************************************************************************************************************************
fatal: [kapitan]: FAILED! => {"attempts": 20, "changed": false, "content": "", "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error Tunnel connection failed: 503 Service Unavailable>", "redirected": false, "status": -1, "url": "https://127.0.0.1:6443/healthz"}

Here is the Ansible python process cmdline, captured on the target while Ansible is running:

[root@kapitan ~]# ps aux | grep python | grep no_proxy
root      3978  0.0  0.0 113184  1216 ?        Ss   15:54   0:00 /bin/sh -c no_proxy='' https_proxy=http://10.0.3.195:3128 NO_PROXY='' http_proxy=http://10.0.3.195:3128 HTTPS_PROXY=http://10.0.3.195:3128 HTTP_PROXY=http://10.0.3.195:3128 /usr/bin/python && sleep 0

It shows clearly that no_proxy=''.

As a workaround, I altered cluster.yml to ensure that { role: kubespray-defaults } runs before the proxy_env fact is set, like this:

diff --git a/cluster.yml b/cluster.yml
index ca828206..ace8e0f3 100644
--- a/cluster.yml
+++ b/cluster.yml
@@ -4,6 +4,8 @@

 - hosts: all
   gather_facts: false
+  roles:
+    - { role: kubespray-defaults }
   tasks:
     - name: "Set up proxy environment"
       set_fact:
diff --git a/scale.yml b/scale.yml
index 65fecae0..5c310bfb 100644
--- a/scale.yml
+++ b/scale.yml
@@ -4,6 +4,8 @@

 - hosts: all
   gather_facts: false
+  roles:
+    - { role: kubespray-defaults }
   tasks:
     - name: "Set up proxy environment"
       set_fact:
diff --git a/upgrade-cluster.yml b/upgrade-cluster.yml
index 70c3943f..39af72e8 100644
--- a/upgrade-cluster.yml
+++ b/upgrade-cluster.yml
@@ -4,6 +4,8 @@

 - hosts: all
   gather_facts: false
+  roles:
+    - { role: kubespray-defaults }
   tasks:
     - name: "Set up proxy environment"
       set_fact:

From my limited understanding, this ensures that the no_proxy value is calculated by roles/kubespray-defaults/tasks/no_proxy.yml before the value is copied into the proxy_env fact.

After applying the above hack, the target successfully provisions.
Here is the Ansible python process cmdline post hack, captured on the target while Ansible is running:

[root@kapitan ~]# ps aux | grep -y python | grep no_proxy
root     16437  0.0  0.0 113184  1216 ?        Ss   16:46   0:00 /bin/sh -c no_proxy=192.168.220.100,kapitan,kapitan.cluster.local,127.0.0.1,localhost,10.233.0.0/18,10.233.64.0/18 https_proxy=http://10.0.3.195:3128 NO_PROXY=192.168.220.100,kapitan,kapitan.cluster.local,127.0.0.1,localhost,10.233.0.0/18,10.233.64.0/18 http_proxy=http://10.0.3.195:3128 HTTPS_PROXY=http://10.0.3.195:3128 HTTP_PROXY=http://10.0.3.195:3128 /usr/bin/python && sleep 0

Notice that the no_proxy environment variable is correct this time.

Ok indeed if I understand correctly you don't get no_proxy from your env variables but you are using the one generated by no_proxy.yml .
That would explain it.

@Miouge1 I think the minimum requirement is to just set the http_proxy and https_proxy. From the documentation, it looks like in general you are supposed to avoid setting no_proxy (which gets generated for you) and instead set additional_no_proxy.

@jasonltaylor I'm with you on this. A topology with nodes separated by a HTTP[S] proxy is weird.

FYI I managed to reproduce that in CI with PR #6039 (commit d234ee0)

It fails at the task TASK [kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver] *****
With

 fatal: [instance-1]: FAILED! => {"attempts": 20, "changed": false, "content": "", "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error Tunnel connection failed: 500 Unable to connect>", "redirected": false, "status": -1, "url": "https://127.0.0.1:6443/healthz"}

See full Ansible logs here

Proxy logs show:

CONNECT   Apr 28 18:43:46 [15]: Request (file descriptor 9): CONNECT 127.0.0.1:6443 HTTP/1.0
INFO      Apr 28 18:43:46 [15]: No upstream proxy for 127.0.0.1
INFO      Apr 28 18:43:46 [15]: opensock: opening connection to 127.0.0.1:6443
INFO      Apr 28 18:43:46 [15]: opensock: getaddrinfo returned for 127.0.0.1:6443
ERROR     Apr 28 18:43:46 [15]: opensock: Could not establish a connection to 127.0.0.1

So exactly the same situation as @jperville

Is there an agreement on the best way to resolve this?

@jasonltaylor did you try to include the 16.100.210.81 into the no_proxy values"?

In the other hand, the place where I set the no_proxy value is in inventory/group_vars/all.yml file and I don't use a CIDR notation.

@electrocucaracha I did not put the proxy IP itself in no_proxy. I simply set no_proxy to similar to how it would be normally be set by no_proxy.yml (localhost, etc.) to see if the no_proxy variable itself would get set during the deployment if made explicit.

Thanks everyone for the info. Further investigation showed that the proxy_env.no_proxy defaults to empty string, then later on the no_proxy facts gets generated, but not the proxy_env.no_proxy which means that it always was empty.

This is fixed by PR #6039

Was this page helpful?
0 / 5 - 0 ratings