Origin: Origin 3.6 internal registry resolution failure

Created on 1 Sep 2017 · 44Comments · Source: openshift/origin

We have installed a Origin 3.6 instance on our development environment and we pushed some images into the internal registry with success. If we try to start a new deployment with this images the POD fails its pull action from the registry because it can't resolve the address docker-registry.default.svc.

Failed to pull image "docker-registry.default.svc:5000/test-shared/haproxy@sha256:424a91dde92e2db9b8b9135bcb06e6b1c53645ee7c0ce274287c570e15f1a4b3": rpc error: code = 2 desc = Get https://docker-registry.default.svc:5000/v2/: dial tcp: lookup docker-registry.default.svc on 10.224.20.20:53: no such host

Version

oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dev-openshift.test.it:8443
openshift v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

Steps To Reproduce

from external servere oc login and docker push of the image on the internal registry
from web-ui adding a new item to the project and run a new image with one replica

Current Result

The pod fails the creation with the error reported below.

Expected Result

The pod pulls the image from the registry.

Additional Information

componenimageregistry componennetworking kinbug lifecyclstale prioritP1

Source

cello86

Most helpful comment

In my case (OCP 3.9) I had to add this to '/etc/dnsmasq.d/node-dnsmasq.conf' and restart dnsmasq.

server=/default.svc/172.30.0.1

ToroNZ on 24 Jul 2018

👍2 🎉1

All 44 comments

We're seeing this on 3.7.0-alpha.1 as well in our CI cluster.

/cc @smarterclayton

stevekuznetsov on 1 Sep 2017

@sdodson for similar DNS issues

smarterclayton on 1 Sep 2017

I tried to configure into the ansible installation the master subnet vlan 10.228.0.0/14 but the results are the same.

openshift_portal_net=172.30.0.0/16
osm_cluster_network_cidr=10.128.0.0/14

If we add external images from openshift catalog or another internal repository all works fine.

Marcello

cello86 on 4 Sep 2017

Facing the same here. pod is not able to pull images from internal registry

bamachrn on 6 Sep 2017

We tried also to configure an external registry to pull the images but OC uses the image streams to pull the images to passing through the internal registry and the problem with the resolution is the same

cello86 on 11 Sep 2017

@knobunc looks more networking than registry related.

bparees on 12 Sep 2017

@cello86 does the docker-registry service exist in the default namespace? (oc get svc -n default)

bparees on 12 Sep 2017

The registry exists and it's reachable externally via the route.

[root@dev-openshift01 ~]# oc get svc -n default
NAME               CLUSTER-IP      EXTERNAL-IP   PORT(S)                   AGE
docker-registry    172.30.85.91    <none>        5000/TCP                  10d
kubernetes         172.30.0.1      <none>        443/TCP,53/UDP,53/TCP     10d
registry-console   172.30.83.228   <none>        9000/TCP                  10d
router             172.30.111.19   <none>        80/TCP,443/TCP,1936/TCP   10d
[root@dev-openshift01 ~]# oc get route -n default
NAME               HOST/PORT                                    PATH      SERVICES           PORT      TERMINATION   WILDCARD
docker-registry    docker-registry-default.devapps.test.it              docker-registry    <all>     passthrough   None
registry-console   registry-console-default.devapps.test.it             registry-console   <all>     passthrough   None

cello86 on 12 Sep 2017

Definitely a networking/dns issue, then, thanks.

bparees on 12 Sep 2017

@cello86 The dispatcher script should've added cluster.local to the search line in /etc/resolv.conf, can you post the contents of that file?

sdodson on 12 Sep 2017

This is the content of /etc/resolv.conf on one master node:

search test.it 

options timeout:1 rotate 

nameserver dns1
nameserver dns2
nameserver dns3

cello86 on 12 Sep 2017

I have faced the same problme in my environment of V3.6
Installed by openshift_ansible project, the way is rpm.

I found the cause is 99-origin-dns.sh of networkmanager.

I fixed the problem and verifyed it in my environment. The solutions is adding one echo line to 99-origin-dns.sh

  sed -e '/^nameserver.*$/d' /etc/resolv.conf >> ${NEW_RESOLV_CONF}
  echo "search svc.cluster.local cluster.local" >> ${NEW_RESOLV_CONF}
  echo "nameserver "${def_route_ip}"" >> ${NEW_RESOLV_CONF}

I will make a PR in the openshift_ansible project

linzhaoming on 12 Sep 2017

@linzhaoming we have tested the change but on our /etc/resolv.conf it's configured the internal domain and DNS servers to resolve the proxy and after the change the installation fails because it can't resolve our HTTP proxy. Did you notice this in your environment?

We put the HTTP proxy address into file /etc/hosts to fix temporary the problem but all the internal address aren't resolvable. Could we change the echo with a sed command like this to keep the internal domain:

sed -e '/^search/ s/$/ svc.cluster.local cluster.local/' /etc/resolv.conf >> ${NEW_RESOLV_CONF}

Also with this configuration the pull fails and the /etc/resolv.conf file is:

search test.it  svc.cluster.local cluster.local

options timeout:1 rotate 

nameserver dns1
nameserver dns2
nameserver dns2
nameserver 192.168.1.141 #internal address of the local server

The issue could we related with the dnsmaq configuration service and the DNS already present into the /etc/resolv.conf file. In fact the openshift node tries to rearch and external endpoint (our first DNS).

Marcello

cello86 on 13 Sep 2017

@linzhaoming i think yours is a different root cause, you don't have a search line in your /etc/resolv.conf to start with, correct? I'll work with you on our PR to fix that variant.

@cello86 We recently fixed an issue where if namesevers are only defined in /etc/resolv.conf the script would fail. Can you try running the latest installer and see if the situation improved? see https://github.com/openshift/openshift-ansible/pull/5145

sdodson on 13 Sep 2017

@sdodson I can test the installer but actually with some changes we have fixed the issue and the problem seems related to nameservers present into /etc/resolv.conf. The actually configuration is:

# cat /etc/resolv.conf 
search svc.cluster.local cluster.local
nameserver 192.1681.144
# cat /etc/dnsmasq.d/*
server=/in-addr.arpa/127.0.0.1
server=/cluster.local/127.0.0.1
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
bind-interfaces
listen-address=192.168.1.144
server=<dns1>
server=<dns2>

cello86 on 14 Sep 2017

@sdodson I tested the latest ansible script and all works fine. The /etc/resolv.conf it's correct and I can pull the images from the internal registry. I have added the property reported below because the installation searched the OC images on docker.io registry and it failed.

openshift_disable_check=docker_image_availability

cello86 on 14 Sep 2017

Thanks for the followup, I'm going to say this was addressed by https://github.com/openshift/openshift-ansible/pull/5145 then

sdodson on 14 Sep 2017

🎉1

This again happened:

++ git --work-tree /data/src/github.com/openshift/origin describe --long --tags --abbrev=7 --match 'v[0-9]*' '5c49449^{commit}'
+ OS_GIT_VERSION=v3.9.0-alpha.4-367-g5c49449
...
2018-02-16T04:57:32.868379745Z Pushing image docker-registry.default.svc:5000/extended-test-dancer-repo-test-l6shc-lz94v/dancer-example:latest ...
2018-02-16T04:57:33.777538638Z Registry server Address: 
2018-02-16T04:57:33.777569802Z Registry server User Name: serviceaccount
2018-02-16T04:57:33.777574483Z Registry server Email: [email protected]
2018-02-16T04:57:33.777578126Z Registry server Password: <<non-empty>>
2018-02-16T04:57:33.797309512Z error: build error: Failed to push image: Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 127.0.0.1:53: no such host

https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_image_ecosystem/388/

legionus on 19 Feb 2018

😕1

/cc @danwinship @dcbw

bparees on 19 Feb 2018

I was able to pull a new image until yesterday. Today I needed to fix a bug and when I tried to make a new build, It didn't work.

ghost on 22 Feb 2018

I am hitting this only on some of the nodes after I upgraded from 3.6 to 3.7.42-1. I potentially have the same issue on one 3.6 node too as it's resolv.conf looks different than the others.

In my case it was fixed by just restarting NetworkManager. I rebooted the node too to see if it was still ok and it was OK. Apparently some race condition still going on.

bortek on 18 May 2018

I managed to have this issue exactly in Origin 3.9 on CentOS 7.5 and added a lot of debug printing into a file in the dispatcher.d/99-origin-dns.sh script.

I started having the issue when I switched my NetworkManager config from dhcp-provided IPs to static configuration.

What I found out is the following:

NetworkManager writes the /etc/resolv.conf with the static resolver/search/etc
the dispatcher script is called with $2 = up ; the script sets up /etc/resolv.conf correctly
the dispatcher script is called with $2 = connectivity-change ; does nothing ; /etc/resolv.conf stays identical (working)
NetworkManager overwrites /etc/resolv.conf with the static resolver/search/etc
dispatcher script is not called ; makes OpenShift unable to pull images from the private registry, and new pods unable to resolve kubernetes services

I have found a workaround, by changing the NetworkManager config including dns=none in the [main] section makes things work in my case since the dispatcher script is still called and thus setup /etc/dnsmasq.d/origin-upstream-dns.conf correctly with NetworkManager-provided resolvers and /etc/resolv.conf with the local dnsmasq IP.

dabelenda on 16 Jul 2018

In my case (OCP 3.9) I had to add this to '/etc/dnsmasq.d/node-dnsmasq.conf' and restart dnsmasq.

server=/default.svc/172.30.0.1

ToroNZ on 24 Jul 2018

👍2 🎉1

Check DNS in Network Manager CLI using this command - nmcli con show eth0

If the DNS is wrong then use this command to change - nmcli con modify eth0 ipv4.dns "DNS IP"
& restart Network Manager & dnsmasq service

dakshaypatel on 30 Jul 2018

I have created a 3.10 cluster using Ansible, this is what I currently have:

[root@master-node]# oc get all
NAME                           READY     STATUS              RESTARTS   AGE
pod/docker-registry-1-7lbtl    1/1       Running             0          25m
pod/docker-registry-2-deploy   0/1       Error               0          25m
pod/registry-console-1-9lxmp   1/1       Running             4          6d
pod/registry-console-1-txj95   0/1       MatchNodeSelector   0          7d
pod/router-1-5p8gz             1/1       Running             0          26m

NAME                                       DESIRED   CURRENT   READY     AGE
replicationcontroller/docker-registry-1    1         1         1         25m
replicationcontroller/docker-registry-2    0         0         0         25m
replicationcontroller/registry-console-1   1         1         1         7d
replicationcontroller/router-1             1         1         1         26m

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
service/docker-registry    ClusterIP   172.30.198.206   <none>        5000/TCP                  25m
service/kubernetes         ClusterIP   172.30.0.1       <none>        443/TCP,53/UDP,53/TCP     7d
service/registry-console   ClusterIP   172.30.125.193   <none>        9000/TCP                  7d
service/router             ClusterIP   172.30.31.91     <none>        80/TCP,443/TCP,1936/TCP   26m

NAME                                                  REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfig.apps.openshift.io/docker-registry    2          1         0         config
deploymentconfig.apps.openshift.io/registry-console   1          1         1         config
deploymentconfig.apps.openshift.io/router             1          1         1         config

NAME                                        HOST/PORT                                                   PATH      SERVICES           PORT      TERMINATION   WILDCARD
route.route.openshift.io/docker-registry    docker-registry-default.router.default.svc.cluster.local              docker-registry    <all>     passthrough   None
route.route.openshift.io/registry-console   registry-console-default.router.default.svc.cluster.local             registry-console   <all>     passthrough   None

When I try building an app I get this:

error: build error: Failed to push image: Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.1.208.34:53: no such host

Which is using one of our our external DNS servers to do the lookup.

I do see that in 3.10 the dnsmaq looks correct though.

[root@master-node]# more /etc/dnsmasq.d/*
::::::::::::::
/etc/dnsmasq.d/node-dnsmasq.conf
::::::::::::::
server=/default.svc/172.30.0.1
::::::::::::::
/etc/dnsmasq.d/origin-dns.conf
::::::::::::::
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
dns-forward-max=10000
cache-size=10000
bind-dynamic
min-port=1024
except-interface=lo
# End of config
::::::::::::::
/etc/dnsmasq.d/origin-upstream-dns.conf
::::::::::::::
server=10.1.208.34

Any ideas? I followed the setup steps and deleted the original registry and router, but it's possible I missed a step.

markandrewj on 5 Oct 2018

@markandrewj What's in your /etc/resolv.conf on the host? It should have added cluster.local to the search path. What does dig +showsearch docker-registry.default.svc show you?

sdodson on 8 Oct 2018

👍1

@sdodson That entry wasn't in the resolv.conf. I tried adding it to the resolv.conf and the request was still being routed incorrectly afterwards. Then I tried restarting the network service after, and the route was still incorrect. So as a last ditch effort I tried restarting the host. When it came back up /etc/NetworkManager/dispatcher.d/99-origin-dns.sh added the cluster.local entry to the search, and the nameserver IP, however the script also stripped out our DNS name server IP entries.

I added them back afterwards, this may resolve our issue, but I put the cluster in a weird state trying different things to fix the issue. What is the recommended way to add the entries so they are not overwritten on reboot?

I noticed that 3.11 was released while I was investigating this issue, and since this is a POC, I decided I would kickstart some fresh nodes and do a clean install of 3.11. I am just syncing the new packages to our Satellite server at the moment. I can update you once I have things going.

markandrewj on 11 Oct 2018

I encountered a similar problem with 3.11. In testing:

dig +showsearch docker-registry.default.svc resolved to 92.242.140.21
dig +showsearch docker-registry.default.svc.cluster.local resolved to 172.30.89.2

As a less than ideal workaround, adding '172.30.89.2 docker-registry.default.svc' to /etc/hosts and restarting the cluster (including docker) worked for me.

dstanley on 11 Nov 2018

👍1

I encountered a similar problem with 3.11. In testing:

dig +showsearch docker-registry.default.svc resolved to 92.242.140.21
dig +showsearch docker-registry.default.svc.cluster.local resolved to 172.30.89.2

As a less than ideal workaround, adding '172.30.89.2 docker-registry.default.svc' to /etc/hosts and restarting the cluster (including docker) worked for me.

Worked for me

co-de on 13 Nov 2018

👍1

@dstanley @co-de do i need change '172.30.89.2' or '92.242.140.21' to my own ip? if so, how to find the two ips? thx

aishenghuomeidaoli on 19 Nov 2018

@dstanley @co-de do i need change '172.30.89.2' or '92.242.140.21' to my own ip? if so, how to find the two ips? thx

Yes to using your own clusters ip's. Run "dig +showsearch docker-registry.default.svc" and "dig +showsearch docker-registry.default.svc.cluster.local" to find the IP's

dstanley on 19 Nov 2018

Hey, sorry for the late reply.

We were able to get things working on 3.11 using this.

[OSEv3:children]
masters
nodes
etcd

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root

openshift_deployment_type=openshift-enterprise
oreg_auth_user="{{ os_registry_user }}"
oreg_auth_password="{{ os_registry_pass }}"

openshift_master_default_subdomain=apps.openshift.subdomain
openshift_hosted_registry_routehost=registry.apps.openshift.subdomain

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# host group for masters
[masters]
node01.subdomain openshift_public_hostname=master.openshift.subdomain

# host group for etcd
[etcd]
node01.subdomain openshift_public_hostname=master.openshift.subdomain

# host group for nodes, includes region info
[nodes]
node01.subdomain openshift_public_hostname=node01.openshift.subdomain openshift_node_group_name='node-config-master-infra'
node02.subdomain openshift_public_hostname=node02.openshift.subdomain openshift_node_group_name='node-config-compute'

We had a couple issues which were causing different problems. There are still a few things we are working out as well. Thanks for the help.

markandrewj on 22 Nov 2018

👍1

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 21 Feb 2019

I'm marking this as closed, but please feel free to reopen this if there is still an active issue.

/close

markandrewj on 27 Feb 2019

@markandrewj: Closing this issue.

In response to this:

I'm marking this as closed, but please feel free to reopen this if there is still an active issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot on 27 Feb 2019

I used ansible to setup the openshift 4.1.
But I cannot use podman to install docker images.

Get an error like this

Get https://image-registry.openshift-image-registry.svc:5000/v2/: dial tcp: lookup image-registry.openshift-image-registry.svc on 127.0.0.1:53: no such host

what is the host should be registry?

Piping on 4 Sep 2019

what is the host should be registry?

assuming you're trying to push from a machine that's not part of your cluster, you should create a route for the registry and push to the route hostname.

https://docs.openshift.com/container-platform/4.1/registry/securing-exposing-registry.html#registry-exposing-secure-registry-manually_securing-exposing-registry

bparees on 4 Sep 2019

Thanks @bparees !
Using your provided link, I am able to find the HOST and use podman login to login to the registry .

But now I cannot push the image successfully.
Some error output like this:

podman push 5f6c26a1851a default-route-openshift-image-registry.apps.ocp4.example.com

Getting image source signatures
Error: Error copying image to the remote destination: Error trying to reuse blob sha256:0b8293539998d18431d3f74d3e9f89cc8ff7d057a234fae914e0213c8a0c63e3 at destination: Error checking whether a blob sha256:0b8293539998d18431d3f74d3e9f89cc8ff7d057a234fae914e0213c8a0c63e3 exists in docker.io/library/default-route-openshift-image-registry.apps.ocp4.a10networks.com: errors:
denied: requested access to the resource is denied
error parsing HTTP 401 response body: unexpected end of JSON input: ""


======================================

podman push --creds=kubeadmin:password 5f6c26a1851a default-route-openshift-image-registry.apps.ocp4.example.com
Getting image source signatures

Error: Error copying image to the remote destination: Error trying to reuse blob sha256:0b8293539998d18431d3f74d3e9f89cc8ff7d057a234fae914e0213c8a0c63e3 at destination: unable to retrieve auth token: invalid username/password

The second command uses kubeadmin:password taken from podman login -u kubeadmin -p password

Apparently the authentication does not work. What is the right/updated approach for push images from a remote machine?

Best

Piping on 4 Sep 2019

see: https://docs.openshift.com/container-platform/3.11/install_config/registry/accessing_registry.html#access-logging-in-to-the-registry

you can also use "oc registry login" to update your docker config.json w/ auth creds for pushing to the registry.

but the key is that you can't be using the default cluster-admin account as it has no token. You need to create a regular user or use a service account.

(though those are 3.11 docs, they are valid for 4.x. @adambkaplan @dmage @bmcelvee we need to get that content ported into 4.x if it's not there)

i'm also not sure what you're pushing to since you only provided the registry hostname, no repository(namespace name) or image name.

bparees on 4 Sep 2019

(looks like the 4.x docs are here: https://docs.openshift.com/container-platform/4.1/registry/accessing-the-registry.html#registry-accessing-directly_accessing-the-registry but they are a bit mixed up since they talk about accessing the registry from within the cluster and using the registry service hostname instead of exposing a route and using the route hostname).

bparees on 4 Sep 2019

@bparees

but the key is that you can't be using the default cluster-admin account as it has no token. You need to create a regular user or use a service account.

This information is critical. But how to create a user just for a setup lap environment? Does setting up a ldap server mandatory? I tried the htpasswd approach but it shows internal error for on the command line.

Username: admin
Password:
Error from server (InternalError): Internal error occurred: unexpected response: 500

i'm also not sure what you're pushing to since you only provided the registry hostname, no repository(namespace name) or image name.

assume HOST is the exposed default registry route.
can I have to create a repo name first to push my container image?

for example, if I tag my image like this, $HOST/default_reponame/app , will it just work?

Piping on 6 Sep 2019

Htpassword is fine. Creating a service account and using its token is even
easier.

The image repository name maps to an openshift project/namespace name. You
have to have permissions to create imagestreams in the namespace to be able
to push an image to it.

Ben Parees | OpenShift

On Fri, Sep 6, 2019, 00:53 Piping notifications@github.com wrote:

@bparees https://github.com/bparees

but the key is that you can't be using the default cluster-admin account
as it has no token. You need to create a regular user or use a service
account.

This information is critical. But how to create a user just for a setup
lap environment? Does setting up a ldap server mandatory? I tried the
htpasswd approach but it shows internal error for on the command line.

Username: admin
Password:
Error from server (InternalError): Internal error occurred: unexpected response: 500

i'm also not sure what you're pushing to since you only provided the
registry hostname, no repository(namespace name) or image name.

assume HOST is the exposed default registry route.
can I have to create a repo name first to push my container image?

for example, if I tag my image like this, $HOST/default_reponame/app ,
will it just work?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/16097?email_source=notifications&email_token=ABF6LXVONQMC56QUGU5JI6LQIGEW3A5CNFSM4DZHGXAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BCJMY#issuecomment-528622771,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABF6LXTLFR2EDBZZHHAMAW3QIGEW3ANCNFSM4DZHGXAA
.

bparees on 6 Sep 2019

Htpassword is fine. Creating a service account and using its token is even
easier.

Thanks for tips. I used a service account and successfully pushed my local image to imagestream inside the openshift. But the process stuck again. I tried to create a yaml file in pod and run the service. But it failed to pull the image. The yaml files looks like this? Do I miss anything here?

apiVersion: v1
kind: Pod
metadata:
  name: hello-test
  labels:
    app: hello-test
  namespace: default
spec:
  containers:
    - name: hello-test
      image: default/500p1b14
      ports:
        - containerPort: 22

Piping on 6 Sep 2019

@Piping refer to the following to reference an ImageStream tag directly in a pod or other k8s resource:

https://docs.okd.io/3.11/dev_guide/managing_images.html#using-is-with-k8s