Origin: Origin 3.6 internal registry resolution failure

Created on 1 Sep 2017  Â·  44Comments  Â·  Source: openshift/origin

We have installed a Origin 3.6 instance on our development environment and we pushed some images into the internal registry with success. If we try to start a new deployment with this images the POD fails its pull action from the registry because it can't resolve the address docker-registry.default.svc.

Failed to pull image "docker-registry.default.svc:5000/test-shared/haproxy@sha256:424a91dde92e2db9b8b9135bcb06e6b1c53645ee7c0ce274287c570e15f1a4b3": rpc error: code = 2 desc = Get https://docker-registry.default.svc:5000/v2/: dial tcp: lookup docker-registry.default.svc on 10.224.20.20:53: no such host

Version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dev-openshift.test.it:8443
openshift v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
Steps To Reproduce
  1. from external servere oc login and docker push of the image on the internal registry
  2. from web-ui adding a new item to the project and run a new image with one replica
Current Result

The pod fails the creation with the error reported below.

Expected Result

The pod pulls the image from the registry.

Additional Information
componenimageregistry componennetworking kinbug lifecyclstale prioritP1

Most helpful comment

In my case (OCP 3.9) I had to add this to '/etc/dnsmasq.d/node-dnsmasq.conf' and restart dnsmasq.

server=/default.svc/172.30.0.1

All 44 comments

We're seeing this on 3.7.0-alpha.1 as well in our CI cluster.

/cc @smarterclayton

@sdodson for similar DNS issues

I tried to configure into the ansible installation the master subnet vlan 10.228.0.0/14 but the results are the same.

openshift_portal_net=172.30.0.0/16
osm_cluster_network_cidr=10.128.0.0/14

If we add external images from openshift catalog or another internal repository all works fine.

Marcello

Facing the same here. pod is not able to pull images from internal registry

We tried also to configure an external registry to pull the images but OC uses the image streams to pull the images to passing through the internal registry and the problem with the resolution is the same

@knobunc looks more networking than registry related.

@cello86 does the docker-registry service exist in the default namespace? (oc get svc -n default)

The registry exists and it's reachable externally via the route.

[root@dev-openshift01 ~]# oc get svc -n default
NAME               CLUSTER-IP      EXTERNAL-IP   PORT(S)                   AGE
docker-registry    172.30.85.91    <none>        5000/TCP                  10d
kubernetes         172.30.0.1      <none>        443/TCP,53/UDP,53/TCP     10d
registry-console   172.30.83.228   <none>        9000/TCP                  10d
router             172.30.111.19   <none>        80/TCP,443/TCP,1936/TCP   10d
[root@dev-openshift01 ~]# oc get route -n default
NAME               HOST/PORT                                    PATH      SERVICES           PORT      TERMINATION   WILDCARD
docker-registry    docker-registry-default.devapps.test.it              docker-registry    <all>     passthrough   None
registry-console   registry-console-default.devapps.test.it             registry-console   <all>     passthrough   None

Definitely a networking/dns issue, then, thanks.

@cello86 The dispatcher script should've added cluster.local to the search line in /etc/resolv.conf, can you post the contents of that file?

This is the content of /etc/resolv.conf on one master node:

search test.it 

options timeout:1 rotate 

nameserver dns1
nameserver dns2
nameserver dns3

I have faced the same problme in my environment of V3.6
Installed by openshift_ansible project, the way is rpm.

I found the cause is 99-origin-dns.sh of networkmanager.

I fixed the problem and verifyed it in my environment. The solutions is adding one echo line to 99-origin-dns.sh

  sed -e '/^nameserver.*$/d' /etc/resolv.conf >> ${NEW_RESOLV_CONF}
  echo "search svc.cluster.local cluster.local" >> ${NEW_RESOLV_CONF}
  echo "nameserver "${def_route_ip}"" >> ${NEW_RESOLV_CONF}

I will make a PR in the openshift_ansible project

@linzhaoming we have tested the change but on our /etc/resolv.conf it's configured the internal domain and DNS servers to resolve the proxy and after the change the installation fails because it can't resolve our HTTP proxy. Did you notice this in your environment?

We put the HTTP proxy address into file /etc/hosts to fix temporary the problem but all the internal address aren't resolvable. Could we change the echo with a sed command like this to keep the internal domain:

sed -e '/^search/ s/$/ svc.cluster.local cluster.local/' /etc/resolv.conf >> ${NEW_RESOLV_CONF}

Also with this configuration the pull fails and the /etc/resolv.conf file is:

search test.it  svc.cluster.local cluster.local

options timeout:1 rotate 

nameserver dns1
nameserver dns2
nameserver dns2
nameserver 192.168.1.141 #internal address of the local server

The issue could we related with the dnsmaq configuration service and the DNS already present into the /etc/resolv.conf file. In fact the openshift node tries to rearch and external endpoint (our first DNS).

Marcello

@linzhaoming i think yours is a different root cause, you don't have a search line in your /etc/resolv.conf to start with, correct? I'll work with you on our PR to fix that variant.

@cello86 We recently fixed an issue where if namesevers are only defined in /etc/resolv.conf the script would fail. Can you try running the latest installer and see if the situation improved? see https://github.com/openshift/openshift-ansible/pull/5145

@sdodson I can test the installer but actually with some changes we have fixed the issue and the problem seems related to nameservers present into /etc/resolv.conf. The actually configuration is:

# cat /etc/resolv.conf 
search svc.cluster.local cluster.local
nameserver 192.1681.144
# cat /etc/dnsmasq.d/*
server=/in-addr.arpa/127.0.0.1
server=/cluster.local/127.0.0.1
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
bind-interfaces
listen-address=192.168.1.144
server=<dns1>
server=<dns2>

@sdodson I tested the latest ansible script and all works fine. The /etc/resolv.conf it's correct and I can pull the images from the internal registry. I have added the property reported below because the installation searched the OC images on docker.io registry and it failed.

openshift_disable_check=docker_image_availability

Thanks for the followup, I'm going to say this was addressed by https://github.com/openshift/openshift-ansible/pull/5145 then

This again happened:

++ git --work-tree /data/src/github.com/openshift/origin describe --long --tags --abbrev=7 --match 'v[0-9]*' '5c49449^{commit}'
+ OS_GIT_VERSION=v3.9.0-alpha.4-367-g5c49449
...
2018-02-16T04:57:32.868379745Z Pushing image docker-registry.default.svc:5000/extended-test-dancer-repo-test-l6shc-lz94v/dancer-example:latest ...
2018-02-16T04:57:33.777538638Z Registry server Address: 
2018-02-16T04:57:33.777569802Z Registry server User Name: serviceaccount
2018-02-16T04:57:33.777574483Z Registry server Email: [email protected]
2018-02-16T04:57:33.777578126Z Registry server Password: <<non-empty>>
2018-02-16T04:57:33.797309512Z error: build error: Failed to push image: Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 127.0.0.1:53: no such host

https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_image_ecosystem/388/

/cc @danwinship @dcbw

I was able to pull a new image until yesterday. Today I needed to fix a bug and when I tried to make a new build, It didn't work.

I am hitting this only on some of the nodes after I upgraded from 3.6 to 3.7.42-1. I potentially have the same issue on one 3.6 node too as it's resolv.conf looks different than the others.

In my case it was fixed by just restarting NetworkManager. I rebooted the node too to see if it was still ok and it was OK. Apparently some race condition still going on.

I managed to have this issue exactly in Origin 3.9 on CentOS 7.5 and added a lot of debug printing into a file in the dispatcher.d/99-origin-dns.sh script.

I started having the issue when I switched my NetworkManager config from dhcp-provided IPs to static configuration.

What I found out is the following:

  • NetworkManager writes the /etc/resolv.conf with the static resolver/search/etc
  • the dispatcher script is called with $2 = up ; the script sets up /etc/resolv.conf correctly
  • the dispatcher script is called with $2 = connectivity-change ; does nothing ; /etc/resolv.conf stays identical (working)
  • NetworkManager overwrites /etc/resolv.conf with the static resolver/search/etc
  • dispatcher script is not called ; makes OpenShift unable to pull images from the private registry, and new pods unable to resolve kubernetes services

I have found a workaround, by changing the NetworkManager config including dns=none in the [main] section makes things work in my case since the dispatcher script is still called and thus setup /etc/dnsmasq.d/origin-upstream-dns.conf correctly with NetworkManager-provided resolvers and /etc/resolv.conf with the local dnsmasq IP.

In my case (OCP 3.9) I had to add this to '/etc/dnsmasq.d/node-dnsmasq.conf' and restart dnsmasq.

server=/default.svc/172.30.0.1

Check DNS in Network Manager CLI using this command - nmcli con show eth0

If the DNS is wrong then use this command to change - nmcli con modify eth0 ipv4.dns "DNS IP"
& restart Network Manager & dnsmasq service

I have created a 3.10 cluster using Ansible, this is what I currently have:

[root@master-node]# oc get all
NAME                           READY     STATUS              RESTARTS   AGE
pod/docker-registry-1-7lbtl    1/1       Running             0          25m
pod/docker-registry-2-deploy   0/1       Error               0          25m
pod/registry-console-1-9lxmp   1/1       Running             4          6d
pod/registry-console-1-txj95   0/1       MatchNodeSelector   0          7d
pod/router-1-5p8gz             1/1       Running             0          26m

NAME                                       DESIRED   CURRENT   READY     AGE
replicationcontroller/docker-registry-1    1         1         1         25m
replicationcontroller/docker-registry-2    0         0         0         25m
replicationcontroller/registry-console-1   1         1         1         7d
replicationcontroller/router-1             1         1         1         26m

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
service/docker-registry    ClusterIP   172.30.198.206   <none>        5000/TCP                  25m
service/kubernetes         ClusterIP   172.30.0.1       <none>        443/TCP,53/UDP,53/TCP     7d
service/registry-console   ClusterIP   172.30.125.193   <none>        9000/TCP                  7d
service/router             ClusterIP   172.30.31.91     <none>        80/TCP,443/TCP,1936/TCP   26m

NAME                                                  REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfig.apps.openshift.io/docker-registry    2          1         0         config
deploymentconfig.apps.openshift.io/registry-console   1          1         1         config
deploymentconfig.apps.openshift.io/router             1          1         1         config

NAME                                        HOST/PORT                                                   PATH      SERVICES           PORT      TERMINATION   WILDCARD
route.route.openshift.io/docker-registry    docker-registry-default.router.default.svc.cluster.local              docker-registry    <all>     passthrough   None
route.route.openshift.io/registry-console   registry-console-default.router.default.svc.cluster.local             registry-console   <all>     passthrough   None

When I try building an app I get this:

error: build error: Failed to push image: Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.1.208.34:53: no such host

Which is using one of our our external DNS servers to do the lookup.

I do see that in 3.10 the dnsmaq looks correct though.

[root@master-node]# more /etc/dnsmasq.d/*
::::::::::::::
/etc/dnsmasq.d/node-dnsmasq.conf
::::::::::::::
server=/default.svc/172.30.0.1
::::::::::::::
/etc/dnsmasq.d/origin-dns.conf
::::::::::::::
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
dns-forward-max=10000
cache-size=10000
bind-dynamic
min-port=1024
except-interface=lo
# End of config
::::::::::::::
/etc/dnsmasq.d/origin-upstream-dns.conf
::::::::::::::
server=10.1.208.34

Any ideas? I followed the setup steps and deleted the original registry and router, but it's possible I missed a step.

@markandrewj What's in your /etc/resolv.conf on the host? It should have added cluster.local to the search path. What does dig +showsearch docker-registry.default.svc show you?

@sdodson That entry wasn't in the resolv.conf. I tried adding it to the resolv.conf and the request was still being routed incorrectly afterwards. Then I tried restarting the network service after, and the route was still incorrect. So as a last ditch effort I tried restarting the host. When it came back up /etc/NetworkManager/dispatcher.d/99-origin-dns.sh added the cluster.local entry to the search, and the nameserver IP, however the script also stripped out our DNS name server IP entries.

I added them back afterwards, this may resolve our issue, but I put the cluster in a weird state trying different things to fix the issue. What is the recommended way to add the entries so they are not overwritten on reboot?

I noticed that 3.11 was released while I was investigating this issue, and since this is a POC, I decided I would kickstart some fresh nodes and do a clean install of 3.11. I am just syncing the new packages to our Satellite server at the moment. I can update you once I have things going.

I encountered a similar problem with 3.11. In testing:

dig +showsearch docker-registry.default.svc resolved to 92.242.140.21
dig +showsearch docker-registry.default.svc.cluster.local resolved to 172.30.89.2

As a less than ideal workaround, adding '172.30.89.2 docker-registry.default.svc' to /etc/hosts and restarting the cluster (including docker) worked for me.

I encountered a similar problem with 3.11. In testing:

dig +showsearch docker-registry.default.svc resolved to 92.242.140.21
dig +showsearch docker-registry.default.svc.cluster.local resolved to 172.30.89.2

As a less than ideal workaround, adding '172.30.89.2 docker-registry.default.svc' to /etc/hosts and restarting the cluster (including docker) worked for me.

Worked for me

@dstanley @co-de do i need change '172.30.89.2' or '92.242.140.21' to my own ip? if so, how to find the two ips? thx

@dstanley @co-de do i need change '172.30.89.2' or '92.242.140.21' to my own ip? if so, how to find the two ips? thx

Yes to using your own clusters ip's. Run "dig +showsearch docker-registry.default.svc" and "dig +showsearch docker-registry.default.svc.cluster.local" to find the IP's

Hey, sorry for the late reply.

We were able to get things working on 3.11 using this.

[OSEv3:children]
masters
nodes
etcd

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root

openshift_deployment_type=openshift-enterprise
oreg_auth_user="{{ os_registry_user }}"
oreg_auth_password="{{ os_registry_pass }}"

openshift_master_default_subdomain=apps.openshift.subdomain
openshift_hosted_registry_routehost=registry.apps.openshift.subdomain

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# host group for masters
[masters]
node01.subdomain openshift_public_hostname=master.openshift.subdomain

# host group for etcd
[etcd]
node01.subdomain openshift_public_hostname=master.openshift.subdomain

# host group for nodes, includes region info
[nodes]
node01.subdomain openshift_public_hostname=node01.openshift.subdomain openshift_node_group_name='node-config-master-infra'
node02.subdomain openshift_public_hostname=node02.openshift.subdomain openshift_node_group_name='node-config-compute'

We had a couple issues which were causing different problems. There are still a few things we are working out as well. Thanks for the help.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

I'm marking this as closed, but please feel free to reopen this if there is still an active issue.

/close

@markandrewj: Closing this issue.

In response to this:

I'm marking this as closed, but please feel free to reopen this if there is still an active issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I used ansible to setup the openshift 4.1.
But I cannot use podman to install docker images.

Get an error like this

Get https://image-registry.openshift-image-registry.svc:5000/v2/: dial tcp: lookup image-registry.openshift-image-registry.svc on 127.0.0.1:53: no such host

what is the host should be registry?

what is the host should be registry?

assuming you're trying to push from a machine that's not part of your cluster, you should create a route for the registry and push to the route hostname.

https://docs.openshift.com/container-platform/4.1/registry/securing-exposing-registry.html#registry-exposing-secure-registry-manually_securing-exposing-registry

Thanks @bparees !
Using your provided link, I am able to find the HOST and use podman login to login to the registry .

But now I cannot push the image successfully.
Some error output like this:

podman push 5f6c26a1851a default-route-openshift-image-registry.apps.ocp4.example.com

Getting image source signatures
Error: Error copying image to the remote destination: Error trying to reuse blob sha256:0b8293539998d18431d3f74d3e9f89cc8ff7d057a234fae914e0213c8a0c63e3 at destination: Error checking whether a blob sha256:0b8293539998d18431d3f74d3e9f89cc8ff7d057a234fae914e0213c8a0c63e3 exists in docker.io/library/default-route-openshift-image-registry.apps.ocp4.a10networks.com: errors:
denied: requested access to the resource is denied
error parsing HTTP 401 response body: unexpected end of JSON input: ""


======================================

podman push --creds=kubeadmin:password 5f6c26a1851a default-route-openshift-image-registry.apps.ocp4.example.com
Getting image source signatures

Error: Error copying image to the remote destination: Error trying to reuse blob sha256:0b8293539998d18431d3f74d3e9f89cc8ff7d057a234fae914e0213c8a0c63e3 at destination: unable to retrieve auth token: invalid username/password

The second command uses kubeadmin:password taken from podman login -u kubeadmin -p password

Apparently the authentication does not work. What is the right/updated approach for push images from a remote machine?

Best

see: https://docs.openshift.com/container-platform/3.11/install_config/registry/accessing_registry.html#access-logging-in-to-the-registry

you can also use "oc registry login" to update your docker config.json w/ auth creds for pushing to the registry.

but the key is that you can't be using the default cluster-admin account as it has no token. You need to create a regular user or use a service account.

(though those are 3.11 docs, they are valid for 4.x. @adambkaplan @dmage @bmcelvee we need to get that content ported into 4.x if it's not there)

i'm also not sure what you're pushing to since you only provided the registry hostname, no repository(namespace name) or image name.

(looks like the 4.x docs are here: https://docs.openshift.com/container-platform/4.1/registry/accessing-the-registry.html#registry-accessing-directly_accessing-the-registry but they are a bit mixed up since they talk about accessing the registry from within the cluster and using the registry service hostname instead of exposing a route and using the route hostname).

@bparees

but the key is that you can't be using the default cluster-admin account as it has no token. You need to create a regular user or use a service account.

This information is critical. But how to create a user just for a setup lap environment? Does setting up a ldap server mandatory? I tried the htpasswd approach but it shows internal error for on the command line.

Username: admin
Password:
Error from server (InternalError): Internal error occurred: unexpected response: 500

i'm also not sure what you're pushing to since you only provided the registry hostname, no repository(namespace name) or image name.

assume HOST is the exposed default registry route.
can I have to create a repo name first to push my container image?

for example, if I tag my image like this, $HOST/default_reponame/app , will it just work?

Htpassword is fine. Creating a service account and using its token is even
easier.

The image repository name maps to an openshift project/namespace name. You
have to have permissions to create imagestreams in the namespace to be able
to push an image to it.

Ben Parees | OpenShift

On Fri, Sep 6, 2019, 00:53 Piping notifications@github.com wrote:

@bparees https://github.com/bparees

but the key is that you can't be using the default cluster-admin account
as it has no token. You need to create a regular user or use a service
account.

This information is critical. But how to create a user just for a setup
lap environment? Does setting up a ldap server mandatory? I tried the
htpasswd approach but it shows internal error for on the command line.

Username: admin
Password:
Error from server (InternalError): Internal error occurred: unexpected response: 500

i'm also not sure what you're pushing to since you only provided the
registry hostname, no repository(namespace name) or image name.

assume HOST is the exposed default registry route.
can I have to create a repo name first to push my container image?

for example, if I tag my image like this, $HOST/default_reponame/app ,
will it just work?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/16097?email_source=notifications&email_token=ABF6LXVONQMC56QUGU5JI6LQIGEW3A5CNFSM4DZHGXAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BCJMY#issuecomment-528622771,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABF6LXTLFR2EDBZZHHAMAW3QIGEW3ANCNFSM4DZHGXAA
.

Htpassword is fine. Creating a service account and using its token is even
easier.

Thanks for tips. I used a service account and successfully pushed my local image to imagestream inside the openshift. But the process stuck again. I tried to create a yaml file in pod and run the service. But it failed to pull the image. The yaml files looks like this? Do I miss anything here?

apiVersion: v1
kind: Pod
metadata:
  name: hello-test
  labels:
    app: hello-test
  namespace: default
spec:
  containers:
    - name: hello-test
      image: default/500p1b14
      ports:
        - containerPort: 22

@Piping refer to the following to reference an ImageStream tag directly in a pod or other k8s resource:

https://docs.okd.io/3.11/dev_guide/managing_images.html#using-is-with-k8s

Was this page helpful?
0 / 5 - 0 ratings