Che: Route DNS hostnames not routeable in airgap scenario so che fails to start

Created on 13 Nov 2019 · 13Comments · Source: eclipse/che

Describe the bug

Depending on the network topology or DNS servers, a fully disconnected installation in some instances will not be able to resolve route URLs inside the cluster. This manifests in an issue with the Che server pod trying to retrieve the openid configuration at $PUBLIC_KEYCLOAK_URL/auth/realms/che/.well-known/openid-configuration.

I don't know exactly how OpenShift does DNS in different environments. I would think that in-cluster traffic would be able to resolve a route properly, but it does not appear to be the case in all scenarios.

curl $KEYCLOAK_ROUTE_URL/auth/realms/che/.well-known/openid-configuration times out, but

curl keycloak.namespace.svc:8080/auth/realms/che/.well-known/openid-configuration succeeds

Che version

[x] latest
[ ] nightly
[ ] other: please specify

Steps to reproduce

Start a Che installation in a disconnected environment.

Expected behavior

Runtime

[ ] kubernetes (include output of kubectl version)
[x] Openshift (include output of oc version)
[ ] minikube (include output of minikube version and kubectl version)
[ ] minishift (include output of minishift version and oc version)
[ ] docker-desktop + K8S (include output of docker version and kubectl version)
[ ] other: (please specify)

Screenshots

Installation method

[ ] chectl
[x] che-operator 7.4.0
[ ] minishift-addon
[ ] I don't know

Environment

[ ] my computer
- [ ] Windows
- [ ] Linux
- [ ] macOS
[ ] Cloud
- [ ] Amazon
- [ ] Azure
- [ ] GCE
- [ ] other (please specify)
[ ] other: please specify

Additional context

areinstall kinbug severitP1 teahosted-che

Source

tomgeorge

Most helpful comment

By default, on AWS, GCP, and Azure, if cluster DNS zone configuration was provided to the OpenShift installer, OpenShift will manage wildcard DNS records for ingress in the configured zones (assuming ingress is being exposed by a LoadBalancer Service, which is the default on those platforms.)

On other platforms, or if cluster DNS zone configuration is omitted, wildcard DNS records for ingress are _not managed_ and it's up to the cluster owner to configure DNS to expose ingress (if desired.)

I hope that helps clarify some of the DNS management behavior. I can provide more specific details if someone can help me understand how the problematic clusters are being created (e.g. through the OpenShift installer IPI flow, UPI, etc.)

ironcladlou on 14 Nov 2019

👍2

All 13 comments

Alternatively, you may be able to override these by setting customCheProperties fields in the CR. Working on the list of those.

tomgeorge on 13 Nov 2019

The same problem happens much sooner, when installing with TLS with self-signed certificate - While extracting the certificate, operator creates temporary route, but this route is not accessible:

time="2019-11-14T14:19:59Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:19:59Z" level=info msg="Creating a new object Route, name: test" 
time="2019-11-14T14:20:29Z" level=error msg="An error occurred when reaching test TLS route: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout" 
time="2019-11-14T14:20:29Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout" 
time="2019-11-14T14:20:30Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:20:31Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:31Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:32Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:20:33Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:33Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:34Z" level=info msg="Creating a test route test to extract router crt" 
...
...
...

rhopp on 14 Nov 2019

@rhopp It may be a fix for the issue you mentioned https://github.com/eclipse/che-operator/pull/66/commits/db15bdb770dac7171f2cfb41451fc737bc4fb5c8 but I'm not sure

sleshchenko on 14 Nov 2019

I've been able to succesfully start che-server using k8s internal dns name of the keycloak service (in my case CHE_KEYCLOAK_AUTH__SERVER__URL: 'http://keycloak.rhopp-air-gap-crw.svc:8080/auth')

But then (as expected) dashboard wasn't able to load (with typical message Authorization token is missed), because my browser couldn't resolve that URL.

rhopp on 14 Nov 2019

@rhopp @tomgeorge Would it be possible to check with OpenShift teams whether it is expected that typical airgaped OpenShift 4.2 installations would not allow PODs to access external routes ?
That seems quite a very hard restriction that would probably make Che fail anyway.

davidfestal on 14 Nov 2019

👍2

ironcladlou on 14 Nov 2019

👍2

Thanks to @ironcladlou for looking into this with me. The issue appears to be that the traffic is rejected by the LB or when on the way back to the node. We should look at the way this cluster was configured in QE and see if it matches the installation procedure in https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html

tomgeorge on 15 Nov 2019

Response from Jianlin Liu, who has knowledge of how the cluster is configured:

Ah, I know what is the problem there.
In this disconnected env, we enabled proxy setting so that some core operators can connect cloud API.
cluster-image-registry-operator loaded these proxy setting, that is why my testing is going well.
sh-4.2$ env|grep -i proxy
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.rhopp-airgap.qe.devcluster.openshift.com,api.rhopp-airgap.qe.devcluster.openshift.com,etcd-0.rhopp-airgap.qe.devcluster.openshift.com,etcd-1.rhopp-airgap.qe.devcluster.openshift.com,etcd-2.rhopp-airgap.qe.devcluster.openshift.com,localhost,test.no-proxy.com
HTTPS_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128
HTTP_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128

In this disconnected env, we drop all internet connection from subnets to create an airgap env. apps DNS is pointing to an external ELB which is provisioned by ingress operator.
So that means you are trying to access an internet url inside cluster. I think that is an expected behavior.

Personally apps url mainly for external user access, if you want to access some cluster service inside cluster, why not use k8s svc endpoints?

rhopp on 15 Nov 2019

If I understand the setup correctly, if you really want to use Routes on an internal subnet (i.e. routes can be accessed _only within the private subnet_), with OpenShift 4.2 you can try replacing the default ingresscontroller with an internally-scoped variant that provisions the LB on the cluster's private subnet, e.g.

$ oc replace --force --wait --filename - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  namespace: openshift-ingress-operator
  name: default
spec:
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      scope: Internal
EOF

(See these Kubernetes docs for more detail on how this works)

ironcladlou on 15 Nov 2019

👍1

@rhopp @jianlinliu could you please clarify what is the expected & recommended OCP installation setup/config in the airgap mode regarding DNS / LB? If I understand correctly we face this issue since the dns resolution of routes on the QA cluster is happening in public internet and the only way to communicate is using sevicename + port combo. What I do not understand is how come OCP in the airgap mode falls back on the public internet for route resolution? shouldn't it use internal DNS by default?

ibuziuk on 15 Nov 2019

This document is the best thing that we have for airgap/restricted network installations: https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html

The issue is not actually the DNS resolution but rather that there is no route for traffic to exit the cluster and return through the ELB. After looking at the templates in http://git.app.eng.bos.redhat.com/git/openshift-misc.git/plain/v3-launch-templates/functionality-testing/aos-4_2/hosts/upi_on_aws-cloudformation-templates/ it looks like it is using Route53 for DNS resolution.

I went through the cloudformation templates that are used in this installation and compared them from the ones in the docs and found that the only differences were in the VPC/Networking configuration. The documented cloudformation stack had:

AWS::EC2::NatGateways tied to the public subnets
AWS::EC2::EIPs in the VPC's domain
AWS::EC2::Routes from the private routing tables to the NatGateway

The template used in cluster provisioning did not have these resources. Additionally, the template used in installation had:

AWS::EC2::SecurityGroup Allowing ingress from all protocols to the VPC CIDR range
A VPCEndpoint in the VPC to the Services com.amazonaws.${AWS_REGION}.ec2 and com.amazonaws.${AWS_REGION}.elasticloadbalancing

Could the lack of aws Routes from the private subnets to the NATGatewaybe the cause of this? The docs seem to indicate that not all of these resources are necessary in a restricted network environment:

You must have a public internet gateway, with public routes, attached to the VPC. In the provided templates, each public subnet has a NAT gateway with an EIP address. These NAT gateways allow cluster resources, like private-subnet instances, to reach the internet and are not required for some restricted network or proxy scenarios.

So it seems like a difference in configuration from the documented way, and the behavior of AWS ELB's where the traffic must leave the AWS network and come back in.

I wonder how hard it would be to refactor che to use service hostnames wherever possible, and keep the public-facing route to client-side things?

tomgeorge on 15 Nov 2019

Latest info from my side:
I'm not very sure about all the underlying networking/solutions... But as Jianlin said they are using proxy, I've tried to deploy CRW 2.0 with proxy configured and it works (server succesfully started & dashboard was loaded - keycloak redirection was working).
I wasn't able to try workspace startup and I don't have time to do that now - this will have to wait for monday.

rhopp on 15 Nov 2019

PR with docs update has been merged - eclipse/che-docs#944 merged. Closing
Closing

ibuziuk on 27 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings