Depending on the network topology or DNS servers, a fully disconnected installation in some instances will not be able to resolve route URLs inside the cluster. This manifests in an issue with the Che server pod trying to retrieve the openid configuration at $PUBLIC_KEYCLOAK_URL/auth/realms/che/.well-known/openid-configuration.
I don't know exactly how OpenShift does DNS in different environments. I would think that in-cluster traffic would be able to resolve a route properly, but it does not appear to be the case in all scenarios.
curl $KEYCLOAK_ROUTE_URL/auth/realms/che/.well-known/openid-configuration times out, but
curl keycloak.namespace.svc:8080/auth/realms/che/.well-known/openid-configuration succeeds
Start a Che installation in a disconnected environment.
kubectl version) oc version)minikube version and kubectl version)minishift version and oc version)docker version and kubectl version)Alternatively, you may be able to override these by setting customCheProperties fields in the CR. Working on the list of those.
The same problem happens much sooner, when installing with TLS with self-signed certificate - While extracting the certificate, operator creates temporary route, but this route is not accessible:
time="2019-11-14T14:19:59Z" level=info msg="Creating a test route test to extract router crt"
time="2019-11-14T14:19:59Z" level=info msg="Creating a new object Route, name: test"
time="2019-11-14T14:20:29Z" level=error msg="An error occurred when reaching test TLS route: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout"
time="2019-11-14T14:20:29Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout"
time="2019-11-14T14:20:30Z" level=info msg="Creating a test route test to extract router crt"
time="2019-11-14T14:20:31Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL"
time="2019-11-14T14:20:31Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL"
time="2019-11-14T14:20:32Z" level=info msg="Creating a test route test to extract router crt"
time="2019-11-14T14:20:33Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL"
time="2019-11-14T14:20:33Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL"
time="2019-11-14T14:20:34Z" level=info msg="Creating a test route test to extract router crt"
...
...
...
@rhopp It may be a fix for the issue you mentioned https://github.com/eclipse/che-operator/pull/66/commits/db15bdb770dac7171f2cfb41451fc737bc4fb5c8 but I'm not sure
I've been able to succesfully start che-server using k8s internal dns name of the keycloak service (in my case CHE_KEYCLOAK_AUTH__SERVER__URL: 'http://keycloak.rhopp-air-gap-crw.svc:8080/auth')
But then (as expected) dashboard wasn't able to load (with typical message Authorization token is missed), because my browser couldn't resolve that URL.
@rhopp @tomgeorge Would it be possible to check with OpenShift teams whether it is expected that typical airgaped OpenShift 4.2 installations would not allow PODs to access external routes ?
That seems quite a very hard restriction that would probably make Che fail anyway.
By default, on AWS, GCP, and Azure, if cluster DNS zone configuration was provided to the OpenShift installer, OpenShift will manage wildcard DNS records for ingress in the configured zones (assuming ingress is being exposed by a LoadBalancer Service, which is the default on those platforms.)
On other platforms, or if cluster DNS zone configuration is omitted, wildcard DNS records for ingress are _not managed_ and it's up to the cluster owner to configure DNS to expose ingress (if desired.)
I hope that helps clarify some of the DNS management behavior. I can provide more specific details if someone can help me understand how the problematic clusters are being created (e.g. through the OpenShift installer IPI flow, UPI, etc.)
Thanks to @ironcladlou for looking into this with me. The issue appears to be that the traffic is rejected by the LB or when on the way back to the node. We should look at the way this cluster was configured in QE and see if it matches the installation procedure in https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html
Response from Jianlin Liu, who has knowledge of how the cluster is configured:
Ah, I know what is the problem there.
In this disconnected env, we enabled proxy setting so that some core operators can connect cloud API.
cluster-image-registry-operator loaded these proxy setting, that is why my testing is going well.
sh-4.2$ env|grep -i proxy
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.rhopp-airgap.qe.devcluster.openshift.com,api.rhopp-airgap.qe.devcluster.openshift.com,etcd-0.rhopp-airgap.qe.devcluster.openshift.com,etcd-1.rhopp-airgap.qe.devcluster.openshift.com,etcd-2.rhopp-airgap.qe.devcluster.openshift.com,localhost,test.no-proxy.com
HTTPS_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128
HTTP_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128
In this disconnected env, we drop all internet connection from subnets to create an airgap env. apps DNS is pointing to an external ELB which is provisioned by ingress operator.
So that means you are trying to access an internet url inside cluster. I think that is an expected behavior.
Personally apps url mainly for external user access, if you want to access some cluster service inside cluster, why not use k8s svc endpoints?
If I understand the setup correctly, if you really want to use Routes on an internal subnet (i.e. routes can be accessed _only within the private subnet_), with OpenShift 4.2 you can try replacing the default ingresscontroller with an internally-scoped variant that provisions the LB on the cluster's private subnet, e.g.
$ oc replace --force --wait --filename - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
namespace: openshift-ingress-operator
name: default
spec:
endpointPublishingStrategy:
type: LoadBalancerService
loadBalancer:
scope: Internal
EOF
(See these Kubernetes docs for more detail on how this works)
@rhopp @jianlinliu could you please clarify what is the expected & recommended OCP installation setup/config in the airgap mode regarding DNS / LB? If I understand correctly we face this issue since the dns resolution of routes on the QA cluster is happening in public internet and the only way to communicate is using sevicename + port combo. What I do not understand is how come OCP in the airgap mode falls back on the public internet for route resolution? shouldn't it use internal DNS by default?
This document is the best thing that we have for airgap/restricted network installations: https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html
The issue is not actually the DNS resolution but rather that there is no route for traffic to exit the cluster and return through the ELB. After looking at the templates in http://git.app.eng.bos.redhat.com/git/openshift-misc.git/plain/v3-launch-templates/functionality-testing/aos-4_2/hosts/upi_on_aws-cloudformation-templates/ it looks like it is using Route53 for DNS resolution.
I went through the cloudformation templates that are used in this installation and compared them from the ones in the docs and found that the only differences were in the VPC/Networking configuration. The documented cloudformation stack had:
AWS::EC2::NatGateways tied to the public subnetsAWS::EC2::EIPs in the VPC's domainAWS::EC2::Routes from the private routing tables to the NatGatewayThe template used in cluster provisioning did not have these resources. Additionally, the template used in installation had:
AWS::EC2::SecurityGroup Allowing ingress from all protocols to the VPC CIDR rangecom.amazonaws.${AWS_REGION}.ec2 and com.amazonaws.${AWS_REGION}.elasticloadbalancingCould the lack of aws Routes from the private subnets to the NATGatewaybe the cause of this? The docs seem to indicate that not all of these resources are necessary in a restricted network environment:
You must have a public internet gateway, with public routes, attached to the VPC. In the provided templates, each public subnet has a NAT gateway with an EIP address. These NAT gateways allow cluster resources, like private-subnet instances, to reach the internet and are not required for some restricted network or proxy scenarios.
So it seems like a difference in configuration from the documented way, and the behavior of AWS ELB's where the traffic must leave the AWS network and come back in.
I wonder how hard it would be to refactor che to use service hostnames wherever possible, and keep the public-facing route to client-side things?
Latest info from my side:
I'm not very sure about all the underlying networking/solutions... But as Jianlin said they are using proxy, I've tried to deploy CRW 2.0 with proxy configured and it works (server succesfully started & dashboard was loaded - keycloak redirection was working).
I wasn't able to try workspace startup and I don't have time to do that now - this will have to wait for monday.
PR with docs update has been merged - eclipse/che-docs#944 merged. Closing
Closing
Most helpful comment
By default, on AWS, GCP, and Azure, if cluster DNS zone configuration was provided to the OpenShift installer, OpenShift will manage wildcard DNS records for ingress in the configured zones (assuming ingress is being exposed by a LoadBalancer Service, which is the default on those platforms.)
On other platforms, or if cluster DNS zone configuration is omitted, wildcard DNS records for ingress are _not managed_ and it's up to the cluster owner to configure DNS to expose ingress (if desired.)
I hope that helps clarify some of the DNS management behavior. I can provide more specific details if someone can help me understand how the problematic clusters are being created (e.g. through the OpenShift installer IPI flow, UPI, etc.)