Serving: Network configuration when using private clusters

Created on 22 Jul 2019  Â·  21Comments  Â·  Source: knative/serving

In what area(s)?

/area networking

Ask your question here:

We encountered an issue when using Knative in a private cluster environment. Consider the following architecture:

We have a cluster for our engineers running in GKE as a private cluster (master and nodes are inaccessible via the Internet). Unfortunately, when applying a Knative service it fails with:

Internal error occurred: failed calling webhook "webhook.serving.knative.dev": Post https://webhook.knative-serving.svc:443/?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Everything works as expected when installing the service on a public cluster. Any help on this is highly appreciated 🙂

arenetworking kinquestion

All 21 comments

This is quite strange, since https://webhook.knative-serving.svc:443/ is definitely a cluster local address.
Do you have logs from Webhook itself? Did it succeed to register?

I am experiencing the exact same issue. I have installed knative (build/serving/eventing) on 1.11x, 1.12x, and 1.13x private GKE clusters. These clusters have the latest istio installed and have the master authorized networks disabled (have tried this with these networks enabled as well) and am unable to creates builds or ksvcs under any scenario. Have also tried installed knative v0.6x and v0.7x under all the above GKE settings and no luck either

Can you share information about how to create a cluster like the one where you are seeing this?

@mattmoor Below are the configurations that I'm using to create my gke cluster and to bootstrap it with knative.

# Generate legacy default auth credential file for use with terraform 
gcloud auth application-default login

# Download latest terraform client, if not already present
brew install terraform

# Create terraform file that uses  [official GCP GKE module](https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/4.1.0/submodules/beta-private-cluster)
cat << EOF > main.tf
module "gke" {
  source                     = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster"
  version                    = "4.1.0"
  project_id                 = "my-project"
  name                       = "private-gke-cluster-1"
  regional                   = true
  region                     = "us-east1"
  zones                      = ["us-east1-b", "us-east1-c", "us-east1-d"]
  network                    = "default"
  subnetwork                 = "default"
  ip_range_pods              = ""
  ip_range_services          = ""
  http_load_balancing        = true
  horizontal_pod_autoscaling = true
  kubernetes_dashboard       = false
  network_policy             = false
  kubernetes_version         = "1.13.7-gke.8"
  issue_client_certificate   = true
  service_account            =  "[email protected]"
  enable_private_nodes       = true
  enable_private_endpoint    = false
  remove_default_node_pool   = true
  istio                      = true
  cloudrun                   = false

  node_pools = [
    {
      name               = "default-node-pool"
      machine_type       = "n1-standard-2"
      min_count          = 1
      max_count          = 100
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS"
      auto_repair        = true
      auto_upgrade       = true
      service_account    = "[email protected]"
      preemptible        = false
      initial_node_count = 1
    },
  ]

  node_pools_oauth_scopes = {
    all = []

    default-node-pool = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }

  node_pools_labels = {
    all = {}

    default-node-pool = {
      default-node-pool = "true"
    }
  }

  node_pools_metadata = {
    all = {}

    default-node-pool = {
      node-pool-metadata-custom-value = "my-node-pool"
    }
  }

  node_pools_taints = {
    all = []

    default-node-pool = [
      {
        key    = "default-node-pool"
        value  = "true"
        effect = "PREFER_NO_SCHEDULE"
      },
    ]
  }

  node_pools_tags = {
    all = []

    default-node-pool = [
      "default-node-pool",
    ]
  }
}
EOF


# Create GKE cluster via standard terraform client commands
terraform init
terraform plan
terraform apply

# Manually remove the GKE cluster's master authorized network (as per [this issue](https://github.com/terraform-providers/terraform-provider-google/issues/3098))
gcloud container clusters update private-gke-cluster-1  --region us-east1 --no-enable-master-authorized-networks

# Install knative CRDs as per official guidance
kubectl apply --selector knative.dev/crd-install=true \
--filename https://github.com/knative/serving/releases/download/v0.7.0/serving.yaml \
--filename https://github.com/knative/build/releases/download/v0.7.0/build.yaml \
--filename https://github.com/knative/eventing/releases/download/v0.7.0/release.yaml \
--filename https://github.com/knative/serving/releases/download/v0.7.0/monitoring.yaml

# Install knative controllers etc (sans the monitoring stack)
kubectl apply --filename https://github.com/knative/serving/releases/download/v0.7.0/serving.yaml --selector networking.knative.dev/certificate-provider!=cert-manager \
--filename https://github.com/knative/build/releases/download/v0.7.0/build.yaml \
--filename https://github.com/knative/eventing/releases/download/v0.7.0/release.yaml

# Once knative pods are up, run the follow knative-build hello-world example
cat << EOF > hello-knative-build.yaml
apiVersion: build.knative.dev/v1alpha1
kind: Build
metadata:
  name: hello
spec:
  steps:
  - image: busybox
    args: ['echo', 'Hello, World!']
EOF

kubectl apply -f hello-knative-build.yaml

After running kubectl apply on the build manifest, no build resources are ever created on the cluster and in about 30 seconds I receive the same timeout error message that the OP reported.

cc @tcnghia

Thanks for the detailed repro instructions. Early this week will be a bit chaotic shutting down 0.8, but this should be very helpful attempting to reproduce what you are seeing so that we can get your problem sorted out.

@mattmoor Hate to pester you, but I'm curious if there has been any update on the knative + private GKE issue.

I'll try to find someone to look into it. I pinged @tcnghia , but realized he is out today. Sorry for the delay.

No worries at all. Thanks for putting this on the radar.

On Wed, Aug 21, 2019, 11:15 AM Matt Moore notifications@github.com wrote:

I'll try to find someone to look into it. I pinged @tcnghia
https://github.com/tcnghia , but realized he is out today. Sorry for
the delay.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/knative/serving/issues/4868?email_source=notifications&email_token=AFYTD6MKNTW3F3KOUGIJKRLQFVL2XA5CNFSM4IFWEL62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42ANHI#issuecomment-523503261,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYTD6P2OP35PKAM2WM425LQFVL2XANCNFSM4IFWEL6Q
.

I think this is a firewall issue, similar that of https://github.com/elastic/cloud-on-k8s/issues/1437

Can you please try the workaround there? thanks

The short explanation is that GKE private cluster by default only allows the GKE master to access your Services at port 443 or 80. Our webhook uses 8443 here, so it needs to be white-listed.

Instruction for that is here https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules

There may be other webhooks like Istio's that may need a white list.

My identical problem was resolved by @tcnghia's suggestion to add ingress 8443 to the firewall

BTW, the reason why 443 wasn't used is to avoid a privileged port (https://github.com/knative/build/pull/604).

I just look at Istio's webhooks and it look like they use 443, so no need to have additional rule for Istio. 8443 should be enough.

@sjmiller609 awesome! thanks a lot for confirmation.

@bbhuston if you could confirm this works, then we should discuss if/what changes we need to close this out.

@mattmoor Sorry for the delayed response. Was on an awesome vacation and was a little too lazy to check up on this.

Anyway, I reran the terraform/gke/knative setup that I posted above and then manually opened up port 8443 for the clusters master and worker node firewall rules. And BOOM! It works. Thank you for the follow-up and please feel free to close this issue.

Thanks for confirming.

I think we'll need to update the doc with this information, since avoiding 443 is still a good path (avoiding privileged port)

/close

@tcnghia: Closing this issue.

In response to this:

Thanks for confirming.

I think we'll need to update the doc with this information, since avoiding 443 is still a good path (avoiding privileged port)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I got this issue on microk8s on Windows:

D:\project_amanah\n8n-tutorial>kubectl apply -f serving-n8n.yaml
Error from server: error when creating "serving-n8n.yaml": conversion webhook for serving.knative.dev/v1, Kind=Service failed: Post https://webhook.knative-serving.svc:443/?timeout=30s: dial tcp 10.152.183.205:443: connect: connection refused

Any suggestion on what I should do to start diagnosing them cause and finding alternatives?

Was this page helpful?
0 / 5 - 0 ratings