/area networking
We encountered an issue when using Knative in a private cluster environment. Consider the following architecture:
We have a cluster for our engineers running in GKE as a private cluster (master and nodes are inaccessible via the Internet). Unfortunately, when applying a Knative service it fails with:
Internal error occurred: failed calling webhook "webhook.serving.knative.dev": Post https://webhook.knative-serving.svc:443/?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Everything works as expected when installing the service on a public cluster. Any help on this is highly appreciated 🙂
This is quite strange, since https://webhook.knative-serving.svc:443/ is definitely a cluster local address.
Do you have logs from Webhook itself? Did it succeed to register?
I am experiencing the exact same issue. I have installed knative (build/serving/eventing) on 1.11x, 1.12x, and 1.13x private GKE clusters. These clusters have the latest istio installed and have the master authorized networks disabled (have tried this with these networks enabled as well) and am unable to creates builds or ksvcs under any scenario. Have also tried installed knative v0.6x and v0.7x under all the above GKE settings and no luck either
Can you share information about how to create a cluster like the one where you are seeing this?
@mattmoor Below are the configurations that I'm using to create my gke cluster and to bootstrap it with knative.
# Generate legacy default auth credential file for use with terraform
gcloud auth application-default login
# Download latest terraform client, if not already present
brew install terraform
# Create terraform file that uses [official GCP GKE module](https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/4.1.0/submodules/beta-private-cluster)
cat << EOF > main.tf
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster"
version = "4.1.0"
project_id = "my-project"
name = "private-gke-cluster-1"
regional = true
region = "us-east1"
zones = ["us-east1-b", "us-east1-c", "us-east1-d"]
network = "default"
subnetwork = "default"
ip_range_pods = ""
ip_range_services = ""
http_load_balancing = true
horizontal_pod_autoscaling = true
kubernetes_dashboard = false
network_policy = false
kubernetes_version = "1.13.7-gke.8"
issue_client_certificate = true
service_account = "[email protected]"
enable_private_nodes = true
enable_private_endpoint = false
remove_default_node_pool = true
istio = true
cloudrun = false
node_pools = [
{
name = "default-node-pool"
machine_type = "n1-standard-2"
min_count = 1
max_count = 100
disk_size_gb = 100
disk_type = "pd-standard"
image_type = "COS"
auto_repair = true
auto_upgrade = true
service_account = "[email protected]"
preemptible = false
initial_node_count = 1
},
]
node_pools_oauth_scopes = {
all = []
default-node-pool = [
"https://www.googleapis.com/auth/cloud-platform",
]
}
node_pools_labels = {
all = {}
default-node-pool = {
default-node-pool = "true"
}
}
node_pools_metadata = {
all = {}
default-node-pool = {
node-pool-metadata-custom-value = "my-node-pool"
}
}
node_pools_taints = {
all = []
default-node-pool = [
{
key = "default-node-pool"
value = "true"
effect = "PREFER_NO_SCHEDULE"
},
]
}
node_pools_tags = {
all = []
default-node-pool = [
"default-node-pool",
]
}
}
EOF
# Create GKE cluster via standard terraform client commands
terraform init
terraform plan
terraform apply
# Manually remove the GKE cluster's master authorized network (as per [this issue](https://github.com/terraform-providers/terraform-provider-google/issues/3098))
gcloud container clusters update private-gke-cluster-1 --region us-east1 --no-enable-master-authorized-networks
# Install knative CRDs as per official guidance
kubectl apply --selector knative.dev/crd-install=true \
--filename https://github.com/knative/serving/releases/download/v0.7.0/serving.yaml \
--filename https://github.com/knative/build/releases/download/v0.7.0/build.yaml \
--filename https://github.com/knative/eventing/releases/download/v0.7.0/release.yaml \
--filename https://github.com/knative/serving/releases/download/v0.7.0/monitoring.yaml
# Install knative controllers etc (sans the monitoring stack)
kubectl apply --filename https://github.com/knative/serving/releases/download/v0.7.0/serving.yaml --selector networking.knative.dev/certificate-provider!=cert-manager \
--filename https://github.com/knative/build/releases/download/v0.7.0/build.yaml \
--filename https://github.com/knative/eventing/releases/download/v0.7.0/release.yaml
# Once knative pods are up, run the follow knative-build hello-world example
cat << EOF > hello-knative-build.yaml
apiVersion: build.knative.dev/v1alpha1
kind: Build
metadata:
name: hello
spec:
steps:
- image: busybox
args: ['echo', 'Hello, World!']
EOF
kubectl apply -f hello-knative-build.yaml
After running kubectl apply on the build manifest, no build resources are ever created on the cluster and in about 30 seconds I receive the same timeout error message that the OP reported.
cc @tcnghia
Thanks for the detailed repro instructions. Early this week will be a bit chaotic shutting down 0.8, but this should be very helpful attempting to reproduce what you are seeing so that we can get your problem sorted out.
@mattmoor Hate to pester you, but I'm curious if there has been any update on the knative + private GKE issue.
I'll try to find someone to look into it. I pinged @tcnghia , but realized he is out today. Sorry for the delay.
No worries at all. Thanks for putting this on the radar.
On Wed, Aug 21, 2019, 11:15 AM Matt Moore notifications@github.com wrote:
I'll try to find someone to look into it. I pinged @tcnghia
https://github.com/tcnghia , but realized he is out today. Sorry for
the delay.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/knative/serving/issues/4868?email_source=notifications&email_token=AFYTD6MKNTW3F3KOUGIJKRLQFVL2XA5CNFSM4IFWEL62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42ANHI#issuecomment-523503261,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFYTD6P2OP35PKAM2WM425LQFVL2XANCNFSM4IFWEL6Q
.
I think this is a firewall issue, similar that of https://github.com/elastic/cloud-on-k8s/issues/1437
Can you please try the workaround there? thanks
8443 is the port that you need to allow
https://github.com/knative/serving/blob/master/config/400-webhook-service.yaml#L26
The short explanation is that GKE private cluster by default only allows the GKE master to access your Services at port 443 or 80. Our webhook uses 8443 here, so it needs to be white-listed.
Instruction for that is here https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules
There may be other webhooks like Istio's that may need a white list.
My identical problem was resolved by @tcnghia's suggestion to add ingress 8443 to the firewall
BTW, the reason why 443 wasn't used is to avoid a privileged port (https://github.com/knative/build/pull/604).
I just look at Istio's webhooks and it look like they use 443, so no need to have additional rule for Istio. 8443 should be enough.
@sjmiller609 awesome! thanks a lot for confirmation.
@bbhuston if you could confirm this works, then we should discuss if/what changes we need to close this out.
@mattmoor Sorry for the delayed response. Was on an awesome vacation and was a little too lazy to check up on this.
Anyway, I reran the terraform/gke/knative setup that I posted above and then manually opened up port 8443 for the clusters master and worker node firewall rules. And BOOM! It works. Thank you for the follow-up and please feel free to close this issue.
Thanks for confirming.
I think we'll need to update the doc with this information, since avoiding 443 is still a good path (avoiding privileged port)
/close
@tcnghia: Closing this issue.
In response to this:
Thanks for confirming.
I think we'll need to update the doc with this information, since avoiding 443 is still a good path (avoiding privileged port)
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Now on the official GCP documentation!
I got this issue on microk8s on Windows:
D:\project_amanah\n8n-tutorial>kubectl apply -f serving-n8n.yaml
Error from server: error when creating "serving-n8n.yaml": conversion webhook for serving.knative.dev/v1, Kind=Service failed: Post https://webhook.knative-serving.svc:443/?timeout=30s: dial tcp 10.152.183.205:443: connect: connection refused
Any suggestion on what I should do to start diagnosing them cause and finding alternatives?