Serving: External route stops working after a while (503 Service Unavailable)

Created on 5 Nov 2019  路  5Comments  路  Source: knative/serving

/area networking
/kind bug

tl;dr deployed KService stops working after a while, starts returning 503 from the gateway when invoked externally. cluster-local domain works fine. redeploying fixes the issue, only to be broken later again N hours.

What version of Knative?

v0.9.0-gke.3, istio no-mesh mode

Expected Behavior

I deploy a very basic KService like this

apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
  name: hello
spec:
  template:
    spec:
      containers:
      - image: gcr.io/google-samples/hello-app:1.0
I query it like curl -vH "Host: hello.default.example.com" 34.68.171.188 Nothing interesting. ## Actual Behavior After some time (~6-12 hours based on my estimates) this service starts returning HTTP 503 Service Unavailable from the gateway. **However,** the cluster-local route continues to work fine.
curl -vH "Host: hello.default.example.com" 34.68.171.188
< HTTP/1.1 503 Service Unavailable
< date: Tue, 05 Nov 2019 22:49:04 GMT
< server: istio-envoy
< content-length: 0
<
* Connection #0 to host 34.68.171.188 left intact
Traffic isn't even making it to activator or pod, the pod isn't scaling up from 0-to-1 when this happens. I literally take the same YAML and change `name: hello` to `name: hello2` and redeploy as a new KService, **it works just fine**. I've been observing this for several days, and delete/redeploy seems to be working, I am not able to explain. Here are some outputs:
kubectl get service
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"serving.knative.dev/v1alpha1","kind":"Service","metadata":{"annotations":{},"name":"hello","namespace":"default"},"spec":{"template":{"spec":{"containers":[{"image":"gcr.io/google-samples/hello-app:1.0"}]}}}}
    serving.knative.dev/creator: [email protected]
    serving.knative.dev/lastModifier: [email protected]
  creationTimestamp: "2019-11-05T05:28:51Z"
  generation: 1
  name: hello
  namespace: default
  resourceVersion: "43474394"
  selfLink: /apis/serving.knative.dev/v1/namespaces/default/services/hello
  uid: 26e02c24-ff8d-11e9-a378-42010a80012d
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containerConcurrency: 0
      containers:
      - image: gcr.io/google-samples/hello-app:1.0
        name: user-container
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources: {}
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100
status:
  address:
    url: http://hello.default.svc.cluster.local
  conditions:
  - lastTransitionTime: "2019-11-05T05:28:54Z"
    status: "True"
    type: ConfigurationsReady
  - lastTransitionTime: "2019-11-05T05:28:56Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2019-11-05T05:28:56Z"
    status: "True"
    type: RoutesReady
  latestCreatedRevisionName: hello-kmjgg
  latestReadyRevisionName: hello-kmjgg
  observedGeneration: 1
  traffic:
  - latestRevision: true
    percent: 100
    revisionName: hello-kmjgg
  url: http://hello.default.example.com


kubectl get revision

    apiVersion: serving.knative.dev/v1
kind: Revision
metadata:
  annotations:
    serving.knative.dev/creator: [email protected]
    serving.knative.dev/lastPinned: "1572991246"
  creationTimestamp: "2019-11-05T05:28:51Z"
  generateName: hello-
  generation: 1
  labels:
    serving.knative.dev/configuration: hello
    serving.knative.dev/configurationGeneration: "1"
    serving.knative.dev/route: hello
    serving.knative.dev/service: hello
  name: hello-kmjgg
  namespace: default
  ownerReferences:
  - apiVersion: serving.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Configuration
    name: hello
    uid: 26e19591-ff8d-11e9-a378-42010a80012d
  resourceVersion: "43690190"
  selfLink: /apis/serving.knative.dev/v1/namespaces/default/revisions/hello-kmjgg
  uid: 26e3420f-ff8d-11e9-a378-42010a80012d
spec:
  containerConcurrency: 0
  containers:
  - image: gcr.io/google-samples/hello-app:1.0
    name: user-container
    readinessProbe:
      successThreshold: 1
      tcpSocket:
        port: 0
    resources: {}
  timeoutSeconds: 300
status:
  conditions:
  - lastTransitionTime: "2019-11-05T05:29:54Z"
    message: The target is not receiving traffic.
    reason: NoTraffic
    severity: Info
    status: "False"
    type: Active
  - lastTransitionTime: "2019-11-05T05:28:54Z"
    status: "True"
    type: ContainerHealthy
  - lastTransitionTime: "2019-11-05T05:28:54Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2019-11-05T05:28:54Z"
    status: "True"
    type: ResourcesAvailable
  imageDigest: gcr.io/google-samples/hello-app@sha256:c62ead5b8c15c231f9e786250b07909daf6c266d0fcddd93fea882eb722c3be4
  logUrl: https://console.cloud.google.com/logs/viewer?advancedFilter=resource.type%3D%22k8s_container%22%0Aresource.labels.container_name%3D%22user-container%22%0Alabels.%22k8s-pod%2Fserving_knative_dev%2FrevisionUID%22%3D%2226e3420f-ff8d-11e9-a378-42010a80012d%22
  observedGeneration: 1
  serviceName: hello-kmjgg


kubectl get route

apiVersion: serving.knative.dev/v1
kind: Route
metadata:
  annotations:
    serving.knative.dev/creator: [email protected]
    serving.knative.dev/lastModifier: [email protected]
  creationTimestamp: "2019-11-05T05:28:51Z"
  finalizers:
  - routes.serving.knative.dev
  generation: 1
  labels:
    serving.knative.dev/service: hello
  name: hello
  namespace: default
  ownerReferences:
  - apiVersion: serving.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Service
    name: hello
    uid: 26e02c24-ff8d-11e9-a378-42010a80012d
  resourceVersion: "43474393"
  selfLink: /apis/serving.knative.dev/v1/namespaces/default/routes/hello
  uid: 26e3fdf6-ff8d-11e9-a378-42010a80012d
spec:
  traffic:
  - configurationName: hello
    latestRevision: true
    percent: 100
status:
  address:
    url: http://hello.default.svc.cluster.local
  conditions:
  - lastTransitionTime: "2019-11-05T05:28:54Z"
    status: "True"
    type: AllTrafficAssigned
  - lastTransitionTime: "2019-11-05T05:28:56Z"
    status: "True"
    type: IngressReady
  - lastTransitionTime: "2019-11-05T05:28:56Z"
    status: "True"
    type: Ready
  observedGeneration: 1
  traffic:
  - latestRevision: true
    percent: 100
    revisionName: hello-kmjgg
  url: http://hello.default.example.com



kubectl get ingress.networking

apiVersion: networking.internal.knative.dev/v1alpha1
kind: Ingress
metadata:
  annotations:
    networking.knative.dev/ingress.class: istio.ingress.networking.knative.dev
    serving.knative.dev/creator: [email protected]
    serving.knative.dev/lastModifier: [email protected]
  creationTimestamp: "2019-11-05T05:28:54Z"
  generation: 1
  labels:
    serving.knative.dev/route: hello
    serving.knative.dev/routeNamespace: default
    serving.knative.dev/service: hello
  name: hello
  namespace: default
  ownerReferences:
  - apiVersion: serving.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Route
    name: hello
    uid: 26e3fdf6-ff8d-11e9-a378-42010a80012d
  resourceVersion: "43690196"
  selfLink: /apis/networking.internal.knative.dev/v1alpha1/namespaces/default/ingresses/hello
  uid: 28a0037b-ff8d-11e9-a378-42010a80012d
spec:
  rules:
  - hosts:
    - hello.default.svc.cluster.local
    - hello.default.example.com
    http:
      paths:
      - retries:
          attempts: 3
          perTryTimeout: 15m0s
        splits:
        - appendHeaders:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello-kmjgg
          percent: 100
          serviceName: hello-kmjgg
          serviceNamespace: default
          servicePort: 80
        timeout: 15m0s
    visibility: ExternalIP
  visibility: ExternalIP
status:
  conditions:
  - lastTransitionTime: "2019-11-05T22:00:46Z"
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2019-11-05T05:28:54Z"
    status: "True"
    type: NetworkConfigured
  - lastTransitionTime: "2019-11-05T22:00:46Z"
    status: "True"
    type: Ready
  loadBalancer:
    ingress:
    - domainInternal: istio-ingress.gke-system.svc.cluster.local
  observedGeneration: 1
  privateLoadBalancer:
    ingress:
    - domainInternal: cluster-local-gateway.gke-system.svc.cluster.local
  publicLoadBalancer:
    ingress:
    - domainInternal: istio-ingress.gke-system.svc.cluster.local


kubectl get virtualservice

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  annotations:
    networking.knative.dev/ingress.class: istio.ingress.networking.knative.dev
    serving.knative.dev/creator: [email protected]
    serving.knative.dev/lastModifier: [email protected]
  creationTimestamp: "2019-11-05T05:28:54Z"
  generation: 1
  labels:
    serving.knative.dev/route: hello
    serving.knative.dev/routeNamespace: default
  name: hello
  namespace: default
  ownerReferences:
  - apiVersion: networking.internal.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Ingress
    name: hello
    uid: 28a0037b-ff8d-11e9-a378-42010a80012d
  resourceVersion: "43474383"
  selfLink: /apis/networking.istio.io/v1alpha3/namespaces/default/virtualservices/hello
  uid: 28ae02b1-ff8d-11e9-a378-42010a80012d
spec:
  gateways:
  - knative-serving/cluster-local-gateway
  - knative-serving/gke-system-gateway
  - knative-serving/knative-ingress-gateway
  hosts:
  - hello.default
  - hello.default.example.com
  - hello.default.svc
  - hello.default.svc.cluster.local
  - c0d2f6b75318fcbab3006314bec06026.probe.invalid
  http:
  - match:
    - authority:
        regex: ^hello\.default\.example\.com(?::\d{1,5})?$
      gateways:
      - knative-serving/gke-system-gateway
      - knative-serving/knative-ingress-gateway
    - authority:
        regex: ^hello\.default(\.svc(\.cluster\.local)?)?(?::\d{1,5})?$
      gateways:
      - knative-serving/cluster-local-gateway
    retries:
      attempts: 3
      perTryTimeout: 15m0s
    route:
    - destination:
        host: hello-kmjgg.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          add:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello-kmjgg
      weight: 100
    timeout: 15m0s
    websocketUpgrade: true
  - fault:
      abort:
        httpStatus: 200
        percent: 100
    match:
    - authority:
        exact: c0d2f6b75318fcbab3006314bec06026.probe.invalid
    route:
    - destination:
        host: null.invalid
        port:
          number: 80
      weight: 0

(no logs on istio-ingressgateway-* pod while I query the ksvc)

(no logs on activator* pod while I query the ksvc)

Steps to Reproduce the Problem

Use the yaml above, wait for several hours. Then query the service, observe 503.

Use the same yaml, change name, redeploy, query. Observe http 200.

arenetworking kinbug

Most helpful comment

Because of an upgrade from 0.6 to 0.9, some orphan VirtualService were left in knative-serving leading to an invalid Envoy config (non-existing backend).

All 5 comments

How do you check that traffic doesn't make it to the activator?
Are there any relevant logs out in the activator?

Because of an upgrade from 0.6 to 0.9, some orphan VirtualService were left in knative-serving leading to an invalid Envoy config (non-existing backend).

Indeed that was the problem. We can /close if direct upgrades are not supported (at least for now).

/close

@JRBANCEL: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

josephburnett picture josephburnett  路  6Comments

mattmoor picture mattmoor  路  5Comments

vtereso picture vtereso  路  5Comments

maxiloEmmmm picture maxiloEmmmm  路  4Comments

tcnghia picture tcnghia  路  3Comments