Serving: Service getting scaled down on Timeout before Timeout?

Created on 11 Dec 2019 · 17Comments · Source: knative/serving

In what area(s)?

/area serving

What version of Knative?

0.9

Expected Behavior

Deployed a service with default revision timeout (300s).. however the service gets marked as timeout after 2m28s causing the deployment to be scaled down to 0 before the service has a chance to come up.

Actual Behavior

NAME                                                     READY   STATUS    RESTARTS   AGE
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Pending   0          105s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Pending   0          106s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Init:0/1   0          111s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Init:0/1   0          2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Terminating   0          2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5   0/3     Terminating   0          2m28s

Events:

$ k get events -n petrodeg-mds5-1
LAST SEEN   TYPE      REASON                OBJECT                                                         MESSAGE
16s         Normal    Synced                azurekeyvaultsecret/mir-int-westus2-tenant-viennadroptestkey   AzureKeyVaultSecret synced successfully
24m         Warning   FailedScheduling      pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     0/14 nodes are available: 14 node(s) didn't match node selector, 3 Insufficient memory.
24m         Normal    NotTriggerScaleUp     pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
23m         Warning   FailedScheduling      pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     0/15 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 14 node(s) didn't match node selector, 3 Insufficient memory.
23m         Normal    Scheduled             pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Successfully assigned petrodeg-mds5-1/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 to k8s-petrodeg-mds5-1-42413502-vmss000001
23m         Normal    Pulling               pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Pulling image "docker.io/istio/proxy_init:1.3.0"
23m         Normal    Pulled                pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Successfully pulled image "docker.io/istio/proxy_init:1.3.0"
23m         Normal    Created               pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Created container istio-init
23m         Normal    Started               pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5     Started container istio-init
25m         Normal    SuccessfulCreate      replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc    Created pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
23m         Normal    SuccessfulDelete      replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc    Deleted pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
25m         Normal    ScalingReplicaSet     deployment/petrodeg-ms5-default-v9rdc-deployment               Scaled up replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 1
23m         Normal    ScalingReplicaSet     deployment/petrodeg-ms5-default-v9rdc-deployment               Scaled down replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 0
25m         Normal    Updated               metric/petrodeg-ms5-default-v9rdc                              Successfully updated metric status petrodeg-mds5-1/petrodeg-ms5-default-v9rdc
23m         Normal    Updated               serverlessservice/petrodeg-ms5-default-v9rdc                   Successfully updated ServerlessService "petrodeg-mds5-1/petrodeg-ms5-default-v9rdc"
23m         Warning   InternalError         revision/petrodeg-ms5-default-v9rdc                            Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again
23m         Warning   UpdateFailed          podautoscaler/petrodeg-ms5-default-v9rdc                       Failed to update status for PA "petrodeg-ms5-default-v9rdc": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "petrodeg-ms5-default-v9rdc": the object has been modified; please apply your changes to the latest version and try again
25m         Normal    Created               configuration/petrodeg-ms5-default                             Created Revision "petrodeg-ms5-default-v9rdc"
23m         Warning   LatestCreatedFailed   configuration/petrodeg-ms5-default                             Latest created revision "petrodeg-ms5-default-v9rdc" has failed

Revision:

$ k describe revision petrodeg-ms5-default-v9rdc -n petrodeg-mds5-1
Name:         petrodeg-ms5-default-v9rdc
Namespace:    petrodeg-mds5-1
Labels:  serving.knative.dev/configuration=petrodeg-ms5-default
              serving.knative.dev/configurationGeneration=1
              serving.knative.dev/service=
              serving.kubeflow.org/kfservice=petrodeg-ms5
Annotations:  autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
              autoscaling.knative.dev/maxScale: 1
              autoscaling.knative.dev/minScale: 1
              autoscaling.knative.dev/target: 1
              autoscaling.knative.dev/targetBurstCapacity: 0
              networking.knative.dev/customExternalIPGatewayName: petrodeg-mds5-1/petrodeg-ms5-1
API Version:  serving.knative.dev/v1alpha1
Kind:         Revision
Metadata:
  Creation Timestamp:  2019-12-11T17:55:08Z
  Generate Name:       petrodeg-ms5-default-
  Generation:          1
  Owner References:
    API Version:           serving.knative.dev/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Configuration
    Name:                  petrodeg-ms5-default
    UID:                   5ed02123-1c3f-11ea-b469-000d3a067a44
  Resource Version:        28603988
  Self Link:               /apis/serving.knative.dev/v1alpha1/namespaces/petrodeg-mds5-1/revisions/petrodeg-ms5-default-v9rdc
  UID:                     5ed5e000-1c3f-11ea-9e0d-000d3a067133
Spec:
  Container Concurrency:  0
  Containers:
    Args:
      http
    Image:  kcorer/jump
    Name:   user-container
    Ports:
      Container Port:  8888
      Protocol:        TCP
    Readiness Probe:
      Success Threshold:  1
      Tcp Socket:
        Port:  0
    Resources:
      Limits:
        Cpu:     400m
        Memory:  400Mi
      Requests:
        Cpu:             400m
        Memory:          400Mi
  Timeout Seconds:       300
Status:
  Conditions:
    Last Transition Time:  2019-12-11T17:57:37Z
    Message:               The target could not be activated.
    Reason:                TimedOut
    Severity:              Info
    Status:                False
    Type:                  Active
    Last Transition Time:  2019-12-11T17:55:09Z
    Reason:                Deploying
    Status:                Unknown
    Type:                  ContainerHealthy
    Last Transition Time:  2019-12-11T17:57:10Z
    Reason:                PodInitializing
    Status:                False
    Type:                  Ready
    Last Transition Time:  2019-12-11T17:57:10Z
    Reason:                PodInitializing
    Status:                False
    Type:                  ResourcesAvailable
  Image Digest:            index.docker.io/kcorer/jump@sha256:4bd27b8fcad2b575b4c30fd0e8c360375020dc4e28923e3d8540779c1f72748b
  Log URL:                 http://localhost:8001/api/v1/namespaces/knative-monitoring/services/kibana-logging/proxy/app/kibana#/discover?_a=(query:(match:(kubernetes.labels.knative-dev%2FrevisionUID:(query:'5ed5e000-1c3f-11ea-9e0d-000d3a067133',type:phrase))))
  Observed Generation:     1
  Service Name:            petrodeg-ms5-default-v9rdc
Events:
  Type     Reason         Age                From                 Message
  ----     ------         ----               ----                 -------
  Warning  InternalError  24m (x2 over 24m)  revision-controller  Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again

Steps to Reproduce the Problem

Create a KN service in a ns that does not have any nodes (using taints/tolerations)
Wait a bit
Remove a taint on a node to allow the pod to be scheduled

kinbug

Source

rakelkar

Most helpful comment

@vagababov @mattmoor Is it ok to increase the constant to 300s? KFServing downloads the ML models from cloud storage before starting up the server, for large models it can take 1~2 minutes to download the model and also we need to account for ~30s for pod start up time, so I think 5 minutes is reasonable, thoughts?

https://github.com/knative/serving/blob/af63abe0d4aaf9ef9999dd46dc8e933d59b980b3/pkg/reconciler/revision/resources/constants.go#L36

yuzisun on 19 Apr 2020

👍2

All 17 comments

So the current logic here is: https://github.com/knative/serving/blob/master/pkg/reconciler/autoscaling/kpa/kpa.go#L273
If desired scale is 0 (and it seems you didn't send any requests and don't have minScale>0) and revision never activated -- we mark it as timed out.

Perhaps we need to wait to get the revision to start before deciding to scale down.
/cc @markusthoemmes @mattmoor @tanzeeb

vagababov on 11 Dec 2019

I suspect this might be to do with progressDeadlineSeconds that is currently hardcoded to 120s...

rakelkar on 11 Dec 2019

If desired scale is 0 (and it seems you didn't send any requests and don't have minScale>0) and revision never activated -- we mark it as timed out.

@vagababov I don't think that's the case, the scaler shouldn't scale to 0 until the pod is active, see https://github.com/knative/serving/blob/master/pkg/reconciler/autoscaling/kpa/scaler.go#L148

tanzeeb on 11 Dec 2019

I suspect this might be to do with progressDeadlineSeconds that is currently hardcoded to 120s...

@rakelkar i think you're right. I'm curious about why it's at 2m28s instead of 1m30s

tanzeeb on 11 Dec 2019

So its not progressDeadlineSeconds - I tried it with progressDeadlineSeconds set to 1 and the revision comes up fine... It is however related (or caused) by the fact that the revision gets set to failed when revision lifecycle marks resources as unavailable.

            // Update the revision status if pod cannot be scheduled(possibly resource constraints)
            // If pod cannot be scheduled then we expect the container status to be empty.
            for _, cond := range pod.Status.Conditions {
                if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {
                    // COMMENTING THIS OUT ALLOWS THE REVISION TO PROCEED
                    // rev.Status.MarkResourcesUnavailable(cond.Reason, cond.Message)
                    break
                }
            }

As an aside I observed some random behavior I cannot explain when changing progressDeadline and redeploying the controller... (not related to the main issue):

deploy controller with progressDeadline=5s
deploy a service (watch it come up as expected - say pod xyz)
deploy a controller with progressDeadline=120s
Observe pod xyz get killed and new pod abc come up for updated deployment (so far so good)
Then observe autoscaler scale the deployment down to 0 causing the newly created abc to die
You now have a happy revision that has min=max=1 but zero actual pods :(

rakelkar on 13 Dec 2019

Gah - have to up ProgressDeadlineSecs as well... since a few lines down the code fails the revision if progressDeadline is exceeded...:

} else if w := status.State.Waiting; w != nil && hasDeploymentTimedOut(deployment) {
    logger.Infof("%s marking resources unavailable with: %s: %s", rev.Name, w.Reason, w.Message)
    rev.Status.MarkResourcesUnavailable(w.Reason, w.Message)
}

rakelkar on 14 Dec 2019

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

knative-housekeeping-robot on 13 Mar 2020

@vagababov any thoughts on this, this is definitely a bumper which we hit so many times.

yuzisun on 14 Mar 2020

👍1

What is the situation that you're seeing? The pods don't come up fast enough and get killed?

vagababov on 14 Mar 2020

What is the situation that you're seeing? The pods don't come up fast enough and get killed?

Yes sometimes image is big and take sometime to download, knative should not scale down before the pod is ready right ?

yuzisun on 14 Mar 2020

Right now, we just _fail_ to activate the revision (as said above). We might want to tweak that constant, but it still is going to be a constant.
/cc mattmoor

vagababov on 14 Mar 2020

Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

knative-housekeeping-robot on 14 Apr 2020

/remove-lifecycle rotten

yuzisun on 18 Apr 2020

https://github.com/knative/serving/blob/af63abe0d4aaf9ef9999dd46dc8e933d59b980b3/pkg/reconciler/revision/resources/constants.go#L36

yuzisun on 19 Apr 2020

👍2

Rather than arbitrarily bumping it, let's add a knob to config-deployment?