/area serving
0.9
Deployed a service with default revision timeout (300s).. however the service gets marked as timeout after 2m28s causing the deployment to be scaled down to 0 before the service has a chance to come up.
NAME READY STATUS RESTARTS AGE
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Pending 0 105s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Pending 0 106s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Init:0/1 0 111s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Init:0/1 0 2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Terminating 0 2m28s
petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/3 Terminating 0 2m28s
Events:
$ k get events -n petrodeg-mds5-1
LAST SEEN TYPE REASON OBJECT MESSAGE
16s Normal Synced azurekeyvaultsecret/mir-int-westus2-tenant-viennadroptestkey AzureKeyVaultSecret synced successfully
24m Warning FailedScheduling pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/14 nodes are available: 14 node(s) didn't match node selector, 3 Insufficient memory.
24m Normal NotTriggerScaleUp pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
23m Warning FailedScheduling pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 0/15 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 14 node(s) didn't match node selector, 3 Insufficient memory.
23m Normal Scheduled pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Successfully assigned petrodeg-mds5-1/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 to k8s-petrodeg-mds5-1-42413502-vmss000001
23m Normal Pulling pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Pulling image "docker.io/istio/proxy_init:1.3.0"
23m Normal Pulled pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Successfully pulled image "docker.io/istio/proxy_init:1.3.0"
23m Normal Created pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Created container istio-init
23m Normal Started pod/petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5 Started container istio-init
25m Normal SuccessfulCreate replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc Created pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
23m Normal SuccessfulDelete replicaset/petrodeg-ms5-default-v9rdc-deployment-766bf565cc Deleted pod: petrodeg-ms5-default-v9rdc-deployment-766bf565cc-k5ph5
25m Normal ScalingReplicaSet deployment/petrodeg-ms5-default-v9rdc-deployment Scaled up replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 1
23m Normal ScalingReplicaSet deployment/petrodeg-ms5-default-v9rdc-deployment Scaled down replica set petrodeg-ms5-default-v9rdc-deployment-766bf565cc to 0
25m Normal Updated metric/petrodeg-ms5-default-v9rdc Successfully updated metric status petrodeg-mds5-1/petrodeg-ms5-default-v9rdc
23m Normal Updated serverlessservice/petrodeg-ms5-default-v9rdc Successfully updated ServerlessService "petrodeg-mds5-1/petrodeg-ms5-default-v9rdc"
23m Warning InternalError revision/petrodeg-ms5-default-v9rdc Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again
23m Warning UpdateFailed podautoscaler/petrodeg-ms5-default-v9rdc Failed to update status for PA "petrodeg-ms5-default-v9rdc": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "petrodeg-ms5-default-v9rdc": the object has been modified; please apply your changes to the latest version and try again
25m Normal Created configuration/petrodeg-ms5-default Created Revision "petrodeg-ms5-default-v9rdc"
23m Warning LatestCreatedFailed configuration/petrodeg-ms5-default Latest created revision "petrodeg-ms5-default-v9rdc" has failed
Revision:
$ k describe revision petrodeg-ms5-default-v9rdc -n petrodeg-mds5-1
Name: petrodeg-ms5-default-v9rdc
Namespace: petrodeg-mds5-1
Labels: serving.knative.dev/configuration=petrodeg-ms5-default
serving.knative.dev/configurationGeneration=1
serving.knative.dev/service=
serving.kubeflow.org/kfservice=petrodeg-ms5
Annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/maxScale: 1
autoscaling.knative.dev/minScale: 1
autoscaling.knative.dev/target: 1
autoscaling.knative.dev/targetBurstCapacity: 0
networking.knative.dev/customExternalIPGatewayName: petrodeg-mds5-1/petrodeg-ms5-1
API Version: serving.knative.dev/v1alpha1
Kind: Revision
Metadata:
Creation Timestamp: 2019-12-11T17:55:08Z
Generate Name: petrodeg-ms5-default-
Generation: 1
Owner References:
API Version: serving.knative.dev/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Configuration
Name: petrodeg-ms5-default
UID: 5ed02123-1c3f-11ea-b469-000d3a067a44
Resource Version: 28603988
Self Link: /apis/serving.knative.dev/v1alpha1/namespaces/petrodeg-mds5-1/revisions/petrodeg-ms5-default-v9rdc
UID: 5ed5e000-1c3f-11ea-9e0d-000d3a067133
Spec:
Container Concurrency: 0
Containers:
Args:
http
Image: kcorer/jump
Name: user-container
Ports:
Container Port: 8888
Protocol: TCP
Readiness Probe:
Success Threshold: 1
Tcp Socket:
Port: 0
Resources:
Limits:
Cpu: 400m
Memory: 400Mi
Requests:
Cpu: 400m
Memory: 400Mi
Timeout Seconds: 300
Status:
Conditions:
Last Transition Time: 2019-12-11T17:57:37Z
Message: The target could not be activated.
Reason: TimedOut
Severity: Info
Status: False
Type: Active
Last Transition Time: 2019-12-11T17:55:09Z
Reason: Deploying
Status: Unknown
Type: ContainerHealthy
Last Transition Time: 2019-12-11T17:57:10Z
Reason: PodInitializing
Status: False
Type: Ready
Last Transition Time: 2019-12-11T17:57:10Z
Reason: PodInitializing
Status: False
Type: ResourcesAvailable
Image Digest: index.docker.io/kcorer/jump@sha256:4bd27b8fcad2b575b4c30fd0e8c360375020dc4e28923e3d8540779c1f72748b
Log URL: http://localhost:8001/api/v1/namespaces/knative-monitoring/services/kibana-logging/proxy/app/kibana#/discover?_a=(query:(match:(kubernetes.labels.knative-dev%2FrevisionUID:(query:'5ed5e000-1c3f-11ea-9e0d-000d3a067133',type:phrase))))
Observed Generation: 1
Service Name: petrodeg-ms5-default-v9rdc
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InternalError 24m (x2 over 24m) revision-controller Operation cannot be fulfilled on deployments.apps "petrodeg-ms5-default-v9rdc-deployment": the object has been modified; please apply your changes to the latest version and try again
Create a KN service in a ns that does not have any nodes (using taints/tolerations)
Wait a bit
Remove a taint on a node to allow the pod to be scheduled
So the current logic here is: https://github.com/knative/serving/blob/master/pkg/reconciler/autoscaling/kpa/kpa.go#L273
If desired scale is 0 (and it seems you didn't send any requests and don't have minScale>0) and revision never activated -- we mark it as timed out.
Perhaps we need to wait to get the revision to start before deciding to scale down.
/cc @markusthoemmes @mattmoor @tanzeeb
I suspect this might be to do with progressDeadlineSeconds that is currently hardcoded to 120s...
If desired scale is 0 (and it seems you didn't send any requests and don't have minScale>0) and revision never activated -- we mark it as timed out.
@vagababov I don't think that's the case, the scaler shouldn't scale to 0 until the pod is active, see https://github.com/knative/serving/blob/master/pkg/reconciler/autoscaling/kpa/scaler.go#L148
I suspect this might be to do with progressDeadlineSeconds that is currently hardcoded to 120s...
@rakelkar i think you're right. I'm curious about why it's at 2m28s instead of 1m30s
So its not progressDeadlineSeconds - I tried it with progressDeadlineSeconds set to 1 and the revision comes up fine... It is however related (or caused) by the fact that the revision gets set to failed when revision lifecycle marks resources as unavailable.
// Update the revision status if pod cannot be scheduled(possibly resource constraints)
// If pod cannot be scheduled then we expect the container status to be empty.
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {
// COMMENTING THIS OUT ALLOWS THE REVISION TO PROCEED
// rev.Status.MarkResourcesUnavailable(cond.Reason, cond.Message)
break
}
}
As an aside I observed some random behavior I cannot explain when changing progressDeadline and redeploying the controller... (not related to the main issue):
Gah - have to up ProgressDeadlineSecs as well... since a few lines down the code fails the revision if progressDeadline is exceeded...:
} else if w := status.State.Waiting; w != nil && hasDeploymentTimedOut(deployment) {
logger.Infof("%s marking resources unavailable with: %s: %s", rev.Name, w.Reason, w.Message)
rev.Status.MarkResourcesUnavailable(w.Reason, w.Message)
}
Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle stale
@vagababov any thoughts on this, this is definitely a bumper which we hit so many times.
What is the situation that you're seeing? The pods don't come up fast enough and get killed?
What is the situation that you're seeing? The pods don't come up fast enough and get killed?
Yes sometimes image is big and take sometime to download, knative should not scale down before the pod is ready right ?
Right now, we just _fail_ to activate the revision (as said above). We might want to tweak that constant, but it still is going to be a constant.
/cc mattmoor
Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle rotten
/remove-lifecycle rotten
@vagababov @mattmoor Is it ok to increase the constant to 300s? KFServing downloads the ML models from cloud storage before starting up the server, for large models it can take 1~2 minutes to download the model and also we need to account for ~30s for pod start up time, so I think 5 minutes is reasonable, thoughts?
Rather than arbitrarily bumping it, let's add a knob to config-deployment?
Sure, I can try to squeeze this today/tomorrow.
Thanks @vagababov @mattmoor !
Most helpful comment
@vagababov @mattmoor Is it ok to increase the constant to 300s? KFServing downloads the ML models from cloud storage before starting up the server, for large models it can take 1~2 minutes to download the model and also we need to account for ~30s for pod start up time, so I think 5 minutes is reasonable, thoughts?
https://github.com/knative/serving/blob/af63abe0d4aaf9ef9999dd46dc8e933d59b980b3/pkg/reconciler/revision/resources/constants.go#L36