/area API
We are currently specifying timeoutSeconds in the yaml however any integer greater than 600 throws the error "STDIN": Internal error occurred: admission webhook "webhook.serving.knative.dev" denied the request: mutation failed: expected 0s <= 601s <= 600s: spec.runLatest.configuration.revisionTemplate.spec.timeoutSeconds (where 601 is the integer specified by timeoutSeconds in the yaml).
We would like to be able to have quite long running tasks (potentially around the hour mark).
I couldn't find any info regarding this 600s limit and other people are having the same issue: https://stackoverflow.com/questions/55401253/how-to-increase-the-execution-time-for-apps-served-by-knative/55666887?noredirect=1#comment98138638_55666887
Our yaml setup is below:
apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
name: test
namespace: production
spec:
runLatest:
configuration:
revisionTemplate:
metadata:
annotations:
autoscaling.knative.dev/maxScale: "200"
spec:
containerConcurrency: 80
timeoutSeconds: 601
container:
image: $CONTAINER_IMAGE
resources:
requests:
memory: "128Mi"
cpu: "200m"
limits:
memory: "256Mi"
cpu: "400m"
env:
- name: NODE_ENV
value: production
This is using Knative 0.5
Knative is fab - thanks for all of your hard work! :)
/area api
/cc @mattmoor
The 600s timeout max seems to be based on the 10m timeout ( https://github.com/knative/serving/blob/master/pkg/apis/networking/register.go#L72 ) that we use for configuring the Istio ClusterIngress and VirtualService.
If the timeout for the Knative Service was set higher than 600s the connection would be interrupted at these layers before the request completes.
It is not clear to me why we picked 10 minutes here. It looks like it has been increased previously ( https://github.com/knative/serving/pull/2867 ).
How high are we talking? 15m? 30m? 10h? :)
We can consider raising this, but at some point extremely long sessions become a form of state, and we're not looking to support stateful workloads. Above a certain point is also start to look a lot like we're abusing Serving as a batch system, which it is not.
40m would cover our use cases - we have a few longer running background tasks. (although ∞ sounds exciting... 🙈)
We can consider raising this, but at some point extremely long sessions become a form of state, and we're not looking to support stateful workloads. Above a certain point is also start to look a lot like we're abusing Serving as a batch system, which it is not.
I see your point however should Knative be forcing limits upon developers? - it seems that if a longer time has no ill effects regarding the underlying infrastructure then it should be the developers choice and instead maybe the Knative documentation should advise developers, rather than enforce, due to statefulness, batching, etc., to use a time limit of no longer than x
it seems that if a longer time has no ill effects regarding the underlying infrastructure
I think the problem that Matt is trying to highlight is that as the time gets longer and longer it actually does effect the underlying infrastructure as it limits the types of operations that Knative can perform without risk of impact to the application. To honor the timeout and prevent premature connection termination we must wait for the all connections to drain to a Pod before killing it. This can slow down the speed of autoscale downscaling, cluster node replacement, Pod rescheduling, and Knative sidecar updates. Leaving it completely open-ended does not make it easy for Knative to reason about operational actions.
can slow down the speed of autoscale downscaling, cluster node replacement, Pod rescheduling, and Knative sidecar updates
Good points, had not thought of it from this angle.
Ultimately, it's the Knative team's call as you will all have the best understanding of the whole system, however, if there is any chance of getting to 40 mins that would be amazing (as I see it this time wouldn't allow proper long running stateful loads that can last hours, days, etc.)
I think part of the issue with the timeout is that it isn't documented in the service docs (happy to create a PR to add this in once a conclusion is made either way), so we were under the impression that although we want to keep the timeout as low as possible, there was no limit. Also, Google Cloud Run's per request timeout is 15m, which from my understanding is based off Knative, so assume the Google Team has forked off Knative to do this.
Appreciate all of your time - love Knative, amazing how fast your release cycle is!
Sorry for the slow reply. There is some work happening in this area (see related issue). In making the timeout we use for Istio configurable it unblocks making this larger.
Most helpful comment
Sorry for the slow reply. There is some work happening in this area (see related issue). In making the timeout we use for Istio configurable it unblocks making this larger.