POST requests should not be retried in activator. Instead, activator should use health check signals to wait for the deployment to be fully resolved and working and then do a single POST request.
/assign @pivotal-sukhil-suresh
@mdemirhan - checking if this issue is blocked by #1241, which is still marked open?
I don't think it should be blocked on that work. We should make this work for the default port for now. When (and if) we fix #1241, we should do it in a way that it enhances your changes in this work to support different ports. Let me know if that makes sense.
@mdemirhan thanks for confirming that #1241 is not a blocker. I see your point; ideally, changes made for #1241 will be invisible to the changes (readiness probe calls) made for this issue. Happy to move ahead!
Paired with @shashwathi on this yesterday and some thoughts...
@mdemirhan: we assume the intent is to avoid retries for ALL requests GET/POST (when waiting for forwarding to the revision pod). Basing this on your previous comment on a separate issue - https://github.com/knative/serving/pull/1292#issuecomment-398840154
The activator/revision.go ActiveEndpoint already does a watch on the revision status and waits until the status is ready. And our understanding is that the activator/main.go retries are specifically for istio route rule update delay related problems. It would be nice if you could confirm.
Considering that the readiness/liveness probe is not mandated for user-container, it seems like the alternate is to use the queue-proxy readinessProbe. But, this /health endpoint only reports the readiness of the queue-proxy itself and does not necessarily reflect the user-container readiness.
CC @tcnghia @akyyy @nikkithurmond @josephburnett
When the health status shows that the pod is healthy, we found that it is not immediately usable by the activator. The reason is, activator use service Ip to communicate with the revision and a pod being ready does not immediately translate into the service being ready to take traffic. In our experience, we found a delay of 3-4 seconds after the pod is ready, that the service can route traffic to the pod (most likely ip table configuration for the service ip to pod ip takes some time).
I like the idea of checking for queue-proxy's health if the user doesn't have a liveness/readiness probe. That would be helpful. Based on your suggestions, how does the following flow work:
I am partial on whether we should do a small number of retries for GET or just a single one. Let me know if this makes sense.
@mdemirhan sorry about the delay in getting back. Yes, the [New behavior] that you have listed is doable and does make sense. Will proceed with this approach to replace the http retries.
Also, seems like there is a longer-term solution being considered in #1509. @shashwathi had suggested something similar last week - the idea of using TCP health checks using istio-proxy (the other container) on the revision pod. I had missed mentioning it in my previous comment
CC @dprotaso @tanzeeb @bsnchan
Paired with @shashwathi on this.
@mdemirhan: Wanted to get your thoughts on some problems we encountered
Based on the K8s docs, the readinessProbe action can be defined by HTTPGet, TCPSocket, and Exec. For the initial iteration, the intent was to focus on the HTTPGet and TCPSocket only.
The good news is that we were able to successfully use the user-defined HTTPGet based readiness probe (instead of making multiple attempts at forwarding request to the revision pod).
However, we are running into problems with the user-defined TCPSocket based probes. TCPSocket based probes have only two values - host and port. The user is not allowed to specify port by the webhook validation. And for the host address, irrespective of whatever the user specifies, we are using the internal app service address.
So, at this point, we think the story may be blocked on issue #1241 at least for user-defined TCPSocket based probes.
Does it sound reasonable to submit a PR with support for user-defined HTTPGet probes and fall back on the default probe (queue-proxy readiness check)? In this case, if the user provides any non-HTTP probes, activator ignores it and uses default probe.
Thanks for the detailed information Sukhil. That plan sounds good to me.
@tcnghia @akyyy @josephburnett FYI
Summarizing discussion with @mdemirhan & @shashwathi :
We are trying to tackle the problem if user-defined readiness check probes are not provided and we want to use queue-proxy readiness check. However, the queue proxy exposes only port 8012 (RequestQueuePort) through service port mapping and so we are unable to target the queue proxy health handler from the activator.
Currently, the queue proxy /health endpoint is served on port 8022 (RequestQueueAdminPort). This port is not accessible through the service FQDN, so we are thinking of following options:
Queue proxy starts another health server on X port and expose this X port in service port mapping. In this option, the activator can use service http://service-FQDN:X/health to do default readiness check. This option also involves removing the /health endpoint from 8022 (RequestQueueAdminPort).
Expose 8022 (RequestQueueAdminPort) on queue proxy through service port mapping. One potential problem is /quitquitquit endpoint on this port is used as preStop hook so we are not sure about security implications of exposing this in service port mapping.
We are leaning towards option 1. What do you think? @tcnghia @josephburnett @ZhiminXiang
Feel free to correct us if we misunderstood the conversation @mdemirhan
@mdemirhan @pivotal-sukhil-suresh @shashwathi @tcnghia
I am a little bit concerned about using health check signal of queue proxy without any retry when user-defined readiness check probes are not provided since this signal is not completely accurate. In this case, I think we still need some retries.
For the options about choosing/exposing which port for health check, I am also leaning towards option 1 since exposing /quitquitquit path could potentially cause that a pod is a killed by other pods by sending request to /quitquitquit.
I am a little bit concerned about using health check signal of queue proxy without any retry when user-defined readiness check probes are not provided since this signal is not completely accurate. In this case, I think we still need some retries.
We are also concerned with the fact that the queue-proxy readiness may not reflect user-container readiness and was raised in a prior comment
Everybody is in consensus about what to do if user-defined readinessProbes are provided but we need to arrive at a consensus for the scenario when user-defined readinessProbes are not provided. Ideally, we do not want to retry HTTP requests to the revision.
When user-defined readinessProbes are not provided, we could:
Mandate user defined readinessProbe. Extremely reliable but a change in contract for a user and knative.
Use queue proxy health endpoint readinessProbe and do not use retries. This is not reliable and the forwarded request may fail.
Continue using HTTP retries along with queue proxy readinessProbes but we are not moving away from retries.
Continue using HTTP retries when user-defined readinessProbes are not present.
@ZhiminXiang @mdemirhan
I think it is hard to avoid dropping some traffics when users don't specify a good health check. The users won't just hit issues when scaling 0->1, they will face the same issue for N->N+x as well, and activator isn't involved in N->N+x.
I think it makes sense to do option 2, which makes 0->1 most similar to N->N+x. In addition to that we want enough documentation to tell users about the error / error messages that they may see due to lacking a good health check and at the same time emphasize the importance of specifying healthchecks for smooth autoscaling.
Thanks for the quick feedback @tcnghia!
@shashwathi and I agree with what you said and are leaning towards option 2.
Dropping a note to say that (as of now) @shashwathi and I intend to do a multi PR approach for the issue
@pivotal-sukhil-suresh since you looked at #1509, can I assign it to you so you can post what you found there? thanks
@mdemirhan @tcnghia @ZhiminXiang CC @shashwathi
The initial PR (which handles the case when user-defined readinessProbe) is prepped and temporarily blocked on the KNative launch.
But for the follow-up PR (which handles the case when the user has NOT defined readinesProbe), it would be nice to agree on how to proceed.
Summarizing approach based on previous discussions on this thread. In the absence of a readinessProbe provided by the user:
The alternative is to wait on the health endpoint of the queue-proxy container (before forwarding request to the user service endpoint)
Querying the queue-proxy health will require exposing another port in the service port mapping
Querying the queue-proxy health through the user service URL will alleviate routing related concerns. The activator code already waits for Revision Pod to be in Ready status
It is understood that the queue-proxy container health may not be a true reflection of the user-container health and so there still MAY be traffic failure
Stop making HTTP retries for forwarding request to user service endpoint from the activator
/unassign @pivotal-sukhil-suresh
/assign @shashwathi
@markusthoemmes I remember that you made significant changes to activator retries. Is this issue still relevant?
Activator has changed significantly and only reporting stats, so without retrying requests we can no longer activate revisions.
(in fact retrying is the only other thing that activator does beside sending stats)
Most helpful comment
@pivotal-sukhil-suresh since you looked at #1509, can I assign it to you so you can post what you found there? thanks