Serving: Don't retry POST requests in activator

Created on 2 Jul 2018 · 21Comments · Source: knative/serving

POST requests should not be retried in activator. Instead, activator should use health check signals to wait for the deployment to be fully resolved and working and then do a single POST request.

areautoscale arenetworking kinbug

Source

mdemirhan

Most helpful comment

@pivotal-sukhil-suresh since you looked at #1509, can I assign it to you so you can post what you found there? thanks

tcnghia on 19 Jul 2018

👍2

All 21 comments

/assign @pivotal-sukhil-suresh

sukhil-suresh on 3 Jul 2018

@mdemirhan - checking if this issue is blocked by #1241, which is still marked open?

sukhil-suresh on 4 Jul 2018

I don't think it should be blocked on that work. We should make this work for the default port for now. When (and if) we fix #1241, we should do it in a way that it enhances your changes in this work to support different ports. Let me know if that makes sense.

mdemirhan on 5 Jul 2018

👍1

@mdemirhan thanks for confirming that #1241 is not a blocker. I see your point; ideally, changes made for #1241 will be invisible to the changes (readiness probe calls) made for this issue. Happy to move ahead!

sukhil-suresh on 5 Jul 2018

👍1

Paired with @shashwathi on this yesterday and some thoughts...

@mdemirhan: we assume the intent is to avoid retries for ALL requests GET/POST (when waiting for forwarding to the revision pod). Basing this on your previous comment on a separate issue - https://github.com/knative/serving/pull/1292#issuecomment-398840154

The activator/revision.go ActiveEndpoint already does a watch on the revision status and waits until the status is ready. And our understanding is that the activator/main.go retries are specifically for istio route rule update delay related problems. It would be nice if you could confirm.

Considering that the readiness/liveness probe is not mandated for user-container, it seems like the alternate is to use the queue-proxy readinessProbe. But, this /health endpoint only reports the readiness of the queue-proxy itself and does not necessarily reflect the user-container readiness.

sukhil-suresh on 6 Jul 2018

CC @tcnghia @akyyy @nikkithurmond @josephburnett

When the health status shows that the pod is healthy, we found that it is not immediately usable by the activator. The reason is, activator use service Ip to communicate with the revision and a pod being ready does not immediately translate into the service being ready to take traffic. In our experience, we found a delay of 3-4 seconds after the pod is ready, that the service can route traffic to the pod (most likely ip table configuration for the service ip to pod ip takes some time).

I like the idea of checking for queue-proxy's health if the user doesn't have a liveness/readiness probe. That would be helpful. Based on your suggestions, how does the following flow work:

[Current behavior] Wait until the endpoint is ready
[New behavior] Start checking the health probe using the service IP - if user is missing readiness/liveness, check queue proxy's readiness/liveness
[Modified behavior] For both GET POST requests, make a single call and return that - no retries

I am partial on whether we should do a small number of retries for GET or just a single one. Let me know if this makes sense.

mdemirhan on 6 Jul 2018

@mdemirhan sorry about the delay in getting back. Yes, the [New behavior] that you have listed is doable and does make sense. Will proceed with this approach to replace the http retries.

Also, seems like there is a longer-term solution being considered in #1509. @shashwathi had suggested something similar last week - the idea of using TCP health checks using istio-proxy (the other container) on the revision pod. I had missed mentioning it in my previous comment

CC @dprotaso @tanzeeb @bsnchan

sukhil-suresh on 9 Jul 2018

1509 or @shashwathi's suggestion is more of an addition to this, rather than a replacement. If the user specified a TCP or an HTTP health check in container spec, we should simply use that because that will be more reliable than a TCP health check to another container (in case the user code startup time is significant). However, if there is no health check endpoint specified in user-container's spec, doing a TCP probe to either queue proxy or istio proxy sounds like a good idea.

mdemirhan on 9 Jul 2018

👍2

Paired with @shashwathi on this.

@mdemirhan: Wanted to get your thoughts on some problems we encountered

Based on the K8s docs, the readinessProbe action can be defined by HTTPGet, TCPSocket, and Exec. For the initial iteration, the intent was to focus on the HTTPGet and TCPSocket only.

The good news is that we were able to successfully use the user-defined HTTPGet based readiness probe (instead of making multiple attempts at forwarding request to the revision pod).

However, we are running into problems with the user-defined TCPSocket based probes. TCPSocket based probes have only two values - host and port. The user is not allowed to specify port by the webhook validation. And for the host address, irrespective of whatever the user specifies, we are using the internal app service address.

So, at this point, we think the story may be blocked on issue #1241 at least for user-defined TCPSocket based probes.

Does it sound reasonable to submit a PR with support for user-defined HTTPGet probes and fall back on the default probe (queue-proxy readiness check)? In this case, if the user provides any non-HTTP probes, activator ignores it and uses default probe.

sukhil-suresh on 17 Jul 2018

Thanks for the detailed information Sukhil. That plan sounds good to me.

@tcnghia @akyyy @josephburnett FYI

mdemirhan on 17 Jul 2018

👍1

Summarizing discussion with @mdemirhan & @shashwathi :

We are trying to tackle the problem if user-defined readiness check probes are not provided and we want to use queue-proxy readiness check. However, the queue proxy exposes only port 8012 (RequestQueuePort) through service port mapping and so we are unable to target the queue proxy health handler from the activator.

Currently, the queue proxy /health endpoint is served on port 8022 (RequestQueueAdminPort). This port is not accessible through the service FQDN, so we are thinking of following options:

Queue proxy starts another health server on X port and expose this X port in service port mapping. In this option, the activator can use service http://service-FQDN:X/health to do default readiness check. This option also involves removing the /health endpoint from 8022 (RequestQueueAdminPort).
Expose 8022 (RequestQueueAdminPort) on queue proxy through service port mapping. One potential problem is /quitquitquit endpoint on this port is used as preStop hook so we are not sure about security implications of exposing this in service port mapping.

We are leaning towards option 1. What do you think? @tcnghia @josephburnett @ZhiminXiang

Feel free to correct us if we misunderstood the conversation @mdemirhan

sukhil-suresh on 18 Jul 2018

@mdemirhan @pivotal-sukhil-suresh @shashwathi @tcnghia

I am a little bit concerned about using health check signal of queue proxy without any retry when user-defined readiness check probes are not provided since this signal is not completely accurate. In this case, I think we still need some retries.

For the options about choosing/exposing which port for health check, I am also leaning towards option 1 since exposing /quitquitquit path could potentially cause that a pod is a killed by other pods by sending request to /quitquitquit.

ZhiminXiang on 18 Jul 2018

I am a little bit concerned about using health check signal of queue proxy without any retry when user-defined readiness check probes are not provided since this signal is not completely accurate. In this case, I think we still need some retries.

We are also concerned with the fact that the queue-proxy readiness may not reflect user-container readiness and was raised in a prior comment

Everybody is in consensus about what to do if user-defined readinessProbes are provided but we need to arrive at a consensus for the scenario when user-defined readinessProbes are not provided. Ideally, we do not want to retry HTTP requests to the revision.

When user-defined readinessProbes are not provided, we could:

Mandate user defined readinessProbe. Extremely reliable but a change in contract for a user and knative.
Use queue proxy health endpoint readinessProbe and do not use retries. This is not reliable and the forwarded request may fail.
Continue using HTTP retries along with queue proxy readinessProbes but we are not moving away from retries.
Continue using HTTP retries when user-defined readinessProbes are not present.

@ZhiminXiang @mdemirhan

sukhil-suresh on 18 Jul 2018

I think it is hard to avoid dropping some traffics when users don't specify a good health check. The users won't just hit issues when scaling 0->1, they will face the same issue for N->N+x as well, and activator isn't involved in N->N+x.

I think it makes sense to do option 2, which makes 0->1 most similar to N->N+x. In addition to that we want enough documentation to tell users about the error / error messages that they may see due to lacking a good health check and at the same time emphasize the importance of specifying healthchecks for smooth autoscaling.

tcnghia on 18 Jul 2018

👍1

Thanks for the quick feedback @tcnghia!
@shashwathi and I agree with what you said and are leaning towards option 2.

sukhil-suresh on 18 Jul 2018

Dropping a note to say that (as of now) @shashwathi and I intend to do a multi PR approach for the issue

Initial PR which handles the scenario of when the user has defined readinessProbe and so the HTTP retries are skipped. Working on this right now.
Followup PR will handle the scenario when the user has NOT defined the readiness probe.

sukhil-suresh on 18 Jul 2018

@pivotal-sukhil-suresh since you looked at #1509, can I assign it to you so you can post what you found there? thanks

tcnghia on 19 Jul 2018

👍2

@mdemirhan @tcnghia @ZhiminXiang CC @shashwathi

The initial PR (which handles the case when user-defined readinessProbe) is prepped and temporarily blocked on the KNative launch.

But for the follow-up PR (which handles the case when the user has NOT defined readinesProbe), it would be nice to agree on how to proceed.

Summarizing approach based on previous discussions on this thread. In the absence of a readinessProbe provided by the user:

The alternative is to wait on the health endpoint of the queue-proxy container (before forwarding request to the user service endpoint)
Querying the queue-proxy health will require exposing another port in the service port mapping
Querying the queue-proxy health through the user service URL will alleviate routing related concerns. The activator code already waits for Revision Pod to be in Ready status
It is understood that the queue-proxy container health may not be a true reflection of the user-container health and so there still MAY be traffic failure
Stop making HTTP retries for forwarding request to user service endpoint from the activator

sukhil-suresh on 24 Jul 2018

/unassign @pivotal-sukhil-suresh
/assign @shashwathi

sukhil-suresh on 27 Jul 2018

@markusthoemmes I remember that you made significant changes to activator retries. Is this issue still relevant?

tcnghia on 6 Nov 2018

Activator has changed significantly and only reporting stats, so without retrying requests we can no longer activate revisions.

(in fact retrying is the only other thing that activator does beside sending stats)

tcnghia on 4 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

latestRevision routes may point to older revisions if latestCreated is not ready

greghaynes · 4Comments

SSL hangs with network certmanager

alexnederlof · 5Comments

Support wildcard certificate for Auto TLS feature

ZhiminXiang · 3Comments

Installing bundled Istio frequently fails with "no matches for kind ..."

bbrowning · 6Comments

Support shortened versions of cluster-internal service name

tcnghia · 3Comments