/area test-and-release
TestAutoscaleSustaining is pretty flaky according to TestGrid for Istio 1.3 (no issues with Istio 1.2)
/cc @JRBANCEL
JR, does this fall into your Envoy crashing investigation bucket?
This test fails because of random 503, it is consistent with the other tests failings with similar errors.
Hopefully this is because of the segfaults. Once we deploy the fix, we will know.
I have gone through a couple of logs from 1.3-__mesh__ and just wanted to note something in common.
503 from some requests c.f - error making requests for scale up: total = 63515, errors = 20, expected 0503 happened on the connection from istio-ingressgateway(istio-proxy) to activator.503 returns only one of the multiple pods. e.g If all Activator IPs are [10.56.1.32:8012 10.56.10.4:8012 10.56.2.15:8012 10.56.4.19:8012 10.56.9.23:8012], then only 10.56.1.32 returns 503 errors.10.56.1.32 in the above case) was scaling __down__ and its istio-proxy was during graceful termination when 503 returns.I hope that segfaults caused the issue, but w/ mesh env might have another reason.
@mattmoor did work in that space and I think has fixed the activator termination part.
This is not related to the segfaults.
There are a handful bugs in 1.3.x that lead to high CPU usage being the root cause of the 503 NR errors.
These have been fixed in the upcoming 1.3.3 release. I tested the pre-release bits and it looked good.
We probably need a test similar to TestDestroyPodInFlight (and its siblings) to destroy activators to make sure this can be caught with certainty.
/close
Addressed by #5812
If there is indeed issues with Activator in the shutdown path, the new test will catch it.
@JRBANCEL: Closing this issue.
In response to this:
/close
Addressed by #5812
If there is indeed issues with Activator in the shutdown path, the new test will catch it.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.