Serving: e2e.TestAutoscaleSustaining flaky on istio-1.3-[no-]mesh

Created on 10 Oct 2019 · 8Comments · Source: knative/serving

/area test-and-release

TestAutoscaleSustaining is pretty flaky according to TestGrid for Istio 1.3 (no issues with Istio 1.2)

TestGrid: istio-1.3-mesh
TestGrid: istio-1.3-no-mesh

aretest-and-release kinbug

Source

ssmall

All 8 comments

/cc @JRBANCEL
JR, does this fall into your Envoy crashing investigation bucket?

vagababov on 10 Oct 2019

This test fails because of random 503, it is consistent with the other tests failings with similar errors.
Hopefully this is because of the segfaults. Once we deploy the fix, we will know.

JRBANCEL on 10 Oct 2019

I have gone through a couple of logs from 1.3-__mesh__ and just wanted to note something in common.

The failure is because of getting 503 from some requests c.f - error making requests for scale up: total = 63515, errors = 20, expected 0
The 503 happened on the connection from istio-ingressgateway(istio-proxy) to activator.
When the issue happened, activator has multiple pods (more than 3).
The 503 returns only one of the multiple pods. e.g If all Activator IPs are [10.56.1.32:8012 10.56.10.4:8012 10.56.2.15:8012 10.56.4.19:8012 10.56.9.23:8012], then only 10.56.1.32 returns 503 errors.
The problematic activator (e.g 10.56.1.32 in the above case) was scaling __down__ and its istio-proxy was during graceful termination when 503 returns.

I hope that segfaults caused the issue, but w/ mesh env might have another reason.

nak3 on 13 Oct 2019

@mattmoor did work in that space and I think has fixed the activator termination part.

vagababov on 13 Oct 2019

This is not related to the segfaults.
There are a handful bugs in 1.3.x that lead to high CPU usage being the root cause of the 503 NR errors.
These have been fixed in the upcoming 1.3.3 release. I tested the pre-release bits and it looked good.

JRBANCEL on 14 Oct 2019

👍1

We probably need a test similar to TestDestroyPodInFlight (and its siblings) to destroy activators to make sure this can be caught with certainty.

tcnghia on 14 Oct 2019

👍1

/close
Addressed by #5812
If there is indeed issues with Activator in the shutdown path, the new test will catch it.

JRBANCEL on 16 Oct 2019

@JRBANCEL: Closing this issue.

In response to this:

/close
Addressed by #5812
If there is indeed issues with Activator in the shutdown path, the new test will catch it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.