Serving: e2e.TestAutoscaleSustaining flaky on istio-1.3-[no-]mesh

Created on 10 Oct 2019  路  8Comments  路  Source: knative/serving

/area test-and-release

TestAutoscaleSustaining is pretty flaky according to TestGrid for Istio 1.3 (no issues with Istio 1.2)

TestGrid: istio-1.3-mesh
TestGrid: istio-1.3-no-mesh

aretest-and-release kinbug

All 8 comments

/cc @JRBANCEL
JR, does this fall into your Envoy crashing investigation bucket?

This test fails because of random 503, it is consistent with the other tests failings with similar errors.
Hopefully this is because of the segfaults. Once we deploy the fix, we will know.

I have gone through a couple of logs from 1.3-__mesh__ and just wanted to note something in common.

  • The failure is because of getting 503 from some requests c.f - error making requests for scale up: total = 63515, errors = 20, expected 0
  • The 503 happened on the connection from istio-ingressgateway(istio-proxy) to activator.
  • When the issue happened, activator has multiple pods (more than 3).
  • The 503 returns only one of the multiple pods. e.g If all Activator IPs are [10.56.1.32:8012 10.56.10.4:8012 10.56.2.15:8012 10.56.4.19:8012 10.56.9.23:8012], then only 10.56.1.32 returns 503 errors.
  • The problematic activator (e.g 10.56.1.32 in the above case) was scaling __down__ and its istio-proxy was during graceful termination when 503 returns.

I hope that segfaults caused the issue, but w/ mesh env might have another reason.

@mattmoor did work in that space and I think has fixed the activator termination part.

This is not related to the segfaults.
There are a handful bugs in 1.3.x that lead to high CPU usage being the root cause of the 503 NR errors.
These have been fixed in the upcoming 1.3.3 release. I tested the pre-release bits and it looked good.

We probably need a test similar to TestDestroyPodInFlight (and its siblings) to destroy activators to make sure this can be caught with certainty.

/close
Addressed by #5812
If there is indeed issues with Activator in the shutdown path, the new test will catch it.

@JRBANCEL: Closing this issue.

In response to this:

/close
Addressed by #5812
If there is indeed issues with Activator in the shutdown path, the new test will catch it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wtam2018 picture wtam2018  路  4Comments

bbrowning picture bbrowning  路  6Comments

mattmoor picture mattmoor  路  7Comments

mattmoor picture mattmoor  路  5Comments

scothis picture scothis  路  3Comments