I've been seeing a number of 502s on the playground just now.

(Hope it wasn't my fault!)
/cc @toothrot
I'll take a look this afternoon, but it just worked for me now on mobile. There are a couple more stability issues to resolve.
Change https://golang.org/cl/228438 mentions this issue: playground: show a meaningful error on build timeout
It looks like the sandbox instances are running out of memory. I'm investigating.
Nevermind, that may be a red herring of Docker doing it's job.
I was getting a 502 of a GET to /. (See screenshot) That wouldn't make its way back to the gvisor world.
I understand that, but when too many requests 500, the AppEngine LB will return a 502 for all requests from a particular Flex instance.
Change https://golang.org/cl/229418 mentions this issue: sandbox: update configuration to match production
I've also been getting these consistently for the past 30 minutes


Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.
Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.
Sorry, closed the wrong issue while flipping through tabs.
For some context, the new architecture from #25224 introduces sandbox hosts that are separate from our GAE instances (which I have been calling frontends, in the Google parlance).
The frontend healthcheck looks like this:
func (s *server) healthCheck() error {
ctx := context.Background() // TODO: cap it to some reasonable timeout
resp, err := compileAndRun(ctx, &request{Body: healthProg})
if err != nil {
return err
}
if resp.Errors != "" {
return fmt.Errorf("compile error: %v", resp.Errors)
}
if len(resp.Events) != 1 || resp.Events[0].Message != "ok" {
return fmt.Errorf("unexpected output: %v", resp.Events)
}
return nil
}
With the change to add new sandbox backends, this health check fails if all of the backends are down. AppEngine will mark our app as dead, and will try restarting it.
Until we understand the instability issues in our backend, I think we should change this health check to only do the "compile" phase, which happens on the frontend. It would still be helpful to mark them unhealthy if binary compilation is failing entirely.
This should help us stop serving 502s, but it doesn't get us out of the woods of our sandbox hosts being unstable. I have other CLs planned for investigating that, and it starts with measurements.
/cc @dmitshur @cagedmantis @andybons
I've discussed this with @toothrot and agree that is a good next step.
It's also worth mentioning that we should move to split health checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks
Change https://golang.org/cl/229677 mentions this issue: playground: only build binaries for health check
Change https://golang.org/cl/229679 mentions this issue: playground: instrument HTTP handlers with StackDriver
Change https://golang.org/cl/229680 mentions this issue: playground: instrument number of running containers
Change https://golang.org/cl/229681 mentions this issue: sandbox: add stackdriver scope
Change https://golang.org/cl/229958 mentions this issue: sandbox: correct stackdriver labels
Change https://golang.org/cl/229980 mentions this issue: sandbox: add container creation metrics
Change https://golang.org/cl/229981 mentions this issue: sandbox: reduce container starvation
After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

I performed a load test of a trivial program to figure out how the sandbox scaled.
/healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.
https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.
The results of those changes are similar, but I was unable to cause the instances to become unhealthy, nor could I make the instances autoscale with trivial programs:

Many CPU-bound snippets would still trigger horizontal scaling successfully.
I honestly think this is good enough for this issue. For scaling further, we may want to consider:
runtime.NumCPU().
I'm going to close this issue as I think we're now in a good place when it comes to our daily usage. We can re-open it, or open a new one, if we notice something specific now.
I'm very happy about our new metrics, and hopefully we'll have way more insight into what's going on the next time something like this comes up for the playground.
Most helpful comment
After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:
I performed a load test of a trivial program to figure out how the sandbox scaled.
/healthzis a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.
https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.