Go: x/playground: serving 502 errors

Created on 19 Apr 2020 · 25Comments · Source: golang/go

I've been seeing a number of 502s on the playground just now.

Screen Shot 2020-04-19 at 1 16 14 PM

(Hope it wasn't my fault!)

/cc @toothrot

NeedsInvestigation

Source

bradfitz

Most helpful comment

After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

I performed a load test of a trivial program to figure out how the sandbox scaled.

The container pool can be exhausted without using very much CPU
- CPU intensive programs would have a different footprint, but forget that for a moment
/healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.
- As expected, the vast majority of our load (6 QPS) in steady state is from healthz requests
The instance group manager is set to scale up instances after 50% CPU utilization, which is hard to hit on the sandboxes from high usage of trivial programs (which I expect most of our traffic is).

https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.

https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.

toothrot on 25 Apr 2020

❤2 👍1

All 25 comments

I'll take a look this afternoon, but it just worked for me now on mobile. There are a couple more stability issues to resolve.

toothrot on 19 Apr 2020

Change https://golang.org/cl/228438 mentions this issue: playground: show a meaningful error on build timeout

gopherbot on 20 Apr 2020

It looks like the sandbox instances are running out of memory. I'm investigating.

toothrot on 22 Apr 2020

Nevermind, that may be a red herring of Docker doing it's job.

toothrot on 22 Apr 2020

I was getting a 502 of a GET to /. (See screenshot) That wouldn't make its way back to the gvisor world.

bradfitz on 22 Apr 2020

I understand that, but when too many requests 500, the AppEngine LB will return a 502 for all requests from a particular Flex instance.

toothrot on 22 Apr 2020

Change https://golang.org/cl/229418 mentions this issue: sandbox: update configuration to match production

gopherbot on 22 Apr 2020

I've also been getting these consistently for the past 30 minutes
Screen Shot 2020-04-22 at 4 57 38 PM
Screen Shot 2020-04-22 at 5 22 57 PM

odeke-em on 23 Apr 2020

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

toothrot on 23 Apr 2020

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

toothrot on 23 Apr 2020

Sorry, closed the wrong issue while flipping through tabs.

toothrot on 23 Apr 2020

For some context, the new architecture from #25224 introduces sandbox hosts that are separate from our GAE instances (which I have been calling frontends, in the Google parlance).

The frontend healthcheck looks like this:

func (s *server) healthCheck() error {
    ctx := context.Background() // TODO: cap it to some reasonable timeout
    resp, err := compileAndRun(ctx, &request{Body: healthProg})
    if err != nil {
        return err
    }
    if resp.Errors != "" {
        return fmt.Errorf("compile error: %v", resp.Errors)
    }
    if len(resp.Events) != 1 || resp.Events[0].Message != "ok" {
        return fmt.Errorf("unexpected output: %v", resp.Events)
    }
    return nil
}

With the change to add new sandbox backends, this health check fails if all of the backends are down. AppEngine will mark our app as dead, and will try restarting it.

Until we understand the instability issues in our backend, I think we should change this health check to only do the "compile" phase, which happens on the frontend. It would still be helpful to mark them unhealthy if binary compilation is failing entirely.

This should help us stop serving 502s, but it doesn't get us out of the woods of our sandbox hosts being unstable. I have other CLs planned for investigating that, and it starts with measurements.

toothrot on 23 Apr 2020

👍1

/cc @dmitshur @cagedmantis @andybons

toothrot on 23 Apr 2020

I've discussed this with @toothrot and agree that is a good next step.

dmitshur on 23 Apr 2020

It's also worth mentioning that we should move to split health checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks

toothrot on 23 Apr 2020

👍1

Change https://golang.org/cl/229677 mentions this issue: playground: only build binaries for health check

gopherbot on 23 Apr 2020

Change https://golang.org/cl/229679 mentions this issue: playground: instrument HTTP handlers with StackDriver

gopherbot on 24 Apr 2020

Change https://golang.org/cl/229680 mentions this issue: playground: instrument number of running containers

gopherbot on 24 Apr 2020

Change https://golang.org/cl/229681 mentions this issue: sandbox: add stackdriver scope

gopherbot on 24 Apr 2020

Change https://golang.org/cl/229958 mentions this issue: sandbox: correct stackdriver labels

gopherbot on 24 Apr 2020

Change https://golang.org/cl/229980 mentions this issue: sandbox: add container creation metrics

gopherbot on 25 Apr 2020

Change https://golang.org/cl/229981 mentions this issue: sandbox: reduce container starvation

gopherbot on 25 Apr 2020

After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

I performed a load test of a trivial program to figure out how the sandbox scaled.

The container pool can be exhausted without using very much CPU
- CPU intensive programs would have a different footprint, but forget that for a moment
/healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.
- As expected, the vast majority of our load (6 QPS) in steady state is from healthz requests
The instance group manager is set to scale up instances after 50% CPU utilization, which is hard to hit on the sandboxes from high usage of trivial programs (which I expect most of our traffic is).

https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.

toothrot on 25 Apr 2020

❤2 👍1

The results of those changes are similar, but I was unable to cause the instances to become unhealthy, nor could I make the instances autoscale with trivial programs:

Many CPU-bound snippets would still trigger horizontal scaling successfully.

I honestly think this is good enough for this issue. For scaling further, we may want to consider:

Increasing worker count to greater than runtime.NumCPU().
- For a traffic pattern of many small requests, we are currently more bound on IO wait than CPU
Removing the run semaphore. The blocking call to getContainer is likely sufficient.
- For calculating max concurrent processes from a binary memory size, we should do some math to compare our docker RAM limit, and our max binary size * number of workers.
Increasing our number of containers to something more than max-workers.
- It takes about 500ms to create a container (see below). A flood of tiny requests could overwhelm our ability to create them in the current configuration, without triggering autoscaling.
Finally, if we ever get our load balancer, we can instead trigger autoscaling by QPS.
- You can't configure managed instance groups to scale by custom metrics, otherwise we could use our own metrics to do this. :disappointed:

toothrot on 27 Apr 2020

👍1

I'm going to close this issue as I think we're now in a good place when it comes to our daily usage. We can re-open it, or open a new one, if we notice something specific now.

I'm very happy about our new metrics, and hopefully we'll have way more insight into what's going on the next time something like this comes up for the playground.

toothrot on 27 Apr 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings