Go: x/playground: serving 502 errors

Created on 19 Apr 2020  路  25Comments  路  Source: golang/go

I've been seeing a number of 502s on the playground just now.

Screen Shot 2020-04-19 at 1 16 14 PM

(Hope it wasn't my fault!)

/cc @toothrot

NeedsInvestigation

Most helpful comment

After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

image

I performed a load test of a trivial program to figure out how the sandbox scaled.

  • The container pool can be exhausted without using very much CPU

    • CPU intensive programs would have a different footprint, but forget that for a moment

  • /healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.

    • As expected, the vast majority of our load (6 QPS) in steady state is from healthz requests

  • The instance group manager is set to scale up instances after 50% CPU utilization, which is hard to hit on the sandboxes from high usage of trivial programs (which I expect most of our traffic is).

https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.

https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.

All 25 comments

I'll take a look this afternoon, but it just worked for me now on mobile. There are a couple more stability issues to resolve.

Change https://golang.org/cl/228438 mentions this issue: playground: show a meaningful error on build timeout

It looks like the sandbox instances are running out of memory. I'm investigating.

Nevermind, that may be a red herring of Docker doing it's job.

I was getting a 502 of a GET to /. (See screenshot) That wouldn't make its way back to the gvisor world.

I understand that, but when too many requests 500, the AppEngine LB will return a 502 for all requests from a particular Flex instance.

Change https://golang.org/cl/229418 mentions this issue: sandbox: update configuration to match production

I've also been getting these consistently for the past 30 minutes
Screen Shot 2020-04-22 at 4 57 38 PM
Screen Shot 2020-04-22 at 5 22 57 PM

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

Sorry, closed the wrong issue while flipping through tabs.

For some context, the new architecture from #25224 introduces sandbox hosts that are separate from our GAE instances (which I have been calling frontends, in the Google parlance).

The frontend healthcheck looks like this:

func (s *server) healthCheck() error {
    ctx := context.Background() // TODO: cap it to some reasonable timeout
    resp, err := compileAndRun(ctx, &request{Body: healthProg})
    if err != nil {
        return err
    }
    if resp.Errors != "" {
        return fmt.Errorf("compile error: %v", resp.Errors)
    }
    if len(resp.Events) != 1 || resp.Events[0].Message != "ok" {
        return fmt.Errorf("unexpected output: %v", resp.Events)
    }
    return nil
}

With the change to add new sandbox backends, this health check fails if all of the backends are down. AppEngine will mark our app as dead, and will try restarting it.

Until we understand the instability issues in our backend, I think we should change this health check to only do the "compile" phase, which happens on the frontend. It would still be helpful to mark them unhealthy if binary compilation is failing entirely.

This should help us stop serving 502s, but it doesn't get us out of the woods of our sandbox hosts being unstable. I have other CLs planned for investigating that, and it starts with measurements.

/cc @dmitshur @cagedmantis @andybons

I've discussed this with @toothrot and agree that is a good next step.

It's also worth mentioning that we should move to split health checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks

Change https://golang.org/cl/229677 mentions this issue: playground: only build binaries for health check

Change https://golang.org/cl/229679 mentions this issue: playground: instrument HTTP handlers with StackDriver

Change https://golang.org/cl/229680 mentions this issue: playground: instrument number of running containers

Change https://golang.org/cl/229681 mentions this issue: sandbox: add stackdriver scope

Change https://golang.org/cl/229958 mentions this issue: sandbox: correct stackdriver labels

Change https://golang.org/cl/229980 mentions this issue: sandbox: add container creation metrics

Change https://golang.org/cl/229981 mentions this issue: sandbox: reduce container starvation

After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

image

I performed a load test of a trivial program to figure out how the sandbox scaled.

  • The container pool can be exhausted without using very much CPU

    • CPU intensive programs would have a different footprint, but forget that for a moment

  • /healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.

    • As expected, the vast majority of our load (6 QPS) in steady state is from healthz requests

  • The instance group manager is set to scale up instances after 50% CPU utilization, which is hard to hit on the sandboxes from high usage of trivial programs (which I expect most of our traffic is).

https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.

https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.

The results of those changes are similar, but I was unable to cause the instances to become unhealthy, nor could I make the instances autoscale with trivial programs:

image

Many CPU-bound snippets would still trigger horizontal scaling successfully.

I honestly think this is good enough for this issue. For scaling further, we may want to consider:

  • Increasing worker count to greater than runtime.NumCPU().

    • For a traffic pattern of many small requests, we are currently more bound on IO wait than CPU

  • Removing the run semaphore. The blocking call to getContainer is likely sufficient.

    • For calculating max concurrent processes from a binary memory size, we should do some math to compare our docker RAM limit, and our max binary size * number of workers.

  • Increasing our number of containers to something more than max-workers.

    • It takes about 500ms to create a container (see below). A flood of tiny requests could overwhelm our ability to create them in the current configuration, without triggering autoscaling.

  • Finally, if we ever get our load balancer, we can instead trigger autoscaling by QPS.

    • You can't configure managed instance groups to scale by custom metrics, otherwise we could use our own metrics to do this. :disappointed:

image

I'm going to close this issue as I think we're now in a good place when it comes to our daily usage. We can re-open it, or open a new one, if we notice something specific now.

I'm very happy about our new metrics, and hopefully we'll have way more insight into what's going on the next time something like this comes up for the playground.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DemiMarie picture DemiMarie  路  320Comments

tklauser picture tklauser  路  219Comments

ianlancetaylor picture ianlancetaylor  路  519Comments

networkimprov picture networkimprov  路  194Comments

bradfitz picture bradfitz  路  176Comments