Go: x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically

Created on 3 Feb 2020  路  16Comments  路  Source: golang/go

@ianlancetaylor has reported in https://github.com/golang/go/issues/36996#issuecomment-581596950 that the openbsd-386-62 gomote instance was crashing, which made debugging an OpenBSD issue more difficult and time consuming:

Unfortunately, the gomote then crashed before I could look at all the data. The gomote continues to crash periodically, forcing me to rebuild everything before I can do more testing.

We should investigate and try to fix that, or find another solution to make it easier to debug OpenBSD issues. This is the tracking issue for that. /cc @cagedmantis @toothrot

Builders OS-OpenBSD

Most helpful comment

This change has been deployed. I've managed to keep a GCE based gomote session alive for hours.

I'm going to close this issue.

@ianlancetaylor Hopefully you can finish your debugging of OpenBSD now!

All 16 comments

@ianlancetaylor Can you let us know on this bug a date and time next time this occurs? I want to investigate in our instance and coordinator logs.

It just happened a few minutes ago.

I'll try to capture the exact time next time.

The openbsd-386-62 gomote just crashed again, between 13:46:57 PST and 13:49:09 PST.

Perfect thanks, I'll take a look.

As I suspected, the coordinator is killing your instances, but I do not know why:

"2020/02/03 21:01:25 created buildlet user-iant-openbsd-386-62-1 for user-iant (GCE VM: buildlet-openbsd-386-62-rnb65ce41)
...
"2020/02/03 21:45:47 deleting VM "buildlet-openbsd-386-62-rnb65ce41" in zone "us-central1-c"; delete-at expiration ..."

It looks like it was up for about 45 minutes. I'll try to figure out how this is implemented.

The default timeout is 45 minutes: https://github.com/golang/build/blob/17a7d8724fa7128cd79bcb78e1fbe087043bf810/cmd/coordinator/coordinator.go#L140

I'm still tracing through this code, but it seems like this should happen for all GCE VMs. I'll keep digging.

45 minutes sounds like roughly the same timescale as in #28365. Perhaps they have the same root cause?

Thanks for looking at this. My understanding was that the coordinator would not kill the instance if I was connected to it via gomote ssh, as was the case here. But maybe my understanding was incorrect.

That was my understanding too, and I've seen gomote instances hang around for a long time (many hours) due to an active ssh connection. Perhaps it works for some builder types but not others.

I've remembered about issue #36802 still needing a resolution. There aren't any gomote sessions right now, so I'll use this as a chance to redeploy coordinator, so we can know that the latest version is in use. Edit: Done, see https://github.com/golang/go/issues/36802#issuecomment-581700676.

My current belief is this issue is specifically related to GCE VMs, which is a narrow-ish subset of our VMs. I'm still reading through the coordinator code to fully understand how it works before saying with confidence what is causing it, but I have my suspicions.

OK! I believe I have tracked it down.

When using gomote ssh, a property named Expires on RemoteBuildlet is updated every minute while a SSH session is active: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/buildlet/remote.go#L171

For GCE VMs, we also track a different attribute, delete-at in instance metadata. This property is not updated while SSHing, meaning we will eventually hit the default 45 minute timeout on these VMs and expire them here: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/cmd/coordinator/gce.go#L577

We could do one or more of the following:

  • Update the SSH session to also bump the instance metadata attribute where applicible
  • Improve the expiration check in gce.go to account for active SSH sessions
  • Not set a delete-at when SSHing, as we'll rely on the remote buildlet cleanup instead.

I'm not sure which is best yet, or if some combination of them is best. I'll keep looking. The majority of the knowledge of this code I believe is tied up in @bradfitz and @crawshaw.

I'd do (2) .... "Improve the expiration check in gce.go to account for active SSH sessions"

Sorry, I thought it already did that.

Change https://golang.org/cl/217722 mentions this issue: cmd/coordinator,buildlet: keep active GCE SSH sessions alive

This change has been deployed. I've managed to keep a GCE based gomote session alive for hours.

I'm going to close this issue.

@ianlancetaylor Hopefully you can finish your debugging of OpenBSD now!

I just verified that my test instance was successfully deleted by the remote buildlet cleanup process (as opposed to the abandoned VM process), as intended.

Thanks!

Was this page helpful?
0 / 5 - 0 ratings