We've had several reports of memory corruption on Linux 5.3.x (or later) kernels from people running tip since asynchronous preemption was committed. This is a super-bug to track these issues. I suspect they all have one root cause.
Typically these are "runtime error: invalid memory address or nil pointer dereference" or "runtime: unexpected return pc" or "segmentation violation" panics. They can also appear as self-detected data corruption.
If you encounter a crash that could be random memory corruption, are running Linux 5.3.x or later, and are running a recent tip Go (after commit 62e53b79227dafc6afcd92240c89acb8c0e1dd56), please file a new issue and add a comment here. If you can reproduce it, please try setting "GODEBUG=asyncpreemptoff=1" in your environment and seeing if you can still reproduce it.
Duplicate issues (I'll edit this comment to keep this up-to-date):
runtime: corrupt binary export data seen after signal preemption CL (#35326): Corruption in file version header observed by vet. Medium reproducible. Strong leads.
cmd/compile: panic during early copyelim crash (#35658): Invalid memory address in cmd/compile/internal/ssa.copyelim. Not reproducible. Nothing obvious in stack trace. Haven't dug into assembly.
runtime: SIGSEGV in mapassign_fast64 during cmd/vet (#35689): Invalid memory address in runtime.mapassign_fast64 in vet. Stack trace includes random pointers. Some assembly decoding work.
runtime: unexpected return pc for runtime.(*mheap).alloc (#35328): Unexpected return pc. Stack trace includes random pointers. Not reproducible.
cmd/dist: I/O error: read src/xxx.go: is a directory (#35776): Random misbehavior. Not reproducible.
runtime: "fatal error: mSpanList.insertBack" in mallocgc (#35771): Bad mspan next pointer (random and unaligned). Not reproducible.
cmd/compile: invalid memory address or nil pointer dereference in gc.convlit1 (#35621): Invalid memory address in cmd/compile/internal/gc.convlit1. Evidence of memory corruption, though no obvious random pointers. Not reproducible.
cmd/go: unexpected signal during runtime execution (#35783): Corruption in file version header observed by vet. Not reproducible.
runtime: unexpected return pc for runtime.systemstack_switch (#35592): Unexpected return pc. Stack trace includes random pointers. Not reproducible.
cmd/compile: random compile error running tests (#35760): Compiler data corruption. Not reproducible.
@aclements for your records, https://github.com/golang/go/issues/35328 and https://github.com/golang/go/issues/35776 might be related as well. Those two were on the same Linux 5.3.x machine of mine.
Thanks @mvdan. I've folded those into the list above.
@aclements just saw https://github.com/golang/go/issues/35783 for the record.
If you think we have enough "evidence" please say and I'll stop creating issues for now 😄
Have we roughly bisected which Linux versions are affected? Looking at the kernel changes in that region might yield a clue about where and whose the bug is.
5.3 = bad.
5.2 = ?
In https://github.com/golang/go/issues/35326#issuecomment-557223145, I used Arch's 4.19 LTS and could not reproduce the bexport corruption. However, the kernel configuration differs between 4.19 and 5.3, so that may be unscientific. (I'm letting my machine rebuild 5.3 without PREEMPT set to see if that's the problem, but I have doubts. EDIT: It was not PREEMPT, so setting up a builder with a newer kernel would likely be good regardless.)
What set of kernels do the current Linux builders use? That might provide a lower bound, as I've never seen the issue there.
(I'd bring up #9505 to advocate for an Arch builder, but that issue is more about everything _but_ the kernel version. I feel like there should be some builder which is at the latest Linux kernel, whatever that may be.)
The existing Go Linux builders use Container Optimized OS with a Linux kernel 4.19.72+
.
Thanks @myitcv, I think we have enough reports. If you do happen to find another one that's reproducible, that would be very helpful, though.
To recap experiments last Friday (and I rechecked the test for the more mystifying of these Sunday afternoon), Cherry and I tried the following:
Double the size of the sigaltstack, just in case. Also sanity check the bounds within gdb, they were okay.
Modified the definition of fpstate to conform to what is defined in the linux header files.
Modified sigcontext to use the new Xstate:
fpstate *Xstate // *fpstate1
Wrote a method to allow us to store the ymm registers that were supplied (as registers) to the signal handler,
1) tried an experiment in the assembly language handler to trash the YMM registers (not the data structures) before return. We never saw any sign of the trash but this seemed to raise the rate of the failures (running "go vet all"). The trashing string stored was "This_is_a_test. "
2) tried printing the saved and current ymm registers in sigtrampgo.
The saved ones looked like memmove artifacts (source code while running vet all), and the current ones were always zero. The memmove artifacts stayed unchanged, a lot, between signals.
I rechecked the code that did this earlier today, just in case we got it wrong.
3) made a copy of the saved xmm and ymm registers on sigtrampgo entry, then checked the copy against the saved registers, to see if our code ever somehow modified them. That never fired.
I spent some time Saturday looking for "interesting" comments in the Linux git log, I have some to review. What I am wondering is if there was some attempt to optimize saving of the ymm registers and that got fouled up. One thing I wonder a little about was what they are doing for power management with AVX use, I saw some mention of that.
(I.e., what triggers AVX use, can they "save power" if they don't touch the registers, if they believe AVX is not being used? Suppose they rely on some hardware bit that isn't set under exactly the expected conditions?)
type Xstate struct {
Fpstate Fpstate
Hdr Header
Ymmh Ymmh_state
}
type Fpstate struct {
Cwd uint16
Swd uint16
Twd uint16
Fop uint16
Rip uint64
Rdp uint64
Mxcsr uint32
Mxcsr_mask uint32
St_space [32]uint32
Xmm_space [64]uint32
Reserved2 [12]uint32
Reserved3 [12]uint32
}
type Header struct {
Xfeatures uint64
Reserved1 [2]uint64
Reserved2 [5]uint64
}
type Ymmh_state struct {
Space [64]uint32
}
TEXT runtime·getymm(SB),NOSPLIT,$0
MOVQ 0(FP), AX
c Y0,0(AX)
VMOVDQU Y1,(1*32)(AX)
VMOVDQU Y2,(2*32)(AX)
VMOVDQU Y3,(3*32)(AX)
VMOVDQU Y4,(4*32)(AX)
VMOVDQU Y5,(5*32)(AX)
VMOVDQU Y6,(6*32)(AX)
VMOVDQU Y7,(7*32)(AX)
VMOVDQU Y8,(8*32)(AX)
VMOVDQU Y9,(9*32)(AX)
VMOVDQU Y10,(10*32)(AX)
VMOVDQU Y11,(11*32)(AX)
VMOVDQU Y12,(12*32)(AX)
VMOVDQU Y13,(13*32)(AX)
VMOVDQU Y14,(14*32)(AX)
VMOVDQU Y15,(15*32)(AX)
RET
An update from over in #35326: I've bisected the issue to kernel commit torvalds/linux@d9c9ce34ed5c892323cbf5b4f9a4c498e036316a, which happened between v5.1 and v5.2. It also requires the kernel to be built with GCC 9 (GCC 8 does not reproduce the issue).
Not sure where Austin's reporting this or if he had time today, but:
arch/x86/kernel/fpu/signal.c
.All of the progress updates have been going on #35326. (Most recently, https://github.com/golang/go/issues/35326#issuecomment-558371242.)
There is this commit that clams to be fixing something in the culprit commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b81ff1013eb8eef2934ca7e8cf53d553c1029e84
I don't know if it will help or not, but @aclements if you have test setup ready, may be worth cherry-picking and trying.
I think that commit is already included in 5.2 and 5.3 kernel, which still has the problem.
Thanks @dvyukov. I just re-confirmed that I can still reproduce it in the same way on 5.3, which includes that commit. I'll double check that I can still reproduce right at that commit, just in case it was somehow re-introduced later.
Reproduced at torvalds/linux@b81ff1013eb8eef2934ca7e8cf53d553c1029e84, as well as v5.4, which was just released.
I've filed the upstream kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=205663
Amazing debugging!
So, the next question is... do we add an unconditional workaround in the Go runtime, or add a workaround guarded by a run-time check or environment setting, or advise users to avoid the affected kernels?
That's a good question.
@dr2chase suggested a clever, simple workaround, which is to use a CAS to ensure the top page of the gsignal stack is faulted in just before sending the signal (the CAS ensures it's write-faulted without the danger of a racing write). I may do that now just to head off more memory corruption bugs. Though I'm not positive this completely works with funny cgo thread and signal stack configurations.
On the other hand, this only happens in really bleeding-edge kernels. Assuming it gets fixed upstream quickly, the people who are running bleeding-edge kernels will continue to run bleeding-edge kernels, and will get the kernel fix.
Maybe we put in the workaround for 1.14 and remove it later.
If upstream responds quickly to the bug, and there's a Linux 5.3.x bugfix release before Go 1.14 is out, I'd say we could avoid adding a workaround in Go. But if they defer the fix until 5.4 I'm not so sure; bleeding edge systems typically take a few weeks or even months to pick up the latest major kernel releases, so we could end up with Go 1.14 released while some bleeding edge systems are still on 5.3.x.
This only works for the signals sent by the Go runtime, not external signals. But if we do this and the initial faulting-in, the preemption signals will probably frequent enough to keep the signal stack always paged in.
We just chatted about workarounds and the favorite workaround is to check the kernel version and disable AVX use on the known-bad kernels. This way it doesn't matter where the signals are coming from or who set up the signal stacks. The solution focuses on the AVX corruption. (We would, of course, also mention this in the release notes.)
Where "known-bad" means only >= 5.2 for now (without an upper bound), until a fix is released, and then we backport a change to the release branch to add an upper bound for when AVX gets enabled again?
Might be more robust to make “known-bad” mean 5.2–5.4, and backport a change if it still isn't fixed in 5.5 (or _is_ backported to others in that range).
[edit:] Hmm, but I guess the workaround probably isn't harmful elsewhere, so ≥5.2 seems ok.
Add #35158 to the list of issues likely caused by this.
Maybe this could be useful.
I just tried to build tip (with all.bash
) on two almost identical systems with gcc 9.2.0
.
Just one with kernel
Linux s1 5.3.11-1-MANJARO #1 SMP PREEMPT Wed Nov 13 12:21:14 UTC 2019 x86_64 GNU/Linux
builds and passes all the tests just fine.
And on another with kernel
Linux s2 5.3.12-1-MANJARO #1 SMP PREEMPT Thu Nov 21 10:55:53 UTC 2019 x86_64 GNU/Linux
builds successfully, but tests randomly fails.
@klauspost, I think #35158 is unrelated to this issue. While this issue technically applies to Go 1.12 and 1.13, it requires an application that's receiving a lot of signals. The corruption in that issue's stack traces also doesn't look like the corruption we typically see as a result of this issue.
We just chatted about workarounds and the favorite workaround is to check the kernel version and disable AVX use on the known-bad kernels.
Is the register file corruption limited to AVX? From looking at the relevant kernel commits, it would seem like any FPU registers could potentially be affected.
As already mentioned in #35326 there is a patch from Sebastian Andrzej Siewior now. He asks the reporters to test and verify.
@mdempsky, you're right. This also affects XMM registers. That means we can't work around it by just disabling AVX. We use XMM registers all over the place.
Don't know if Austin is verifying this already or not, but I'm about to fire off a build, go to lunch, then test.
@knweiss, thanks for pointing that out (not sure how else I would have found that out otherwise...). I've replied on the kernel bug, since I'm not subscribed to LKML.
Austin confirms a fix, I tested it at Monday's Linux tip and it also works for me.
Insofar as I understand the problem, the only reliable workaround would seem to be to mlock
the signal stack, or at least the first page of the signal stack, into memory.
As @cherrymui said using an atomic CAS before sending a signal only helps with signals that we send ourselves. Seems like programs will still be vulnerable to other signals. If we decide that we don't care about that--nobody has been reporting a problem, after all--then it seems simpler to just disable signal preemption on the kernels known to have this problem. Disabling signal preemption is not so bad; it just means that Go 1.14 acts like earlier versions of Go.
How bad is it to mlock
the first page of each signal stack?
It seems to me that disabling signal preemption is strictly worse than pre-poking the page with a CAS. Both leave us equally vulnerable to this bug triggered by not-our-signals (an apparently tiny risk, since none have been reported), but disabling signal preemption means that we'll have OS-linked latency-glitches, for a few versions of Linux (5.2, 5.3, 5.4). The cost of the CAS compared to sending the signal is small.
But mlock
still seems preferable -- maybe we only mlock
on the affected versions of Linux? If anyone complains, we have an obvious workaround, and we eliminate all that risk without OS-linked latency weirdness.
You're right that disabling signal preemption is strictly worse, but it is very simple and unlikely to have any subtle ramifications.
Linux 5.2 is no longer supported upstream, and Linux 5.3 will be obsolete very soon. Only Linux 5.4 is LTS and the issue is very likely to be fixed in Linux 5.4.1.
See: https://en.wikipedia.org/wiki/Linux_kernel_version_history#Releases_5.x.y
I think I was hit by this issue today. Submitted the details to https://github.com/golang/go/issues/35900. Two crashes (unexpected return pc
) and one pass when running all.bash
on rev 8054b13536.
Running with export GODEBUG=asyncpreemptoff=1
, I no longer get crashes. At least not after three consecutive runs of all.bash
.
The patch didn't make 5.4.1, maybe 5.4.2..
The patch was merged: https://github.com/torvalds/linux/commit/59c4bd853abcea95eccc167a7d7fd5f1a5f47b98
Hopefully 5.4.2, then. :slightly_smiling_face:
Thanks for the updates. Out of curiosity, does anybody who follows Linux kernel development closer than I do know if this would get backported to a 5.2 or 5.3 patch release? I know 5.4 is an LTS release, but I don't know how that affects further patch releases of older minor releases.
@aclements, Linux 5.2 is EOL and will not have any more patch releases. Linux 5.3 is still active and will get bugfixes backported:
Change https://golang.org/cl/209597 mentions this issue: runtime: add a simple version number parser
Change https://golang.org/cl/209598 mentions this issue: runtime: disable signal-based preemption on Linux 5.2–5.4.1
Change https://golang.org/cl/209599 mentions this issue: runtime: work around SSE save bug in Linux 5.2–5.4.1
I've mailed two possible workarounds:
CL 209598 checks the kernel version and disables signal-based preemption for the affected versions. This implements what we talked about, but it's a fair amount of code, mostly to figure out the kernel version.
CL 209599 just touches the signal stack before sending the preemption signal. It's a one line change and obviously harmless.
Both workarounds only affect preemption signals. There's still a danger of other asynchronous signals causing this, but working around that is much harder and still imperfect.
Personally I prefer the version number check but I don't feel that strongly about it. Thanks for writing these.
I don't feel very strongly about it, either. I lean toward touching the signal stack just because it's one line of code instead of 161, doesn't disable any features, and is likely to coincidentally mitigate any issues caused by this bug for other signals.
Would we remove the "touch the signal stack" fix eventually, such as when the tree reopens for 1.15?
Also, for those keeping count, Arch Linux just upgraded to 5.4.1, so it should get 5.4.2 as soon as that's out. That's one less bleeding edge distro that would be affected by this bug without a workaround.
Prefer touching the signal stack because it reduces the need to explain things to users, though I hope we're talking about a small number of people anyway (latency-sensitive Go users on latest-N-greatest Linux for the next few months). And, also, it will be trivial to remove if/when we get around to doing that.
Why not both? 🤷♂️
Since we've isolated the issue to 5.2–5.4, and 5.2 is EOL while 5.3 and 5.4 should have fixes soon... Will there still be a need to work around this issue in 2 months once Go 1.14 is released?
I think we need a workaround for the beta, which is imminent. Beyond that, my inclination is to keep a workaround in place for 1.14 because the workaround is low cost and the impact of this bug is so subtle. But I'm okay with removing the workaround in 1.15, given the kernel release cycles.
@mdempsky, if we were confident that the kernels won't be in use then, maybe. But if there's a chance people will be running Go on them, we really don't want more of these bug reports.
@bradfitz Can we change the GitHub issue template to supply us with Linux kernel version? That would at least make it as easy as recognizing they're using an affected kernel, and suggesting to update.
I think both of Austin's CLs are fine. I just think if it's impossible to completely workaround the issue from userspace, then we should really push on getting it fixed in the kernel and for systems to get those fixes.
@mdempsky, the kernel version is already in the output of go bug
, but people don't use that often as the environment where the Go binary is deployed and hitting problems likely a) doesn't have the "go" command available, b) likely has a different kernel version than their dev machine.
I suppose we could update https://github.com/golang/go/blob/master/.github/ISSUE_TEMPLATE but the longer we make that, the more often people get overwhelmed and just delete it all.
Runtime (+ stdlib?) panics could print the OS version?
For anyone running Ubuntu, the mainline daily kernel build now contains the relevant fix:
https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-12-03/
As expected, a variant of @mvdan's reproducer fails to trigger the issue reported in https://github.com/golang/go/issues/35326:
while true; do td=$(mktemp -d); export GOCACHE=$td; go clean -testcache -cache; go vet encoding/... |& tee out; grep bexport out && break; rm -rf $td; done
I'll now run with this kernel and tip, upgrading to 5.4.2 when it's released on Ubuntu.
The patch is now in 5.3.15-rc1, so 5.3.15 will have the fix.
Change https://golang.org/cl/209899 mentions this issue: runtime: mlock top of signal stack on Linux 5.2–5.4.1
Okay, one more shot at the workaround. @cherrymui pointed out the issue also almost certainly affects profiling signals, and neither of the workarounds I posted can be applied to profiling signals. So I went ahead and wrote the mlock solution: CL 209899. It's only slightly more complex than the workaround to disable asynchronous preemption, since the complexity of both is dominated by getting and parsing the kernel version.
And the patch was just merged into Linux 5.4.2.
Fixed by commit 8174f7fb2b64c221f7f80c9f7fd4d7eb317ac8bb (I messed up the magic commit syntax)
I've filed a tracking bug to remove the workaround for Go 1.15: #35979.
To close with a summary:
Linux 5.2 introduced a bug when compiled with GCC 9 that could cause vector register corruption on amd64 on return from a signal handler where the top page of the signal stack had not yet been paged in. This can affect any program in any language (assuming it uses at least SSE registers), including versions of Go before 1.14, and generally results in arbitrary memory corruption. It became significantly more likely in Go 1.14 because the addition of signal-based non-cooperative preemption significantly increased the number of asynchronous signals received by Go processes. It's also somewhat more likely in Go than other languages because Go regularly creates new OS threads with alternate signal stacks that are likely not to be paged in.
The kernel bug was fixed in Linux 5.3.15 and 5.4.2, and the fix will appear in all 5.5 and future releases. 5.4 is a long-term support release, and 5.4.2 was released with the fix just 10 days after 5.4.
For Go 1.14, we introduced a workaround that mlocks the top page of each signal stack on the affected kernel versions to ensure it is paged in and remains paged in.
Thanks everyone for helping track this down!
I'll keep this issue pinned until next week for anyone running a tip from before the workaround.
Do you have an idea how much memory is going to be mlocked? Distros have different values for RLIMIT_MEMLOCK, some of them are pretty low.
Looks like the workaround CL only applies to linux/amd64. Shouldn't it apply to linux/386 too?
Looks like the workaround CL only applies to linux/amd64. Shouldn't it apply to linux/386 too?
Elsewhere @aclements said he's been unable to reproduce the problem on 386.
@lmb How low is "pretty low"? Expected number of pages locked is O(threads) (not goroutines) pages, since it is one per page. Unless you have a lot of goroutines tied to threads, ought to be GOMAXPROCS pages, plus a few for bad luck.
and this is also tied to just a few versions of Linux that we hope nobody will be using a year from now.
Change https://golang.org/cl/210098 mentions this issue: runtime: give useful failure message on mlock failure
Elsewhere @aclements said he's been unable to reproduce the problem on 386.
Hm, where was that? I looked through this issue and #35326, and didn't notice any comments to that effect.
@aclements did mention that it also affects XMM registers, which are available on 386. The linux kernel fix looks generic to all of x86, not amd64-specific.
I'm willing to believe it doesn't affect 386 executables, but then I'm curious why not.
@mdempsky In the comments on CL 209899.
@zikaeroh Hiding in plain sight. Thanks.
@mdempsky: https://go-review.googlesource.com/c/go/+/209899/3/src/runtime/os_linux_386.go#7 (it's a little buried)
It may just be harder to reproduce. But we do use the X registers in memmove on 386, so I would still have expected to see it.
@aclements Thanks. Do you mind elaborating how you tested 386? Like the C reproducer exhibits the issue when built with -m64 but not with -m32, when all else is the same (e.g., exact same kernel)?
@dr2chase I did an unrepresentative survey amongst colleagues. Debian (and Ubuntu) allows 64 mb of locked memory by default. Arch Linux only has 64 kb.
@lmb, thanks for the info. At least Arch users will get failure messages when mlock fails now telling them to update their kernel to a fixed version. (at which point mlock of stack tops won't be used)
Speaking of Arch, 5.4.2 just landed on their mirrors.
Do you mind elaborating how you tested 386?
I ran the go vet stress test with a toolchain built with GOHOSTARCH=386 GOARCH=386.
However, I just ran my C reproducer, changed to use XMM instead of YMM and compiled with gcc -m32 -msse4.2 -pthread testxmm.c
and it failed. So I guess 386 has this problem, too. :(
Reopening for 386 fix.
Change https://golang.org/cl/210299 mentions this issue: runtime: suggest 5.3.15 kernel upgrade for mlock failure
FYI, max locked memory is 64KB on Fedora by default (all.bash currently fails). It looks like a 5.3.15 update is in the pipeline, so this failure should be temporary.
I'm also on Fedora, getting
$ GODEBUG=asyncpreemptoff=1 ./make.bash
Building Go cmd/dist using /home/elias/dev/go-1.7. (go1.7.1 linux/amd64)
Building Go toolchain1 using /home/elias/dev/go-1.7.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.4.2 or later
fatal error: mlock failed
even with GODEBUG=asyncpreemptoff=1
. How can I proceed while waiting for kernel 5.3.15? ulimit -l 4096
doesn't seem to make a difference (ulimit -l
still reports 64).
@eliasnaur, modify/add the memlock
value in /etc/security/limits.conf
?
The cleanest official Fedora way is to create a new 95-memlock.conf
file in /etc/security/limits.d/
that has the contents:
* - memlock 131072
Then unfortunately you need to log out and back in again (or ssh to your own machine) to get the new limits applied to your login session. Replace the 131072 by another number if you want a different limit than 128 MBytes; I aimed high because my Fedora machines are single-user machines with only me on them.
Change https://golang.org/cl/210345 mentions this issue: runtime: mlock top of signal stack on both amd64 and 386
The 5.3.15 kernel update has been released for Fedora 30/31. all.bash builds again.
I can confirm the C reproducer program runs correctly on Fedora with the 5.3.15 kernel
$ gcc -pthread test.c
$ time ./a.out
real 1m0.009s
user 0m0.001s
sys 0m0.004s
$ uname -r
5.3.15-300.fc31.x86_64
This workaround can be problematic for applications that run with limited RLIMIT_MEMLOCK
, e.g. systemd services, apps running as root. I bumped into the limit running gVisor with Docker which inherits limits from containerd service.
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
Is there an option to disable mlock'ing pages, and possibly preemption to avoid random corruptions? Is the plant to remove this workaround from 1.15?
You can disable preemption by setting the environment variable GODEBUG=asyncpreemptoff=1
.
But the key point here is that that doesn't avoid random corruption. The random corruption can occur with any program in any language. Using async preemption does make the random corruption more likely. But it can happen regardless.
Therefore, since the mlock
call is the only way we know to avoid the corruption, there is no way to disable that mlock
call. Other than upgrading or downgrading to a fixed kernel.
Change https://golang.org/cl/243658 mentions this issue: runtime: let GODEBUG=mlock=0 disable mlock calls
Change https://golang.org/cl/244059 mentions this issue: runtime: don't mlock on Ubuntu 5.4 systems
Change https://golang.org/cl/246200 mentions this issue: runtime: revert signal stack mlocking
Most helpful comment
To close with a summary:
Linux 5.2 introduced a bug when compiled with GCC 9 that could cause vector register corruption on amd64 on return from a signal handler where the top page of the signal stack had not yet been paged in. This can affect any program in any language (assuming it uses at least SSE registers), including versions of Go before 1.14, and generally results in arbitrary memory corruption. It became significantly more likely in Go 1.14 because the addition of signal-based non-cooperative preemption significantly increased the number of asynchronous signals received by Go processes. It's also somewhat more likely in Go than other languages because Go regularly creates new OS threads with alternate signal stacks that are likely not to be paged in.
The kernel bug was fixed in Linux 5.3.15 and 5.4.2, and the fix will appear in all 5.5 and future releases. 5.4 is a long-term support release, and 5.4.2 was released with the fix just 10 days after 5.4.
For Go 1.14, we introduced a workaround that mlocks the top page of each signal stack on the affected kernel versions to ensure it is paged in and remains paged in.
Thanks everyone for helping track this down!
I'll keep this issue pinned until next week for anyone running a tip from before the workaround.