--- FAIL: TestSetuidEtc (0.01s)
syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t1\t1\t1\t1" (bad)
syscall_linux_test.go:632: [6] "Setgroups([]int{0,1,2,3})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0 1 2 3" (bad)
syscall_linux_test.go:632: [8] "Setgroups([]int{0})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0" (bad)
syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
syscall_linux_test.go:632: [15] "Setresgid(101,0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
FAIL
FAIL syscall 0.309s
2020-11-06T19:42:05-362d25f/linux-ppc64-buildlet
2020-10-30T00:03:40-01efc9a/linux-ppc64-buildlet
2020-10-29T19:03:09-f2c0c2b/linux-ppc64-buildlet
2020-10-23T23:01:52-64dc25b/linux-ppc64-buildlet
2020-10-23T20:54:25-5f616a6/linux-ppc64-buildlet
See previously #42178 (@AndrewGMorgan, @laboger, @ianlancetaylor)
Marking as release-blocker at least until we determine whether this is due to an API change in Go 1.16 (per https://github.com/golang/go/issues/42178#issuecomment-716511284).
Please assign this bug report to me.
I noticed this on the dashboard. It seems like there is some kind of timing issue in reading the /proc/
Reviewing the failure logs, these three all share the same failure characteristics (root cause currently unknown)
Looking at build.golang.org, this failure mode appears to be rare.
These failures are believed to have been resolved in 3a819e8 11 days ago (ie., the failure signature matches something explicitly fixed with that commit):
Looking for patterns, the three failures are listed below. The Commonality visible in the first and third log appears to be that, as Lynn points out, the listed thread id is common to the failing test case.
I'm struck by how close together the three listed thread IDs are: 22053, 21921 and 21923. Is this to be expected? For example, is the build run in a container launched just to run the build tests, so the number of processes consumed by the time the test runs is generally about the same? Or is the available process range pretty small in this container and this is just evidence of ID reuse?
From debugging #42178 we determined that ppc64 (sans "le") runs without cgo, so this code is executing the syscall.AllThreadsSyscall() function behind the scenes when the test calls syscall.Set*(). This may or may not be important. I wonder if similar sporadic failures have been seen in the nocgo build tests on other platforms?
--- FAIL: TestSetuidEtc (0.01s)
syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t1\t1\t1\t1" (bad)
syscall_linux_test.go:632: [6] "Setgroups([]int{0,1,2,3})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0 1 2 3" (bad)
syscall_linux_test.go:632: [8] "Setgroups([]int{0})" comparison: "/proc/22053/status" got:"Groups:" want:"Groups:\t0" (bad)
syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
syscall_linux_test.go:632: [15] "Setresgid(101,0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
--- FAIL: TestSetuidEtc (0.01s)
syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/21921/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
--- FAIL: TestSetuidEtc (0.04s)
syscall_linux_test.go:632: [1] "Setegid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t1\t1\t1\t1" (bad)
syscall_linux_test.go:632: [5] "Setgid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
syscall_linux_test.go:632: [6] "Setgroups([]int{0,1,2,3})" comparison: "/proc/21923/status" got:"Groups:" want:"Groups:\t0 1 2 3" (bad)
syscall_linux_test.go:632: [8] "Setgroups([]int{0})" comparison: "/proc/21923/status" got:"Groups:" want:"Groups:\t0" (bad)
syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t101\t0\t0\t0" (bad)
syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t102\t102\t102" (bad)
syscall_linux_test.go:632: [11] "Setregid(0,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
Looking at this output using the following method, we see that there is evidence that the threads individually stop behaving as part of the collective process. In the first case, this is immediately - the Setegid(1) fails to take. In the third case that call succeeds, but the next Seteuid(1) call fails. Looking at the code for the test
https://github.com/golang/go/blob/master/src/syscall/syscall_linux_test.go#L596
All of the test cases after the first failing one also fail, unless the test is looking for the stuck state. What also seems to happen is that the test stops failing at some point before getting to the last test (which should be numbered [20]). This suggests that whatever is causing the test to fail, cleans itself up before the test reaches the end.
My suspicion is that this test is, in some way, racing thread termination on this architecture's test environment.
"grep uid" on the first and third of these yields:
syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/22053/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
and
syscall_linux_test.go:632: [2] "Seteuid(1)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [12] "Setreuid(1,0)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
syscall_linux_test.go:632: [13] "Setreuid(0,2)" comparison: "/proc/21923/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
and "grep gid" yields:
syscall_linux_test.go:632: [0] "Setegid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t1\t0\t1" (bad)
syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t1\t1\t1\t1" (bad)
syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
syscall_linux_test.go:632: [15] "Setresgid(101,0,102)" comparison: "/proc/22053/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
and
syscall_linux_test.go:632: [1] "Setegid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
syscall_linux_test.go:632: [4] "Setgid(1)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t1\t1\t1\t1" (bad)
syscall_linux_test.go:632: [5] "Setgid(0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
syscall_linux_test.go:632: [9] "Setregid(101,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t101\t0\t0\t0" (bad)
syscall_linux_test.go:632: [10] "Setregid(0,102)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t102\t102\t102" (bad)
syscall_linux_test.go:632: [11] "Setregid(0,0)" comparison: "/proc/21923/status" got:"Gid:\t0\t1\t0\t1" want:"Gid:\t0\t0\t0\t0" (bad)
Racing with thread termination would also explain the second failure.
This passes for me (total running time about 1m26s) on linux-ppc64-buildlet:
go/bin/go test syscall -v -run=TestSetuidEtc -count=10000
Given the rarity of this failure mode, I'm going to have to think a bit more about how to fix it.
Without the -v it runs quicker, but similarly no failures. I'm going to explore modifying the test to see if I can make it fail quicker, and/or more verbosely.
Based on the theory that the issue is possibly a race somewhere with terminating a thread, I modified src/syscall/syscall_linux_test.go with the following to aggressively terminate OS threads [*]:
diff --git a/src/syscall/syscall_linux_test.go b/src/syscall/syscall_linux_test.go
index 41ae8cc5a1..051eb3f204 100644
--- a/src/syscall/syscall_linux_test.go
+++ b/src/syscall/syscall_linux_test.go
@@ -550,13 +550,13 @@ func compareStatus(filter, expect string) error {
for _, f := range fs {
tf := fmt.Sprintf("/proc/%s/status", f.Name())
d, err := ioutil.ReadFile(tf)
- if os.IsNotExist(err) {
- // We are racing against threads dying, which
- // is out of our control, so ignore the
- // missing file and skip to the next one.
- continue
- }
if err != nil {
+ if os.IsNotExist(err) || strings.HasSuffix(err.Error(), ": no such process") {
+ // We are racing against threads dying, which
+ // is out of our control, so ignore the
+ // missing file and skip to the next one.
+ continue
+ }
return fmt.Errorf("unable to read %q: %v", tf, err)
}
lines := strings.Split(string(d), "\n")
@@ -574,6 +574,14 @@ func compareStatus(filter, expect string) error {
return nil
}
+// killAThread locks the go routine to an OS thread and exits; this
+// causes a thread to terminate.
+func killAThread(c <-chan struct{}) {
+ runtime.LockOSThread()
+ <-c
+ return
+}
+
// TestSetuidEtc performs tests on all of the wrapped system calls
// that mirror to the 9 glibc syscalls with POSIX semantics. The test
// here is considered authoritative and should compile and run
@@ -624,6 +632,9 @@ func TestSetuidEtc(t *testing.T) {
}
for i, v := range vs {
+ c := make(chan struct{})
+ go killAThread(c)
+ close(c)
if err := v.fn(); err != nil {
t.Errorf("[%d] %q failed: %v", i, v.call, err)
continue
[*] Note, this patch includes a fix that is unrelated to the bug in question, and fixes an issue with the last test fix... I was able to get some failures on my non-ppc64 workstation, but those turned out to be due to the fact that os.IsNotExist(err) is not always the correct way to recognize a thread ending on Linux.
Using that I generated some similar failures on linux-ppc64-buildlet:
$ [..magic to run remotely..] go/bin/go test syscall -run=TestSetuidEtc -count=1000
--- FAIL: TestSetuidEtc (0.09s)
syscall_linux_test.go:643: [14] "Setreuid(0,0)" comparison: "/proc/1795/status" got:"Uid:\t0\t2\t2\t2" want:"Uid:\t0\t0\t0\t0" (bad)
syscall_linux_test.go:643: [15] "Setresgid(101,0,102)" comparison: "/proc/1795/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t102\t0" (bad)
--- FAIL: TestSetuidEtc (0.03s)
syscall_linux_test.go:643: [12] "Setreuid(1,0)" comparison: "/proc/2462/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t1\t0\t0\t0" (bad)
syscall_linux_test.go:643: [13] "Setreuid(0,2)" comparison: "/proc/2462/status" got:"Uid:\t0\t0\t0\t0" want:"Uid:\t0\t2\t2\t2" (bad)
--- FAIL: TestSetuidEtc (0.13s)
syscall_linux_test.go:643: [19] "Setresuid(0,2,1)" comparison: "/proc/3144/status" got:"Uid:\t1\t0\t2\t0" want:"Uid:\t0\t2\t1\t2" (bad)
--- FAIL: TestSetuidEtc (0.14s)
syscall_linux_test.go:643: [7] "Setgroups(nil)" comparison: "/proc/8741/status" got:"Groups:\t0 1 2 3" want:"Groups:" (bad)
syscall_linux_test.go:643: [8] "Setgroups([]int{0})" comparison: "/proc/8741/status" got:"Groups:\t0 1 2 3" want:"Groups:\t0" (bad)
syscall_linux_test.go:643: [9] "Setregid(101,0)" comparison: "/proc/8741/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t101\t0\t0\t0" (bad)
syscall_linux_test.go:643: [10] "Setregid(0,102)" comparison: "/proc/8741/status" got:"Gid:\t0\t0\t0\t0" want:"Gid:\t0\t102\t102\t102" (bad)
FAIL
FAIL syscall 30.086s
FAIL
Error running run: exit status 1
Change https://golang.org/cl/268717 mentions this issue: syscall: improve TestSetuidEtc() /proc/ parsing against races
The issue appears to be another race with interpreting the /proc/ filesystem in the testing code:
Note, in passing, that the master branch just threw up another failure for linux-ppc-buildlet:
https://build.golang.org/log/650f97099ef1a84e84bb7f2d17f63a20d4c04559
This doesn't yet include the https://golang.org/cl/268717 fix.
So, the first version of the changes unexpectedly failed:
2 of 23 SlowBots failed:
Failed on linux-amd64: https://storage.googleapis.com/go-build-log/fd4e8e86/linux-amd64_038a24f9.log
Failed on linux-386: https://storage.googleapis.com/go-build-log/fd4e8e86/linux-386_a3c5af8b.log
My plan is to revert the thread termination part of the change to misc/cgo/test/issue1435.go and retry. I'm not yet clear if this is a newly uncovered bug, or something related to what I thought I was fixing.
Some investigation later, there were two issues:
1) I had missed some needed bits from the syscall_linux_test.go -> issue1435.go test upgrade.
2) as suspected the forced thread termination is somehow not working as expected.
My plan is to submit a fix for this bug in two stages - fix the original bug as reported upto and including (1) of this comment.
Then investigate why (2) is failing. It could be some subtlety of glibc's thread implementation, since launching C.pthread_create() is the only way the issue1435.go test differs from the one in syscall_linux_test.go.
These two test continue to fail:
2 of 23 SlowBots failed:
Failed on linux-amd64: https://storage.googleapis.com/go-build-log/d73fad79/linux-amd64_0f499959.log
Failed on linux-386: https://storage.googleapis.com/go-build-log/d73fad79/linux-386_1a608de2.log
All evidence of (1) is gone now, so I'll focus on what is up with the cgo failures now.
The net change with this fix "shouldn't" have triggered any new failures, so I suspect we're tripping over what must have been let's say a heisenbug lurking all along with the cgo side of the feature... [Here's hoping it is not a glibc issue.]
I've decided to split this issue into two (see #42494 for the aspect of the issue I plan not to resolve in this present bug). For this present bug, I want to get the checked in source back to testing stability by addressing the issue with the ppc64 build testing.