In our Windows implementation of os/Process.Wait, we always do a 5 millisecond sleep:
func (p *Process) wait() (ps *ProcessState, err error) {
handle := atomic.LoadUintptr(&p.handle)
s, e := syscall.WaitForSingleObject(syscall.Handle(handle), syscall.INFINITE)
switch s {
case syscall.WAIT_OBJECT_0:
break
case syscall.WAIT_FAILED:
return nil, NewSyscallError("WaitForSingleObject", e)
default:
return nil, errors.New("os: unexpected result from WaitForSingleObject")
}
var ec uint32
e = syscall.GetExitCodeProcess(syscall.Handle(handle), &ec)
if e != nil {
return nil, NewSyscallError("GetExitCodeProcess", e)
}
var u syscall.Rusage
e = syscall.GetProcessTimes(syscall.Handle(handle), &u.CreationTime, &u.ExitTime, &u.KernelTime, &u.UserTime)
if e != nil {
return nil, NewSyscallError("GetProcessTimes", e)
}
p.setDone()
// NOTE(brainman): It seems that sometimes process is not dead
// when WaitForSingleObject returns. But we do not know any
// other way to wait for it. Sleeping for a while seems to do
// the trick sometimes. So we will sleep and smell the roses.
defer time.Sleep(5 * time.Millisecond)
defer p.Release()
return &ProcessState{p.Pid, syscall.WaitStatus{ExitCode: ec}, &u}, nil
}
Sleeps in code are always gross. They're either too fast (and still flaky) or too slow (incurring extra delays on everybody when not needed).
This bug is about figuring out what we're doing wrong on Windows that made us need that sleep in the first place.
https://go-review.googlesource.com/c/go/+/84175 was one attempt to remove it, but the goal then was fixing a Windows XP test failure (for #17245), and it didn't fix the XP failure. We no longer support XP, though, so maybe that CL is okay now.
But maybe it's not correct.
There was also https://golang.org/cl/84896 from @johnsonj to use jobs to wait for process completion on windows, but it was "not working as expected on Server 2008" and @alexbrainman had concerns in https://github.com/golang/go/issues/17245#issuecomment-353811567 (that whole thread is worth a read, even if it's mostly about debugging an XP issue, which might've been related to this bug.)
If you do not delete executable file immediately after its process exits, then you could remove that 5ms sleep, and nothing will change for you.
So we really need that extra sleep mainly to make cmd/go work properly, because cmd/go runs programs and deletes them immediately.
Perhaps we could move that 5ms sleep from os package and into cmd/go code.
Alex
It should be possible to delete a file after it finishes executing, for any program, not just cmd/go.
This bug is about figuring out what synchronization we're missing.
This bug is about figuring out what synchronization we're missing.
I understand that. I was suggesting what to do in case we cannot find such synchronization.
Alex
Is the sleep needed on every process we want to wait on?
WinApi docs state that
... if you have a thread that creates windows, use MsgWaitForMultipleObjects or MsgWaitForMultipleObjectsEx, rather than WaitForSingleObject.
We might want to use a different wait method on windows?
When I tried to remove the sleep in https://go-review.googlesource.com/c/go/+/84175 I was doing it mostly for debugging Windows XP problems (#2866 and #17245), and it apparently didn't work.
But we don't support Windows XP anymore, so maybe we can remove it now?
But I forget everything about Windows I had learned while researching that patch.
/cc @alexbrainman @jordanrh1
I've seen a lot of code that waits on a process handle, and none of it sleeps. It would be a major bug if you couldn't trust a process handle to be signaled at the right time, and would surely be fixed. The sleep must be masking a problem somewhere else. I'm all for removing the sleep and seeing what shakes loose.
I'm all for removing the sleep and seeing what shakes loose.
SGTM
Alex
Change https://golang.org/cl/145221 mentions this issue: os: remove sleep in windows Process.Wait
@bradfitz @jordanrh1
CL 145221 have broken windows-2008 builder - latest snapshot of https://build.golang.org/

Even one failure on windows-2012 https://build.golang.org/log/d37c2bad6c49d20fe4bec6f1a8d380e583ea79e8
What should we do now?
Alex
Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?
Is there some other handle we're not closing quickly?
@jordanrh1, could you audit some of these code paths and make sure we're doing things right? (clearly we aren't)
I'm not seeing anything obviously wrong with the code, but these are the things I'm suspecting:
To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.
Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?
Yes, these errors happens more often. Maybe it was happening on Windows XP is even more often.
Is there some other handle we're not closing quickly?
I think that once process handle is signaled - syscall.WaitForSingleObject returns - the process is exited. So opened handles could not matter here. But I could be wrong about that.
I'm suspecting:
- might be behavior of OS that it signals process handle before files are released
- might have to do with stdout/stderr handles
- might have to do with child processes
- might be a code defect in os.exec
To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.
I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?
I don't see how it could be related to (2). Like I said above, we wait for syscall.WaitForSingleObject on process handle to return, so whatever handles process opens, they should be all closed now. And when new process starts, we use syscall.DuplicateHandle to create stdin, stdout and stderr handles for child process and close the handles immediately - see syscall.StartProcess for details.
I am not sure what to say about (3) and (4).
Should I revert CL 145221 in the meantime?
Alex
Should I revert CL 145221 in the meantime?
Sure? Super sad, but we don't have good short-term leads.
If you do, please reopen this bug and reference both this bug in the commit, and #23171.
/cc @aclements
I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?
How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop? If the theory is that Windows signals the process handle before releasing the file, that should test it.
Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?
Yes, what @aclements suggested.
I should mention that I don't have any cycles to test this until the 19th.
Change https://golang.org/cl/148957 mentions this issue: Revert "os: remove sleep in windows Process.Wait"
reopen this bug and reference both this bug in the commit, and #23171.
https://go-review.googlesource.com/c/go/+/148957
How about creating an executable file (probably by just making a copy of an existing executable), run it,
WaitForSingleObject, and then try to delete it, and do that in a loop?
I will try to do what you have asked. I would not be surprised, if the system configuration or OS version will also matter here - imagine, if you have an antivirus program eager to catch executable with virus in them, or another "helpful" service.
Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?
I am not sure what locks you are talking about. Please, explain more.
All errors, we see here, happen during cmd/go test. Perhaps, the exe in questions starts itself as a child, and then parent process exists, and we notice parent exit, but do not wait for the child? But all examples of faults are simple run of a program - they do not look like starting any child processes.
Alex
I will try to do what you have asked.
I created github.com/alexbrainman/winapi/tree/master/issue25965.TestIssue25965
The test runs required exercise for both Go version (only syscall package is used) and C version. The test fails immediately on my Windows 7 computer. But it was much harder to make it fail on my Windows 10 computer - I managed to make it fail after I added a loop of 100 iterations around the test - so be warned that the test might take a while to run. Anyway my Windows 10 output looks like this:
c:\>u:\test
PASS
c:\>u:\test
--- FAIL: TestIssue25965 (84.83s)
main_test.go:166: running c exe failed: exit status 1 iteration 56: CopyFile failed (5)
FAIL
c:\>u:\test
--- FAIL: TestIssue25965 (92.83s)
main_test.go:143: iteration 83 failed: CopyFile: Access is denied.
FAIL
c:\>u:\test
--- FAIL: TestIssue25965 (64.27s)
main_test.go:166: running c exe failed: exit status 1 iteration 14: CopyFile failed (5)
FAIL
c:\>
Let me know, if you want me to do anything else.
Alex
Keep in mind that users with a poorly implemented anti virus installed may get the issue "executable binary is locked a bit after the process exited" anyway, independent on how solid you do the implementation here.
I haven't tested but I think the right solution (if it works) would be:
My proposal assumes FILE_SHARE_DELETE is usable, as requested in #32088.
Keep in mind that users with a poorly implemented anti virus installed may get the issue "executable binary is locked a bit after the process exited" anyway, independent on how solid you do the implementation here.
I agree.
I haven't tested but I think the right solution (if it works) would be:
I wasted enough time here. If you want to try and implement this, go ahead. We will review your code.
Alex
Thanks, anyhow I'd prefer this to be slated on go1.14 since it's a significant behavior change that could have subtle behavior impact so I'll plan to work on this later.
I'd prefer this to be slated on go1.14 since it's a significant behavior change that could have subtle behavior impact so I'll plan to work on this later.
SGTM.
Alex
Change https://golang.org/cl/181541 mentions this issue: cmd/go: factor the I/O-retry logic out of renameio
This comment was originally part of a discussion in "syscall: starting a process with large number of arguments is way too slow #41825". It was suggested that I also post it here:
Thoughts on the 5ms delay:
I have been assured in past conversations with the Go team that without the delay errors sometimes occur when deleting the executable immediately after return from process wait.
A comment in the code says "// NOTE(brainman): It seems that sometimes process is not dead when WaitForSingleObject returns...."
I suspect that a more accurate statement would be "the process is complete but the executable has not yet been closed".
Is the contract of "wait" that the executable is closed on return?
By far most external process launches don't care whether the executable is closed before completion is signaled. Closing of the executable is part of the OS's cleanup after running a process and may be done concurrently after wait returns -- the caller should not have to wait for that cleanup given that it is usually not necessary to wait.
For Windows, those few programs that run processes and then delete the executable (such as "go run"?) might have to deal with an executable that is still open. Note that Unix does not have this concern, since it's OK to remove an open file (then the file is later physically deleted by the OS as soon as it has no openers left).
Suggestion:
This suggestion does require that client programs take care if they want to delete the executable, I would say that programs that don't are relying on undocumented behavior -- I have seen no promise in the documentation that the executable is closed on return from Wait.
Random thought:
Offer an alternative Wait function: WaitAndRemoveExecutable ??
Most helpful comment
This comment was originally part of a discussion in "syscall: starting a process with large number of arguments is way too slow #41825". It was suggested that I also post it here:
Thoughts on the 5ms delay:
I have been assured in past conversations with the Go team that without the delay errors sometimes occur when deleting the executable immediately after return from process wait.
A comment in the code says "// NOTE(brainman): It seems that sometimes process is not dead when WaitForSingleObject returns...."
I suspect that a more accurate statement would be "the process is complete but the executable has not yet been closed".
Is the contract of "wait" that the executable is closed on return?
By far most external process launches don't care whether the executable is closed before completion is signaled. Closing of the executable is part of the OS's cleanup after running a process and may be done concurrently after wait returns -- the caller should not have to wait for that cleanup given that it is usually not necessary to wait.
For Windows, those few programs that run processes and then delete the executable (such as "go run"?) might have to deal with an executable that is still open. Note that Unix does not have this concern, since it's OK to remove an open file (then the file is later physically deleted by the OS as soon as it has no openers left).
Suggestion:
This suggestion does require that client programs take care if they want to delete the executable, I would say that programs that don't are relying on undocumented behavior -- I have seen no promise in the documentation that the executable is closed on return from Wait.
Random thought:
Offer an alternative Wait function: WaitAndRemoveExecutable ??