Runtime: Intermittent: Empty Process stdout

Created on 13 Aug 2018 · 32Comments · Source: dotnet/runtime

Story

When running user-defined console-only processes using System.Diagnostics.Process simultaneously in parallel threads and redirecting their outputs, in some cases you can get empty or non-full output data (from redirected STDOUTs).

Platforms

All platforms (see quoted file in https://github.com/dotnet/runtime/issues/27128#issuecomment-583153102 and other platform-specific implementations).

Reproduction code

The error is intermittent, to reproduce it quickly, you need to run multiple instances of the reproduction program on a computer with multi-core processor (4+ threads recommended). Some "tests" would pass, but some of them would fail.

You can specify the path in process.StartInfo.FileName to any console program, that exits automatically. In the given example, I use the standard Windows console program ipconfig.exe.

````c#
using System;
using System.IO;
using System.Text;
using System.Threading;
using System.Diagnostics;

class EmptyStdOutRepro
{

public static void Main()
{
    new EmptyStdOutRepro().Run();
}

public void Run()
{

    while (true)
        Exec();

}

private void Exec()
{

    int lineCount = 0;
    string output = "";

    Process process = new Process();

    process.StartInfo.FileName = "ipconfig.exe";

    process.StartInfo.UseShellExecute = false;
    process.StartInfo.RedirectStandardOutput = true;
    process.StartInfo.CreateNoWindow = true;
    process.StartInfo.ErrorDialog = false;

    process.OutputDataReceived += new DataReceivedEventHandler((sender, e) =>
    {

        if (!String.IsNullOrEmpty(e.Data))
        {
            lineCount++;
            output += ("\n[" + lineCount + "]: " + e.Data);
        }

    });

    process.Start();

    process.BeginOutputReadLine();
    process.WaitForExit(1000); // Unpredictable behavior

    if (output != "")
        Console.WriteLine("PASSED!" + output.Length);
    else
        Console.WriteLine("FAILED!" + output.Length);

    process.WaitForExit();
    process.Close();

}

}

````

Initial reproduction, needs minimization

https://github.com/thesirkadirov/simplepm-server/tree/master/SProgramRunner

area-System.Diagnostics.Process

Source

sirkadirov

Most helpful comment

I suspect you might be seeing a known problem where there is a race condition between Process.Exited and Process.OutputDataReceived events.

See the remarks here:

https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.process.waitforexit?redirectedfrom=MSDN&view=netcore-3.1#System_Diagnostics_Process_WaitForExit_System_Int32

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload.

If you are calling Process.WaitForExit(int timeout), you actually have to call it again without the timeout to ensure that stdout has been flushed e.g. change this:

return Process.WaitForExit(timeout);

to this:

bool result = Process.WaitForExit(timeout);
if (result)
{
    Process.WaitForExit();
}
return result;

This has to be a really common gotcha.

I've just found it in https://github.com/daveaglick/Buildalyzer/issues/125.

duncanawoods on 6 Feb 2020

❤3 👍2

All 32 comments

Are you able to reproduce it on both Windows and Linux?
Are you able to reproduce it on more than 1 machine?
Is the large repro sensitive to HW configuration?

karelz on 15 Aug 2018

Hi, @karelz!
I just moved my project from .NET framework/Mono due to not implremented System.Diagnostics features in second. After that, I noticed this problem.

Yes, this error rep on both OSes, tested on VMs - reproduces...

Maybe this error is mine, so now I'm rewriting that module. If problem returns, I let you know.

No, not sensitive, this must work on all hardware.

sirkadirov on 15 Aug 2018

I'll close this meantime, feel free to reopen if you determine it is .NET Core.

danmosemsft on 16 Aug 2018

👍1

OK, let us know if you see it is a real problem in .NET Core.

karelz on 16 Aug 2018

OK, I'll stay in touch.

sirkadirov on 16 Aug 2018

👍1

Rewrote the "testing subsystem" from scratch. Nothing helped.
Now testing it to get more information.
Also, I moved this "subsystem" from .NET Core to .NET Standard 2.0.3.
Repro coming soon, I hope to minimize it.

sirkadirov on 21 Aug 2018

@danmosemsft @karelz Checked many times, with different ways of stream reading - problem exists...

Base stream reader used

````c#
using System.IO;
using System.Threading;
using System.Threading.Tasks;

namespace SProgramRunner
{

public class AdvancedStreamReader
{

    private const int BufferSize = 4096;

    private Stream _source;
    private MemoryStream _outputMemoryStream;

    private CancellationTokenSource _cancellationToken;

    public AdvancedStreamReader(Stream source)
    {

        _source = source;
        _outputMemoryStream = new MemoryStream();

    }

    public void Run()
    {

        _cancellationToken = new CancellationTokenSource();

        Task.Run(async () =>
        {

            var buffer = new byte[BufferSize];

            while (true)
            {

                _cancellationToken.Token.ThrowIfCancellationRequested();

                var count = await _source.ReadAsync(buffer, 0, BufferSize, _cancellationToken.Token);

                if (count <= 0)
                    break;

                await _outputMemoryStream.WriteAsync(buffer, 0, count, _cancellationToken.Token);

                await _outputMemoryStream.FlushAsync(_cancellationToken.Token);

            }

        }, _cancellationToken.Token);

    }

    public byte[] KillAndGet()
    {

        _cancellationToken.Cancel();

        return _outputMemoryStream.ToArray();

    }

}

}
````

Exec this reader

c# process.Start(); _outputStreamReader = new AdvancedStreamReader(_process.StandardOutput.BaseStream); _outputStreamReader.Run();
on process exited:
c# _programRunningResult.ProgramOutputData = _outputStreamReader.KillAndGet();

Error - empty output on small count of tests from 50-100 or more... And this is not my code.

Hope this can help to identify the problem. Now working on repro.

sirkadirov on 26 Aug 2018

@stephentoub any thoughts come to mind?

danmosemsft on 27 Sep 2018

any thoughts come to mind?

Not until we have a repro :)

That said, accessing StreamReader.BaseStream is a bit suspicious. The StreamReader could have buffered data from the stream, such that it's been consumed from the Stream and is sitting in the StreamReader's buffer. The shared code isn't doing any manipulation of the reader, so this may not be the issue, but that's why having a repro that actually shows the problem is important.

stephentoub on 27 Sep 2018

@stephentoub I'll re-try to create a working repro.

sirkadirov on 27 Sep 2018

I suspect you might be seeing a known problem where there is a race condition between Process.Exited and Process.OutputDataReceived events.

See the remarks here:

https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.process.waitforexit?redirectedfrom=MSDN&view=netcore-3.1#System_Diagnostics_Process_WaitForExit_System_Int32

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload.

If you are calling Process.WaitForExit(int timeout), you actually have to call it again without the timeout to ensure that stdout has been flushed e.g. change this:

return Process.WaitForExit(timeout);

to this:

bool result = Process.WaitForExit(timeout);
if (result)
{
    Process.WaitForExit();
}
return result;

This has to be a really common gotcha.

I've just found it in https://github.com/daveaglick/Buildalyzer/issues/125.

duncanawoods on 6 Feb 2020

❤3 👍2

I'm not familiar with this problem but since the parameterless version is just calling the other one with an infinite timeout. I'm guessing which you choose doesn't matter ie the text could probably be better worded as

To ensure that asynchronous event handling has been completed, call WaitForExit() after receiving a true from this overload.

danmosemsft on 6 Feb 2020

I'm guessing which you choose doesn't matter

There is an explicit difference in the descriptions of the two overloads so I would be fairly certain that the underlying implementation in how they handle stdout does differ in key ways. It might be within the same low-level method ultimately by choosing a different branch based on whether there is a timeout.

WaitForExit()

This overload ensures that all processing has been completed, including the handling of asynchronous events for redirected standard output

WaitForExit(Int32)

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload

duncanawoods on 6 Feb 2020

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload

If such a problem exists, but is extremely rare, it may be manifested when my code is executed due to many concurrent Process.Start(...) usages.

@duncanawoods Thank you for your desire to help me with this problem. I will check your solution as quick as possible.

sirkadirov on 6 Feb 2020

@duncanawoods appears this is what it refers to - it is a little counter intuitive that waiting for infinity and waiting for not-quite-infinity have potentially different behavior.

https://github.com/dotnet/runtime/blob/719d627a9e79ad9ed0255f1c6be1fc49cdffc88a/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Windows.cs#L174-L181

danmosemsft on 7 Feb 2020

@danmosemsft yep that's it. I looked at another FOSS project I use, https://github.com/SimonCropp/TextCopy, and it also fails to call WaitForExit twice after using a timeout. I fear this bug is lurking EVERYWHERE so well worth thinking about what can be done e.g. deprecating it, warnings etc. It's really almost perfectly evil: a bizarre requirement that will cause unpredictable failures only in high load scenarios. Gah!

it may be manifested when my code is executed due to many concurrent Process.Start(...) usages

@sirkadirov yep those were the conditions under which I saw it it: 10+ parallel invocations on a 12 core machine (Ryzen 3900X). I have used this library many times on quad-core machines and never saw this failure before.

duncanawoods on 7 Feb 2020

It looks like we inherited this from long ago. If there was a non-breaking way to improve things, great but I'm not sure there is (?) The Process class is slippery.

danmosemsft on 7 Feb 2020

@duncanawoods My code with additional Process.WaitForExit() works well on a small amount of tests. Need more time to verify whether it isn't only a partial solution, though.

@danmosemsft It is not a popular problem, the solution of which will bring more trouble than help. In my opinion, the best idea is to add some notes to the project's documentation.

I don't think that .NET Core platform is designed for this case. Third-party library will be a better solution for working deeply with processes.

sirkadirov on 7 Feb 2020

👍1

It looks like we inherited this from long ago. If there was a non-breaking way to improve things, great but I'm not sure there is (?)

I would wager that if we searched GitHub for all uses of WaitForExit(int timeout) combined with reading standard-out, we would find this bug in > 50% of the cases. I know that nuget.client had to fix it some time ago. https://github.com/NuGet/Home/issues/1902.

I also suspect that the frequency of observed stability problems will increase as core-counts increase and a lot of users won't track it down to this. I don't know what measures you use to decide on changes but to me this API is a guaranteed bug in user's code and must be fixed.

I appreciate the difficulty changing something so fundamental so I would suggest:

Mark WaitForExit(int timeout) as deprecated with a link to explanation and resolution in the warning. No need to delete it.
Create a new overload WaitForExit(int timeout, ProcessStreamControl processStreamControl) where the second argument is an enum to choose either the current behaviour or the expected behaviour e.g.

public enum ProcessStreamControl
{
    WaitForStreamsToFlush,
    ExitImmediately
}

Those names suck and yes it's ugly but it's clear and won't break anything.

duncanawoods on 7 Feb 2020

👍3

@duncanawoods I support your opinion. Had similar idea, but I don't know whether it will be accepted due to compatibility limitations.

Even if a decision is made to change the API, this little one will take too much time, also, there may be problems when porting old projects from .NET framework to .NET Core 3.x & .NET 5 (?). On the other side, we should quickly change some problematic APIs before .NET 5 release.

@danmosemsft Do you think we should create an issue in dotnet/standard regarding this problem?

sirkadirov on 7 Feb 2020

@duncanawoods i pushed a new version of textcopy with the double wait hack. let me know id you still experience the problem

SimonCropp on 29 Feb 2020

would having a new async process api fix this problem??

SimonCropp on 1 Mar 2020

👍2

Hi @SimonCropp Fantastic, thanks for making the fix.

The good news is that I have already tried to replicate the problem with the unfixed TextCopy with massively parallel copy-pasting of huge amounts of text but couldn't. I suspect that there are conditions on the called executable to reproduce the problem e.g. it does some work or yields before writing to stdout. I imagine that xclip is just a single system call so it's very unlikely for us to see the race condition. Better safe than sorry, so again, thanks for making the change!

Great idea for an async API to replace the blocking wait.

duncanawoods on 2 Mar 2020

@adamsitnik @danmosemsft given the extra info. should this be moved from "question" to "bug" label, and "needs more info" label be removed?

SimonCropp on 21 Jul 2020

as an aside. does this mean this bug exists in powershell? https://github.com/PowerShell/PowerShell/blob/master/src/Microsoft.PowerShell.Commands.Management/commands/management/Clipboard.cs#L53 cc @adityapatwardhan

SimonCropp on 21 Jul 2020

Looks like it. I will have a detailed look tomorrow. Thanks for reporting.

adityapatwardhan on 21 Jul 2020

again @adamsitnik @danmosemsft given the extra info. should this be moved from "question" to "bug" label, and "needs more info" label be removed?

SimonCropp on 27 Oct 2020

👍1

Sure, removed them. We don't use those labels consistently anyway.

danmosemsft on 27 Oct 2020

Can it be treated as a bug please? It is causing bugs in Microsoft's own tools.

duncanawoods on 27 Oct 2020

cc @adamsitnik @eiriktsarpalis (per https://github.com/dotnet/runtime/blob/master/docs/area-owners.md) to respond to that

danmosemsft on 27 Oct 2020

NuGet is being asked to do this double wait trick. If this is a problem, would prefer to have runtime fix, rather than put workarounds everywhere. Guidance?
@SimonCropp opened: https://github.com/NuGet/Home/issues/10189

rrelyea on 27 Oct 2020

👍2

NuGet is being asked to do this double wait trick. If this is a problem, would prefer to have runtime fix, rather than put workarounds everywhere. Guidance?
@SimonCropp opened: NuGet/Home#10189

Good idea, but as I see, a fix can potentially damage legacy/ported software projects. Maybe, new API proposal could help?

sirkadirov on 28 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings