Orleans: When logging to file, log messages are not flushed

Created on 19 May 2016 · 19Comments · Source: dotnet/orleans

I have an issue after updating from Orleans 1.1.3 to 1.2.0 (the same applies to 1.2.1).

This is my service configuration (extract):

<Tracing DefaultTraceLevel="Warning" TraceToConsole="true" TraceToFile="Logs\{0}-{1}.log" PropagateActivityId="false" BulkMessageLimit="1000">
    <TraceLevelOverride LogPrefix="Application" TraceLevel="Info" />
</Tracing>

When I start my silo via command line everything is logged properly into the console window, but in the logfile I see nothing or only the first 3 log entries (coming from Orleans itself). All subsequent log messages are missing.

Now when I stop my silo by pressing CTRL-C, all the missing log messages are flushed into the logfile.

However, when I start my silo via debugger, everything is logged properly into both the console and the file.

It seems that if the debugger is attached, log entry flushing is working okay, but if the debugger is not attached, flushing does not work properly.

bug

Source

peter-perot

All 19 comments

I have written a workaround which flushes the messages into the logfile:

private void Flush()
{
    (this.Logger as global::Orleans.Runtime.TraceLogger)?.GetType()
        .GetMethod("Flush", BindingFlags.Static | BindingFlags.Public | BindingFlags.NonPublic, null, new Type[0], null)?
        .Invoke(this.Logger, null);
}

When I call this method after each call to the logger, the logfile is okay. Seems to be an Orleans bug.

Maybe something goes wrong with this delta: https://github.com/dotnet/orleans/commit/c2f192bc038f01113a668f582817db8bf38afa19

peter-perot on 19 May 2016

Another hint: I added the following section to my silo host – but it did not solve the issue:

<system.diagnostics>
    <trace autoflush="true" indentsize="0">
        <listeners>
            <remove name="*" />
        </listeners>
    </trace>
    <switches>
        <add name="traceLevel" value="2" />
    </switches>
</system.diagnostics>

Still need the workaround to make flushing to file work.

peter-perot on 19 May 2016

Interesting. @jdom, indeed why do we need flush interval? Why not flush after each statement? I think that was the original behaviour, if I am not wrong.

gabikliot on 19 May 2016

@gabikliot, @jdom: I made another interesting observation: When a add a Thread.Sleep(1500) after each log message (instead of calling the Flush()-workaround), the logfile is also okay.

After analysing the code a saw that @sergeybykov added some stuff on 29 Feb 2016 concerning flushing.

Here I see

https://github.com/dotnet/orleans/blame/c2f192bc038f01113a668f582817db8bf38afa19/src/Orleans/Logging/TraceLogger.cs

the lines

private TimeSpan flushInterval = Debugger.IsAttached ? TimeSpan.FromMilliseconds(10) : TimeSpan.FromSeconds(1);
private DateTime lastFlush = DateTime.UtcNow;

// [...]

if ((DateTime.UtcNow - lastFlush) > flushInterval)
{
    lastFlush = DateTime.UtcNow;
    Flush();
}

This would explain why the thing is working when debugging.

However I don't get the idea behind it. Okay, I think log messages shall be buffered for an interval of one second. But what happens if a log message is not flushed (since the interval is not expired), and then a long time passes before a subsequent message is logged. The latter will flush the preceding, but the preceding is stuck in the queue for some time.

Another observation: Sometimes the logfile is empty and only gets filled when the silo host is shut down. Maybe the reason is that the TraceLogger instance is initialized this way:...

private DateTime lastFlush = DateTime.UtcNow;

... and then in the following 999ms (actually < 1000ms) some messages are logged – but these messages will not get flushed since

if ((DateTime.UtcNow - lastFlush) > flushInterval)

evaluates to false for the first second.

IOW: I think this would require a timer which is triggered after the expired interval to flush stuck messages. If a new message comes in before the timer fires, the timer is restarted.

peter-perot on 19 May 2016

I don't think we need to wait at all before flushing. Flush every time.
@sergeybykov and @jdom must have had a reason.

gabikliot on 19 May 2016

👍1

@gabikliot: +1. But maybe it was added for performance reasons. Writing a large block to the HD is faster than writing multiple smaller blocks – maybe. And since the logging is not done async, someone thought it would be a good idea to write in one second intervals instead immediately.

peter-perot on 19 May 2016

It is indeed due to perf reasons. The reason not to flush every time is the same as why the Write API itself does not flush every time. Writing 1mb of very small chunks is orders of magnitude slower than writing a 1mb chunk (or whatever the optimized chunk size is for the OS and the disk, which is determined by the runtimr).
We could try to micro optimize for this dev scenario, but in reality under production load we will probably mainly rely on the API automatically flushing once the stuff written is big enough... The 1 second interval is just to account for light loads at times (but not necessarily non-existent as in this case).
I'm not opposed to adding a timer though, but I wouldn't make it every seconds either... Maybe fire every 10 seconds or some other multiple of the flush interval

jdom on 19 May 2016

And yes, also because calling flush will block the thread until the bytes are flushed

jdom on 19 May 2016

@jdom, this is all of course true, THEORETICALLY.
In practice, have you seen any problem due to that? We log only at info, which is VERY VERY little information overall. Like single Kbs a minute. Never seen any problem due to flushing for many years with Orleans.
And I was always telling @sergeybykov how the server log is service's most valuable resource, it eyes into what happens, any piece of info is super valuable. Its not just "some other random stuff" to care about. If thought about this way, jeopardizing anything about it is wrong, in my opinion.

My strong recommendation is to leave it at every statement, unless you have a strong evidence it made anything severely worse.

gabikliot on 19 May 2016

I have built a few logging libraries while in p&p, and this is definitely not just theory.
If I take your word for it that only a few KBs per minutes are written, then I wouldn't mind if there is a blocking and poorly performant flush call there. Feel free to change it.
Nevertheless, can we trust that assertion even for Orleans logging? Also, I thought that all application tracing done through grain.GetLogger("foo") would go through this same pipeline. If that's the case, then that's potentially a huge amount of logging, and that where I would worry a lot about.

jdom on 20 May 2016

It is definitely your call from now on.
It was my call when I had to troubleshoot those production logs. But it's yours now.

gabikliot on 20 May 2016

If I may, I wish to propose another approach. It might be not be exactly what you're after and maybe I over-engineered it, but I've implemented a solution for this issue as part of ZooKeeperNetEx. Look at NonBlockingFileWriter.cs. What it does is write to a concurrent queue which is consumed asynchronously and it flushes after every write. The task that writes is spawned when needed, and closes after the write burst has ended, there are never two concurrent write operations. You might find it useful.

shayhatsor on 23 May 2016

👍1

@shayhatsor yeah, I think we could use something like that too. Not necessarily flushing on every single Write, but flushing once the queue is empty... if it's never empty, then there's actually no need to explicitly flush, as the OS will be doing that automatically for us. We should also be careful as to have a bounded capacity for the buffer... Something that we don't expect to reach, but still a cap so that it's not a huge memory leak in case there is something wrong when writing.

jdom on 23 May 2016

👍1

I'm not sure if it is the same issue, but I am seeing an issue on 1.2.2 where writing a log at certain times results in the log message getting 'stuck' - it never flushes on it's own even after the 1 second interval passes.

The only way the log file is updated is when another log is written from elsewhere in the code.

nkm8 on 8 Feb 2018

@nkm8 Orleans 1.2.2? Can you try 1.5.3?

sergeybykov on 8 Feb 2018

@sergeybykov I have been trying to upgrade to 1.5.3, but I am having some issues. Our code persists instances of grains using Orleans.StorageProviders.SimpleSQLServerStorage and after upgrading Orleans and SimpleSQLServerStorage, the grains fail to deserialize:
!!!!!!!!!! Error clearing: GrainType=Orleans.Runtime.Versions.VersionStoreGrain GrainId=GrainReference:*grn/13BAB4D8/00000000+OrleansTest ETag= in to DataSource=

The grains were serialized as binary to a SQL Server DB before VersionStoreGrain was added, so the deserialize always fails and I can't find an easy way to add a fallback serializer to the old 1.2.2 way. This could work for new/clean installs, but existing installs of our current codebase would require deleting all of the stored grains before upgrading, which requires too much manual work to document and re-create all the grain configurations during an upgrade.

nkm8 on 8 Feb 2018

@nkm8 Please open a separate issue for the problem of data migration. We should be able to suggest a solution for it.

sergeybykov on 8 Feb 2018

@sergeybykov Good idea - I opened https://github.com/dotnet/orleans/issues/4002. I am glad to hear that there may be some potential solutions because I am stumped.

nkm8 on 9 Feb 2018

@nkm8 Thanks. I'll close this one then, as we are mostly out of the log management business in 2.0.0.

sergeybykov on 10 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings