Csvhelper: Add async/await from *Async methods

Created on 6 Nov 2013  Â·  65Comments  Â·  Source: JoshClose/CsvHelper

This would be really handy to have async all the way down to IO.

feature

Most helpful comment

I added issue #1047 for this.

All 65 comments

Do you have any preferences on how that would work; like what methods it would be implemented on?

Ideally an async version would exist for all the reading and writing logic.

For example, we have to do this currently:

public Task<Record> GetRecordAsync(string recordId)
{
    Guard.That(RecordId).IsNotNullOrEmpty();

    Record result = null;

    using (var reader = this._reader())
    {
        var relatedRecords = reader.GetRecords<Record>();

        if (relatedRecords != null)
        {
            result =
                relatedRecords.FirstOrDefault(
                    x => string.Equals(x.RecordId, recordId, StringComparison.OrdinalIgnoreCase));
        }
    }

    return Task.FromResult(result);
}

It would be great if this was async right down to the IO so that calling code could do the following:

public await Task<Record> GetRecordAsync(string recordId)
{
    Guard.That(RecordId).IsNotNullOrEmpty();

    Record result = null;

    using (var reader = this._reader())
    {
        var relatedRecords = await reader.GetRecordsAsync<Record>().ConfigureAwait(false);

        if (relatedRecords != null)
        {
            result =
                relatedRecords.FirstOrDefault(
                    x => string.Equals(x.RecordId, recordId, StringComparison.OrdinalIgnoreCase));
        }
    }

    return result;
}

I agree that this would be a fantastic feature to have in CsvHelper. I use this library almost whenever I need to parse CSV file since it makes life so much easier, but unfortunately in a project that's going async all the way I had to resort to manual parsing. Has any work been started on implementing this? I actually just forked the repo and am going to start implementing the Task Async pattern for .NET 4.5 when I have some spare time.

I haven't started anything on this yet. I should just do it, 'cause I don't think it'll be too many lines of code.

On second thought, let me know what your plans are on this.

The simple case would be to just run the current stuff as a Task.

c# public async Task<IEnumerable<T>> GetRecordsAsync<T>() { return Task.Run( () => { return GetRecords<T>().ToList(); } ); }

This would have to call .ToList() because it returns an IEnumerable that will get a row on each iteration.

The more complex scenario is to make it async all the way down to the TextReader/TextWriter level, so that each iteration of the IEnumerable is async, and you would be able to call GetRecordAsync<T>, and use the parser async also. Doing this would be quite a large change into the bowels of the parser.

I actually thought about doing that as well, but the IO calls made by the underlying TextWriter/TextReader are still made using synchronous blocking calls. I would do what you mentioned, but in a file IO intensive application the long term solution is probably to go async all the way down to the parser.

Yeah. That will be a little tricky to get those lower level calls to work in an async manner without having to duplicate code.

So after looking into this a little, I'm not sure it's worth having the async passed all the way down. Will that actually save any time? Just having the whole thing run on a background thread will probably be fine. There is no opportunity to have multiple things running at the same time.

I will probably make async Task<T> GetRecordAsync<T> methods.

I will probably also do a async Task<IEnumerable<T>> GetRecordsAsync<T> also, but that will not yield results; it'll just return everything when it's finished.

For the IEnumerable<T> scenario, I will probably make a IObservable<T> GetRecordsObservable<T> that will have the same effect as what you'd think you'd want with a Task<IEnumerable<T>>.

It looks like implementing IObservable is a big no-no, and pulling in parts of Rx to implement would be huge. I got up to around 30 classes and just gave up. You pretty much have to pull the whole framework in to do Observable.Create.

Maybe it's not worth it and wrapping in an async Task object is enough as long as it's configured to always run on a background thread.

@roryprimrose Do you have a reason that you'd like the async to go down to the IO level? I think having a task run at a higher level won't lose any speed. As long as the IO, parsing, and converting to an object is done in a task, I would think it would be fine. That whole part could be on a single separate thread.

It sounds ok because you would still be awaiting a task to complete that is in itself doing more than just the IO work. All the way down would be better but this is a good trade-off.

The reader pulls a bunch of data into a buffer at once, and most of the work is done elsewhere. I'll continue with just putting tasks at a higher level.

I'll shelve something in the next day or so to see what you guys think.

Will this affect apps built with ASP.NET MVC or Web API?

Sent from my Windows Phone


From: Josh Close
Sent: 3/21/2014 8:44 PM
To: JoshClose/CsvHelper
Subject: Re: [CsvHelper] Add async/await from *Async methods (#202)

The reader pulls a bunch of data into a buffer at once, and most of the work is done elsewhere. I'll continue with just putting tasks at a higher level.

I'll shelve something in the next day or so to see what you guys think.

Reply to this email directly or view it on GitHubhttps://github.com/JoshClose/CsvHelper/issues/202#issuecomment-38341018.

What do you mean? Affect them how?

There will be new methods called GetRecord(s)Async and WriteRecord(s)Async, so it will be new stuff and shouldn't change any existing code.

Alright. Can everyone have a look at this? It's pretty simple but should do that job. Any suggestions are welcome, but I would like to keep it simple if possible. https://github.com/JoshClose/CsvHelper/commit/2c3f5336db501d71e2b2ef937883f0a56437c662

Just want to make sure extra threads aren't being created. In a web app that's not ideal. However, if you're going to use task factory it should be ok.

Sent from my Windows Phone


From: Josh Close
Sent: 3/21/2014 9:02 PM
To: JoshClose/CsvHelper
Cc: Jeremiah Medina
Subject: Re: [CsvHelper] Add async/await from *Async methods (#202)

What do you mean? Affect them how?

There will be new methods called GetRecord(s)Async and WriteRecord(s)Async, so it will be new stuff and shouldn't change any existing code.

Reply to this email directly or view it on GitHubhttps://github.com/JoshClose/CsvHelper/issues/202#issuecomment-38341306.

I'm leaning towards not putting this feature in. The reasoning is in the discussion in the commit. https://github.com/JoshClose/CsvHelper/commit/2c3f5336db501d71e2b2ef937883f0a56437c662

Putting Task.Run() around everything (and everywhere) is indeed not the way to go.

@JoshClose was there no way to get the async done at a lower level? E.g. where you read the stream? (Stream has many async methods) I'm about to parse a 45MB stream provided by WebClient. It will be a slow stream...

It would have to be in the parser, which means basically every single method in the entire library would have to change to Async. I don't think there would be a performance improvement at all from it. It's simple enough for a consumer to do Task.Run where they want to.

Task.Run is a simple workaround, but the whole point of using async is to avoid using yet another worker thread (which cannot easily be canceled in a asp.net app) . If every library implementer out there responds the same, then we are back to square one. A lot of effort went into C# to avoid resorting to Task.Run() for IO bound operations.

In my current case however, I will only read my input file once per day, so hogging an extra worker thread for what is a very limited time is not much of an issue, other than violating my project's coding guidelines and being the odd duck out. Next time I might not be so lucky.

But yeah, the core would then need to be async, otherwise there is no point in doing it.

What stinks about this is way deep down there is a single call that happens. https://github.com/JoshClose/CsvHelper/blob/master/src/CsvHelper/FieldReader.cs#L108

I'll look into doing it again all the way down to the parser reading.

I think there will be a lot of code duplication. I can't have the sync methods call the async methods and wait on them because this needs to work in .NET 2.0.

Making things async would only make sense when an operation takes more than 10-15 ms IMHO. Reading a char doesn't make it faster.

What we are doing in our library is reading / writing the file stream into memory (async). Then we let the parser do its magic.

The only thing that could make (a bit) of sense is if CsvHelper would read batches from the memory stream async, but I don't think it's worth the effort.

If everything is ultimately chucked into a memory stream, then Geert has a point.

But is this really the case? If I start streaming a huge csv file from a slow website (actually my main usage of CsvHelper thus far), I first have to wait for everything to be stuffed into a MemoryStream..? That does not sound healthy.

But that can be done async (since you can pass in a stream to CsvHelper). So we wrote a method like this (of course this doesn't compile, but it's for explanation purposes only):

public async Task<IEnumerable<TRecord>> ReadCsvAsync<TRecord, TMap>(Uri uri)
{
    var downloadedBytes = await _downloader.DownloadSlowStuffAsync();

    using (var memoryStream = new MemoryStream(downloadedBytes))
    {
        return CsvHelper.ParseSync(memoryStream);
    }
}

The slow part in your code isn't the parsing, it's the retrieval of the file.

You could even split this up:

  1. Store the first line (as header)
  2. Download 1MB at a time. While the download continues, you can already start parsing the code (since the download is async anyway).

It's a bit more complex, but then you would have a true async method.

No data is read into a buffer of size chosen in config (2048 by default).
No memory stream inside, at least not anymore.

Is there any place other than TextReader.Read where an async call would occur? https://github.com/JoshClose/CsvHelper/blob/master/src/CsvHelper/FieldReader.cs#L109 Same would go for writing.

If you are really reading char by char, I would not even bother making this async (since thread switching, creating tasks, etc is more expensive).

It's reading a buffer of Configuration.BufferSize = 2048 size. That buffer is then read char by char and keeps track of positions so it can be pulled out of the buffer all at once for performance.

// Read from TextReader into buffer
charsRead = Reader.Read( buffer, bufferPosition, configuration.BufferSize );

// Pull data out of buffer into string
field.Append( new string( buffer, fieldStartPosition, length ) );

@GeertvanHorrik @JoshClose Agreed! While it's possible to add async functionality the only async function that could be called in the stacktrace from a reading standpoint is reading a character which gets called a lot so this story is probably not a good idea. I wanted to find a reason it would make sense to add async but there really isn't any.

I agree. There would have to be duplicate methods all the way up the tree too. I don't think it makes sense in this library. If someone just wants reading to not block, they can start a new task. I'll need some good convincing and a solid plan to open this one back up.

To jump in here- I don't think the right solution is to not offer *Async methods.

If the stream is being read character by character then I think the solution is to not read character by character but rather chunks at a time.

IO operations absolutely need async overloads. And the implementation should be appropriate for async. Giving up isn't the right solution.

The characters themselves are read on chunks into a buffer, but inside of
the read character method. Thus we would be creating a task for each
character. That would be highly inefficient.

On Jun 26, 2017 11:21 AM, "Sam" notifications@github.com wrote:

To jump in here- I don't think the right solution is to not offer *Async
methods.

If the stream is being read character by character then I think the
solution is to not read character by character but rather chunks at a time.

IO operations absolutely need async overloads. And the implementation
should be appropriate for async. Giving up isn't the right solution.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/JoshClose/CsvHelper/issues/202#issuecomment-311091798,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD_ohW9ZDhHOhNlHHQ1bjM3qvJtTRUJ9ks5sH8yDgaJpZM4BLKDJ
.

@jamesbascle I understand that is the current implementation- I'm saying it should be rewritten so that it's done efficiently and not reading per character.

Every IO library nowadays is offering Async overloads. An answer that is essentially "it's impossible" seems extremely unlikely.

I see a way of refactoring the buffer reads so every CsvFieldReader.GetChar() doesn't have to be async.

Is there an easy way to have the same code ran for methods so I don't have to duplicate all the code for every method? Most of the code all the way down will be identical. There will just be Read vs ReadAsync at the bottom. I don't want to use any sort of wrapper that just does Wait or ConfigureWait.

I suggest putting a feature request in for an implementation of the parser which is not character based. There may be scenarios where one performs better than another. ICsvReader with implementations such as CsvCharReaderImpl, CsvLineReaderImpl, etc. I'm simplifying this of course but by only point is not to overload the existing classes where it doesn't fit. A number of other stories would be required for this effort including decoupling some of these classes and coming up with an interface design that would work for different implementations. Whether or not this effort would be worth it is another story. :)

I'm not sure I see the idea behind that post clearly.

The data is not loaded into memory char by char, but on a continuous basis
according to however large you set the buffer size to.

The parser, regardless of how the data is loaded into that buffer, has to
operate on a character by character basis, to detect control characters,
and other such things.

@jamesbascle I was thinking of using ReadLine (async) then parsing each line char by char after that. Just brainstorming mind you. If someone were to write an async parser that's where I'd start.

What if there is a \r\n in a field? ReadLine may not be a whole CSV line.

Currently it's reading into a buffer. The char by char reading is done from that buffer. I was thinking of just having ReadBuffer and ReadBufferAsync methods so the ReadChar wouldn't have to be async.

My current issue is finding away to not duplicate all the code in all the methods.

If you reached bufferlength - 10, you can start reading the next batch anyway. Since it's async it doesn't matter ;-)

I can't see a way of making async methods without making a duplicate async method for every method in CsvParser. CsvFieldReader.GetChar would not be async, but there would need to be an async method to load the buffer every time it ran out. A check to see if the buffer needs loading would have to happen any time before a char is retrieved. CsvParser gets a char in every method in that class. I'm not seeing a way around this. I don't want to duplicate the code in every method.

I've read some people talking about how you should either create a sync method or an async method, but not both. If the code is IO bound or is going to take a long time, then it should be async, otherwise sync. CsvHelper isn't IO bound by default. The user decides whether that is the case or not. If their TextReader is a network stream, then it's going to be downloading as it's parsing. The user could also choose to download it asynchronously ahead of time, then run it through CsvHelper.

Every IO library nowadays is offering Async overloads.

What are some examples of this? I'd like to see how they overcome all the duplication of code.

The thing that's really holding me back here is code duplication. I need a way to run the same code sync and async without duplication. If the TextReader isn't attached to network, it's going to be a lot faster to no use async methods.

Also, method wrapping won't work. Should I expose asynchronous wrappers for synchronous methods? Should I expose synchronous wrappers for asynchronous methods?

What are some examples of this? I'd like to see how they overcome all the duplication of code.

JSON.NET and Entity Framework are the two biggest examples. I'm sure there are lots of others as well.

If you don't want to duplicate code, one thing you could do is make every method async, and make the synchronous versions just call .Result on the async versions. This might not be "recommended" but it will effectively re-use code. (Just make sure all your async methods are all doing ConfigureAwait(false) or else this will cause big problems)

In WinRT, Microsoft actually eliminated the non-async overloads altogether, though this would be breaking so I don't recommend this.

For reference, Stephen Toub wrote a blog post about this topic.

So, looks like Json.NET just copies all the methods exactly and replaces calls with async calls. I will look into what that would entail here and how I can make it have minimal amounts of code duplication. This means the reader will probably only have the Read() method be async and the consumer would have to loop the records manually, which isn't a big deal.

await csv.Read();
csv.ReadHeader();
while( await csv.Read() )
{
    var record = csv.GetRecord<MyType>();
}

I'll add this to the 3.0 milestone.

The IDE experience isn't as nice, but another thing I've done here is use T4 templates to promote code re-use. (Alternative templating engines would also work, of course) That way you move all the code into templates and just use <# if(isAsync) #> or something similar to output essentially the same code but with the async tweaks.

lol, oh man. I really don't want to do the T4 thing, but I'll look into it. If it makes it easier and is still readable, maybe I'll do that.

Got any good T4 doc sources you recommend?

I learned how to use it from the MSDN docs. It's similar to writing ASP.NET, if you've ever done that.

The only tricky part is if you want it to be invoked as part of the build (it doesn't need to be, if you're ok just running it once via Visual Studio and checking in the generated files).

Entity Framework also uses it for their code generators, I believe.

Well, seems like T4 will work as I have all the parsing working using async methods. I found a VS plugin, tangible T4 that is helping a lot. I guess this is going to be a thing. ;)

Got all the async reading done. Now on to writing, which shouldn't take long at all.

Thanks @MgSam for the direction.

Here is a unit test to show how reading will work. Obviously returning a Task<IEnumerable<T>> won't work, so you have to write a couple more lines.

using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;

namespace CsvHelper.Tests.Async
{
    [TestClass]
    public class ReadingTests
    {
        [TestMethod]
        public async Task ReadingTest()
        {
            using( var stream = new MemoryStream() )
            using( var writer = new StreamWriter( stream ) )
            using( var reader = new StreamReader( stream ) )
            using( var csv = new CsvReader( reader ) )
            {
                writer.WriteLine( "Id,Name" );
                writer.WriteLine( "1,one" );
                writer.WriteLine( "2,two" );
                writer.Flush();
                stream.Position = 0;

                var records = new List<Simple>();
                await csv.ReadAsync();
                csv.ReadHeader();
                while( await csv.ReadAsync() )
                {
                    records.Add( csv.GetRecord<Simple>() );
                }

                Assert.AreEqual( 2, records.Count );

                var record = records[0];
                Assert.AreEqual( 1, record.Id );
                Assert.AreEqual( "one", record.Name );

                record = records[1];
                Assert.AreEqual( 2, record.Id );
                Assert.AreEqual( "two", record.Name );
            }
        }

        private class Simple
        {
            public int Id { get; set; }

            public string Name { get; set; }
        }
    }
}

Done. db5945439d8213c38006df4e8a57fc4a55364f52

I was afraid you were going for the .Result solution, but just checked the code and it looks good! Well done, thanks for implementing this.

I read too many articles telling me not to do it. ;)

I was just waiting for a good solution to not reproduce code. T4 seemed to work well. The only hard part is doing active development on a template. I'm basically doing active development on the result file, and then will have to merge the changes back in later.

What happened to this? I cant find WriteRecordsAsync from https://github.com/JoshClose/CsvHelper/commit/2c3f5336db501d71e2b2ef937883f0a56437c662 in 6.0 version.

There is no WriteRecordsAsync. That part doesn't need to be asynchronous. There is FlushAsync and NextRecordAsync though. NextRecordAsync will call FlushAsync for you.

WriteRecords will never block? It holds the records until I call FlushAsync? Is that effective, when I call WriteRecords many times, with only few records in every write?

WriteRecords will write several objects at once and will call NextRecord which in turn calls Flush. This doesn't use the async stuff at all. If you want to write async methods, you'll need to loop your records and call WriteRecord, then NextRecordAsync.

@JoshClose - can you confirm that the following works when reading in a physical file?

 private static async Task<List<string>> ReadCsvAsync(string filename)
 {
      using (var streamReader = File.OpenText(filename))
      {
           using (var reader = new CsvReader(streamReader))
           {
                var list = new List<string>();

                while (await csv.ReadAsync())
                {
                     list.Add(csv.GetField(0));
                }
           }
      }
 }

As soon as my break point hits the WHILE loop with the ReadAsync() call, it goes into lala land and never moves to the next line. If I change it to Read() then everything works.

My guess is that brett-burkhart encountered a deadlock. A quick grep indicates that .ConfigureAwait(false); is not called anywhere in the code base? (which means deadlocks could be encountered in Winforms/WPF applications)

Good call rune. That is an important thing to so to async/await calls on
libraries so it works in the UI frameworks

On Tue, Jun 5, 2018, 5:07 PM 9Rune5 notifications@github.com wrote:

My guess is that brett-burkhart encountered a deadlock. A quick grep
indicates that .ConfigureAwait(false); is not called anywhere in the code
base? (which means deadlocks could be encountered in Winforms/WPF
applications)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/JoshClose/CsvHelper/issues/202#issuecomment-394860726,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD_ohRQBrfdEYU_Ydoohk6grVyMUJz25ks5t5vMmgaJpZM4BLKDJ
.

Are you saying CsvHelper should be using ConfigureAwait(false) or the caller of CsvHelper should be?

Yes.

Apply .ConfigureAwait(false) liberally within the library code. Basically every time you do not need to resume the same context after the await. For non-GUI code, this pretty much means 'always'.

It seems counterintuitive at first (why, oh why, isn't it the default?!), but eventually it becomes second nature.

BTW: It may seem unnecessary to do this for every await in any given method, but some tasks may end up executing synchronously (e.g. if you mock a certain function or a task finishes much sooner than anticipated), so AFAICT best practices dictates showering every await with a corresponding ConfigureAwait().

Tip: Read everything written by Stephen Cleary. Start here: https://msdn.microsoft.com/en-us/magazine/jj991977.aspx?f=255&MSPPError=-2147217396

Thanks for the link to Stephen Cleary. I would have asked for that. I'll change all async calls to use ConfigureAwait(false).

I added issue #1047 for this.

Done.

Was this page helpful?
0 / 5 - 0 ratings