Machinelearning: Make DataFrame ReadCsv more efficient

Created on 9 Sep 2019 · 3Comments · Source: dotnet/machinelearning

DataFrame.ReadCsv at the moment loops over all the rows in a file twice: once to find the number of rows and schema, and the 2nd time to append the data. Each line is read in as a string and string.Split is called on it. Profiling shows that a sizeable chunk of time is spend in the Split calls. We should explore using a ReadOnlySpan to parse the file and avoid all the intermediate strings.

Microsoft.Data.Analysis

Source

pgovind

Most helpful comment

Could we also make it such that it can handle escaping using quoted strings.

avinds on 17 Nov 2019

👍3

All 3 comments

Is there a reason to not make api ReadStream async?

MarcoRossignoli on 10 Sep 2019

Nope. This is not the most performant implementation. This exists at the moment so folks can work on real data. Making this method more efficient is on my backlog.

pgovind on 19 Sep 2019

👍1

Could we also make it such that it can handle escaping using quoted strings.

avinds on 17 Nov 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings