Machinelearning: Make DataFrame ReadCsv more efficient

Created on 9 Sep 2019  路  3Comments  路  Source: dotnet/machinelearning

DataFrame.ReadCsv at the moment loops over all the rows in a file twice: once to find the number of rows and schema, and the 2nd time to append the data. Each line is read in as a string and string.Split is called on it. Profiling shows that a sizeable chunk of time is spend in the Split calls. We should explore using a ReadOnlySpan to parse the file and avoid all the intermediate strings.

Microsoft.Data.Analysis

Most helpful comment

Could we also make it such that it can handle escaping using quoted strings.

All 3 comments

Is there a reason to not make api ReadStream async?

Nope. This is not the most performant implementation. This exists at the moment so folks can work on real data. Making this method more efficient is on my backlog.

Could we also make it such that it can handle escaping using quoted strings.

Was this page helpful?
0 / 5 - 0 ratings