In https://github.com/jimhester/readr/commit/33b793621c33b915e896fb3778c5a47152ccd73d @jimhester implements a proof-of-concept of https://github.com/ben-strasser/fast-cpp-csv-parser with impressive timings on a static 1.56 GB file (especially on the 'hot' second timing).
> system.time(y <- readr:::read_trip_fare(normalizePath("trip_fare_1.csv")))
user system elapsed
19.97 1.01 20.15
> system.time(y <- data.table::fread(normalizePath("trip_fare_1.csv")))
|--------------------------------------------------|
|==================================================|
user system elapsed
23.88 0.75 18.24
> system.time(y <- readr:::read_trip_fare(normalizePath("trip_fare_1.csv")))
user system elapsed
12.81 1.08 12.91
> system.time(y <- data.table::fread(normalizePath("trip_fare_1.csv")))
|--------------------------------------------------|
|==================================================|
user system elapsed
24.36 0.66 17.92
> dim(y)
[1] 14776615 11
The purpose of this issue is recognize its performance and broadcast awareness of this parser, to anticipate comparisons with fread and to gather what, if anything, can be learnt from this implementation. From what I understand, the function requires a lot of knowledge of the csv's structure well in advance of it being read. (In data.table parlance, perhaps, a 'fast and unfriendly file finagler'.) Nonetheless I believe there is a use-case for such a function: a kind-of plain-text cached version could be very valuable if fast.
Thanks Hugh for bringing this to our attention. I wonder if you measured these timings yourself (and if so, what are the specs of the system where this was run), or found on readr's blog somewhere (in which case what version of data.table are they using?).
In particular what strikes me as odd is that "user" time is very similar to the "elapsed" time. On my machine user time usually much higher, because fread utilizes multiple cores:
> system.time(fread("~/Downloads/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
user system elapsed
170.614 9.984 24.392
> system.time(fread("~/Downloads/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
user system elapsed
118.323 6.500 16.293
> system.time(fread("~/Downloads/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
user system elapsed
115.234 6.260 15.853
Yes, I ran the timings myself. I used the latest dev version of data.table.
Rerunning today:
```r
system.time(fread("~/../Downloads/trip_fare/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
user system elapsed
29.80 1.38 24.34
system.time(fread("~/../Downloads/trip_fare/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
user system elapsed
27.95 0.77 22.06
Intel i7-6800K @ 3.40 GHz
Installed RAM: 128 GB
Windows 10.
FWIW I noticed a large-ish file I was reading slow down a tad after going from a dev version in Dec/January or so to now (I noticed because before it didn't produce a progress bar, now it does).
I decided to document this by using this script:
https://gist.github.com/MichaelChirico/afb9949027d720629f0934a5398108b7
To run this script:
https://gist.github.com/MichaelChirico/63ae2e4cf87079d9a45b7fb17082820e
The first script contains the commit hash for the 35 most recent commits affecting fread.c. The second uses this as input to install data.table from that commit, then times 5 runs of lapply(f, fread) on 10 files, each with about 4.5 million rows x 26 columns. I can't share the data.
I'm on MacBook Pro (High Sierra 10.13.3 / 2.5 GHz Intel Core i7 / 16GB 1600 MHz DDR3 RAM), here's the result:

Overall there hasn't been _crazy_ variation, but the variation is there. Speed appears to have peaked in early December, and has been creeping up a bit since.
@MichaelChirico you be great if you could put those tests in macrobenchmarking/data.table/fread.Rraw
you mean the scripts? or the output
On Feb 23, 2018 12:12 PM, "Jan Gorecki" notifications@github.com wrote:
@MichaelChirico https://github.com/michaelchirico you be great if you
could put those tests in macrobenchmarking/data.table/fread.Rraw—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2634#issuecomment-367904733,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdcdbQdHw81ezoEHnmmHYGJO9jl5eks5tXjqsgaJpZM4SKesK
.
There's always a trade-off between speed and robustness. For example, a recent change in parsing of doubles added few checks for extra digits and for correctness of the number literal. Those checks probably slowed down the parser by few percent points, but at the benefit of improved functionality. Things like this may add up. On the other hand, there could have been added genuine minor inefficiencies -- hard to know... The overall change that you see is roughly 5% slowdown, so it's not anything dramatic (it looks scary on the chart only because the origin is not at 0).
Also, some time ago there was a change in progress bar logic to make it appear earlier.
I agree it's minor; I set out to document this given the appearance of a new parser that claims to perform better.
I think it's great for fread to robustly/automatically handle such a diversity of weird input files (which, at scale, are common "in the wild"); as Hugh points out, though, if the user _does_ know that they have a prim-and-proper simple csv, the potential speed-up could be quite large.
@HughParsonage what's the memory performance of readr vis-a-vis fread for this case? I wonder how much of the second-run performance comes from a more liberal absorption of user memory by readr, perhaps...
I'm embarrassed to say I don't really know how to benchmark memory:
> gc(1,1)
Garbage collection 26 = 16+5+5 (level 2) ...
25.7 Mbytes of cons cells used (51%)
6.6 Mbytes of vectors used (52%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 480733 25.7 940480 50.3 480733 25.7
Vcells 856189 6.6 1650153 12.6 856189 6.6
> pryr::mem_change(data.table::fread("~/../Downloads/trip_fare/trip_fare_1.csv"))
|--------------------------------------------------|
|==================================================|
140 MB
> gc(1,1)
Garbage collection 58 = 29+7+22 (level 2) ...
32.4 Mbytes of cons cells used (30%)
139.7 Mbytes of vectors used (9%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 605389 32.4 2051488 109.6 605389 32.4
Vcells 18307526 139.7 199271127 1520.4 18307526 139.7
> pryr::mem_change(readr:::read_trip_fare(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
864 B
> gc(1,1)
Garbage collection 66 = 29+7+30 (level 2) ...
34.2 Mbytes of cons cells used (33%)
140.0 Mbytes of vectors used (10%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 638792 34.2 1946970 104.0 638792 34.2
Vcells 18338486 140.0 184058497 1404.3 18338486 140.0
> gc(1,1)
Garbage collection 67 = 29+7+31 (level 2) ...
34.2 Mbytes of cons cells used (33%)
140.0 Mbytes of vectors used (12%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 638797 34.2 1946970 104.0 638797 34.2
Vcells 18338514 140.0 147246797 1123.5 18338514 140.0
Restarting R session...
> gc(1,1)
Garbage collection 22 = 14+4+4 (level 2) ...
23.7 Mbytes of cons cells used (59%)
6.0 Mbytes of vectors used (47%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 443330 23.7 750400 40.1 443330 23.7
Vcells 783156 6.0 1650153 12.6 783156 6.0
> pryr::mem_change(readr:::read_trip_fare(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
9.62 MB
> gc(1,1)
Garbage collection 43 = 18+4+21 (level 2) ...
29.9 Mbytes of cons cells used (27%)
14.6 Mbytes of vectors used (1%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 559042 29.9 2051488 109.6 559042 29.9
Vcells 1912426 14.6 181141032 1382.0 1912426 14.6
> pryr::mem_change(y <- readr:::read_trip_fare(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
1.51 GB
> pryr::mem_change(y2 <- data.table::fread(normalizePath("~/../Downloads/trip_fare/trip_fare_1.csv")))
|--------------------------------------------------|
|==================================================|
1.43 GB
neither do i tbh.. .valgrind is the buzzword i have in mind 🤔
https://github.com/burntsushi/xsv
just came across this, not sure if worth a separate issue, but could also be useful to benchmark since it's getting hacker news hype
In the README, I see a timing of reading/summarizing this file:
http://burntsushi.net/stuff/worldcitiespop.csv
In about 12 seconds
On my machine, with data.table, download+read+summarize took 60 seconds; read+summarize took 9.4 out of the box vs:
time xsv stats worldcitiespop.csv --everything | xsv table
field type sum min max min_length max_length mean stddev median mode cardinality
Country Unicode ad zw 2 2 cn 234
City Unicode bab el ahmar Þykkvibaer 1 91 san jose 2351892
AccentCity Unicode Bâb el Ahmar ïn Bou Chella 1 91 San Antonio 2375760
Region Unicode 00 Z9 0 2 13 04 397
Population Integer 2289584999 7 31480498 0 8 47719.570633597126 302885.5592040396 10779 28754
Latitude Float 86294096.37312101 -54.933333 82.483333 1 12 27.188165808468785 21.95261384912504 32.4972221 51.15 1038349
Longitude Float 117718483.57958724 -179.9833333 180 1 14 37.08885989656418 63.223010459241635 35.28 23.8 1167162
real 0m7.890s
user 0m15.361s
sys 0m1.475s
Of course the summarize command is _only_ doing that, whereas fread will bring the _entire_ dataset into memory before summarizing, so it still feels like fread has the advantage.
More potential benchmarks here. Might be useful to run all these commands and get the total time for xsv vs data.table to help illustrate the advantage of getting the object in memory.
Anyway it looks like a nice tool for poking around CSVs on the command line.
Nice find! Ostensibly a slightly different domain: xsv appears to be more aimed at one or two queries on a fresh csv where the 'bottleneck' of fread is enough to make data.table slower. But the gap appears to close pretty damn fast and so it would seem like data.table should dominate in almost all use-cases.
Loading data to R requires to populate R global string cache which AFAIK is single threaded, thus it will be relatively easy for non-R tools to be faster than fread. Fair comparison in this case would be versus C fread, without involving R fread. Anyway hopefully we will address that (populating R global string cache) in future.
In python datatable, the file was read in 0.54s; reading+summarizing took 4.9s
Excellent... maybe worth a short blog post then (could also be used to show off pydatatable, as well as the new-ish benchmarking vignette)... if I find time soon I'll at least outline one 👍
Most helpful comment
FWIW I noticed a large-ish file I was reading slow down a tad after going from a dev version in Dec/January or so to now (I noticed because before it didn't produce a progress bar, now it does).
I decided to document this by using this script:
https://gist.github.com/MichaelChirico/afb9949027d720629f0934a5398108b7
To run this script:
https://gist.github.com/MichaelChirico/63ae2e4cf87079d9a45b7fb17082820e
The first script contains the commit hash for the 35 most recent commits affecting
fread.c. The second uses this as input to installdata.tablefrom that commit, then times 5 runs oflapply(f, fread)on 10 files, each with about 4.5 million rowsx26 columns. I can't share the data.I'm on MacBook Pro (High Sierra 10.13.3 / 2.5 GHz Intel Core i7 / 16GB 1600 MHz DDR3 RAM), here's the result:
Overall there hasn't been _crazy_ variation, but the variation is there. Speed appears to have peaked in early December, and has been creeping up a bit since.