Julia: Poor performance of readdlm on master

Created on 7 Mar 2015  Â·  60Comments  Â·  Source: JuliaLang/julia

Here is a simple benchmark for reading in real-world data in TSV (tab-separated values) format. The data are financial time series from a proprietary source with 47 fields. The smaller samples are constructed from taking the first n rows of the original file, which has 100,654,985 rows and is 21.879 GiB in size.

Timings reported in seconds. Timings under 60s are best of 5. Timings over 60s were run once.

| number of rows | wc -l | grep -c | data.table fread | R read.table | pandas.read_csv | readdlm | DataFrames.readtable | timeit pandas.read_csv | gc_disabled readdlm | gc_disabled DataFrames.readtable |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1,000 | 0.003 | 0.003 | 0.013* | 0.035 | 0.023 | 0.028 | 0.033 | 0.013 | 0.029 | 0.020 |
| 10,000 | 0.004 | 0.007 | 0.052* | 0.316 | 0.100 | 0.353 | 0.357 | 0.080 | 0.316 | 0.202 |
| 100,000 | 0.017 | 0.035 | 0.461* | 5.190 | 0.666 | 6.177 | 20.090 | 0.749 | 3.424 | 2.643 |
| 1,000,000 | 0.132 | 0.278 | 4.604* | 48.302 | 7.846 | 268.673 | 47.400 | 6.704 | 35.731 | 24.813 |
| 10,000,000 | 1.292 | 2.732 | 47.503* | 549.017 | 93.413 | 25298.222 | 916.866 | 71.286 | 348.457 | 264.402 |
| 100,654,985 | 12.691 | 25.779 | 725.966* | 5147.951 | 799.608* | | | 825.816* | | |

*timings obtained from files with NUL characters stripped.

Note that the pandas timings obtained with timeit have the Python garbage collector disabled, and so are fair comparisons with the gc_disabled numbers in Julia. The first pandas column is timed with simple time.time() wrappers and are analogous to Julia @time reports.

It's quite clear from the timings that garbage collection causes the scaling to become superlinear, with the result that readdlm and readtable become unusable for large data, even on a machine large enough to read all the data into memory. However, the baseline performance of pandas is still far superior to what we have to offer in Julia, even with garbage collection turned off. This is perhaps unsurprising, since the main pandas parser is written in Cython and not pure Python.

We could probably do much better if readdlm generated less garbage.


Actual commands run:

wc -l datafile.tsv
grep -c AAPL datafile.tsv
#Python 2.7.9 |Anaconda 2.1.0
t=time(); A=pandas.read_csv("datafile.tsv", sep="\t", header=None); time()-t
timeit.timeit('pandas.read_csv("datafile.tsv", sep="\t", header=None)', setup='import pandas', number=5)/5
#Julia 0.4-dev
gc(); readdlm("datafile.tsv", '\t');
gc(); DataFrames.readtable("datafile.tsv", separator = '\t', header = false);
gc_disable(); #repeat readdlm, readtable, etc.
#R 3.1.2
> system.time(read.table("datafile.tsv", sep="\t", header=FALSE, skipNul=TRUE))                                                                                                   
   user  system elapsed 
  0.034   0.000   0.035
... similarly
  0.316   0.000   0.316 
  5.148   0.040   5.190 
 47.191   0.802  48.302 
535.361  10.463 549.017
4867.606  229.902 5147.951

> system.time(fread("datafile.tsv", sep="\t", header=FALSE)) #Chokes on NUL characters
   user  system elapsed 
  0.008   0.004   0.013 
... similarly
  0.052   0.000   0.052 
  0.457   0.004   0.461 
  4.430   0.173   4.604 
 46.230   1.051  47.503 
703.847  18.296 725.966 

Julia @time info

readdlm (returns Array{Any,2})
10^3 elapsed time: 0.027763177 seconds (7 MB allocated)
10^4 elapsed time: 0.353573064 seconds (80 MB allocated, 29.06% gc time in 3 pauses with 0 full sweep)
10^5 elapsed time: 6.177088453 seconds (809 MB allocated, 52.92% gc time in 19 pauses with 4 full sweep)
10^6 elapsed time: 268.672912983 seconds (8097 MB allocated, 86.89% gc time in 160 pauses with 17 full sweep
10^7 elapsed time: 25298.222635691 seconds (80990 MB allocated, 98.63% gc time in 2424 pauses with 177 full sweep)

DataFrames.readtable
10^3 elapsed time: 0.033111172 seconds (4 MB allocated)
10^4 elapsed time: 0.356714838 seconds (41 MB allocated, 11.54% gc time in 1 pauses with 0 full sweep)
10^5 elapsed time: 20.089927057 seconds (433 MB allocated, 79.65% gc time in 5 pauses with 3 full sweep)
10^6 elapsed time: 47.399711163 seconds (4176 MB allocated, 13.33% gc time in 40 pauses with 8 full sweep)
10^7 elapsed time: 916.86617322 seconds (40755 MB allocated, 52.70% gc time in 523 pauses with 39 full sweep)
performance

Most helpful comment

This line of comments does not belong on the issue tracker. Please ask these things on julia-users.

All 60 comments

There's no doubt that Pandas is doing much better than we are.

That said, I'd be interested to see how CSVReaders compares. For very large data sets, it seems to do much better than readtable because it doesn't produce an intermediate representation of all strings.

Actually, I take that back. CSVReaders is slower: it just uses 50% as much memory, but takes noticeably longer to run.

One of the things I suspect is problematic for the readtable parser is the way that type inference interprets it. Lines like https://github.com/JuliaStats/DataFrames.jl/blob/3320b2b7d160be66bb6a4bd3f95f78b8be6e0d23/src/dataframe/io.jl#L510 are, I fear, a major problem. (And this is endemic with DataFrames because of the Vector{Any} representation of the columns.

I tried the following code to get a sense of things:

function rd(f)
    nr = countlines(f)
    fh = open(f)
    nc = length(split(readline(fh),','))
    seekstart(fh)
    d = Array(Float64, nr, nc)
    for i = 1:nr
        r = split(readline(fh),',')
        for j = 1:nc
            d[i,j] = float64(r[j])
        end
    end
    d
end

I compared this to readdlm on a simple delimited file of 5000000x4 random numbers:

julia> @time readdlm("rand3.csv",',');
elapsed time: 10.446378342 seconds (2791 MB allocated, 2.32% gc time in 34 pauses with 7 full sweep)

julia> @time rd("rand3.csv");
elapsed time: 9.385646328 seconds (2784 MB allocated, 2.10% gc time in 121 pauses with 1 full sweep)

"Only" a 10% speedup, but fairly shocking given how trivial my code is.

Are you shocked because it's a 10% speedup or because it's only a 10% speedup?

Because there's any speedup at all. A 500-line implementation ought to be much faster.

FWIW, I've done a bunch of profiling while working on CSVReaders. The places where you get killed are often surprising. For example, I got a huge speedup by hardcoding the sentinel values that you check for missing values and Booleans. Having those sentinel values stored in arrays that get looped through was roughly 50% slower than hardcoding them.

I'm not sure I agree with the heuristic that a larger implementation should be much faster. For example, a lot of the code in readtable is spent checking branches to make sure that type inference is still correct.

All that said, I would be ecstatic if we managed to figure out how to write a fast CSV parser in pure Julia, so I'm really happy to have you thinking about this.

Part of the key seems to be separating CSV, which is a very complex format, from simple TSV. If you happen to have a dead-simple TSV file, we should be able to read it super fast.

While that's clearly true in the abstract, I think it's important that the pandas benchmark is a reader for the full CSV format.

Sent from my iPhone

On Mar 7, 2015, at 10:01 AM, Jeff Bezanson [email protected] wrote:

Part of the key seems to be separating CSV, which is a very complex format, from simple TSV. If you happen to have a dead-simple TSV file, we should be able to read it super fast.

—
Reply to this email directly or view it on GitHub.

With 20 more minutes of work, I can beat readdlm in this case by a full factor of 2 (https://gist.github.com/JeffBezanson/83fb1d79deefa2316067). I guess my point is that if you're going to have a large implementation, it might as well include special cases like this. Of course full CSV is what matters, but it's _even worse_ to be slow on simple numeric TSV files.

Much of the complexity of readdlm comes because of handling nuances of the csv format like:

  • quoted columns that allow both row and column separators and quotes within them
  • trimming spaces
  • empty lines
  • missing columns, which necessitates parsing the whole file before determining output dimensions

Some of these can be disabled with appropriate flags, but they do have some residual impact.

Also the approach of reading/mapping the whole file is probably not giving enough benefits anymore because of speedups in other parts of Julia. It is now proving to be a bottleneck with large files. A way to handle files in chunks will probably help here.

It will be good to have a way of having a simple and fast method while being able to plugin one or more routines to handle the complexity, but without having a large impact on performance.

I think the approach of parsing the entire file before determining output dimensions is having a large impact. With that disabled (by specifying dims), here's what I had got on a 10GB file with 60000000 x 10 floats on julia.mit.edu:

julia> @time a = readdlm("randf2.csv", ';', Float64; dims=(60000000, 10));
elapsed time: 401.356118774 seconds (33393 MB allocated, 0.58% gc time in 841 pauses with 1 full sweep)

julia> size(a)
(60000000,10)

But I was not able to read a 40GB file because it ran into memory/gc issues.
gc was enabled during the test.

@tanmaykm here are the timings for the original example when the dimensions are pre-specified:

julia> for i=3:7; @time A=readdlm("testfile.tsv", '\t', dims=(10^i,46)); end
elapsed time: 0.017998925 seconds (3 MB allocated)
elapsed time: 0.309120058 seconds (41 MB allocated, 41.21% gc time in 2 pauses with 1 full sweep)
elapsed time: 4.848010965 seconds (423 MB allocated, 57.23% gc time in 18 pauses with 2 full sweep)
elapsed time: 184.121671697 seconds (4234 MB allocated, 88.44% gc time in 154 pauses with 16 full sweep)
elapsed time: 15579.129143684 seconds (42359 MB allocated, 98.69% gc time in 1512 pauses with 154 full sweep)

With gc disabled:

julia> for i=3:7; @time A=readdlm("testfile.tsv", '\t', dims=(10^i,46)); end
elapsed time: 0.019123992 seconds (3 MB allocated)
elapsed time: 0.199007951 seconds (41 MB allocated)
elapsed time: 2.031139609 seconds (423 MB allocated)
elapsed time: 19.807757219 seconds (4234 MB allocated)
elapsed time: 175.472120788 seconds (42359 MB allocated)

It does look like pre-specifying the dimensions helps cut execution time almost by half. So much the better, since the dimensions can be determined from UNIX shell commands first for a fraction of the cost in Julia ;-)

Updated OP with timings for R's read.table.

@jiahao If you're looking for a fast function to benchmark, fread() in the data.table R package claims to be much faster than the standard read.table().

@jiahao I'll try and profile the code to figure out memory allocation hotspots. Also the data is being read as an Any array which is also much slower than reading into a numeric/bitstype array. And parsing ascii will be faster than utf8.

In this particular case, cleaning/converting these to numeric values/factors or separating them to be read from a different file may help:

  • columns 1,2,8,13 look like date/time stamps
  • column 15 looks like a category type
  • columns 23,24,25 seem to have nulls, and probably some non ASCII bytes also...

Seconding @nalimilan, fread is the R functionality to test against. Never used read.table when I was using R every day.

I'll have to re-devise the test then. This test file contains NUL characters which breaks fread.

Updated OP with timings for data.table fread(). The timings for the other commands did not change significantly when rerunning with files stripped of NULs.

If anyone was interested, stripping the original file of NULs with td -d '\000' took all of 1 minute.

Looks like the SubString created here https://github.com/JuliaLang/julia/blob/master/base/datafmt.jl#L163 contributes most of the garbage generated in readdlm.

Here is how Base.readdlm now performs on master after @tanmaykm's improvements:

| number of rows | elapsed (s) | % gc time | elapsed (s) with gc_disable |
| --- | --- | --- | --- |
| 10^3 | 0.020 | - | 0.025 |
| 10^4 | 0.323 | 35.93% gc time in 3 pauses with 0 full sweep | 0.244 |
| 10^5 | 4.246 | 39.30% gc time in 16 pauses with 4 full sweep | 2.419 |
| 10^6 | 25.186 | 6.06% gc time in 11 pauses with 7 full sweep | 21.251 |
| 10^7 | 9667.191 | 96.87% gc time in 1946 pauses with 110 full sweep | 283.008 |

Unfortunately it looks like the quadratic scaling behavior still persists for n = 10^7 rows. At >200x slower than R's fread, there is still a lot of work to be done.

@carnaval can we turn a knob in the GC to make it back off faster?

We can, the current heuristics are nowhere near perfect, but changing this require at least a bit of regression testing w.r.t. memory usage, which is a pain.
TBH most cases where I've seen this, the best fix is to disable gc and reenable it again after, since a vast majority of the memory is useful. I know it's not very satisfactory but the opposite problem being huge memory usage, which julia is already pretty lenient on, I'm not sure what is the good thing to do.

Manually disabling GC is not acceptable. Surely the jump from 11 to 1946 pauses in the last table row can be avoided. I'm even ok with 39% GC time, just not 96%.

Of course it is not a good solution to 96% gc time, but even if you get it down to 40%, surely people will disable gc if in this case it means no memory overhead and almost 2x performance.

If people want to disable gc to tweak performance, that's fine. But there is a massive difference. 2x slower is acceptable, 100x is not.

My tests on a 200M file, 46 cols (strings and numerics) :
So there are two problems here. The first one is heuristics which are not very good for this kind of workload. The gc still increases the collect interval up to about 4GB (which is very high) but a few collections could be avoided by tuning. I'll have a look.
The second one is that we are boxing _every single value_ into a huge array. The array ends up in old gen quickly but needs to be fully remarked at every collection. The only way around this is carding (which we don't have).
I'm pretty sure the parser design could be more efficient, by being optimistic about value types and keeping a separate array for the few values that don't fit the infered types.

There's probably more to be done in terms of writing a better reader too, separate from the gc issue.

julia> gc_disable();

julia> @time A=readall(open("try6")); #Read everything in
elapsed time: 0.693111661 seconds (768 MB allocated)

julia> B=[]; @time for l in readlines(open("try6"))
          push!(B, split(l, '\t'))
       end #Very dumb three-line version
elapsed time: 5.365409988 seconds (3262 MB allocated) 

julia> @time C=readdlm("try6");
elapsed time: 26.533051742 seconds (7410 MB allocated)

R: 4.604 seconds

grep: 0.278 seconds

The three-line Julia code (B) is comparable to R in speed, and is still ~8x slower than pure I/O speed (A). C obviously does more work, but I don't see immediately why the overhead is 5x slower than B.

Also, @carnaval noted that reading in a 200 MB text file produced a 10^6 x 46 Array{Any, 2} that occupies 8 GB of memory. That's ~170 bytes of overhead per array entry, which seems excessive.

I know very little about this, but I've read a few papers that seem to find that doing GC based on time rather than allocation works better. Would that be an option, @carnaval?

It would help at least a bit not to have every object 1 word bigger than necessary (#10898).

@StefanKarpinski Never heard of this, but seeing how brittle _and_ performance sensitive this kind of choices are, I wouldn't be surprised if this could give good results in some cases. If you have a link I'd be interested in those papers. Implementation could be very easy, depending on the kind of choices you make for the next collection time.

@JeffBezanson That would help for sure...

So I added (another) GC heuristic to cope with this situation. See the ob/gctune branch. The 2GB file (same one as in Jiahao's 10^7 test) now has only around 30% gc overhead. As you imagine it's a lot faster (around 7min).
I won't merge this now because, even if this should only trigger in very specific heavy-retention/low-garbage cases, I'd still prefer to at least run the bench suite to check for regressions. It also should give worse memory usage in those cases but at least you're not spending your whole time in GC. About memory usage we could also probably lower max_collect_interval (and making it configurable for people with big machines) because I rarely have 4GB of useless heap space for the GC to slack off.

I'll try implementing a different parser (the DLMHandler part) to avoid the double pass. Will post here if that turns out to be better.

Tried an implementation that:

  • parses into column vectors instead of a matrix
  • estimates nrows,column types and resizes/changes types when required
  • returns hcat(column vectors)

Running this on the test file try6, the row count and column type estimates are pretty accurate.
Memory allocation seems to be lesser.

But there's not much improvement in time. It's probably running into some other issue I'm unable to identify.

With this change:

julia> @time C=readdlm("try6");
newsize: 1007563
final dims: 1000000,46
elapsed time: 49.610953899 seconds (5945 MB allocated, 50.52% gc time in 190 pauses with 12 full sweep)

Without this change:

julia> @time C=readdlm("try6");
elapsed time: 41.688899803 seconds (7473 MB allocated, 31.27% gc time in 14 pauses with 7 full sweep)

The modified source can be found at /home/tanmaykm/julia/base/datafmt.jl on the julia machine.
Is there anything that can be improved here?

Hey Tanmay, could you try it on the ob/gctune branch ?
If this doesn't get any improvement I will have a look tomorrow.

Sure, will do.

I would guess the hcat step is pretty bad, since that re-introduces boxes for everything. For CSV files with columns of different types, I'm not sure there's much point in using anything other than a DataFrame.

It may not be a bad idea to have readdlm only return an array of special types, and ask the user to use DataFrames, instead of returning Array{Any}. The other solution could be to return a tuple of vectors, but still enforce that all rows in a column are the same type.

Yes, that's probably the way to go.

On the ob/gctune branch:

With this change:

julia> @time C=readdlm("try6");
newsize: 1007563
final dims: 1000000,46
elapsed time: 27.451447013 seconds (5945 MB allocated, 4.43% gc time in 10 pauses with 6 full sweep)

And without:

julia> @time C=readdlm("try6");
elapsed time: 40.300693973 seconds (7473 MB allocated, 27.22% gc time in 12 pauses with 7 full sweep)

I have pushed my changes to the tan/readdlm2 branch.

The hcat seems to be pretty fast (1 sec):

julia> @time C=readdlm("try6nonuls", '\t');
newsize: 1007257
colvectors created in 22.198546171188354 time
final dims: 1000000,46
hcat in 1.111128807067871 time
elapsed time: 25.18618537 seconds (5944 MB allocated, 4.85% gc time in 10 pauses with 6 full sweep)

At this stage, converting byte ranges to values seems to be the most expensive step.
Just parsing delimiters to identify cell offsets takes:

julia> @time C=readdlm("try6", '\t');
elapsed time: 3.700563769 seconds (206 MB allocated, 0.64% gc time in 1 pauses with 0 full sweep)

So the new heuristic seems to trigger correctly. I will merge when I get around running the perf suite somewhere (let's say monday as a deadline).

@carnaval, do you have access to the julia.mit.edu machine? If not, you should.

Yep but I'm not sure it's the right place given the whole swarm of VMs running buildbots. Maybe if I pin a CPU (and make sure blas threads are well behaved ... god I hate benchmarking).

We could give you an ec2 machine if it helps.

There is also csail cloud

Ok, I still haven't come around doing this, sorry.

The 1st order improvement would probably be to introduce a custom StringArray which stores every String linearly in an IOBuffer and the offsets as ranges in a separate array. You would avoid most gc/allocation traffic, have much better cache behavior, etc. Mutation becomes a problem though, and pushing this to the maximum performance would probably makes you implement a specialized "depth-1" compacting gc for strings. I'm almost certain that doing it naievly at first would already be a _huge_ improvement compared to allocating each string separatly.

It's also easier for the gc because you only materialize the individual strings on getindex, and they are likely to stay in young gen and be cheap to free.

Since a lot of columns seems to be string I think it would be very worthwhile to try it.

Bump. Does something have to be done on the GC front here, or on reorganizing readdlm?

Some more GC related weirdness.

I have a simple test, I create files containing n comma separated integers, use readall to read it all in as a single string, split it with "," as separator, print length of the read-in string and exit, using the unix time command to get timing data. Here are the timings for Julia vs Python 2.7 on my machine (ubuntu 14.04, 32 GB RAM)

| N | Python 2.7 | Julia 0.4-dev | Julia 0.4-dev gc off |
| --- | --- | --- | --- |
| 10k | 0.027 seconds | 0.207 seconds | 0.217 seconds |
| 100k | 0.036 seconds | 0.20 seconds | 0.229 seconds |
| 1 million | 0.053 seconds | 0.288 seconds | 0.257 seconds |
| 10 million | 0.475 seconds | 0.823 seconds | 0.622 seconds |
| 100 million | 3.812 seconds | 51.273 seconds | 3.053 seconds |
| 300 million | 13.15 seconds | 25.61 seconds | 10.98 seconds |
| 600 million | 33 seconds | Didn't complete after twenty minutes | 36 seconds |

The Python code

#just read contents of a file so I can time it
import string
def read_my_file(file_name):
    f = open(file_name)
    s = f.read()
    a = string.split(s,",")
    print(len(s))
    f.close()

read_my_file("sixhundredmillionints.txt")

The Julia code with GC on

#just read a file so I can time it
f = open("sixhundredmillionints.txt")
s = readall(f)
a = split(s,",")
print(length(s));
close(f)

Julia code without GC

# gc off to avoid gc flakiness
gc_disable();
f = open("sixhundredmillionints.txt")
s = readall(f)
a = split(s,",")
print(length(s));
close(f)

probably the same thing that is solved on ob/gctune

So (sorry newbie here, excuse the dumb question) are you saying the GC is fixed now and can handle large inputs without flailing? If so any idea when that branch will get merged into main?

well it is not merged yet, but at least on this branch it should not do what I think it's doing in your example which is : believing it is doing well when it's not, and because of that not widening the collection interval.

There is no "general fix" for this, only improvements although this one is a very needed one because otherwise in cases like yours the heuristics are actively making things worse.

Something I've been working on

Trying to load upto a 100 million row, 24 column dataset consisting of random integers, floats and strings, (limits: no other datatypes, no nulls,only comma as separator, doesn't infer schema) into a datastructure which is essentially this

ColumnType = Union(Array{Int64,1},Array{Float64,1},Array{AbstractString,1})
#this should probably be a list, but array works for now
DataFrameType = Array{ColumnType,1}

ie. a dataframe is an array of columns, each column being a Union of Int Array, Float Array, String Array , (this is not an original idea, this is how the Pandas dataframe works, modulo the types)

I'm getting the following loading times on my 4 core 32 GB Ubuntu 14.04 machine.

All times in seconds.

| Number of Rows | R readtable | R fread | Julia 0.4-dev gc on | gc % | Julia 0.4-dev gc off |
| --- | --- | --- | --- | --- | --- |
| 1000 | 0.02 secs | 0.003 secs | 0.06 secs | 0 % | 0.06 secs |
| 10000 | 0.33 secs | 0.043 secs | 0.12 secs | 12.19 % | 0.13 secs |
| 100000 | 3.489 secs | 0.237 secs | 0.79 seconds | 23.54% | 0.6 secs |
| 1000000 | 19.459 secs | 1.939 secs | 13.3 secs | 57.85 % | 6.2 secs |
| 10000000 | 163 seconds | 16.389 secs | 635 secs | 91.65% | 59.03 secs |
| 100000000 | Killed after 30 minutes | 277 secs | Killed After 30 minutes | NA | Memory Exhausted |

The code is here https://github.com/RaviMohan/loaddataframe . The Python code generates the dataset, the Julia code loads the data from the file into the above datastructure.

If the code is basically correct and not doing anything stupid, (errors are always possible, all fixes appreciated in advance :-), ) it seems that without gc, plain Julia comes within a constant factor of R.fread, and handily beats R's read table. Of course R's routines do much more like inferring column datatype, handling nulls, handling multiple formats etc.

Still it _seems_ that fast loading of large csv files _is_ possible, once the gc issues are fixed and/or enough memory is available.

I don't know how to write idiomatic Julia, so the code might be too imperative etc. Any style pointers (and of course bug reports/fixes) greatly appreciated.

@RaviMohan sounds like progress! How does loaddataframe perform on the test data set in this issue?

Perhaps you and @quinnj could discuss how to further combine efforts between https://github.com/quinnj/CSV.jl and https://github.com/RaviMohan/loaddataframe. @quinnj has also implemented his own CSV reader and he showed me some tests showing also good performance.

(Also a quick note, the syntax f(x) is preferred over f (x))

@jiahao I haven't yet tested on the test data set, primarily because I haven't implemented Date/DateTime parsing, and I don't yet have it on my local machine. But the code is very generic and you should see similair speedups (not to mention that a 100 million row csv actually loads completely, unlike in your original tests, memory being available - you'll need a lot of memory On my 32 GB machine, 100 million rows with 24 columns - and very short strings - completely overwhelms the RAM, and it doesn't work with gc on at all.)

My focus with this was on getting the underlying datastructure correct, avoiding Array{Any ...} etc. This is the primary learning which comes out of this effort. We _can_ avoid Array{Any} and still load 100 million lines of data (but schema inference is harder, see below) . And of course the Array of Columns design vs a 2D Array as the underlying datastructure

Also, I have a hardcoded comma-as-separator which is probably not very conducive to the original data set. This is primarily the last in many iterations of code explicitly trying to see if we can come close to/beat R's reading speed, and is very hacky and is not really production quality code imo.

(I must say I am very impressed that fairly straightforward Julia code can come close to R's fread. That is some _very_ impressive language design/compiler work.)

The next important things to do are to handle nulls ,and to generate schema from examining/sampling the dataset (vs hardcoding as of now). I'm not quite sure (yet) how to do this, since it seems to involve the possibility of the ColumnType changing dynamically. A column can have the first 10 million values look like Ints then have the rest be floating values say. If we are not using Any's then the column's type has to change when you encounter the first float. Of course this is not a problem for dynamic languages

To handle nulls I tried to use Union (Array{Nullable(Int64),1} .. ) etc but the memory usage seems to blow up pretty badly on this and I can't figure out why. Something to pursue further.

I don't think @quinnj 's CSV reader has much to learn from this on the parsing side as I am just using string split with 'n' and ',' respectively :-P . I don't deal with any of the complexities of CSV parsing, which can get pretty hairy and full of gnarly details. That said of course I'd be glad to help @quinnj in any way I can.

String splitting also interacts weirdly with the gc when memory gets tight. I probably need to investigate and file a separate issue. (though it just might be an instance of https://github.com/JuliaLang/julia/issues/6103 )

Another thing I'm looking at in the coming weeks is to see if Spark's "Dataframe-as-API + multiple conforming components + a generic query compiler" pattern can be adapted to our needs. If so, users can use the API without really knowing or caring about implementation details, and multiple sources of data can be trivially added. We'll have simpler datastores than RDDs (of course!) but the basic architectural pattern might be valuable to decouple data sources and implementations.

@ViralBShah was mentioning something called "JuliaDB". As and when it takes concrete form, there might be commonalities/common design problem etc.

Thanks for the syntax pointer. Appreciated!

With all the recent successes of CSV.jl, I think it fair to close this issue.

We should remove the base readdlm if we're going to tell people to use something different.

As I recollect, we probably should leave readdlm to read homogeneous data - all of the same type.

Julia should have been built to use a database transparently for the user and be able to do any operation streaming from it (including mixed effects regression or bayesian statistics).

@skanskan There's no question of whether "Julia should have been built" that way or not. readdlm is here to cater to simpler use cases, but more complex scenarios are better handled via packages. See e.g. the DataStreams.jl and OnlineStats.jl packages.

This line of comments does not belong on the issue tracker. Please ask these things on julia-users.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sbromberger picture sbromberger  Â·  3Comments

m-j-w picture m-j-w  Â·  3Comments

manor picture manor  Â·  3Comments

omus picture omus  Â·  3Comments

ararslan picture ararslan  Â·  3Comments