Dataframes.jl: Provide indexing feature to allow for fast sort, join, and group-by operations

Created on 30 Jan 2018  路  7Comments  路  Source: JuliaData/DataFrames.jl

I tested IndexedTables' indexing feature and group-by performance is 10x that of DataFrames.jl, so it would be good to introduce indexing into DataFrames.jl

non-breaking

All 7 comments

Sure, feel free to make a PR if you find a way of integrating indexes to the existing code.

This is actually an important issue for me also. I would start with the following observations (if some of them are wrong please correct me - the issue is complex):

  • that I would not expect any of the operations you mention be slow on DataFrame if they are done one-time only (other data structures have to perform the same kind of work anyway - but maybe only one time not every time);
  • the only situation when the above would not be true is because DataFrame is not type stable, but e.g. join uses findrows to overcome this problem; I have not checked all methods in detail if they are careful enough to do it but they seem to do;
  • so we are left with the case of running some operation many times on DataFrame (e.g. joins); as DataFrame is a thin wrapper over AbstractVectors we do not have control if those vectors are updated so there is no way to ensure from DataFrame level that its index is not invalidated by the user
  • some solution (maybe not ideal but could solve some problems) is to add a set of specialized concrete subtypes of something that could be called AbstractIndexedVector{T} that would make sure that when updated with setindex! it would set dirty flag and recalculate its index if needed when lookup is required to the index (the exact API could be discussed). Probably we would need concrete types for indices allowing and disallowing duplicates and special types for categorical vectors. But this should be doable.

Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered. This approach is something we can do in Julia, but is out of reach in other languages that do not have such a rich type system (it has the disadvantage that indexing on multiple columns would be a bit slower than the theoretically optimal performance but would already give us benefits I think).

Any thoughts about that?

that I would not expect any of the operations you mention be slow on DataFrame

They are currently slow if you benchmark to R's data.table, see my discourse post

Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered.

Good idea. In IndexedTables.jl the indexing is just sorting the data by those keys. This is the approach in data.table as well. I don't see a better way than sorting it. Especially if it's in memory.

Agreed. What I wanted to say is that we have two separate issues here:

  • making the operation fast "one time" (and this in theory should be as fast on DataFrame as on any other data structure - of course there is room for optimizations, e.g. because Julia does not intern strings then either other string type or categorical vectors should be recommended for speed); @xiaodaigh I believe you have a very promising progress in this area; now even sort! for DataFrame on column of Float64 is much slower than order in data.table
  • making an index type (and sorting is a natural approach) - which ensures that repeated calls are fast (then we need to make sure that sorting is fast);

Also an issue here is that we could consider having a fast path in groupby/by that could take advantage of knowing that the data frame is sorted already.

Would this be similar to pandas' DataFrame.index (one per table), or would indices apply to individual columns?

No - adding an index to a DataFrame is achieved via running groupby on it.

The point here is that issorted is much (orders of magnitude) faster than by. Therefore we could run issorted test on grouping columns before grouping. If this test passes then we could potentially use a faster path in groupby/by (as we know that groups will be consecutive continuous blocks of rows in the data frame).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mattBrzezinski picture mattBrzezinski  路  5Comments

rofinn picture rofinn  路  3Comments

yakir12 picture yakir12  路  6Comments

tlienart picture tlienart  路  8Comments

abieler picture abieler  路  7Comments