I tested IndexedTables' indexing feature and group-by performance is 10x that of DataFrames.jl, so it would be good to introduce indexing into DataFrames.jl
Sure, feel free to make a PR if you find a way of integrating indexes to the existing code.
This is actually an important issue for me also. I would start with the following observations (if some of them are wrong please correct me - the issue is complex):
DataFrame if they are done one-time only (other data structures have to perform the same kind of work anyway - but maybe only one time not every time);DataFrame is not type stable, but e.g. join uses findrows to overcome this problem; I have not checked all methods in detail if they are careful enough to do it but they seem to do;DataFrame (e.g. joins); as DataFrame is a thin wrapper over AbstractVectors we do not have control if those vectors are updated so there is no way to ensure from DataFrame level that its index is not invalidated by the userAbstractIndexedVector{T} that would make sure that when updated with setindex! it would set dirty flag and recalculate its index if needed when lookup is required to the index (the exact API could be discussed). Probably we would need concrete types for indices allowing and disallowing duplicates and special types for categorical vectors. But this should be doable.Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered. This approach is something we can do in Julia, but is out of reach in other languages that do not have such a rich type system (it has the disadvantage that indexing on multiple columns would be a bit slower than the theoretically optimal performance but would already give us benefits I think).
Any thoughts about that?
that I would not expect any of the operations you mention be slow on DataFrame
They are currently slow if you benchmark to R's data.table, see my discourse post
Then all join, group-by, sort etc. functions could be specialized to take advantage of those types if they are encountered.
Good idea. In IndexedTables.jl the indexing is just sorting the data by those keys. This is the approach in data.table as well. I don't see a better way than sorting it. Especially if it's in memory.
Agreed. What I wanted to say is that we have two separate issues here:
sort! for DataFrame on column of Float64 is much slower than order in data.tableAlso an issue here is that we could consider having a fast path in groupby/by that could take advantage of knowing that the data frame is sorted already.
Would this be similar to pandas' DataFrame.index (one per table), or would indices apply to individual columns?
No - adding an index to a DataFrame is achieved via running groupby on it.
The point here is that issorted is much (orders of magnitude) faster than by. Therefore we could run issorted test on grouping columns before grouping. If this test passes then we could potentially use a faster path in groupby/by (as we know that groups will be consecutive continuous blocks of rows in the data frame).