Dataframes.jl: Materializing TableTraits sources via Tables.jl is slow

Created on 23 Feb 2019 · 4Comments · Source: JuliaData/DataFrames.jl

I just ran some more performance tests, and the current Tables.jl code that kicks in when one materializes a TableTraits.jl (for example a Query.jl query) source into a DataFrame is slow, and in fact much slower than it used to be when that part was still powered by IterableTables.jl.

Here is what I get. Base case (with Tables#master):

julia> using BenchmarkTools, DataFrames, Query, TableTraitsUtils

julia> n = 10_000_000;

julia> df = DataFrame(a=rand(n), b=rand(n), c=randn(n));

julia> @benchmark DataFrame(df |> @map(_))
BenchmarkTools.Trial:
  memory estimate:  1.56 GiB
  allocs estimate:  60000081
  --------------
  minimum time:     1.022 s (17.90% GC)
  median time:      1.050 s (19.78% GC)
  mean time:        1.049 s (20.06% GC)
  maximum time:     1.084 s (19.55% GC)
  --------------
  samples:          5
  evals/sample:     1

When I use the code that is in IterableTables.jl to materialize a query into a DataFrame, I get this:

julia> function _DataFrame(x)
           cols, names = create_columns_from_iterabletable(x, na_representation=:missing)
           return DataFrames.DataFrame(cols, names)
       end
_DataFrame (generic function with 1 method)

julia> @benchmark _DataFrame(df |> @map(_))
BenchmarkTools.Trial:
  memory estimate:  228.89 MiB
  allocs estimate:  124
  --------------
  minimum time:     173.888 ms (32.76% GC)
  median time:      194.136 ms (40.33% GC)
  mean time:        192.946 ms (39.90% GC)
  maximum time:     198.544 ms (41.18% GC)
  --------------
  samples:          26
  evals/sample:     1

Right now the code in IterableTables.jl only kicks in for DataFrame versions that predate the Tables.jl story (so pretty much never these days).

When I run the same tests with IndexedTables.jl (which has its own materialization code) I get similar good results results to what I get from IterableTables.jl.

So I think right now the code in Tables.jl is a pretty drastic regression in performance for this scenario. That is a real issue because this is of course one of the main/central scenarios for Query.jl and the rest of Queryverse.jl.

I guess there are really two ways out: 1) @quinnj manages to get the performance of Tables.jl for this scenario on par with what we had previously or 2) we go back to using the code from the Queryverse side of things for materializing TableTraits.jl sources into DataFrame.

Source

davidanthoff

Most helpful comment

Thanks for the report; I'll take a look.

quinnj on 23 Feb 2019

👍3

All 4 comments

Thanks for the report; I'll take a look.

quinnj on 23 Feb 2019

👍3

Alright, fix is up: https://github.com/JuliaData/Tables.jl/pull/73. With the fix, I now see:

julia> @benchmark DataFrame(m)
BenchmarkTools.Trial:
  memory estimate:  228.88 MiB
  allocs estimate:  45
  --------------
  minimum time:     102.626 ms (16.89% GC)
  median time:      105.275 ms (17.51% GC)
  mean time:        108.390 ms (19.70% GC)
  maximum time:     166.579 ms (44.02% GC)
  --------------
  samples:          47
  evals/sample:     1

with Tables.jl.

quinnj on 26 Feb 2019

Thank you!