Dataframes.jl: Excessive allocations for simple loop over rows

Created on 26 May 2019  路  1Comment  路  Source: JuliaData/DataFrames.jl

It seems like there should be a faster way to iterate over the rows of a DataFrame:

julia> d = [(a=rand(),b=rand()) for _ in 1:10^6];

julia> df = DataFrame(d);

julia> function f(xs)
        s = 0.0;
        for x in xs
         s += x.a * x.b
        end
        s
       end
f (generic function with 1 method)

julia> function g(xs)
        s = 0.0
        for x in eachrow(xs)
         s += x.a * x.b
        end
        s
       end
g (generic function with 1 method)

julia> @btime f($d)
  577.269 渭s (0 allocations: 0 bytes)
249855.20496448214

julia> @btime g($df)
  105.782 ms (6998979 allocations: 122.05 MiB)
249855.20496448386
julia> versioninfo()
Julia Version 1.3.0-DEV.0
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.0 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_CMDSTAN_HOME = /tmp/cmdstan-2.19.1

Most helpful comment

DataFrame is not type stable so, unfortunately, this is what you will experience.

If you use barrier function you can speed it up and avoid allocations:

julia> g(xs) = _g(Tables.rows(xs))
g (generic function with 1 method)

julia> function _g(xsr)
           s = 0.0
           for x in xsr
               s += x.a * x.b
           end
           s
       end
_g (generic function with 1 method)

julia> @btime g($df)
  1.089 ms (16 allocations: 688 bytes)
249854.8799023578

alternatively you could simply extract the vectors you want to work with and pass them to an inner function specified by a barrier and this should be also fast.

>All comments

DataFrame is not type stable so, unfortunately, this is what you will experience.

If you use barrier function you can speed it up and avoid allocations:

julia> g(xs) = _g(Tables.rows(xs))
g (generic function with 1 method)

julia> function _g(xsr)
           s = 0.0
           for x in xsr
               s += x.a * x.b
           end
           s
       end
_g (generic function with 1 method)

julia> @btime g($df)
  1.089 ms (16 allocations: 688 bytes)
249854.8799023578

alternatively you could simply extract the vectors you want to work with and pass them to an inner function specified by a barrier and this should be also fast.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mattBrzezinski picture mattBrzezinski  路  5Comments

abieler picture abieler  路  7Comments

tlienart picture tlienart  路  8Comments

bkamins picture bkamins  路  8Comments

jangorecki picture jangorecki  路  7Comments