Dataframes.jl: remove CategoricalArrays dependency, for performance

Created on 20 Jul 2020 · 12Comments · Source: JuliaData/DataFrames.jl

Because CategoricalArrays gets loaded, due to issues such as https://github.com/JuliaData/CategoricalArrays.jl/issues/177, everything you do with DataFrames can become a lot slower. For example:

With CategoricalArrays:

julia> @time begin
              @time using Plots
              @time using DelimitedFiles
              @time using HTTP
              @time using DataFrames
              @time file = HTTP.get("https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv")
              @time stats = DelimitedFiles.readdlm(IOBuffer(file.body), ',', header=true)
              @time statsdf = DataFrame(stats[1], map(Symbol, vec(stats[2])))
              @time stats2 = groupby(statsdf, :Country)
              @time ukstats=stats2[179]
              @time xy = [ukstats.Confirmed,ukstats.Deaths,ukstats.Recovered]
              @time plt = plot(ukstats.Date, xy, labels=["Confirmed" "Deaths" "Recovered"])
              @time display(plt)
              end;
  4.300588 seconds (9.47 M allocations: 622.370 MiB, 5.86% gc time)
  0.329027 seconds (588.77 k allocations: 33.346 MiB, 6.31% gc time)
  0.000152 seconds (293 allocations: 18.578 KiB)
  0.253159 seconds (745.07 k allocations: 53.301 MiB, 5.43% gc time)
  3.376627 seconds (9.46 M allocations: 539.943 MiB, 9.51% gc time)
  0.529758 seconds (2.02 M allocations: 103.027 MiB, 7.79% gc time)
  0.677973 seconds (2.58 M allocations: 172.079 MiB, 6.55% gc time)
  0.961566 seconds (4.36 M allocations: 245.538 MiB, 9.20% gc time)
  0.069549 seconds (148.21 k allocations: 10.844 MiB)
  0.034701 seconds (98.86 k allocations: 5.750 MiB)
  9.646682 seconds (37.82 M allocations: 2.197 GiB, 9.66% gc time)
  3.674751 seconds (7.06 M allocations: 423.162 MiB, 3.51% gc time)
 23.888708 seconds (74.41 M allocations: 4.358 GiB, 7.71% gc time)

Without CategoricalArrays:

  4.287277 seconds (9.47 M allocations: 622.460 MiB, 6.18% gc time)
  0.303269 seconds (588.77 k allocations: 33.346 MiB)
  0.000122 seconds (293 allocations: 18.578 KiB)
  0.123709 seconds (287.40 k allocations: 20.340 MiB, 14.98% gc time)
  2.913315 seconds (8.00 M allocations: 454.598 MiB, 7.58% gc time)
  0.586269 seconds (2.00 M allocations: 101.454 MiB, 16.35% gc time)
  0.237549 seconds (715.00 k allocations: 43.179 MiB)
  0.786341 seconds (3.92 M allocations: 212.980 MiB, 7.46% gc time)
  0.076198 seconds (121.19 k allocations: 8.256 MiB, 16.47% gc time)
  0.036821 seconds (98.81 k allocations: 5.746 MiB)
  2.549315 seconds (6.63 M allocations: 386.266 MiB, 3.65% gc time)
  3.423970 seconds (6.59 M allocations: 386.271 MiB, 2.62% gc time)
 15.357997 seconds (38.48 M allocations: 2.225 GiB, 5.56% gc time)

breaking decision performance

Source

vtjnash

👍6

Most helpful comment

Actually my hope is that DataAPI.jl can allow to use CategoricalArrays.jl in DataFrames.jl without DataFrames.jl being aware of it most of the time (to be confirmed by @nalimilan)

bkamins on 30 Aug 2020

👍3

All 12 comments

This is the major point why https://github.com/JuliaData/DataFrames.jl/issues/1764 is a priority (after we stabilize the API).

As I guess there is no hope for solving these issues on the Julia Base level, an alternative that has been recently considered is to avoid having to define the functionality described in https://github.com/JuliaData/CategoricalArrays.jl/issues/177 that causes the problems.

Just to comment why CategoricalArrays.jl is a dependency:

in some functionalities like stack/unstack it is needed to keep the information about the order of levels (but potentially we could drop this functionality if this were the only reason to keep it)
in split-apply-combine and joins, if we know we are working with CategoricalArrays we can perform operations faster taking advantage of this information (and then there is a challenge where such functionality should live - now it lives in DataFrames.jl; but possibly we could handle it via DataAPI.jl using traits and functions defined there).

bkamins on 20 Jul 2020

Why can't things just work on a CategoricalArray (like for most other array types)? As far as I see, almost all problems are due to CategoricalValue and the weird convert methods etc defined on them. Does CategoricalValue flow around a lot by themselves or is it a trick to not have to define so many methods on the array type itself?

KristofferC on 20 Jul 2020

👍1

I would wait for @nalimilan to get back on-line to comment on this 😄.

As you 👍 in https://github.com/JuliaLang/julia/pull/32613#issuecomment-661167503 - removing these convert methods is another avenue to go. In the past there was a hope that Julia Base can be changed so that these invalidations would not happen, but it seems this is not likely to happen, so maybe indeed as @quinnj noted there we should opt for explicit unwrapping. This would also solve a problem with missing we currently have in CategoricalArrays.jl, as:

julia> x = categorical(["a", missing])
2-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "a"
 missing

julia> eltype(x)
Union{Missing, CategoricalValue{String,UInt32}}

so in a generic code you anyway have to check if you are not getting missing (which is not a CategoricalValue), which is almost as having to do unwrapping.

However, this change is hugely breaking, and I am sure @nalimilan has spent days thinking about it. In particular in https://github.com/JuliaData/CategoricalArrays.jl/issues/177#issuecomment-661350949 I have posted an idea that maybe would solve the issue without having to be breaking.

bkamins on 20 Jul 2020

I've tried investigating this to see what are the problematic methods, and I couldn't replicate @vtjnash's findings on Julia master from two weeks ago nor on 1.4.2. My machine appears to be much slower than yours, but the gain from removing CategoricalArrays isn't that clear (-15%). At any rate, the code is much faster than on Julia 1.4.2.

# CategoricalArrays and DataFrames master
  9.728152 seconds (9.27 M allocations: 610.886 MiB, 3.02% gc time)
  0.394274 seconds (475.04 k allocations: 26.099 MiB, 8.22% gc time)
  1.609266 seconds (2.29 M allocations: 129.476 MiB, 3.46% gc time)
  0.589606 seconds (666.57 k allocations: 49.767 MiB, 2.21% gc time)
  6.782716 seconds (9.69 M allocations: 533.719 MiB, 6.61% gc time)
  0.931219 seconds (2.13 M allocations: 105.685 MiB, 2.45% gc time)
  1.443975 seconds (2.85 M allocations: 165.589 MiB, 5.38% gc time)
  1.770543 seconds (4.19 M allocations: 226.813 MiB, 4.96% gc time)
  0.090733 seconds (124.64 k allocations: 7.625 MiB, 19.48% gc time)
  0.097976 seconds (127.60 k allocations: 6.877 MiB)
 14.099407 seconds (33.44 M allocations: 1.855 GiB, 7.03% gc time)
  6.791970 seconds (7.38 M allocations: 416.923 MiB, 2.21% gc time)
 44.373523 seconds (72.67 M allocations: 4.083 GiB, 4.94% gc time)

# Without CategoricalArrays in DataFrames
 10.159092 seconds (9.27 M allocations: 610.900 MiB, 3.10% gc time)
  0.432334 seconds (475.04 k allocations: 26.099 MiB, 7.72% gc time)
  1.721721 seconds (2.29 M allocations: 129.476 MiB, 2.43% gc time)
  0.343618 seconds (263.00 k allocations: 17.314 MiB)
  5.404452 seconds (8.36 M allocations: 459.254 MiB, 3.53% gc time)
  0.968662 seconds (2.12 M allocations: 104.740 MiB, 5.64% gc time)
  0.768228 seconds (1.12 M allocations: 64.918 MiB, 14.33% gc time)
  2.336761 seconds (4.70 M allocations: 252.763 MiB, 5.80% gc time)
  0.070152 seconds (113.14 k allocations: 6.773 MiB)
  0.091751 seconds (127.19 k allocations: 6.855 MiB)
  4.992754 seconds (6.33 M allocations: 354.590 MiB, 2.32% gc time)
 10.326137 seconds (18.00 M allocations: 1.003 GiB, 3.59% gc time)
 37.656776 seconds (53.21 M allocations: 2.992 GiB, 3.63% gc time)

@vtjnash What did you remove exactly in DataFrames to get rid of CategoricalArrays? What Julia version did you use? FWIW, I've pushed my code to the nl/removeCategoricalArrays branch.

nalimilan on 28 Jul 2020

@vtjnash - you can index GroupedDataFrame by key, so stats2[("United Kingdom",)] would work :) if you have not known the group number for UK.

My tests are on Ubuntu 20.04 LTS, Julia Version 1.6.0-DEV.538:

With CategoricalArrays.jl:

  4.080434 seconds (8.70 M allocations: 589.713 MiB, 6.73% gc time)
  0.419881 seconds (766.76 k allocations: 43.370 MiB, 6.79% gc time)
  1.061400 seconds (2.61 M allocations: 152.809 MiB, 3.17% gc time)
  0.173148 seconds (463.57 k allocations: 35.841 MiB)
  4.309586 seconds (9.51 M allocations: 539.608 MiB, 8.82% gc time)
  0.574789 seconds (2.00 M allocations: 100.478 MiB, 3.99% gc time)
  0.752547 seconds (2.64 M allocations: 173.395 MiB, 8.42% gc time)
  1.026466 seconds (4.36 M allocations: 243.742 MiB, 6.36% gc time)
  0.093290 seconds (147.28 k allocations: 10.714 MiB)
  0.048722 seconds (98.91 k allocations: 5.723 MiB)
 11.226519 seconds (37.76 M allocations: 2.180 GiB, 8.15% gc time)
  4.792697 seconds (7.06 M allocations: 421.353 MiB, 2.51% gc time)
 28.579304 seconds (76.17 M allocations: 4.444 GiB, 6.66% gc time)

and without CategoricalArrays.jl (used my own way to remove the dependency)

  4.094485 seconds (8.70 M allocations: 589.409 MiB, 6.45% gc time)
  0.424463 seconds (766.76 k allocations: 43.370 MiB, 7.45% gc time)
  1.086510 seconds (2.61 M allocations: 152.808 MiB, 4.20% gc time)
  0.087799 seconds (151.24 k allocations: 11.263 MiB, 12.46% gc time)
  3.984401 seconds (8.12 M allocations: 458.792 MiB, 5.99% gc time)
  0.567963 seconds (1.98 M allocations: 98.842 MiB, 2.68% gc time)
  0.326275 seconds (844.22 k allocations: 50.122 MiB, 11.99% gc time)
  0.837568 seconds (3.85 M allocations: 207.346 MiB, 5.87% gc time)
  0.077490 seconds (124.31 k allocations: 8.547 MiB)
  0.076980 seconds (98.86 k allocations: 5.719 MiB, 28.92% gc time)
  3.014615 seconds (8.05 M allocations: 463.429 MiB, 4.83% gc time)
  4.102979 seconds (6.30 M allocations: 368.396 MiB, 2.09% gc time)
 18.702140 seconds (41.65 M allocations: 2.403 GiB, 5.07% gc time)

bkamins on 28 Jul 2020

I think you meant to @KristofferC there. And I was only looking at the absolute time; since I can make the overall relative percentage arbitrarily small it's not a particularly convenient metric.

vtjnash on 28 Jul 2020

😕2

Here are a few timings from various tests I've just made (on Antarctic):

34s with DataFrames and CategoricalArrays master
27s without problematic similar, ==, isequal, <, isless and convert
23s without problematic similar, ==, isequal, <, isless and convert and deprecations
28s restricting types that CategoricalArrays supports to Union{AbstractString, AbstractChar, Number} and without deprecations
27s restricting types that CategoricalArrays supports to Union{AbstractString, AbstractChar, Number} and without problematic similar and deprecations
23s without the CategoricalArrays dependency

So it looks like we can go a long way by just dropping deprecated methods and restricting supported types to a fixed list. This restriction avoids defining methods for Any, which triggers lots of invalidations.

nalimilan on 13 Aug 2020

I suggest that we drop CategoricalArrays support from DataFrames for the 1.0 release.
Bringing it back is non-breaking,
but removing it would be breaking.

Removing it now leaves us open to being able to bring it, or something like it but with a slightly different (breakingly different) API back later.

oxinabox on 30 Aug 2020

Actually my hope is that DataAPI.jl can allow to use CategoricalArrays.jl in DataFrames.jl without DataFrames.jl being aware of it most of the time (to be confirmed by @nalimilan)

bkamins on 30 Aug 2020

👍3

could you say what would change for users?

using DataFrames, CategoricalArrays
when stacking columns, :variable is not ordered anymore
?

EDIT: You were once considering factoring out DataFramesBase.jl. Wouldn't that help as well?

greimel on 30 Aug 2020

:variable is not ordered anymore

by default it will be ordered alphabetically. It will be opt-in to use CategoricalVector to keep order of appearance.

and

categorical and categorical! function will not be defined.

Other than that we hope that in user-facing code things will not change (there will be some internal changes, but not user visible).

bkamins on 30 Aug 2020

👍1

Given discussions with @nalimilan it seems to be possible to:

deprecate all dependencies on CategoricalArrays.jl in 0.22 release (and I would stop exporting it)
we will keep dependency on PooledArrays.jl for reshaping, but I guess it is not a big problem as it is a lightweight package

bkamins on 30 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings