Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Linux (i686-pc-linux-gnu)
CPU: Intel(R) Core(TM) i9-9980H CPU @ 2.30GHz
WORD_SIZE: 32
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
I've run into an issue when trying to do a join with two large DataFrames. The CI runners for 32-bit versions for Julia 1.0, 1.1, 1.2 all fail. The 64-bit versions work perfectly fine:
Stacktrace from the CI:
BoundsError: attempt to access 1365559-element Array{Int32,1} at index [0]
Stacktrace:
[1] getindex at ./array.jl:731 [inlined]
[2] group_rows(::DataFrame, ::Bool, ::Bool, ::Bool) at /mnt/builds/DcGs9yxw/0/{REDACTED}/depot/packages/DataFrames/0Em9Q/src/dataframerow/utils.jl:255
[3] group_rows at /mnt/builds/DcGs9yxw/0/{REDACTED}/depot/packages/DataFrames/0Em9Q/src/dataframerow/utils.jl:248 [inlined]
[4] #join#237(::Array{Symbol,1}, ::Symbol, ::Bool, ::Nothing, ::Tuple{Bool,Bool}, ::Function, ::DataFrame, ::DataFrame) at /mnt/builds/DcGs9yxw/0/{REDACTED}/depot/packages/DataFrames/0Em9Q/src/abstractdataframe/join.jl:344
[5] (::getfield(Base, Symbol("#kw##join")))(::NamedTuple{(:on, :makeunique),Tuple{Array{Symbol,1},Bool}}, ::typeof(join), ::DataFrame, ::DataFrame) at ./none:0
...
I've been playing around in VirtualBox with an Ubuntu 32-bit instance. Below is an example of how I am using DataFrames. Note this example on my VirtualBox instance causes a SIGABRT.
using DataFrames
using DataFramesMeta
df_1 = DataFrame(1:2000000)
df_1 = @transform(df_1, time=first.(:A))
df_2 = DataFrame(1:2000000)
df_2 = @transform(df_2, time=first.(:A))
join(df_1, df_2, on=[:time, :A], makeunique=true)
After spending sometime and looking at group_rows I was able to create another example which forces the above stacktrace. It will always crash after g_ix=57979.
using DataFrames
df = DataFrame(A=1:2000000)
groups = Vector{Int}(undef, nrow(df))
ngroups, rhashes, gslots, sorted = DataFrames.row_group_slots(ntuple(i -> df[i], ncol(df)), Val(true), groups, false)
stops = zeros(Int, ngroups)
for g_ix in groups
stops[g_ix] += 1
end
Decided to do a little bit of digging in to this by looking at the row_group_slots function https://github.com/JuliaData/DataFrames.jl/blob/b0d8a87dd8edfadfb458a2121eda78210ec13e0f/src/dataframerow/utils.jl#L102
It looks like the rhashes vector is getting 450 collisions here, and thus 450 elements in the groups array are being set to 0, and since julia arrays start at 1, when stops[g_ix] += 1 is run with g_ix == 0 it breaks.
@nalimilan - you probably have most experience with this part of code base (if you are not available please let me know and I will have a look at this issue).
I spent some more time looking into this just now. I took this code from hash_rows and hashrows_cols!.
using DataFrames
df = DataFrame(A=1:2000000)
tup = ntuple(i -> df[i], ncol(df))
rhashes = zeros(UInt, length(tup[1]))
for (i, col) in enumerate(tup)
@inbounds for j in eachindex(rhashes)
el = col[j]
rhashes[j] = hash(el, rhashes[j])
end
end
nrow(df) - length(rhashes) # 450 collisions
The root cause of this is most likely in here: https://github.com/JuliaLang/julia/blob/master/base/hashing2.jl#L30
EDIT: In my example the first instance of this issue can be replicated with:
hash(40237, 0x00000000)
hash(57970, 0x00000000)
These both evaluate to 0x38b05917
Interesting. Yes, hash collisions are expected to happen, and the code is supposed to be able to handle them. Apparently, I broke that by moving this break to the wrong place at https://github.com/JuliaData/DataTables.jl/pull/79:
https://github.com/JuliaData/DataFrames.jl/blob/b0d8a87dd8edfadfb458a2121eda78210ec13e0f/src/dataframerow/utils.jl#L135
Can you check whether https://github.com/JuliaData/DataFrames.jl/pull/1979 fixes it? If so, we should try to add tests for that (hopefully that won't use too much memory for Travis/AppVeyor)
Just tested #1979 this resolves the issue!
Most helpful comment
Just tested #1979 this resolves the issue!