Julia: make Dict ordered?

Created on 5 Jan 2020  ·  21Comments  ·  Source: JuliaLang/julia

@bkamins pointed out on slack:

julia> using Serialization;
julia> d = Dict{Symbol, Vector{Int}}(Symbol.('a':'z') .=> Ref([1]));
julia> serialize("test.bin", d);
julia> d2 = deserialize("test.bin");
julia> hcat(collect.(keys.([d, d2]))...)
26×2 Array{Symbol,2}:
 :o  :j
 :b  :x
 :p  :d
 :n  :k
 :j  :g
 :e  :u
 :c  :r
 :h  :a
 :l  :m
 :w  :y
 :x  :i
 :d  :o
 :k  :b
 :s  :p
 :v  :n
 :g  :e
 :u  :c
 :q  :h
 :r  :l
 :z  :w
 :a  :s
 :f  :v
 :m  :z
 :y  :q
 :i  :f
 :t  :t

But it doesn't have to be this way.

If we redefine things so it remembers how many slots it should have,
then it comes out the same as it came in.

hintsize(dict::AbstractDict) = length(dict)
hintsize(dict::Dict) = length(dict.keys)

function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict})
    n = read(s.io, Int32)
    sz = read(s.io, Int32)
    t = T();
    sizehint!(t, sz)
    Serialization.deserialize_cycle(s, t)
    for i = 1:n
        k = deserialize(s)
        v = deserialize(s)
        t[k] = v
    end
    return t
end

function Serialization.serialize_dict_data(s::AbstractSerializer, d::AbstractDict)
    write(s.io, Int32(length(d)))
    write(s.io, Int32(length(d.slots)))
    for (k,v) in d
        serialize(s, k)
        serialize(s, v)
    end
end

But this is annoying because it changes the serialization format.
I would rather change sizehint!(::Dict) or how we call it.
The problem is that sizehint!(Dict(), 26) gives it 32 slots,
but the d had 64 slots.


In python this was one of thing things that really caught me out.
Because python salts its hashes wiith a random salt selected each time it starts.
But julia doesn't.

collections decision

Most helpful comment

Or we could just make Dicts ordered.

All 21 comments

A vaguely justifyable (from what we normally do when rehashing)
is

function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict})
    n = read(s.io, Int32)
    t = T();
    sizehint!(t, n > 64000 ? n : n*2)
    Serialization.deserialize_cycle(s, t)
    for i = 1:n
        k = deserialize(s)
        v = deserialize(s)
        t[k] = v
    end
    return t
end

But I don't know that this will always be right -- it depends on the dictionaries history of adds and deletes I guess as how many times it has been resized.

Actually, I guess add and delete paterns might also screw up things even in the solution I posted in the OP.
But maybe we can solve those patterns by rehash!ing before serialization?

My thinking was that it should be rather an update of serialization of Dict, if we decided to change this (other AbstractDicts might have different fields), is just to serialize and deserialize its fields verbatim. This is slightly less efficient but guarantees reproducibility.

What's the reason why one would care about this? Some caching optimization (tokens) that gets invalidated?

For my purposes is mostly just counter intuitive.
Its fine for the keys to be in arbitary order.
But i expect them to stay the same across operations like serialization.

Silly example (but not so different from something that actually messed me up in python).

results::Dict{String, Float64}

function convert_to_percent!(res::Dict{String, Float64},)
    for model in keys(res)
        res[model]*=100
    end
    return res
end

function znormalize!(res::Dict{String, Float64},)
    mean_ = mean(values(res))
    std_ = std(values(res))
    for model in keys(res)
        res[model] = (res[model]-mean_)/std_
    end
    return res
end

percent, zscore = pmap([convert_to_percent!, znormalize]) do f
    f(results)
end

df = DataFrame()
df.models = keys(results)
df.scores = values(results)
df.percent = values(percent)
df.zscore = values(zscore)

The reason to have it is reproducibility. A I have discussed with @pszufe a user can reasonably expect that in:

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> @everywhere using Distributed

julia> @everywhere using Random

julia> @everywhere Random.seed!(1234)

julia> d = Dict(Symbol.('a':'z') .=> 1:26);

julia> pmap(x -> myid() => rand(x), [d])
1-element Array{Pair{Int64,Pair{Symbol,Int64}},1}:
 2 => (:c => 3)

julia> pmap(x -> myid() => rand(x), [d], distributed=false)
1-element Array{Pair{Int64,Pair{Symbol,Int64}},1}:
 1 => (:u => 21)

you should be able to get the same results in both pmap calls. Otherwise the result depends on where you execute the task to be performed (which preferably should not be the case).

It would even more common if someone uses https://github.com/ChrisRackauckas/ParallelDataTransfer.jl.

Oh wow. Why don't we serialize and deserialize Dict as a simple struct containing arrays? It would seem to be the simplest solution. (But then again, I just realized that I don't know much about serialization in Julia... are there any problems with this?)

If we redefine things so it remembers how many slots it should have, then it comes out the same as it came in.

@oxinabox unfortunately with insertions and deletions it means that even with a given number of slots there is some freedom left in the ordering, so that won't quite work. :(

We could make this backwards-compatible for reading using a trick like this: save the negative of the length, then the sizehint size. If the first number is negative we know we have the new format.

Or we could just make Dicts ordered.

Or we could just make Dicts ordered.

We'd better follow JavaScript semantics then. You know, for consistency. :trollface:

The iteration order for objects follows a certain set of rules since ES2015, but it does not (always) follow the insertion order. Simply put, the iteration order is a combination of the insertion order for strings keys, and ascending order for number-like keys:

// key order: 1, foo, bar
const obj = { "foo": "foo", "1": "1", "bar": "bar" }

But it's an interesting point, the default Base dictionary could be ordered, it could be useful for supporting sorting of dictionary keys or values out-of-the box, etc.

Unfortunately we can't just save the fields of the Dict struct verbatim, since some hashes might depend on address --- rehashing has to happen on the other machine anyway. So it seems to me that an ordered dict is strictly required to get this property.

since some hashes might depend on address

That's a good point - thanks.

In that case - does anyone know if there some good benchmarks for OrderedCollections.jl in comparison to Dict/Set? It would be interesting to know the performance characteristics.

We did a bunch of work on it in #10116. Iteration is way faster of course (and the change might be worth it just for that). Some other workloads are a bit faster, and IIRC the main cost is that some deletion-heavy workloads are slower.

Just to add, the implementation in OrderedCollections was derived from #10116 (although the Dict implementation which #10116 was based off of has diverged some since then.)

I'll rename this to reflect the fact that it's equivalent to making Dict ordered, which there is a PR for but no issue AFAICT.

@JeffBezanson Can we just make Dict to be OrderedDict, for Julia 1.6, to be done with it? Or at least add it as an extra, however we do it, either copy it (your, by now modified code) as stdlib, or the full OrderedCollections.jl, which has also an alternative ordered:

It also implements LittleDict which is a ordered dictionary, that is much faster than any other AbstractDict (ordered or not) for small collections.

My longer argument here: https://github.com/JuliaLang/julia/issues/37761#issuecomment-699218475

I realize there are not many days left of the month, and Julia 1.6 is due this month. Can we at least change the implementation and merge to get packageeval, and see, it may actually be faster, even for Julia itself. We could always revert before the end of the month.

I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict.

I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict.

At least I would be interested in seeing a version of Julia using OrderedDict.

I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

I never had any luck making OrderedDict faster for insert/delete heavy workloads.

If ordered hash maps are inherently slower at insertion/deletion then I'd say Julia might want actually both ordered and unordered versions built in. The ordered one being the default dictionary in Base. The unordered one, being better for cache-style workloads, would be used by the compiler (since we care about latency) and if it's in Core.Compiler and of good quality (which it already is) it may as well be exported from Base as well (since users may also have cache-style workloads to optimize). One possible version of this + #37761 that follows this line of reasoning would rename Dict to UnorderedDict globally (including compiler), then add the OrderedDict code to Base but calling it Dict. If packageeval passes on that, well, that would seem very interesting, to me.

I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

That sounds more like an error in the methodology. I highly doubt that loading e.g. OhMyREPL would be an incredibly 2x slower just because of this change to dictionaries so it is likely something else going on.

Yes, I suspected something about precompiling missing, but I did try using the other package first, and to see if it was only the first using, and I did actually get 2x. I guess I still shouldn't rule out that theory, with only your package hitting some codepath for dicts.

Let's say the slowdown was limited to 6% for using and dicts, would that be ok with for the ordered guarantee?

Should I push my code as an early ("[WIP]") with the slowdown I get? Instead I was exploring other possibly faster options, and those are currently failing.* I probably should recover the working version, while I remember the changes I made before recompiling again.

I'd say Julia might want actually both ordered and unordered versions built in.

Yes, then it's less important if there is slowdown. I was just trying to to the simplest thing at first, one version, replacing the other, also to see the effect. AND only comparing to the status quo, the current unordered Dict, that's probably outdated:

SwissDict is available, based on Google's:
https://abseil.io/blog/20180927-swisstables

The “flat” Swiss tables should be your default choice.

also Ordered and regular RobinDict, but I wasn't sure about adding more recent code (less tested?).

* https://discourse.julialang.org/t/ordered-dict-in-base-is-littledict-not-thread-safe-and-possibly-the-reason-i-cant-add-it-to-base/47372

Was this page helpful?
0 / 5 - 0 ratings

Related issues

musm picture musm  ·  3Comments

wilburtownsend picture wilburtownsend  ·  3Comments

omus picture omus  ·  3Comments

arshpreetsingh picture arshpreetsingh  ·  3Comments

sbromberger picture sbromberger  ·  3Comments