Julia: rand(::Set)

Created on 6 May 2016 · 14Comments · Source: JuliaLang/julia

As mentioned on the mailing list, it might be nice to have a rand(::Set) function. Since this is hard to implement efficiently without access to the Set internals, it is something that almost has to be in Base. Here is a sample O(1) implementation:

import Base: GLOBAL_RNG, isslotfilled, rand
function rand(r::AbstractRNG, s::Set)
    isempty(s) && throw(ArgumentError("set must be non-empty"))
    n = length(s.dict.slots)
    while true
        i = rand(r, 1:n)
        isslotfilled(s.dict, i) && return s.dict.keys[i]
    end
end
rand(s::Set) = rand(GLOBAL_RNG, s)

If someone wants to put together a test case and documentation, it could make a good intro PR.

Probably also add a rand(::Set, dims) method (etcetera) to supplement rand(::Array, dims), which could be accomplished just by converting rand arguments from AbstractArray to Union{AbstractArray,AbstractSet} in a few places.

RNG good first issue

Source

stevengj

👍1

Most helpful comment

Sorry guys, it looks like I'll need a much bigger VM to build Julia, and I haven't got around to making it yet—my job is making it hard.

Hopefully, I get a spare moment to build it and start work on the tests for this.

PythonNut on 7 Jun 2016

👍2

All 14 comments

Note that my implementation above is efficient for the usual case that the set's internal hash table is at most a few times larger than the size of the set. You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements. For example, my code will be extremely inefficient for:

s = Set(1:10^7)
empty!(s)
push!(s, 1)
length(s.dict.slots) / length(s) # gives 1.7e7!
@time rand(s)

takes almost 1 second for rand(s).

I suppose you could test whether length(s.dict.slots) > somecoefficient * length(s)^2 and use an O(length(s)) algorithm in that case, where somecoefficient is determined by benchmarking the two algorithms and finding (roughly) the crossover point. However, even iteration over the set is very inefficient in this case [it's O(length(s.dict.slots)), not O(length(s)), I think], so there may be no good algorithm, and it may not be worth worrying about.

stevengj on 6 May 2016

I'd like to take a look at this, if you don't mind.

PythonNut on 15 May 2016

@PythonNut, that would be great.

stevengj on 15 May 2016

@stevengj

You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements.

I'm new to the Julia scene, but this strikes me as something that has performance implications outside this issue. You mention iteration, and I wouldn't be surprised if other operations would also be impeded by large numbers of empty bins.

One possible solution is to rehash! (but in the down direction, which isn't currently implemented, AFAICT) when length(s.dict.slots)/length(s) gets too large. This also saves space, which is nice. Performance obviously takes a hit if you're constantly varying the size of the hash table by large factors, but that sounds somewhat uncommon (idk. is it really?).

Does this sound like a better path forward?

PythonNut on 16 May 2016

I think it would be good to fix Base.rehash! so that it can shrink the hash table if needed.

I don't know about automating this; the need for that seems pretty rare. Typical uses of Set seem to grow but almost never shrink them (except to empty and refill to about the same size).

(We should also check what e.g. Python does in this case.)

Anyway, that should be a separate issue and a separate PR.

stevengj on 16 May 2016

@stevengj that sounds quite reasonable.

I've run some profiling, the results of which are visualized here:

O(1) is the algorithm you supplied
O(n) is rand(collect(s))

v0.4.5 Fedora x86_64 i5-2520M

set_rand_perf_100k
set_rand_perf_1m
set_rand_perf_10m

v0.4.5 Fedora i386 Core2Duo T2500

set_rand_perf_100k
set_rand_perf_1m
set_rand_perf_10m

Conclusion

As you can see, the crossover points vary with the size of the set (in addition to the number of bins), and also vary by machine. In addition, the O(n) solution only beats your O(1) solution when the proportion of unfilled bins is very large (>99.9% in all tests). Do you still think trying to nail the threshold down with more testing is the right way to go?

PythonNut on 19 May 2016

❤1 👍1

@PythonNut, nice job. In principle, one could do a bit better than rand(collect(s)) — do i = rand(1:length(s)); state = start(s); for k = 1:i-1; _, state = next(s, state); end; return next(s, state)[1] — since this will avoid an allocation of a temporary array.

But I doubt it will change the crossover point by orders of magnitude. With such a high crossover point, I would only bother to implement the O(1) algorithm.

stevengj on 19 May 2016

Note also that rather than tic/toq, you can benchmark an expression e with t = @elapsed e.

stevengj on 19 May 2016

@stevengj yes, looks like the effect isn't that significant.

v0.4.5 Arch Linux x86_64 i7-4712HQ

Allocation avoiding technique
set_rand_perf_1m

Original technique
protogon_set_rand_perf_1m

I understand that your technique has the additional advantage of short circuiting the collection once the value is found, so it performs better on average. However, they have almost identical worst-case times, which I can't explain since we should be saving a bit of allocation. ╮(╯_╰)╭

Note also that rather than tic/toq, you can benchmark an expression e with t = @elapsed e.

Thanks. I was looking for something like that, but couldn't find it for some reason.

I guess I'll transition to copying your implementation over and writing tests. It might be a while, since I'm new to the julia codebase.

PythonNut on 19 May 2016

Your thoroughness is inspiring.

KristofferC on 19 May 2016

👍2

@stevengj would you also like a rand(::Dict) method?

PythonNut on 19 May 2016

rand(::Dict) would return a key => value pair, I guess? That might make sense, since then rand(s::Set) could just call rand(s.dict)[1] so you get two methods for the price of one.

stevengj on 20 May 2016

@stevengj Yup, that's what I was thinking.

Sorry, I'm having some trouble building my development Julia since I don't have much disk space. :/ Don't worry, I'll think of something.

PythonNut on 22 May 2016

Sorry guys, it looks like I'll need a much bigger VM to build Julia, and I haven't got around to making it yet—my job is making it hard.

Hopefully, I get a spare moment to build it and start work on the tests for this.

PythonNut on 7 Jun 2016

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

add special display for ≈ test failures

StefanKarpinski · 3Comments

add @callsuper macro

StefanKarpinski · 3Comments

TypeVars in tuples in type parameters

iamed2 · 3Comments

0.5 `@test_skip` causes CI to fail

sbromberger · 3Comments

Dates.format regression on master: Width of milliseconds field cannot be fixed anymore

helgee · 3Comments