As mentioned on the mailing list, it might be nice to have a rand(::Set)
function. Since this is hard to implement efficiently without access to the Set
internals, it is something that almost has to be in Base. Here is a sample O(1) implementation:
import Base: GLOBAL_RNG, isslotfilled, rand
function rand(r::AbstractRNG, s::Set)
isempty(s) && throw(ArgumentError("set must be non-empty"))
n = length(s.dict.slots)
while true
i = rand(r, 1:n)
isslotfilled(s.dict, i) && return s.dict.keys[i]
end
end
rand(s::Set) = rand(GLOBAL_RNG, s)
If someone wants to put together a test case and documentation, it could make a good intro PR.
Probably also add a rand(::Set, dims)
method (etcetera) to supplement rand(::Array, dims)
, which could be accomplished just by converting rand
arguments from AbstractArray
to Union{AbstractArray,AbstractSet}
in a few places.
Note that my implementation above is efficient for the usual case that the set's internal hash table is at most a few times larger than the size of the set. You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements. For example, my code will be extremely inefficient for:
s = Set(1:10^7)
empty!(s)
push!(s, 1)
length(s.dict.slots) / length(s) # gives 1.7e7!
@time rand(s)
takes almost 1 second for rand(s)
.
I suppose you could test whether length(s.dict.slots) > somecoefficient * length(s)^2
and use an O(length(s)) algorithm in that case, where somecoefficient
is determined by benchmarking the two algorithms and finding (roughly) the crossover point. However, even iteration over the set is very inefficient in this case [it's O(length(s.dict.slots)), not O(length(s)), I think], so there may be no good algorithm, and it may not be worth worrying about.
I'd like to take a look at this, if you don't mind.
@PythonNut, that would be great.
@stevengj
You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements.
I'm new to the Julia scene, but this strikes me as something that has performance implications outside this issue. You mention iteration, and I wouldn't be surprised if other operations would also be impeded by large numbers of empty bins.
One possible solution is to rehash!
(but in the down direction, which isn't currently implemented, AFAICT) when length(s.dict.slots)/length(s)
gets too large. This also saves space, which is nice. Performance obviously takes a hit if you're constantly varying the size of the hash table by large factors, but that sounds somewhat uncommon (idk. is it really?).
Does this sound like a better path forward?
I think it would be good to fix Base.rehash!
so that it can shrink the hash table if needed.
I don't know about automating this; the need for that seems pretty rare. Typical uses of Set
seem to grow but almost never shrink them (except to empty and refill to about the same size).
(We should also check what e.g. Python does in this case.)
Anyway, that should be a separate issue and a separate PR.
@stevengj that sounds quite reasonable.
I've run some profiling, the results of which are visualized here:
O(1)
is the algorithm you suppliedO(n)
is rand(collect(s))
As you can see, the crossover points vary with the size of the set (in addition to the number of bins), and also vary by machine. In addition, the O(n)
solution only beats your O(1)
solution when the proportion of unfilled bins is very large (>99.9% in all tests). Do you still think trying to nail the threshold down with more testing is the right way to go?
@PythonNut, nice job. In principle, one could do a bit better than rand(collect(s))
— do i = rand(1:length(s)); state = start(s); for k = 1:i-1; _, state = next(s, state); end; return next(s, state)[1]
— since this will avoid an allocation of a temporary array.
But I doubt it will change the crossover point by orders of magnitude. With such a high crossover point, I would only bother to implement the O(1) algorithm.
Note also that rather than tic
/toq
, you can benchmark an expression e
with t = @elapsed e
.
@stevengj yes, looks like the effect isn't that significant.
Allocation avoiding technique
Original technique
I understand that your technique has the additional advantage of short circuiting the collection once the value is found, so it performs better on average. However, they have almost identical worst-case times, which I can't explain since we should be saving a bit of allocation. â•®(╯_â•°)â•
Note also that rather than
tic
/toq
, you can benchmark an expressione
witht = @elapsed e
.
Thanks. I was looking for something like that, but couldn't find it for some reason.
I guess I'll transition to copying your implementation over and writing tests. It might be a while, since I'm new to the julia
codebase.
Your thoroughness is inspiring.
@stevengj would you also like a rand(::Dict)
method?
rand(::Dict)
would return a key => value
pair, I guess? That might make sense, since then rand(s::Set)
could just call rand(s.dict)[1]
so you get two methods for the price of one.
@stevengj Yup, that's what I was thinking.
Sorry, I'm having some trouble building my development Julia since I don't have much disk space. :/ Don't worry, I'll think of something.
Sorry guys, it looks like I'll need a much bigger VM to build Julia, and I haven't got around to making it yet—my job is making it hard.
Hopefully, I get a spare moment to build it and start work on the tests for this.
Most helpful comment
Sorry guys, it looks like I'll need a much bigger VM to build Julia, and I haven't got around to making it yet—my job is making it hard.
Hopefully, I get a spare moment to build it and start work on the tests for this.