Julia: Enhancement request: optimize Set{Void}

Created on 6 Mar 2017  ·  22Comments  ·  Source: JuliaLang/julia

Ref: https://discourse.julialang.org/t/strange-behavior-with-void/2480

Would it be possible to special-case creating a set out of a large vector of Voids? This is admittedly a weird thing to ask, but it looks like it will be significant for some upcoming LightGraphs work. Right now, if I have a large vector of Voids, Julia knows not to allocate any memory for that vector, but once I try turning it into a Set, it appears that sizehint! is applied based on the number of elements, and this allocates a large amount of memory for a single-element Dict.

collections

Most helpful comment

@StefanKarpinski yes, of course.

The issues as I see them are twofold, listed in order of (my selfish) priority:

1) a = Vector{Void}(10_000_000) takes no memory, but b = Set(a) takes a (relative) lot.
2) a = Vector{Void}(10_000_000) takes no time, but b = Set(a) takes a (relative) lot.

Issue 1 is likely caused by sizehint! not understanding that it should allocate the minimum of the "type width*" and the length of the iterator. Improvements here can have impacts on all "small-width" types.

Issue 2 is caused because in most cases, one needs to iterate through the entire list to populate the Set**, but in the particular case of Void (and perhaps a few others) no iteration is necessary: if the list is non-empty, then it's a single fixed Dict; if the list is empty, it's another.

(*There's probably a correct term for this, but I'm talking about the "keyspace" of the type; that is, how many unique values it can take: for Bool, it's 2, for Void, it's 1, for UInt8, it's 256, etc.)

(**Seems to me that if we know the type width, we can break after the length of the Set/Dict equals it.)

(Sorry if this is obvious to everyone else - I just wanted to get my understanding out there to make sure it's right.)

All 22 comments

I don't think it worth adding a special case for this. The sizehint! should be smarter though.

My main concern is that a Set{Void} should never have the opportunity to grow to be 131 MB. Vector does the right thing.

Vector does nothing special but Set (Dict) need extra allocation for the index so the two are totally different.

The sizehint does seem to be too aggressive but this certainly doesn't worth a special case at all. There are many more cases where a vector might contain a lot of duplication so the sizehint would not be appropreate in those case.

I (think I) understand, but the specific case of Void is that the Dict underlying the Set will never grow beyond one element, no matter how large the Vector{Void} is. This is a pathological case that, IMO, is ripe for optimization - you could turn this into an O(1) memory and O(1) time operation. There's certainly no reason to call sizehint! or union! in this case: the dict for Set(Vector{Void}(n)) will always be

Dict{Void,Void} with 1 entry:
  nothing => nothing

for any n.

Well, the point is though that there's nothing special about Void.

Except that using it as the key type for a Dict or Set is pretty useless.

It's useless unless you're trying to create an implementation for an abstract data structure that can take an (optional) data type, and you don't need it in this particular case. Consider a case where you have

immutable Edge{ED} <: AbstractEdge
  src::Int
  dst::Int
  data::ED
end

function write_unique_edge_data{ED}(io::IO, ev::Vector{Edge{ED}})
  unique_edge_data = Set([x.data for x in ev])
  for i in unique_edge_data
     write(io, "$i")
  end
end

One would like the function to work with any type of edge data, including Void, in the most efficient way possible. I could parameterize the function and test for Void specifically, but I'd have to do that across the board. There's really no reason to do this when the global optimal solution is a one-liner or so in Base, and can help everyone.

is a one-liner or so in Base

And similarly a one line fix for the package that needs it which would be much better anyway if Void is an important case for you since the allocation is not at all why the line takes a long time.

I agree there's an issue in the union! implementation but it applies to all collections with a lot of elements and even on the type level small types that can't represent too many values. We should not add a workaround for a bug that only works for a case that's not generic at all and in general the least important (compare to Bool, Int8 for example).

I don't see why we can't handle all singleton types efficiently. Using zero storage for vectors of singleton types seemed like a cute gimmick until we found that it had real useful applications. This is a similar situation – if we automatically represent dicts of singleton types efficiently, then every usage everywhere gets the benefit automatically. Otherwise every user needs to handle this specially.

Using zero storage for vectors of singleton types seemed like a cute gimmick until we found that it had real useful applications.

👍

Not to derail, but this is what's going to allow some very cool abstractions in an upcoming version of LightGraphs.

we found that it had real useful applications

Which is?

Otherwise every user needs to handle this specially.

As stated above, this doesn't free the user from that.

I do think it makes sense for dict and set sizehint to know about the number of possible values of types, so we don't allocate tons of memory for sets of void, bool, int8, etc.

It would be pretty easy to use 1 << sizeof(T) as an upper bound in sizehint! if isbits(T) && sizeof(T) < sizeof(Int). Or define a num_values(T) method (defaulting to typemax(Int)) that does this, so that we can define specialized methods for Bool and enums.

Would be even cooler if it could somehow automatically switch over to an IntSet-like implementation if num_values is sufficiently small.

we found that it had real useful applications

Which is?

https://github.com/JuliaLang/julia/blob/0827820b3d6d1e6db0a0d4431f2faacf/base/set.jl#L4

As stated above, this doesn't free the user from that.

How doesn't it? If Base Julia uses the most efficient set representation for any type, then the user doesn't have to worry about it – they can just make a Set object with the correct type and benefit.

@sbromberger: there are two possible Dict{Void,Void} values: empty and nonempty, so you need at least a bit of storage, but that's all. This can probably be worked into the storage and hashing.

If Base Julia uses the most efficient set representation for any type, then the user doesn't have to worry about it – they can just make a Set object with the correct type and benefit.

Because the representation has nothing to do why it's slow.

Would be even cooler if it could somehow automatically switch over to an IntSet-like implementation if num_values is sufficiently small.

The num_values being sufficiently small isn't quite enough, since one can imagine a 64-bit type where some 256 values are spread throughout the possible set of representations. In that case, you'd need minimum perfect hashing for the IntSet representation to work. So maybe asking isbits(T) && sizeof(T) <= 2 is a better criterion since that basically means that you can use the values directly as hashes.

Because the representation has nothing to do why it's slow.

The complaint isn't that it's slow, it's that it takes up a huge amount of memory for a 1-bit type.

@StefanKarpinski yes, of course.

The issues as I see them are twofold, listed in order of (my selfish) priority:

1) a = Vector{Void}(10_000_000) takes no memory, but b = Set(a) takes a (relative) lot.
2) a = Vector{Void}(10_000_000) takes no time, but b = Set(a) takes a (relative) lot.

Issue 1 is likely caused by sizehint! not understanding that it should allocate the minimum of the "type width*" and the length of the iterator. Improvements here can have impacts on all "small-width" types.

Issue 2 is caused because in most cases, one needs to iterate through the entire list to populate the Set**, but in the particular case of Void (and perhaps a few others) no iteration is necessary: if the list is non-empty, then it's a single fixed Dict; if the list is empty, it's another.

(*There's probably a correct term for this, but I'm talking about the "keyspace" of the type; that is, how many unique values it can take: for Bool, it's 2, for Void, it's 1, for UInt8, it's 256, etc.)

(**Seems to me that if we know the type width, we can break after the length of the Set/Dict equals it.)

(Sorry if this is obvious to everyone else - I just wanted to get my understanding out there to make sure it's right.)

pinging on this - it would be really great to see this change at some point.

@rfourquet - thank you for sticking with this!!

Was this page helpful?
0 / 5 - 0 ratings