This is a "companion issue" to #31601 in determining what abstractions would need to be in Julia Base (or let's say a stdlib) to allow to write packages for tabular data with no dependency. My intuition here is fuzzier than in #31601 so I hope the proposal makes sense, feel free to correct me if it does not. See this comment and this comment for a little bit of background.
There are at least three distinct array types (WeakRefStrings.StringArray
, PooledArrays.PooledArray
and CategoricalArrays.CategoricalArray
) that each have a corresponding "array of references" (WeakRefStrings
in the first case and say UInt8
in the latter two cases). The array of references (let's call it fast_array(v)
, though the name should be chosen with care) has the following two properties in these cases:
fast_array(v)[i] == fast_array(v)[j]
if and only if v[i] == v[j]
(but the first comparison is in general much faster)permute!(v, p)
and permute!(fast_array(v), p)
have the same effect or similarly copyto!(v, v[p])
and copyto!(fast_array(v), fast_array(v)[p])
have the same effect (again, the fast_array
version is much faster)What I wanted to ask is whether being able to create a fast_array
with the same guarantees as above is sufficiently general as to deserve to have a "fallback" in Base (simply fast_array(v::AbstractArray) = v
, as it has the two properties defined above). Then the various packages could extend it as needed and external tabular data packages could have efficient row comparison (and efficient permute!(v)
and copyto!(v, v[p])
or more generally sorting functionality) without having to depend on all these array packages.
If Base is not a good location for this interface, I'm also open to suggestion as to what could be a reasonable place for this thing to live (it could also be together with the defaultarray
function from #31601).
cc @mbauman
To add even more ramblings, I wanted to point out that it would actually be helpful to have two things:
fast_array(v)
as aboveBasically the second point would be some function recover_original(v, ref)
so that recover_original(v, fast_array(v)[i]) = v[i]
. In this way for example when converting to a PooledArray
(which pools elements together) a v::StringArray
, one could just pool the fast_array(v)
and then convert the (few) unique values to the actual string values. Not having access to this part of the interface led to this conditional definition in a Requires block in Tables and to this strange optimization in a Requires block in StructArrays.
+1, but I don't know whether this should be in Base or in a package. Also it would be nice to find a more specific name.
Do you think we also need levels
as an O(1) way to get the list of possible values, in the order corresponding to references in fast_array(v)
? Currently it lives in Missings.jl (because it skips missing values automatically). Without it, we would have to do something like v[indexin(unique(fast_array(v)), fast_array(v))]
to get the list of unique values, which is O(N). There's a difference between CategoricalArrays/PooledArrays and WeakRefStringArrays in that regard, which is that the former guarantee that references are contiguous and point to a pool of values (returned by levels
), while the latter only returns a custom struct which isn't even an integer. Knowing the range of possible integer values is essential e.g. to preallocate a vector mapping integer references in fast_array(v)
to references ordered according to sortperm(unique(v))
when grouping (to avoid a second pass just for sorting).
(BTW, WeakRefStringArray
is supposed to be deprecated in favor of StringArray
(https://github.com/JuliaData/WeakRefStrings.jl/issues/36), but that doesn't really affect this issue AFAICT.)
EDIT: Ooops, for some reason your comment only appeared when I posted mine. We're talking about the same thing: levels(v)[i]
allows retrieving the value corresponding to value i
in fast_array(v)
, but it only works when that array contains contiguous integer indices (i.e. not for WeakRefStringArrays).
Interesting. At first blush based upon your naming and use of indexing in the examples, I was thinking you were talking about a function that would allow an array to decide if it's faster to copy into a dense Array
or to use directly. So I think we need a better name here 鈥斅燼nd ideally a more general construct.
In some senses, it's kinda like hash.(A)
, only with stricter guarantees, right? In fact, it's essentially perfecthash.(A)
, yeah? Of course, it can't really be a broadcast function because it's the array itself that knows the hashing scheme. Is there more language around perfect hashing that would help us describe levels
and the mapping? My cursory perusal of https://en.wikipedia.org/wiki/Perfect_hash_function was unsatisfactory.
@nalimilan : glad to see we are on the same page (despite GitHub glitches)! It would be ideal to have a unique interface for StringArray
and Categorical/PooledArray
: I like the function because it encompasses both scenarios (being levels(v)[i]
for the categorical case and just String
for the WeakRefStrings case). The sortperm
issue you mentioned was done in PooledArrays in a bit of a "hacky" way by feeding to Perm
a modified version of the vector that has equivalent sorting but is a AbstractVector{<:Integer}
which sorts more quickly. I'm still trying to wrap up my mind as to whether fast_array
and recover_original
could be enough, or if we need a levels
dictionary?
A possible API in my mind would be:
refarray(v::AbstractArray) = v
valuefromref(v::AbstractArray, ref) = ref
with specializations for the various array types. Or we could have a custom RefArray
type that incorporates both the refarray(v)
and the closure ref -> valuefromref(v, ref)
? Not sure.
As to where this should live, I personally have no preferences as long as PooledArrays, WeakRefStrings and CategoricalArrays accept it (so that I can remove "performance hacks" from StructArrays). Maybe it can be tried in a package and then moved to Base in the future?
In some senses, it's kinda like
hash.(A)
, only with stricter guarantees, right?
It's a bit tricky, as in my mind the pooled arrays (arrays that represents data with a vector of hashes and a dictionary) would return hash.(v), ref -> hashdict[ref]
, whereas the StringArray
would return a strange array of custom objects and the method String
to turn them into actual string, and I would like there to be a fallback of v, identity
: I'm afraid that the hash
terminology applies to the first scenario but not to the others.
I'm still trying to wrap up my mind as to whether
fast_array
andrecover_original
could be enough, or if we need alevels
dictionary?
I guess it would work for some use cases, but (as I said) some optimizations only work when you know the references are contiguous integers (so not for WeakRefStringArray
). In that case, you also need a way to know the number (or range) of the references, and the list of unique values. And of course you need a way to know whether you're in this case or not (some kind of trait?).
So in summary one could have (I prefer allowing refvaluedict
to return nothing
rather than adding a trait, but either way is fine for me):
refarray(v::AbstractArray)::AbstractArray = v
refvaluedict(v)::Union{AbstractDict, Nothing} = nothing # nothing if info is not easily available
function refvaluemap(v::AbstractArray)::Function # or callable type
dict = refvaluedict(v)
dict === nothing ? identity : t -> dict[t] # hopefully all of this gets optimized away
end
We just need to come up with sensible names and decide where this should live. @mbauman does this look like something that would be useful in Base? Otherwise it may just be better to have a tiny package for this.
Looks good. Not sure about the names, it's always the hardest part. :-)
We should probably start with a package, and maybe move it to Base later if we're happy with the API and it feels essential enough. Indeed Base doesn't contain functions which aren't used by any type defined in Base AFAIK.
Two remarks:
refvaluedict
, I think we should just require it to return an object which implements length
and keys
/values
and can be indexed using values in refarray
(or nothing
). In practice that means PooledArrays and CategoricalArrays can return a vector, and WeakRefStringArrays a dict.refvaluemap
doesn't seem to be required, since you gave a very simple and generic definition that packages can copy.In practice that means
PooledArrays
andCategoricalArrays
can return a vector, andWeakRefStringArrays
a dict.
That's a good point, vector makes a lot of sense for PooledArray
and CategoricalArray
. I have a basic confusion though: for me StringArray
should return nothing
here as there is no efficient small "refvaluedict" in the general case. However StringArray
should somehow give the information that one should call String
to get the concrete object. Should we just "force it" by annotating the return type, meaning the general function to get the actual values would be (here is an example function to recover values that we would not include in the code but maybe just in the README):
julia
function ref2value(v::AbstractArray{T}, t)::T where T
dict = refvaluedict(v)
dict === nothing ? t : dict[t]
end
So the rule would be: if there is a refvaluedict(v)
then refarray(v)[i]
is the key for refvaluedict(v)
to get the correct value. If instead refvaluedict(v)
is nothing
, refarray(v)[i]
should be something that converts v[i]
.
If indeed we only need this two functions, I find the PooledArrays terminology (refs
and pool
) to be the clearest. I made a small package here with these two definitions so that we can start trying out PRs at PooledArrays, CategoricalArrays and WeakRefStrings.
If instead
refvaluedict(v)
isnothing
,refarray(v)[i]
should be something that convertsv[i]
.
Can you develop? refarray(v)[i]
would return an UInt64
offset for StringArray
. How would you convert that to a String
?
What I had in mind was something like:
refarray(v::StringArray) = convert(StringArray{WeakRefString{UInt8}}, v)
So basically the refarray
would keep the buffer and then refarray(v)[i]
would be a WeakRefString
that could be converted to string easily. The advantage compared to the original array is that to compare two entries we don't need to allocate strings (indexing just returns the WeakRefString
object without materializing a new string). For example:
julia> s = StringArray(["asda", "sada"]);
julia> w = convert(StringArray{WeakRefString{UInt8}}, s);
julia> @btime $s[1] == $s[2]
34.966 ns (2 allocations: 64 bytes)
false
julia> @btime $w[1] == $w[2]
4.773 ns (0 allocations: 0 bytes)
false
Drawback here is that things like a[1:2] = a[3:4]
can't really be done efficiently because a[3:4]
would copy the buffer, whereas I would like to simply work on the a.offsets
and a.lengths
arrays. Maybe this could be fixed by requiring that slicing doesn't copy the buffer for StringArray{<:WeakRefString}
(which are a type that should be "hidden" to the user, so maybe this is OK).
I imagine you are thinking something along the lines of refarray(v) = (v.offsets, v.lengths)
(you also need the length as FWIU in principle there is no guarantee that a string ends where the following begins) which solves the a[1:2] = a[3:4]
problem but doesn't allow comparison, as strings may start at different offsets but be equal.
Makes sense. I think we should wait until @quinnj is available to comment, though.
Drawback here is that things like
a[1:2] = a[3:4]
can't really be done efficiently becausea[3:4]
would copy the buffer, whereas I would like to simply work on thea.offsets
anda.lengths
arrays. Maybe this could be fixed by requiring that slicing doesn't copy the buffer forStringArray{<:WeakRefString}
(which are a type that should be "hidden" to the user, so maybe this is OK).
Maybe that's not an issue? These arrays aren't supposed to be transformed anyway, and one would better use views to take subsets.
I imagine you are thinking something along the lines of
refarray(v) = (v.offsets, v.lengths)
(you also need the length as FWIU in principle there is no guarantee that a string ends where the following begins) which solves thea[1:2] = a[3:4]
problem but doesn't allow comparison, as strings may start at different offsets but be equal.
Good point.
Maybe that's not an issue? These arrays aren't supposed to be transformed anyway, and one would better use views to take subsets.
Yes, actually by looking closer this is probably not needed. I'm getting convinced that refarray(v::StringArray) = convert(StringArray{WeakRefString{UInt8}}, v)
(with some care for missings) should be the correct way forward.
Most helpful comment
+1, but I don't know whether this should be in Base or in a package. Also it would be nice to find a more specific name.
Do you think we also need
levels
as an O(1) way to get the list of possible values, in the order corresponding to references infast_array(v)
? Currently it lives in Missings.jl (because it skips missing values automatically). Without it, we would have to do something likev[indexin(unique(fast_array(v)), fast_array(v))]
to get the list of unique values, which is O(N). There's a difference between CategoricalArrays/PooledArrays and WeakRefStringArrays in that regard, which is that the former guarantee that references are contiguous and point to a pool of values (returned bylevels
), while the latter only returns a custom struct which isn't even an integer. Knowing the range of possible integer values is essential e.g. to preallocate a vector mapping integer references infast_array(v)
to references ordered according tosortperm(unique(v))
when grouping (to avoid a second pass just for sorting).(BTW,
WeakRefStringArray
is supposed to be deprecated in favor ofStringArray
(https://github.com/JuliaData/WeakRefStrings.jl/issues/36), but that doesn't really affect this issue AFAICT.)EDIT: Ooops, for some reason your comment only appeared when I posted mine. We're talking about the same thing:
levels(v)[i]
allows retrieving the value corresponding to valuei
infast_array(v)
, but it only works when that array contains contiguous integer indices (i.e. not for WeakRefStringArrays).