Julia: Julep: Generalize indexing with and by `Associative`s

Created on 6 Oct 2017  Β·  54Comments  Β·  Source: JuliaLang/julia

I've always loved how in Julia (and MATLAB) one can create a new array from an old one, using what is now the APL indexing rules. Basically if you index a collection of values with a collection of indices, you get a new collection of the indexed values. Beautiful, simple. Indexing has also been extended by allowing arrays that don't use 1-based indexing by e.g. the OffsetArrays.jl package.

I'm not sure if this issue exists elsewhere as its own entity (cleaning up distinctions between arrays and associatives was surely mentioned in #20402 and this Julep seems to be a logical extension of #22907), but here I propose specifically that we extend indexing of and by Associative and make related changes so that the semantics are consistent across these two types of container. I prototyped ideas at https://github.com/andyferris/AssociativeArray.jl and basically came up with the ability to (with simple code):

  • Index an Associative{K,V} with an Associative{I,K} to get an Associative{I,V}. E.g. Dict(:a=>1, :b=>2, c:=>3)[Dict("a"=>:a, "c"=>:c)] == Dict("a"=>1, "c"=>3).
  • Index an Associative{K,V} with an AbstractArray{K,N} to get an AbstractArray{V,N}. E.g. Dict(:a=>1, :b=>2, c:=>3)[[:c, :a]] == [3,1].
  • Index an AbstractArray{T,N} with an Associative{K,I} to get an Associative{K,T} (where I might be Int for linear indexing, or a CartesianIndex{N} for Cartesian indexing). E.g. [11,12,13][Dict(:a=>1, :c=>3)] == Dict(:a=>11, :c=>13).

The semantics are consistent across arrays and dictionaries, and provide that for out = a[b]:

  • The output container out shares the indices of b (note: these are CartesianRange for arrays)
  • The values out[i] correspond to a[b[i]].

This is fully consistent with both the Base arrays and the OffsetArrays.jl package (We can do something similar for setindex!).

To make everything consistent, it helps to make the following associated changes:

  • Make Associatives be containers of values, not of index=>value pairs, so that arrays and dictionaries are consistent on this fundamental point. Use the existing pairs function when necessary (and ideally make it preserve indexability).
  • Make similar always return a container with the same indices, even for dictionaries. Ideally, unify similar across Associatives and Arrays (for example a dictionary which is similar to a distributed array might also be distributed) via use of the indices.
  • Have an new empty function that makes empty Dicts and Vectors to which elements should be added. (Done, #24390).
  • Consider whether we want to have collection of things you call getindex and setindex! with be called indices, rather than keys (and rename the current indices(::AbstractArray) to something else)
  • Have view work for the various combinations where getindex works.

The demonstration package also prototypes making AbstractArray{T, N} <: Associative{CartesianIndex{N}, T} - I don't think this is strictly necessary but it helped (me) to highlight which parts of the existing interface were inconsistent. The package does demonstrate that we can put something simple together without excessive amounts of code (some performance tuning is surely required).

Finally, a word on what motivates this: lately I've been playing with what fundamental data operations (such as mapping, grouping, joining or filtering) would be useful for both generic data structures and tables/dataframes (that iterate rows), and I found whenever I created say a grouping (using a dictionary of groups), I immediately felt the loss of ability to do complex indexing and other operations with the result (as well have to worry whether the output iterates values or key-value pairs, etc).

collections design julep

Most helpful comment

I like this a lot. The disconnect between associatives and arrays has always bothered me. I like the notion of array as a specialization of associative collections. The most visible cost of changing this would be having to write for (k, v) in pairs(d) instead of for (k, v) in d but I think that's a pretty negligible cost for all the benefits it brings. In particular, I think that if you want to address the question of "how should dicts iterate?" you fundamentally have to think about "how should map work on dicts?" since those two are so deeply connected. The current map has several issues:

  1. It's unclear that map(f, dict) should return another dict at all. What if f doesn't return pairs?
  2. What if the keys returned by f are non-unique? There are several different possible behaviors in that case, none of which is obviously right.
  3. The 0.6 behavior where map(f, dict) passes key-value pairs to f as a pair of arguments is convenient but fundamentally wrong if we consider dicts to be collections of key-value pairs. The 0.7 behavior where map(f, dict) passes key-value pairs as a single pair argument is correct with respect to iteration of key-values pairs but super annoying to use.

If we change the model for associative collections from being a collection of pairs to being an indexed collection of values and accordingly iterate values instead of pairs, that would clarify the behavior of map(f, dict): it transforms values and that's all. It's obvious that it should return a dict because it leaves the indexing structure of the input untouched just as it does for arrays. We could introduce mappairs and mapkeys functions, but I'm not sure that we need to since we can just write Dict(f(k) => g(v) for (k, v) in pairs(d)) with appropriate f and g.

All 54 comments

I should have also mentioned that "logical indexing" dict[inds], where the values of inds are Bool and the indices of inds match the indices of dict, should probably also work, just like they do for AbstractArray. That would basically be adding Dict(:a=>1, :b=>2, c:=>3)[Dict(:a=>true, :b=>false, :c=>true)] == Dict(:a=>1, :c=>3).

This looks really cool! I haven't had the chance to look through it thoroughly yet, but it's very appealing to me that arrays can be thought of as associative containers with a shape.

Would it be a goal of AssociativeArray to support say , if v = Vec([3,2,1]), something like v[[1,1,1,1]] (which should be [3,3,3,3])? It doesn't look like that works at present, or is there special syntax for that?

@macd Yes, but does v[Vec([1, 1, 1, 1])] do what you expect? I didn't put any effort into making Assoc interoperate with Base types like Vector - it's just for demonstration but we can add more to the package if there is more design space to explore.

Very cool, does work as I expect. I didn't get the interoperability issue, but it makes perfect sense.

As someone who has worked on this kind of stuff, I found it interesting that I was initially surprised by some of your choices, until you pointed out the guiding principle. I completely agree that out = a[b] implies out[i] = a[b[i]] is the most reliable foundation for these design decisions. :+1:

Thanks @timholy. I should point out that this is 100% motivated by a few of your comments along the lines of "arrays are really just some special kind of associative container" (in regards to motivating/justifying offset arrays) and I've been trying to make sense of that since :)

OK, so far I'm seeing positive feedback here, at least on the indexing behavior. I'd like to know people's opinions on my list of suggested breaking changes that IMO would make all this come together consistently. Enumerated:

  1. Have Associative iterate values not index-value pairs. When I first saw Jeff mention this some time ago, I was pretty surprised (my initial reaction was scepticism). However, after thinking about this since then, I pretty much consider this a given now - it will help a lot for generic programming to arbitrary indexables (and we aren't going to make Array iterate pairs, now, are we?).
  2. similar preserves indices and empty creates an empty container (with no indices). This is breaking for similar(::Associative), but consistent with similar(::AbstractArray). I can already see that this will enormously help in creating code/patterns which work across Associative and AbstractArray. It's about clear semantics.
  3. Rename keys to indices and keytype to indextype. While this might be seen as pure bikeshedding, I think it might be more intuitive to work with getindex, setindex! and indices (or else I would be tempted to propose getindex -> getkey, etc).
  4. The crazy part of AssociativeArray.jl: make AbstractArray a subtype of Associative. I still don't know how to feel about this one.

Anyway, some design feedback here would help in creating an implementation.

Very interesting indeed. Still pondering this. One more major breaking point: Allowing non-scalar indexing on Associatives means you'll no longer be able to have arrays or associatives as keys. That is,

julia> d = Dict(:a=>1, :b=>2, :c=>3, [:c,:a]=>4)
Dict{Any,Int64} with 4 entries:
  :a             => 1
  :b             => 2
  :c             => 3
  Symbol[:c, :a] => 4

julia> d[[:c,:a]]
4

@mbauman Yes this is an interesting case. Note that for cases where the indices are strongly typed (such as Dict{Vector{Symbol}, Int}) we can catch this correctly at dispatch, so not all is lost. What to do exactly when the index type is not concrete or not known is a bit less clear. (I would tend to try and prioritize scalar indexing so this isn't a breaking change at all, but this might introduce a run-time cost in certain situations. However, if the index type is Any, well, things are probably going to be a bit slow anyway).

Also - it would help if I knew of a use case for such index types. I know we've worried about hashing of vectors and unit ranges and so-on, but I haven't come across a use for this yet...

In any case we need a nicer way to express that an array should be treated as scalar in an expression where arrays usually are splashed, broadcasted or unboxed in other ways, compare the crutch Scalar from StaticArrays to prevent broadcast. If we figure out that on a high enough level, then also this julep benefits.

Yes, this is also currently a problem for non-scalar indexed assignments for arrays of arrays. That is, A[:] = 1:2 is always interpreted as A[1] = 1; A[2] = 2 and not fill!(A, 1:2). It can really be a pain for generic programming.

I have a very strong distaste for changing semantics based upon eltype changes.

That does seem like a tricky case... I'm wondering what the solution is. We could always favour scalar indexing unless we can prove it doesn't work, or something...

(It seems to me that the scalar case is the more fundamental kind of indexing (at least of getindex, I'm still torn on that setindex! example). @mschauer I was more wondering if we should force users to use Base.Slice or some-such to opt-in to non-scalar indexing in ambiguous cases, rather than ask them to opt-in to scalar indexing. OTOH - using something like Scalar might be easier to use?).

Make Associatives be containers of values, not of index=>value pairs

Then this might no longer be an Associative data-structure. It’s dangerous to generalize the behavior of a concretion, Dict, to the properties of all abstract Associative data structures. An Associative Array is defined as "a collection of (key, value) pairs" (https://en.wikipedia.org/wiki/Associative_array). If you pick that collection to be an unordered Set over the keyspace, then you get a Dict, but if you pick it to be something else, you can still get an Associative subtype. Some other collections that can hold associative pairs include a binary tree (making a SortedDict), an array (making a MultiDict, or the ImmutableDict implementation in Base).

Wikipedia also notes some examples of common Associative data-structures which are not uniquely indexable, including HTML queries and header. They are ordered and can be iterated as pairs, but attempting to index them may return a non-unique result (one name for these is: ordered-multimaps).

It's worth noting that if I were developing non-scalar indexing from scratch right now, I'd totally detangle the scalar and non-scalar behaviors with dot-broadcasting. A[I,J,K] actually behaves very much like the indices are broadcast together, except that J is projected into a higher dimension such that it is always completely orthogonal to I, and K even higher, and so on. This is APL indexing in a nutshell.

Actually, this makes me realize that we could deprecate the non-scalar indexed assignment of many values to many indices to A[idxs] .= [values...]. Then we could reclaim A[idxs] = [value] to place the same array into many locations.

@vtjnash. It is interesting to consider all the possible data structures here. I guess in some senses I've mostly thought Associative to mean "indexable". To expand, from Wikipedia:

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection.

I'm pretty sure the uniqueness of the keys is pretty important here - we'll need to make some assumptions to make a good indexing interface suitable for generic programming. For my uses, I don't see any problem with using the pairs function as required - it's also useful for e.g. Arrays. The benefits to having for example map(indexable) and map(pairs(indexable)) and so-on work in a sensible (index-preserving) and consistent fashion with the same semantic for AbstractArray and Associative seem great (not only will some generic code work for both, but perhaps more importantly it reduces the cognitive load for the code author or reader). I do appreciate that it may be traditional to consider such associative maps as containers of key=>value pairs but I am questioning whether that is the best default choice for iteration for usability in Julia.

I consider the more "exotic" multi-maps and XML structures as being something a bit different to a Julia Associative (e.g. it seems to me that a MultiDict{K,V} is just as useful as a Dict{K,Vector{V}}, with perhaps a bit of an expanded interface to make life easy).

One interesting thing from the Wikipedia article that I have been thinking about is that there is a difference between updating the value corresponding to an existing index, and adding a new index to the collection. It's extremely convenient that I can add new indices to a Dict with setindex! but when reading code I do sometimes get confused which operation is being performed (and of course there is no run-time safety here either, to constrain it to one operation or the other). Not sure how/if we should deal with this somehow (I'm thinking something like push!(collection, index, value)).

Actually, this makes me realize that we could deprecate the non-scalar indexed assignment of many values to many indices to A[idxs] .= [values...]

This seems like a reasonable idea to me. (It reminds me, I've of done something like @view(A[idxs]) .= some dot expression before to take advantage of dot-fusion, so it would be nice to get the fusion for free without the @view).

That should work just fine without the @view:

julia> expand(:(A[idxs] .= sqrt.(B)))
:((Base.broadcast!)(sqrt, (Base.dotview)(A, idxs), B))

dotview is a view for non-scalar indexing expressions, regular getindex otherwise.

Oh, that's nice to know :)

You've now got me wondering if it should be A[i] .= sqrt.(B) for scalar i and A.[inds] .= sqrt.(B) for non-scalar inds (and A[i] vs. A.[i] more generally... I think this is what you were saying above... hugely breaking I know but still interesting... and if we get @ for field referencing from #21912 then something like A@[inds] for view(A, inds)... mind blown. Pity it's a bit of an ASCII scramble though).

The major issue with A.[] is logical indexing and other representations of many indices that aren't just arrays of Ints. We're getting a bit far afield from your original idea here, though. We can keep discussing it over at #22858.

Only because out[i] = a[b[i]] or out[I] = [out[i] for i in I] is the design principle in the "Bracket calculus" document I wrote last year, allow me to link it here: https://gist.github.com/mschauer/b04e000e9d0963e40058, see e.g. https://gist.github.com/mschauer/b04e000e9d0963e40058#indexing-of-arrays

map(indexable) and map(pairs(indexable)) and so-on work in a sensible (index-preserving) and consistent fashion with the same semantic for AbstractArray and Associative seem great

I think there's two operations here being described as map: I'll call them enumerable-map and associative-map. Usually in Julia, map refers to enumerable-map (e.g. map(identity, (i for i in 1:10)) is perfectly valid, and is not an indexable container). However, we can also define an associative map operation that applies the function just to the value as follows: associative-map(f, a) = map( (k => v) -> (k => f(v)), a ). In other languages, this function seems to often be called mapvalues.

We can extend this function to multiple associative collections by making the definition varargs: associative-map(f, a1, as...) = map( (k => v1) -> (k => f(v1, map(a -> a[k], as)...)), a1 )

I like this a lot. The disconnect between associatives and arrays has always bothered me. I like the notion of array as a specialization of associative collections. The most visible cost of changing this would be having to write for (k, v) in pairs(d) instead of for (k, v) in d but I think that's a pretty negligible cost for all the benefits it brings. In particular, I think that if you want to address the question of "how should dicts iterate?" you fundamentally have to think about "how should map work on dicts?" since those two are so deeply connected. The current map has several issues:

  1. It's unclear that map(f, dict) should return another dict at all. What if f doesn't return pairs?
  2. What if the keys returned by f are non-unique? There are several different possible behaviors in that case, none of which is obviously right.
  3. The 0.6 behavior where map(f, dict) passes key-value pairs to f as a pair of arguments is convenient but fundamentally wrong if we consider dicts to be collections of key-value pairs. The 0.7 behavior where map(f, dict) passes key-value pairs as a single pair argument is correct with respect to iteration of key-values pairs but super annoying to use.

If we change the model for associative collections from being a collection of pairs to being an indexed collection of values and accordingly iterate values instead of pairs, that would clarify the behavior of map(f, dict): it transforms values and that's all. It's obvious that it should return a dict because it leaves the indexing structure of the input untouched just as it does for arrays. We could introduce mappairs and mapkeys functions, but I'm not sure that we need to since we can just write Dict(f(k) => g(v) for (k, v) in pairs(d)) with appropriate f and g.

Another question here: should we continue to treat d[1,2] as d[(1,2)]? Given that we've never seen anyone even consider requesting such a monster, I'm not sure it's worth burning time on a multidimensional dictionary that would have a tuple of array-like indices and have values that span the cartesian product thereof.

But we may want to deprecate the implicit tuplization behavior here in order to make the two more similar.

That syntax has always seemed to me like a convenient poor man's sparse matrix.

What if f doesn't return pairs?

That's the smoking gun. The usual structure of a map operation is (A->B) -> Collection[A] -> Collection[B]. For Dict, A is a Pair type, but since B can be anything this structure does not match. But if we take Dict{K} as the Collection type, it matches the structure of mapping over the values.

Also - it would help if I knew of a use case for such index types. I know we've worried about hashing of vectors and unit ranges and so-on, but I haven't come across a use for this yet...

Memoization is a big one.

Index an Associative{K,V} with an AbstractArray{K,N} to get an AbstractArray{V,N}. E.g. Dict(:a=>1, :b=>2, c:=>3)[[:c, :a]] == [3,1].

Mappings and functions are the same thing, so why not replace your proposed indexing with function application: Dict(:a=>1, :b=>2, :c=>3) ∘ [:c, :a] == [3,1]?

It would be error-prone to write generic code where dict[x] is either a look-up, or APL-style indexing, depending on x's type.

@cstjean ∘ already has a meaning which we must preserve, where Dict(...) ∘ [...] creates a function x -> [...](Dict(...))(x). I'm not really comfortable with punning function calls for some kind of indexing behavior.

Thanks @StefanKarpinski for that post. I was trying to mostly avoid the pair-iteration topic here (thinking we should possibly have a separate github issue for that), but you hit the nail on the head when you discuss map, and in fact that is the entire motivation for me supporting having Associative iterate values. In my explorations I found that workflows involving higher-order operations like map, reduce, filter and group (my own function) quickly became unworkable whenever an Associative was created. This change alone would make data manipulation with Associative so much easier to handle (especially the synergy with having a similar interface to AbstractArray).

@mbauman @TotalVerb I would be tempted to deprecate the current dict[a,b] behaviour meaning dict[(a,b)] and free it up to allow the package ecosystem to create the "monster" of Cartesian product spaces for keys of associatives if someone wants (at the very least it would be an interesting experiment and I believe not very difficult either). People will still be free to use tuples (and named tuples) as dictionary keys.

∘ already has a meaning which we must preserve, where Dict(...) ∘ [...]

That's a side-effect of the way ∘ is implemented, not useful functionality.

I'm not really comfortable with punning function calls for some kind of indexing behavior.

It's not a pun, it's an isomorphism: every function f(x::A)::B over a finite countable domain A can be represented as a Dict{A, B}. "Composition of associatives" could behave like function composition.

EDIT: Perhaps I should have mentioned that Index an Associative{K,V} with an AbstractArray{K,N} to get an AbstractArray{V,N}. E.g. Dict(:a=>1, :b=>2, c:=>3)[[:c, :a]] == [3,1]. _is_ a composition of the two associative containers. In any case, @mbauman's suggestion looks more practical.

We already have a very nice scheme for detangling scalar and nonscalar meanings. And the one we need β€” d.[vector] β€” is completely available. It'll take some thought, but I totally believe that we can create a workable definition for it.

I'm not seeing how d[a,b] meaning d[(a,b)] is at odds with this. Wouldn't that dovetail nicely with the view of dicts as a map between an index and a value? In a matrix, the index is the tuple (i,j) which is mapped to the matrix entry M[i,j]. Although I do agree that deprecating it is the safest option in case we discover some inconsistency later on. We can always reintroduce it if it turns out to be useful and consistent.

That's fine, but I thought each index of an Array{T,N} was a CartesianIndex{N}, not an NTuple{N, Int} - I kind-of feel we should pick one or the other and stick with it. :)

At this stage I'm not sure that worrying about Cartesian indexing on Associative is extremely important unless someone is seriously considering using e.g. the type alias AbstractArray{T,N} = Associative{<:AbstractCartesianIndex{N}, T}. But in AssociativeArray.jl I did get the feeling that we can more easily build Cartesian indexing machinery like Slice if the keytype is some Cartesian type, neatly separating everything from cases where the key type is some arbitrary tuple (not restricted to a Cartesian product of sets).

It's worth mentioning that in #24086 we discussed using the syntax a.[b] for non-scalar getindex and a[b] for the scalar case. This helps disambiguate which operation you want, especially for Associative containers where keys might be collections, and relates to an existing ambiguity in setindex! for arrays (as discussed in #24086).

How about the following:

struct MatrixDict{T1, T2, V}
    dict::Dict{Tuple{T1, T2}, V}
    MatrixDict{T1, T2, V}() where {T1, T2, V} = new(Dict{Tuple{T1, T2}, V}())
end

Base.getindex(d::MatrixDict, k1, k2) = d.dict[(k1, k2)]
Base.setindex!(d::MatrixDict, v, k1, k2) = d.dict[(k1, k2)] = v

d = MatrixDict{Int, UnitRange{Int}, Int}()

julia> d[1, 1:3] = 1
1


julia> d[2, 1:3] = 2
2

julia> d.[1:2, 1:3]
????

IIUC, the last expression would give an error since it would broadcast the 1:3. How would I write to only broadcast on the 1:2 while keeping 1:3 as a "scalar".

That's effectively the same problem as marking specific arguments in a f.() expression to be treated as scalar. And the solution is the same: just "protect" the scalar inside an array/collection: d.[1:2, [1:3]]. APL indexing rules would mean that adds a dimension, but you could similarly do d.[1:2, fill(1:3)] to drop it.

That's another motivation for a non-dimension adding scalar wrapper type: https://github.com/JuliaLang/julia/issues/18379

From triage: for 1.0 we mostly need to consider dict iteration and possible renamings. The rest of the indexing stuff can be handled later.

Also from triage, Stefan proposed that we make the interface returned by pairs/keys/values be more complete in order to better support all uses cases of array and dict objects (and sets too). In particular, the proposal was that we should change their return type/behavior to define them as transformations between the 3 fundamental containers in Julia. In particular, define them such that:

pairs(array/dict)::Associative # effectively equivalent to `Associative(zip(keys, values))`
values(array/dict)::AbstractArray # (indexed by the original keys)
keys(array/dict/set)::AbstractSet

I think we can live with dicts iterating either pairs or values --- if they iterate pairs, there will be contexts where you need to write values(d), and if they iterate values there will be contexts where you need to write pairs(d).

However, it seems to me there are lots of functions that operate on keys and values in certain useful patterns. For example indmin gives you the index of the value that minimum returns. It seems obvious to me that indmin on a dict should give you the key of the minimum value (as it now does). Therefore minimum should give the minimum value. That implies that either (1) minimum should call values on its argument, or (2) dicts should iterate values. (1) works, but values is the identity function for everything except Associative. Is anything else going to have a meaningful implementation of values? It seems ugly for a function that can work on arbitrary iterables to have to call values on its argument.

Then again, sorted dictionaries are sorted by key, so perhaps there's no way to have a totally consistent view.

I've been thinking about sets and keys. Here's a relatively sane property that I'm considering which may be desirable across arrays and associatives:

a[keys(a)] == a

That is, both the keys and values will be preserved by this indexing operation. (Perhaps this will be a.[keys(a)] == a if @mbauman has his way).

To achieve this:

  1. keys(a) returns an array or associative which is an "identity mapping" (I made up that term). If b is an identity mapping then b[i] == i for all i in keys(b). Base.OneTo is an example of an identity mapping (and I see no reason why a CartesianRange can't be an array supporting getindex). From memory, I think @timholy might have made some custom unit-range objects which have this property for offset arrays.
  2. AbstractSets are identity mappings, and we should in theory be able to have AbstractSet{T} <: Associative{T,T}. Set can follow the same internal implementation as current, but support getindex and so-on. We can still talk about element in set if we have 3.
  3. AbstractSets and Associatives will iterate values, not key-value pairs. We won't need a values function.
  4. We follow the indexing property in the OP (possibly implemented later, as the triage people indicate).

I think if we have 1-3 for v1.0, things may be pretty consistent and tied together nicely and ready for 4 in the future.

The final thing I'd suggest for v1.0 is to allow parsing (not lowering) of a.[i] so that we can experiment with macros in packages on how to implement the OP without doing weird stuff to e.g. indexing of Dict{Any}s.

a[keys(a)] == a

This property is true of arrays, but is not true of all associative containers. One counter-example is a multi-dicts, but another more commonly encountered counter-example is the min-heap.

You keep talking about multi-dicts, but they are not, in my view, a thing. If m is a "multidict" then what does m[k] return? Multiset, sure, that's a thing – it's an unordered collection of values, which can have repeats, in which one can test containment/count in O(1) time – also known as a Dict{K,Int}. If by "multi-dict" you mean something that maps keys to sets of values, then that's a Dict{K,Set{V}}.

but they are not, in my view, a thing

You also keep trying to claim this, despite that Julia provides an implementation of one (https://github.com/JuliaLang/julia/blob/master/base/dict.jl#L747), and which also fails to conform to your suggestion that it's just a "Dict{K, Set{V}}"

An ImmutableDict is "persistent" (I believe that term is used in Clojure) in that previous mappings aren't destroyed. But I think the fact that this is exposed by iteration is more of a bug.

another more commonly encountered counter-example is the min-heap

What does indexing mean here? Wouldn't the user-facing interface be that of a priority queue (i.e. deque- or bag-like) rather than as an indexable?

The OP is specifically referring to collections which provide mappings from a unique set of keys to a single value, which is what I thought Associative{K,V} tries to encapsulate. You can of course use that to represent more complex data structures if the values are themselves containers, or you can make your own abstract type which isn't an Associative with an appropriate interface for non-unique keys.

But I think the fact that this is exposed by iteration is more of a bug.

It's entirely reasonable to additionally define and create another persistent dict type without needing to first accuse a different data structure of being a bug. It's a feature that we can also implement a ordered multi-dict with the same interface as other associative mappings.

which is what I thought Associative{K,V} tries to encapsulate

By restricting the classification, it'll be a tradeoff. As with any classification system, making the definition more specific ("a unique mapping from a key to a single value") sometimes makes it more useful (you can assume indexing is possible). But it also then excludes other possibilities (the counter-examples I gave), which now will form some new categories. In these cases, these perhaps might be a Queue{ElementT} (or perhaps more generally an Iterable{ElementT}) – which currently is just represented as an interface (push/pop or iterate) over other structures – and a MultiMap{KeyT, ValueT}. We can change the definition of these classes pretty much however we want – there isn’t one right or wrong way to define them – but we should at least acknowledge there are repercussions of the changes: both pros and cons.

The OP is specifically referring to collections which provide mappings from a unique set of keys to a single value

There are alternative ways of writing the OP which don’t make this choice of definition. Here’s one generic implementation option that instead assumes the existence of an appropriate definition for similar and of pair iteration:

function getindex.(iterable, keys::Associative)
    newmap = similar(first(eltype(keys)) => last(eltype(pairs(iterable)), iterable)
    for (newkey, oldkey) in keys
        for (key, value) in pairs(iterable)
            if key == oldkey
                push!(newmap, newkey => value)
            end
       end
    end
    return newmap
end
function getindex.(iterable, keys::AbstractArray)
    newarray = similar(last(eltype(pairs(iterable))), iterable)
    for oldkey in keys
        for (key, value) in pairs(iterable)
            if key == oldkey
                push!(newarray, value)
            end
       end
    end
    return newarray
end

FYI I've started some prototyping at https://github.com/andyferris/Indexing.jl

I think the Indexing.jl prototype is working pretty well now. It's not too much code (it probably needs a bit more code to make it really fast) so it doesn't seem too scary to port to Base at some point.

It's probably a good time to think about future syntax, in case depredations are necessary. I would suggest ending up with something like:

a[i]               --> getindex(a, i)       # scalar only
a.[inds]           --> getindices(a, inds)  # or view(a, inds)?
a[i] = v           --> setindex!(a, v, i)   # scalar only
a.[inds] = v       --> setindices!(a, v, inds)
a[i] .= v          --> broadcast!(identity, getindex(a, i), v)
a.[inds] .= values --> broadcast!(identity, view(a, inds), values)

Note the lack of dotview and maybeview. The last two could support dot-fusion on the RHS. Also, the default for a.[inds] could potentially move to view rather than getindices (or we can make it dovetail with lazy broadcasting a la https://github.com/JuliaLang/julia/pull/23692#issuecomment-352487735 nicely).

IIUC we'd need to deprecate dotview and perhaps a semantically-ambiguous setindex! method on arrays (this one noted by @mbauman) in v0.7, if we want the above to appear in v1.x. Thoughts?

FYI, the second bullet point has been implemented in NamedArrays:

n = NamedArrays.NamedArray(rand(3), ([:a, :b, :c],))
@show n[[:c, :a]];
n[[:c, :a]] = 2-element Named Array{Float64,1}
A  β”‚ 
───┼─────────
:c β”‚ 0.410735
:a β”‚ 0.253642

(ended up here searching for a way to extend REPL completion for Associative-like structures)

the second bullet point has been implemented

Technically, the rules described further in the OP might imply that indexing with a Vector would return a Vector, so you'd need to index with a named array like:

n[NamedArray([:a, :c], ([:a, :c]))

which isn't exactly ergonomic for your purposes!

(The way I think about it is the rules described in the OP is basically equivalent to getindices(a, inds) = broadcast(i -> a[i], inds), so the output shares a lot of similarity with inds - in fact, if broadcast on Associative worked on values not pairs then we could simply make a.[inds] lower to the RHS above and not have a getindices function at all).

Technically, the rules described further in the OP might imply that indexing with a Vector would return a Vector

I wasn't really suggesting that NamedArray is a satisfying solution to your OP. In fact I struggled with how to implement n[[:a :c; :b :a]]. Currently, this still gives a NamedArray, but I had to make up the names of the indices. In the 1D-index case, the names could be obtained from the original NamedArray, but with 2D indexing this no longer makes sense.

I suppose that for type stability reasons (I probably am the most type instable programmer ever) NamedArrays (taking the role of the Associative) should implement your OP, and always produce the type of the indexing array with elements of the type of the indexed array.

I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary non-scalar indexing in combination with _Indexing.jl_.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

manor picture manor  Β·  3Comments

Keno picture Keno  Β·  3Comments

dpsanders picture dpsanders  Β·  3Comments

omus picture omus  Β·  3Comments

sbromberger picture sbromberger  Β·  3Comments