Julia: Julep: `broadcast` with dictionaries

Created on 6 Feb 2018  Â·  28Comments  Â·  Source: JuliaLang/julia

Current behavior

I just learned that apparently broadcast doesn't work as I expected on dictionaries:

julia> d = Dict(:a => "Alice", :b => "Bob", :c => "Charlie")
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> map(kv -> kv[1] => kv[2], d)
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> broadcast(kv -> kv[1] => kv[2], d)
ERROR: KeyError: key 1 not found
Stacktrace:
 [1] getindex at ./dict.jl:474 [inlined]
 [2] (::##17#18)(::Dict{Symbol,String}) at ./REPL[11]:1
 [3] broadcast(::Function, ::Dict{Symbol,String}) at ./broadcast.jl:455

julia> broadcast(length, d)
3

AFAICT it seems broadcast is working on dictionaries like they are scalars or something. Someone correct me if I'm wrong - but maybe we never added any broadcast methods for dictionaries?

Proposal

Make broadcast work with dictionaries. I can see two possibilities, of which I favor option 2:

  1. broadcast for 1D indexable containers generally works like map, so we can extend this to AbstractDict and have broadcast map the key-value pairs
  2. broadcast deals with matching indices and performing some operation on values. So broadcast(f, dict) could be a bit like map(f, values(dict)) except the dictionary structure and it's indices are preserved in the output.

It seems to me that map deals with the iteration protocol and broadcast is more about matching indices and mapping those values. Using option 2 would give:

julia> length.(d)
Dict{Symbol,Int64} with 3 entries:
  :a => 5
  :b => 3
  :c => 7

IMO this kind of syntax could help a lot with manipulating dictionaries (one example is using AbstractDict for output of groupby-style operations, and then acting on the groups) while making no breaking changes to dictionary iteration or map.

To me it also dovetails with the "getindices julep" via this comment https://github.com/JuliaLang/julia/issues/24019#issuecomment-363253768 which basically suggests we can get multi-valued getindex between different indexable container types including dictionaries via the lowering transformation:

a.[inds] --> broadcast(i -> a[i], inds)

EDIT: Addendum (27/07/2019)

I forgot to mention a crucial use case of broadcasting over dictionaries where you have two or more dictionaries with the same keys and you want to match up the values based on the keys. For example dict1 .+ dict2. For our hash-based Dict, the user can't really predict the iteration order and they may actually differ between dict1 and dict2 even if the user can guarantee that they share the same set of keys (for example, dict2 might have been rehashed and dict1 not, but IMO this implementation detail shouldn't affect the semantics of the broadcast operation).

broadcast collections julep

Most helpful comment

I don't think strings are the best example to decide how broadcast should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.

All 28 comments

Duplicate of #18618. Do you mind if we try to keep all the discussion consolidated there?

The additional complexity compared with #18618 is that for Dicts we need to decide whether the broadcasted function should receive values or pairs. So maybe we can discuss that issue here, and keep the discussion regarding general iterables at #18618.

I suggest throwing an error when a Dict is passed to broadcast for now (just like I think we should throw an error for an iterator which doesn't implement the right methods), so that we can discuss the design and implement it in 1.x.

I suggest throwing an error when a Dict is passed to broadcast for now

I agree wit that; we probably shouldn't ship with a known broken implementation of broadcast on AbstractDict that doesn't error out.

I don't think it's "broken." It's not a bug, or unintentional. It was a conscious decision that broadcast is different from map: the former treats things as scalars by default, and the latter as iterators; the former "broadcasts" scalars to match array arguments, whereas the latter errors for mismatched dimensions.

Dictionaries, in particular, are not treated as "arrays" because (a) they don't have a shape/dimensionality, (b) they have an undefined iteration order, and (c) we intentionally had only a small whitelist of things that are array-like containers for broadcast.

Of course a "do what I mean" behavior would be better, but the point is that there are some times when you want one behavior and some times the other.

Making it an error just means you get neither behavior. It doesn't seem like an improvement.

For an example of where the current behavior is useful, think of anything where you are broadcasting over the keys:

julia> d = Dict(3=>4, 5=>6, 7=>20)
Dict{Int64,Int64} with 3 entries:
  7 => 20
  3 => 4
  5 => 6

julia> get.(d, 1:10, 0)
10-element Array{Int64,1}:
  0
  0
  4
  0
  6
  0
 20
  0
  0
  0

Erroring on a Dict argument to broadcast would break this, and gain us nothing useful in return.

Personally, I find it quite unexpected that map(f, x) is different from f.(x). And these issues suggest I'm not alone. In fact, I think making these two behave the same is a crucial element to any proposed solution to #18618. That seems to be where that issue is slowly converging.

Note that we've deprecated map(f, dict) mapping over pairs in #5794 with the intention that we can change it to map over values in the future, so have space to design map here, too. That'll change.

The big question is if we have:

map(sqrt, Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)

and

sqrt.(Dict(:a=>4,:b=>9))    # => Dict(:a=>2.0, :b=>3.0)
Dict(:a=>4, :b=>9) .+ [1,2] # => Dict(:a=>5, :b=>11) # EDIT: THIS MAKES NO SENSE, AND SHOULD ERROR

Then what would happen in the case like Dict(:a=>4, :b=>9) .+ [1 2]? We simply don't have a data structure (built in) for that. I suppose it could be an error or a matrix. Edit: Never mind — we shouldn't allow this operation since the keys between the dict and array don't match.

f.(x) should be either broadcast or map. Longstanding usage for things like .* suggest the former. And those two functions are definitely not the same. map(f, array, scalar) will presumably always throw an error, and map(f, rowvector, columnvector) will never produce a matrix, no?

Almost all of these things seem to boil down to "I want map, not broadcast."

Absolutely these two functions are not the same. But I think they should collapse to the same behavior in the one-argument case.

Even in the one-argument case, there are differences. e.g. map(f, string) maps over characters. But broadcast (rightly, I think), treats strings as "scalars", which is extremely useful for processing collections of strings.

I don't think it makes broadcast more intuitive if its behavior suddenly changes when you have only a single argument.

So your proposal is that whether broadcast treats an argument as a collection or an atom should depend on the number of arguments?

No, definitively not. My preference would be to systematically re-work broadcast along the lines of https://github.com/JuliaLang/julia/issues/18618#issuecomment-360594955, but I know you feel differently.

I know there are examples where it's useful for broadcast to treat strings or dicts (or anything else for that matter, including arrays) as scalars, but "seems useful" does not generalize. broadcast should have a simple general behavior that naturally collapses to map in the 1-argument case.

If we were only dealing with the word broadcast, I could be quite flexible. But with that dot, broadcast is occupying some valuable real estate and so it needs to play nice.

I don't think strings are the best example to decide how broadcast should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.

My mental model of broadcasting is basically that broadcast(f, x_1, ..., x_n) should mean something semantically equivalent to:

  1. For each x_i in x_1...x_n, what are its indices inds_i (also includes what's currently referred to as keys, and the implicit linear indices of types that satisfy the iteration interface but don't have indices). If an argument doesn't have either kind of indices, it is a scalar, which carries a single implicit index (maybe it's wrapped in a Ref to make this possible, and also to allow users to designate scalars).
  2. Try to create a new set of indices inds such that the sequence of functions g_1...g_n is a natural, structure-preserving way to map each index in inds to a single index in each inds_i. If this isn't possible, throw an error.
  3. Return the unique y whose value at each index ind in inds is f(x_1[g_1(ind)], ..., x_n[g_n(ind)]) and whose type is a predictable, minimally complicated type that can hold all such values.

In practice, such a function g_i would have to return a tuple of indices which is splatted into x_i's getindex in order to support Ref, whose getindex takes no arguments.

As @mbauman, such questions of the preferred model are being discussed in #18618. In particular, the question is whether to define that (a) broadcast operates on any iterable (so broadcasting values over keys would mean writing broadcast(f, values(dict)), to express the situation where iteration coincides with indexing), or whether (b) the broadcast operates on indexes (so it only makes sense to use it when the indexes form a cartesian shape, and broadcast(f, dict) is the same as f(dict)), or whether (c) it can also broadcast non-cartesian keys (so broadcasting a dict would require that all other elements be either scalars or have the same indexes, or be an nd-sparse table with more complex rules; and the alternate version are written as either broadcast(f, (dict,)) or broadcast(f, collect(p for p in dict))

Currently, we most nearly implement (b). This proposes we implement (c), which is mostly just a superset of (b). The issue #18618 proposes that we do (a) instead. Summary, in table form, of the possible combinations of (1-D) operations you might want to do and how to do them:

Operation | Current / b | a | c | d | Generator
--|--|--|--|--|--
as-value | broadcast(f, dict) | broadcast(f, (dict,)) | broadcast(f, (dict,)) | broadcast(f, (dict,)) | f(dict)
value-map-keys | Dict(zip(keys(dict), broadcast(f, collect(values(dict))))) | broadcast(f, values(dict)) | broadcast(f, dict) | broadcast(f, dict) | Dict(k => f(v) for (k, v) in dict)
value-map-order | broadcast(f, collect(values(dict))) | broadcast(f, (v for v in values(dict))) | broadcast(f, collect(values(dict)))) | broadcast(f, (v for v in values(dict)))) | Array(f(v) for (k, v) in dict)
pair-map-order | broadcast(f, collect(p for p in dict)) | broadcast(f, (p for p in dict)) | broadcast(f, collect(p for p in dict)) | broadcast(f, (p for p in dict)) | Array(f(p) for p in dict)
pair-map-keys | Dict(broadcast(f, collect(p for p in dict))) | broadcast(f, dict) | Dict(broadcast(f, collect(p for p in dict))) | Dict(broadcast(f, (p for p in dict))) | Dict(f(p) for p in dict)

edit: added column d as suggested below by jekbradbury, which is mostly a superset of d, but where objects without a shape (cartesian axes) and without keys (non-cartesian indices) would instead assume ordered iteration

edit: also added a column of what this broadcast operation would mean in terms of a Generator

I guess I'm arguing that broadcast should operate on indices but fall back to iteration? So column c but without the need for collects?

Ah, good call. I added a column (d) and a note to describe it. It's mostly (c) without collect, but where value-map-order is the same as column (a)

And now also added one more column "Generator", to give an equivalent result in terms of a different expression construct, for more clarity

Based on the discussion with @mbauman here is a list of possible test cases to consider (if they should error and if not what result should be produced):

  • (a=1,b=1) .+ [1 2]
  • (a=1,b=1) .+ (1,2)
  • (a=1,b=1) .+ [1,2]
  • (a=1,b=1) .+ (a=1,b=2)
  • (a=1,b=1) .+ (c=1,d=2)

EDIT: And I "vote up" for a higher priority for this decision as I get requests for DataFrames.jl functionalities that depend on this decision. Thank you!

My opinion has become firmly that broadcast should match-up the values from keys (there is some trickery allowed here for CartesianIndex's) rather than worry about iteration order.

Therefore the 4th one makes sense to me, as does:

(a=1,b=1) .+ (b=2,a=1) --> (a=2,b=3)

The others seem odd to me (and could be errors), except perhaps that named tuples have two kinds of indexes, the Symbols and the Integers (but I personally don't see the latter as particularly useful when you can unwrap it into a tuple so trivially, yet their presence serves to complicate a bit the indexing API).

One of the reasons I think keys is really important and fundamental is really the OP - I want this to work for hash-based Dicts and to be able to write things like dict1 .+ dict2. Unfortunately, for all intents and purposes the relative ordering of the elements of two different hash-based Dicts is not deterministic from the perspective of the programmer (it might actually be deterministic underneath it all but there's no real way for users to align the ordering of two Dicts to be the same, and even if they tried that's not necessarily cheap).

@bkamins If you're thinking of DataFrames.jl I suspect you'll want to match column names together, right?

(I've updated the OP with the two-Dict example)

Currently in DataFrames.jl we have a rule that names must match exactly (order and values), except for push!, when we allow for a mixed order match on names and setindex! of AbstractDict on RHS (where clearly we can only look at values). So we have a slight inconsistency, which I want to get rid of.

If we had a decision in Base (via this PR) what are the rules of combining collections with names I would copy them to DataFrames.jl to be consistent.

OK, thanks, I agree.

I have two more examples I think are important. The first is named tuples and dictionaries:

out = (a = 0, b = 1) .+ Dict(:a => 2, :b => 4)

If broadcast were iteration based then the set of values (Set(values(out))) could depend on details like whether we are running Julia on a 32-bit or 64-bit system.

Another example is mixing different types of AbstractDict:

sortdict .+ hashdict

I feel the only semantically useful definition here is that the broadcast operates by matching up the keys and basically ignoring the iteration order. I can't see anything else that is useful or helpful to users.

Basically, I'd like to be able to write generic code over AbstractDict, like function f(a::AbstractDict, b::AbstractDict); ... ; end. It would be nice to be able to use something like broadcast inside this generic function. Going further, I'd love to be able to duck type this function function f(a, b); ... ; end and let me mix-and-match different containers like dictionaries and named tuples, which really have very similar interfaces.

To reach that goal I believe the best way forward is that map obeys iteration such that first(map(f, a, b)) == f(first(a), first(b)) and broadcast obeys indexing such that broadcast(f, a, b)[i] = f(a[i], b[i]) (for single-dimensional containers; we would do something slightly different where i isa CartesianIndex). With these guarantees I can use map and broadcast in more generic code, the two functions operate have well-defined semantics, and they have somewhat orthogonal concerns (that tend to overlap for e.g. AbstractArray).

a list of possible test cases to consider (if they should error and if not what result should be produced):

  • (a=1,b=1) .+ [1 2]
  • (a=1,b=1) .+ (1,2)
  • (a=1,b=1) .+ [1,2]
  • (a=1,b=1) .+ (a=1,b=2)
  • (a=1,b=1) .+ (c=1,d=2)

How about the test case involving the lhs?

y = Dict(:b=>1)
y .= (a = 1,)

I agree that LHS (broadcasting assignment) test cases are relevant, but they are more complex, so I have left them out intentionally for later.

The reason is that copyto! is usually overloaded by a custom type of LHS anyway, and the first thing we should settle on is what Style should have a Broadcasted object have on RHS (this is needed anyway to implement a proper copyto! method as this Broadcasted object is passed to it anyway).

But I fully agree that we should have a consistent broadcasting + broadcasting assignment system for types defined in Base.

Just for a reference - in DataFrames.jl we have settled that if we have a broadcasting assignment and we have a data frame on LHS then:

  • we use assignment using column and row number;
  • we require column names to match exactly including order if RHS defines column names (currently this is only data frames, but after this Julep I guess we will have more possibilities).

I agree it's clear that broadcasting on dicts should only care about keys and not about order. But the situation is more difficult for named tuples, since they are somewhat a hybrid of a tuple and of a dict. I suggest this:

  • broadcasting a named tuple with a tuple or vector should use the order of values
  • broadcasting a named tuple with a dict should use names. It should also probably return a named tuple, so that the order of the named tuple is preserved
  • broadcasting a named tuple with a named tuple should require both names and order to match

Before implementing everything in Base in one go, how about implementing this outside Base? More specifically, how about _allowing_ to construct Broadcasted object but throw the not-implemented error in the materialization phase? Note that this is impossible at the moment because the error is thrown in broadcastable which is invoked before constructing Broadcasted. This would let people use something like

joinstyle.((a=1,b=1) .+ (a=1,b=2))

where joinstyle is a "magic" function that sets a custom broadcast style implemented in some package.

Importantly, this code would be forward-compatible even after something similar is implemented in Base. I think this let people try this in more serious code base than some toy experimental packages.

Extending this idea, it would be interesting to have different kind "join style":

innerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,)
outerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,b=1,c=2)  #???

(I discussed a similar idea in https://github.com/JuliaLang/julia/issues/32081#issuecomment-494648042)

I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary broadcasting.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

StefanKarpinski picture StefanKarpinski  Â·  3Comments

felixrehren picture felixrehren  Â·  3Comments

m-j-w picture m-j-w  Â·  3Comments

omus picture omus  Â·  3Comments

TotalVerb picture TotalVerb  Â·  3Comments