Julia: Julep: `broadcast` with dictionaries

Created on 6 Feb 2018 · 28Comments · Source: JuliaLang/julia

Current behavior

I just learned that apparently broadcast doesn't work as I expected on dictionaries:

julia> d = Dict(:a => "Alice", :b => "Bob", :c => "Charlie")
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> map(kv -> kv[1] => kv[2], d)
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> broadcast(kv -> kv[1] => kv[2], d)
ERROR: KeyError: key 1 not found
Stacktrace:
 [1] getindex at ./dict.jl:474 [inlined]
 [2] (::##17#18)(::Dict{Symbol,String}) at ./REPL[11]:1
 [3] broadcast(::Function, ::Dict{Symbol,String}) at ./broadcast.jl:455

julia> broadcast(length, d)
3

AFAICT it seems broadcast is working on dictionaries like they are scalars or something. Someone correct me if I'm wrong - but maybe we never added any broadcast methods for dictionaries?

Proposal

Make broadcast work with dictionaries. I can see two possibilities, of which I favor option 2:

broadcast for 1D indexable containers generally works like map, so we can extend this to AbstractDict and have broadcast map the key-value pairs
broadcast deals with matching indices and performing some operation on values. So broadcast(f, dict) could be a bit like map(f, values(dict)) except the dictionary structure and it's indices are preserved in the output.

It seems to me that map deals with the iteration protocol and broadcast is more about matching indices and mapping those values. Using option 2 would give:

julia> length.(d)
Dict{Symbol,Int64} with 3 entries:
  :a => 5
  :b => 3
  :c => 7

IMO this kind of syntax could help a lot with manipulating dictionaries (one example is using AbstractDict for output of groupby-style operations, and then acting on the groups) while making no breaking changes to dictionary iteration or map.

To me it also dovetails with the "getindices julep" via this comment https://github.com/JuliaLang/julia/issues/24019#issuecomment-363253768 which basically suggests we can get multi-valued getindex between different indexable container types including dictionaries via the lowering transformation:

a.[inds] --> broadcast(i -> a[i], inds)

EDIT: Addendum (27/07/2019)

I forgot to mention a crucial use case of broadcasting over dictionaries where you have two or more dictionaries with the same keys and you want to match up the values based on the keys. For example dict1 .+ dict2. For our hash-based Dict, the user can't really predict the iteration order and they may actually differ between dict1 and dict2 even if the user can guarantee that they share the same set of keys (for example, dict2 might have been rehashed and dict1 not, but IMO this implementation detail shouldn't affect the semantics of the broadcast operation).

broadcast collections julep

Source

andyferris

👍1

Most helpful comment

I don't think strings are the best example to decide how broadcast should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.

nalimilan on 7 Feb 2018

👍6 ❤2

All 28 comments

Duplicate of #18618. Do you mind if we try to keep all the discussion consolidated there?

mbauman on 6 Feb 2018

The additional complexity compared with #18618 is that for Dicts we need to decide whether the broadcasted function should receive values or pairs. So maybe we can discuss that issue here, and keep the discussion regarding general iterables at #18618.

I suggest throwing an error when a Dict is passed to broadcast for now (just like I think we should throw an error for an iterator which doesn't implement the right methods), so that we can discuss the design and implement it in 1.x.

nalimilan on 6 Feb 2018

👀1 👎1 👍1

I suggest throwing an error when a Dict is passed to broadcast for now

I agree wit that; we probably shouldn't ship with a known broken implementation of broadcast on AbstractDict that doesn't error out.

andyferris on 7 Feb 2018

I don't think it's "broken." It's not a bug, or unintentional. It was a conscious decision that broadcast is different from map: the former treats things as scalars by default, and the latter as iterators; the former "broadcasts" scalars to match array arguments, whereas the latter errors for mismatched dimensions.

Dictionaries, in particular, are not treated as "arrays" because (a) they don't have a shape/dimensionality, (b) they have an undefined iteration order, and (c) we intentionally had only a small whitelist of things that are array-like containers for broadcast.

Of course a "do what I mean" behavior would be better, but the point is that there are some times when you want one behavior and some times the other.

Making it an error just means you get neither behavior. It doesn't seem like an improvement.

stevengj on 7 Feb 2018

For an example of where the current behavior is useful, think of anything where you are broadcasting over the keys:

julia> d = Dict(3=>4, 5=>6, 7=>20)
Dict{Int64,Int64} with 3 entries:
  7 => 20
  3 => 4
  5 => 6

julia> get.(d, 1:10, 0)
10-element Array{Int64,1}:
  0
  0
  4
  0
  6
  0
 20
  0
  0
  0

Erroring on a Dict argument to broadcast would break this, and gain us nothing useful in return.

stevengj on 7 Feb 2018

Personally, I find it quite unexpected that map(f, x) is different from f.(x). And these issues suggest I'm not alone. In fact, I think making these two behave the same is a crucial element to any proposed solution to #18618. That seems to be where that issue is slowly converging.

Note that we've deprecated map(f, dict) mapping over pairs in #5794 with the intention that we can change it to map over values in the future, so have space to design map here, too. That'll change.

The big question is if we have:

map(sqrt, Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)

and

sqrt.(Dict(:a=>4,:b=>9))    # => Dict(:a=>2.0, :b=>3.0)
Dict(:a=>4, :b=>9) .+ [1,2] # => Dict(:a=>5, :b=>11) # EDIT: THIS MAKES NO SENSE, AND SHOULD ERROR

~~Then what would happen in the case like Dict(:a=>4, :b=>9) .+ [1 2]? We simply don't have a data structure (built in) for that. I suppose it could be an error or a matrix.~~ Edit: Never mind — we shouldn't allow this operation since the keys between the dict and array don't match.

mbauman on 7 Feb 2018

👍3 ❤1

f.(x) should be either broadcast or map. Longstanding usage for things like .* suggest the former. And those two functions are definitely not the same. map(f, array, scalar) will presumably always throw an error, and map(f, rowvector, columnvector) will never produce a matrix, no?

Almost all of these things seem to boil down to "I want map, not broadcast."

stevengj on 7 Feb 2018

Absolutely these two functions are not the same. But I think they should collapse to the same behavior in the one-argument case.

mbauman on 7 Feb 2018

👍3

Even in the one-argument case, there are differences. e.g. map(f, string) maps over characters. But broadcast (rightly, I think), treats strings as "scalars", which is extremely useful for processing collections of strings.

I don't think it makes broadcast more intuitive if its behavior suddenly changes when you have only a single argument.

stevengj on 7 Feb 2018

So your proposal is that whether broadcast treats an argument as a collection or an atom should depend on the number of arguments?

stevengj on 7 Feb 2018

No, definitively not. My preference would be to systematically re-work broadcast along the lines of https://github.com/JuliaLang/julia/issues/18618#issuecomment-360594955, but I know you feel differently.

mbauman on 7 Feb 2018

👍3

I know there are examples where it's useful for broadcast to treat strings or dicts (or anything else for that matter, including arrays) as scalars, but "seems useful" does not generalize. broadcast should have a simple general behavior that naturally collapses to map in the 1-argument case.

If we were only dealing with the word broadcast, I could be quite flexible. But with that dot, broadcast is occupying some valuable real estate and so it needs to play nice.

JeffBezanson on 7 Feb 2018

👍3

nalimilan on 7 Feb 2018

👍6 ❤2

My mental model of broadcasting is basically that broadcast(f, x_1, ..., x_n) should mean something semantically equivalent to:

For each x_i in x_1...x_n, what are its indices inds_i (also includes what's currently referred to as keys, and the implicit linear indices of types that satisfy the iteration interface but don't have indices). If an argument doesn't have either kind of indices, it is a scalar, which carries a single implicit index (maybe it's wrapped in a Ref to make this possible, and also to allow users to designate scalars).
Try to create a new set of indices inds such that the sequence of functions g_1...g_n is a natural, structure-preserving way to map each index in inds to a single index in each inds_i. If this isn't possible, throw an error.
Return the unique y whose value at each index ind in inds is f(x_1[g_1(ind)], ..., x_n[g_n(ind)]) and whose type is a predictable, minimally complicated type that can hold all such values.

In practice, such a function g_i would have to return a tuple of indices which is splatted into x_i's getindex in order to support Ref, whose getindex takes no arguments.

jekbradbury on 7 Feb 2018

As @mbauman, such questions of the preferred model are being discussed in #18618. In particular, the question is whether to define that (a) broadcast operates on any iterable (so broadcasting values over keys would mean writing broadcast(f, values(dict)), to express the situation where iteration coincides with indexing), or whether (b) the broadcast operates on indexes (so it only makes sense to use it when the indexes form a cartesian shape, and broadcast(f, dict) is the same as f(dict)), or whether (c) it can also broadcast non-cartesian keys (so broadcasting a dict would require that all other elements be either scalars or have the same indexes, or be an nd-sparse table with more complex rules; and the alternate version are written as either broadcast(f, (dict,)) or broadcast(f, collect(p for p in dict))

Currently, we most nearly implement (b). This proposes we implement (c), which is mostly just a superset of (b). The issue #18618 proposes that we do (a) instead. Summary, in table form, of the possible combinations of (1-D) operations you might want to do and how to do them:

edit: added column d as suggested below by jekbradbury, which is mostly a superset of d, but where objects without a shape (cartesian axes) and without keys (non-cartesian indices) would instead assume ordered iteration

edit: also added a column of what this broadcast operation would mean in terms of a Generator

vtjnash on 7 Feb 2018

❤3 👍2

I guess I'm arguing that broadcast should operate on indices but fall back to iteration? So column c but without the need for collects?

jekbradbury on 7 Feb 2018

Ah, good call. I added a column (d) and a note to describe it. It's mostly (c) without collect, but where value-map-order is the same as column (a)

vtjnash on 7 Feb 2018

And now also added one more column "Generator", to give an equivalent result in terms of a different expression construct, for more clarity

vtjnash on 7 Feb 2018

👍1

Based on the discussion with @mbauman here is a list of possible test cases to consider (if they should error and if not what result should be produced):

(a=1,b=1) .+ [1 2]
(a=1,b=1) .+ (1,2)
(a=1,b=1) .+ [1,2]
(a=1,b=1) .+ (a=1,b=2)
(a=1,b=1) .+ (c=1,d=2)

EDIT: And I "vote up" for a higher priority for this decision as I get requests for DataFrames.jl functionalities that depend on this decision. Thank you!

bkamins on 26 Jul 2019

❤1

My opinion has become firmly that broadcast should match-up the values from keys (there is some trickery allowed here for CartesianIndex's) rather than worry about iteration order.

Therefore the 4th one makes sense to me, as does:

(a=1,b=1) .+ (b=2,a=1) --> (a=2,b=3)

The others seem odd to me (and could be errors), except perhaps that named tuples have two kinds of indexes, the Symbols and the Integers (but I personally don't see the latter as particularly useful when you can unwrap it into a tuple so trivially, yet their presence serves to complicate a bit the indexing API).

One of the reasons I think keys is really important and fundamental is really the OP - I want this to work for hash-based Dicts and to be able to write things like dict1 .+ dict2. Unfortunately, for all intents and purposes the relative ordering of the elements of two different hash-based Dicts is not deterministic from the perspective of the programmer (it might actually be deterministic underneath it all but there's no real way for users to align the ordering of two Dicts to be the same, and even if they tried that's not necessarily cheap).

@bkamins If you're thinking of DataFrames.jl I suspect you'll want to match column names together, right?

andyferris on 27 Jul 2019

(I've updated the OP with the two-Dict example)

andyferris on 27 Jul 2019

Currently in DataFrames.jl we have a rule that names must match exactly (order and values), except for push!, when we allow for a mixed order match on names and setindex! of AbstractDict on RHS (where clearly we can only look at values). So we have a slight inconsistency, which I want to get rid of.

If we had a decision in Base (via this PR) what are the rules of combining collections with names I would copy them to DataFrames.jl to be consistent.

bkamins on 27 Jul 2019

❤1

OK, thanks, I agree.

I have two more examples I think are important. The first is named tuples and dictionaries:

out = (a = 0, b = 1) .+ Dict(:a => 2, :b => 4)

If broadcast were iteration based then the set of values (Set(values(out))) could depend on details like whether we are running Julia on a 32-bit or 64-bit system.

Another example is mixing different types of AbstractDict:

sortdict .+ hashdict

I feel the only semantically useful definition here is that the broadcast operates by matching up the keys and basically ignoring the iteration order. I can't see anything else that is useful or helpful to users.

Basically, I'd like to be able to write generic code over AbstractDict, like function f(a::AbstractDict, b::AbstractDict); ... ; end. It would be nice to be able to use something like broadcast inside this generic function. Going further, I'd love to be able to duck type this function function f(a, b); ... ; end and let me mix-and-match different containers like dictionaries and named tuples, which really have very similar interfaces.

To reach that goal I believe the best way forward is that map obeys iteration such that first(map(f, a, b)) == f(first(a), first(b)) and broadcast obeys indexing such that broadcast(f, a, b)[i] = f(a[i], b[i]) (for single-dimensional containers; we would do something slightly different where i isa CartesianIndex). With these guarantees I can use map and broadcast in more generic code, the two functions operate have well-defined semantics, and they have somewhat orthogonal concerns (that tend to overlap for e.g. AbstractArray).

andyferris on 27 Jul 2019

a list of possible test cases to consider (if they should error and if not what result should be produced):

(a=1,b=1) .+ [1 2]

(a=1,b=1) .+ (1,2)

(a=1,b=1) .+ [1,2]

(a=1,b=1) .+ (a=1,b=2)

(a=1,b=1) .+ (c=1,d=2)

How about the test case involving the lhs?

y = Dict(:b=>1)
y .= (a = 1,)

tkf on 28 Jul 2019

I agree that LHS (broadcasting assignment) test cases are relevant, but they are more complex, so I have left them out intentionally for later.

The reason is that copyto! is usually overloaded by a custom type of LHS anyway, and the first thing we should settle on is what Style should have a Broadcasted object have on RHS (this is needed anyway to implement a proper copyto! method as this Broadcasted object is passed to it anyway).

But I fully agree that we should have a consistent broadcasting + broadcasting assignment system for types defined in Base.

Just for a reference - in DataFrames.jl we have settled that if we have a broadcasting assignment and we have a data frame on LHS then:

we use assignment using column and row number;
we require column names to match exactly including order if RHS defines column names (currently this is only data frames, but after this Julep I guess we will have more possibilities).

bkamins on 28 Jul 2019

👍1

I agree it's clear that broadcasting on dicts should only care about keys and not about order. But the situation is more difficult for named tuples, since they are somewhat a hybrid of a tuple and of a dict. I suggest this:

broadcasting a named tuple with a tuple or vector should use the order of values
broadcasting a named tuple with a dict should use names. It should also probably return a named tuple, so that the order of the named tuple is preserved
broadcasting a named tuple with a named tuple should require both names and order to match

nalimilan on 28 Jul 2019

👍2 ❤1

Before implementing everything in Base in one go, how about implementing this outside Base? More specifically, how about _allowing_ to construct Broadcasted object but throw the not-implemented error in the materialization phase? Note that this is impossible at the moment because the error is thrown in broadcastable which is invoked before constructing Broadcasted. This would let people use something like

joinstyle.((a=1,b=1) .+ (a=1,b=2))

where joinstyle is a "magic" function that sets a custom broadcast style implemented in some package.

Importantly, this code would be forward-compatible even after something similar is implemented in Base. I think this let people try this in more serious code base than some toy experimental packages.

Extending this idea, it would be interesting to have different kind "join style":

innerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,)
outerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,b=1,c=2)  #???

(I discussed a similar idea in https://github.com/JuliaLang/julia/issues/32081#issuecomment-494648042)

tkf on 19 Sep 2019

I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary broadcasting.

andyferris on 9 Dec 2019

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Getting Openssl error while building from source on ubuntu 14.04.3 cmake version 2.8.12.2

arshpreetsingh · 3Comments

Broadcasting a function on an array gives an expression

dpsanders · 3Comments

Consider adding a `VecTuple{N,T}` to hold N `VecElements{T}`s

m-j-w · 3Comments

Literal NaNs in ASTs

Keno · 3Comments

The devdocs page for functions is outdated

ararslan · 3Comments