I just learned that apparently broadcast
doesn't work as I expected on dictionaries:
julia> d = Dict(:a => "Alice", :b => "Bob", :c => "Charlie")
Dict{Symbol,String} with 3 entries:
:a => "Alice"
:b => "Bob"
:c => "Charlie"
julia> map(kv -> kv[1] => kv[2], d)
Dict{Symbol,String} with 3 entries:
:a => "Alice"
:b => "Bob"
:c => "Charlie"
julia> broadcast(kv -> kv[1] => kv[2], d)
ERROR: KeyError: key 1 not found
Stacktrace:
[1] getindex at ./dict.jl:474 [inlined]
[2] (::##17#18)(::Dict{Symbol,String}) at ./REPL[11]:1
[3] broadcast(::Function, ::Dict{Symbol,String}) at ./broadcast.jl:455
julia> broadcast(length, d)
3
AFAICT it seems broadcast
is working on dictionaries like they are scalars or something. Someone correct me if I'm wrong - but maybe we never added any broadcast
methods for dictionaries?
Make broadcast
work with dictionaries. I can see two possibilities, of which I favor option 2:
broadcast
for 1D indexable containers generally works like map
, so we can extend this to AbstractDict
and have broadcast
map the key-value pairsbroadcast
deals with matching indices and performing some operation on values. So broadcast(f, dict)
could be a bit like map(f, values(dict))
except the dictionary structure and it's indices are preserved in the output.It seems to me that map
deals with the iteration protocol and broadcast
is more about matching indices and mapping those values. Using option 2 would give:
julia> length.(d)
Dict{Symbol,Int64} with 3 entries:
:a => 5
:b => 3
:c => 7
IMO this kind of syntax could help a lot with manipulating dictionaries (one example is using AbstractDict
for output of groupby
-style operations, and then acting on the groups) while making no breaking changes to dictionary iteration or map.
To me it also dovetails with the "getindices julep" via this comment https://github.com/JuliaLang/julia/issues/24019#issuecomment-363253768 which basically suggests we can get multi-valued getindex between different indexable container types including dictionaries via the lowering transformation:
a.[inds] --> broadcast(i -> a[i], inds)
I forgot to mention a crucial use case of broadcasting over dictionaries where you have two or more dictionaries with the same keys and you want to match up the values
based on the keys
. For example dict1 .+ dict2
. For our hash-based Dict
, the user can't really predict the iteration order and they may actually differ between dict1
and dict2
even if the user can guarantee that they share the same set of keys
(for example, dict2
might have been rehashed and dict1
not, but IMO this implementation detail shouldn't affect the semantics of the broadcast operation).
Duplicate of #18618. Do you mind if we try to keep all the discussion consolidated there?
The additional complexity compared with #18618 is that for Dict
s we need to decide whether the broadcasted function should receive values or pairs. So maybe we can discuss that issue here, and keep the discussion regarding general iterables at #18618.
I suggest throwing an error when a Dict
is passed to broadcast
for now (just like I think we should throw an error for an iterator which doesn't implement the right methods), so that we can discuss the design and implement it in 1.x.
I suggest throwing an error when a Dict is passed to broadcast for now
I agree wit that; we probably shouldn't ship with a known broken implementation of broadcast
on AbstractDict
that doesn't error out.
I don't think it's "broken." It's not a bug, or unintentional. It was a conscious decision that broadcast
is different from map
: the former treats things as scalars by default, and the latter as iterators; the former "broadcasts" scalars to match array arguments, whereas the latter errors for mismatched dimensions.
Dictionaries, in particular, are not treated as "arrays" because (a) they don't have a shape/dimensionality, (b) they have an undefined iteration order, and (c) we intentionally had only a small whitelist of things that are array-like containers for broadcast
.
Of course a "do what I mean" behavior would be better, but the point is that there are some times when you want one behavior and some times the other.
Making it an error just means you get neither behavior. It doesn't seem like an improvement.
For an example of where the current behavior is useful, think of anything where you are broadcasting over the keys:
julia> d = Dict(3=>4, 5=>6, 7=>20)
Dict{Int64,Int64} with 3 entries:
7 => 20
3 => 4
5 => 6
julia> get.(d, 1:10, 0)
10-element Array{Int64,1}:
0
0
4
0
6
0
20
0
0
0
Erroring on a Dict
argument to broadcast
would break this, and gain us nothing useful in return.
Personally, I find it quite unexpected that map(f, x)
is different from f.(x)
. And these issues suggest I'm not alone. In fact, I think making these two behave the same is a crucial element to any proposed solution to #18618. That seems to be where that issue is slowly converging.
Note that we've deprecated map(f, dict)
mapping over pairs in #5794 with the intention that we can change it to map over values in the future, so have space to design map
here, too. That'll change.
The big question is if we have:
map(sqrt, Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)
and
sqrt.(Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)
Dict(:a=>4, :b=>9) .+ [1,2] # => Dict(:a=>5, :b=>11) # EDIT: THIS MAKES NO SENSE, AND SHOULD ERROR
Then what would happen in the case like Edit: Never mind — we shouldn't allow this operation since the keys between the dict and array don't match.Dict(:a=>4, :b=>9) .+ [1 2]
? We simply don't have a data structure (built in) for that. I suppose it could be an error or a matrix.
f.(x)
should be either broadcast
or map
. Longstanding usage for things like .*
suggest the former. And those two functions are definitely not the same. map(f, array, scalar)
will presumably always throw an error, and map(f, rowvector, columnvector)
will never produce a matrix, no?
Almost all of these things seem to boil down to "I want map
, not broadcast
."
Absolutely these two functions are not the same. But I think they should collapse to the same behavior in the one-argument case.
Even in the one-argument case, there are differences. e.g. map(f, string)
maps over characters. But broadcast
(rightly, I think), treats strings as "scalars", which is extremely useful for processing collections of strings.
I don't think it makes broadcast
more intuitive if its behavior suddenly changes when you have only a single argument.
So your proposal is that whether broadcast
treats an argument as a collection or an atom should depend on the number of arguments?
No, definitively not. My preference would be to systematically re-work broadcast along the lines of https://github.com/JuliaLang/julia/issues/18618#issuecomment-360594955, but I know you feel differently.
I know there are examples where it's useful for broadcast to treat strings or dicts (or anything else for that matter, including arrays) as scalars, but "seems useful" does not generalize. broadcast
should have a simple general behavior that naturally collapses to map
in the 1-argument case.
If we were only dealing with the word broadcast
, I could be quite flexible. But with that dot, broadcast is occupying some valuable real estate and so it needs to play nice.
I don't think strings are the best example to decide how broadcast
should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.
My mental model of broadcasting is basically that broadcast(f, x_1, ..., x_n)
should mean something semantically equivalent to:
x_i
in x_1...x_n
, what are its indices
inds_i
(also includes what's currently referred to as keys
, and the implicit linear indices of types that satisfy the iteration interface but don't have indices). If an argument doesn't have either kind of indices, it is a scalar, which carries a single implicit index (maybe it's wrapped in a Ref
to make this possible, and also to allow users to designate scalars).inds
such that the sequence of functions g_1...g_n
is a natural, structure-preserving way to map each index in inds
to a single index in each inds_i
. If this isn't possible, throw an error.y
whose value at each index ind
in inds
is f(x_1[g_1(ind)], ..., x_n[g_n(ind)])
and whose type is a predictable, minimally complicated type that can hold all such values.In practice, such a function g_i
would have to return a tuple of indices which is splatted into x_i
's getindex
in order to support Ref
, whose getindex
takes no arguments.
As @mbauman, such questions of the preferred model are being discussed in #18618. In particular, the question is whether to define that (a) broadcast
operates on any iterable (so broadcasting values over keys would mean writing broadcast(f, values(dict))
, to express the situation where iteration coincides with indexing), or whether (b) the broadcast
operates on indexes (so it only makes sense to use it when the indexes form a cartesian shape, and broadcast(f, dict)
is the same as f(dict)
), or whether (c) it can also broadcast non-cartesian keys (so broadcasting a dict would require that all other elements be either scalars or have the same indexes, or be an nd-sparse table with more complex rules; and the alternate version are written as either broadcast(f, (dict,))
or broadcast(f, collect(p for p in dict)
)
Currently, we most nearly implement (b). This proposes we implement (c), which is mostly just a superset of (b). The issue #18618 proposes that we do (a) instead. Summary, in table form, of the possible combinations of (1-D) operations you might want to do and how to do them:
Operation | Current / b | a | c | d | Generator
--|--|--|--|--|--
as-value | broadcast(f, dict) | broadcast(f, (dict,)) | broadcast(f, (dict,)) | broadcast(f, (dict,)) | f(dict)
value-map-keys | Dict(zip(keys(dict), broadcast(f, collect(values(dict))))) | broadcast(f, values(dict)) | broadcast(f, dict) | broadcast(f, dict) | Dict(k => f(v) for (k, v) in dict)
value-map-order | broadcast(f, collect(values(dict))) | broadcast(f, (v for v in values(dict))) | broadcast(f, collect(values(dict)))) | broadcast(f, (v for v in values(dict)))) | Array(f(v) for (k, v) in dict)
pair-map-order | broadcast(f, collect(p for p in dict)) | broadcast(f, (p for p in dict)) | broadcast(f, collect(p for p in dict)) | broadcast(f, (p for p in dict)) | Array(f(p) for p in dict)
pair-map-keys | Dict(broadcast(f, collect(p for p in dict))) | broadcast(f, dict) | Dict(broadcast(f, collect(p for p in dict))) | Dict(broadcast(f, (p for p in dict))) | Dict(f(p) for p in dict)
edit: added column d as suggested below by jekbradbury, which is mostly a superset of d, but where objects without a shape (cartesian axes) and without keys (non-cartesian indices) would instead assume ordered iteration
edit: also added a column of what this broadcast operation would mean in terms of a Generator
I guess I'm arguing that broadcast
should operate on indices but fall back to iteration? So column c but without the need for collect
s?
Ah, good call. I added a column (d) and a note to describe it. It's mostly (c) without collect
, but where value-map-order
is the same as column (a)
And now also added one more column "Generator", to give an equivalent result in terms of a different expression construct, for more clarity
Based on the discussion with @mbauman here is a list of possible test cases to consider (if they should error and if not what result should be produced):
(a=1,b=1) .+ [1 2]
(a=1,b=1) .+ (1,2)
(a=1,b=1) .+ [1,2]
(a=1,b=1) .+ (a=1,b=2)
(a=1,b=1) .+ (c=1,d=2)
EDIT: And I "vote up" for a higher priority for this decision as I get requests for DataFrames.jl functionalities that depend on this decision. Thank you!
My opinion has become firmly that broadcast
should match-up the values from keys
(there is some trickery allowed here for CartesianIndex
's) rather than worry about iteration order.
Therefore the 4th one makes sense to me, as does:
(a=1,b=1) .+ (b=2,a=1) --> (a=2,b=3)
The others seem odd to me (and could be errors), except perhaps that named tuples have two kinds of indexes, the Symbol
s and the Int
egers (but I personally don't see the latter as particularly useful when you can unwrap it into a tuple so trivially, yet their presence serves to complicate a bit the indexing API).
One of the reasons I think keys
is really important and fundamental is really the OP - I want this to work for hash-based Dict
s and to be able to write things like dict1 .+ dict2
. Unfortunately, for all intents and purposes the relative ordering of the elements of two different hash-based Dict
s is not deterministic from the perspective of the programmer (it might actually be deterministic underneath it all but there's no real way for users to align the ordering of two Dict
s to be the same, and even if they tried that's not necessarily cheap).
@bkamins If you're thinking of DataFrames.jl I suspect you'll want to match column names together, right?
(I've updated the OP with the two-Dict
example)
Currently in DataFrames.jl we have a rule that names must match exactly (order and values), except for push!
, when we allow for a mixed order match on names and setindex!
of AbstractDict
on RHS (where clearly we can only look at values). So we have a slight inconsistency, which I want to get rid of.
If we had a decision in Base (via this PR) what are the rules of combining collections with names I would copy them to DataFrames.jl to be consistent.
OK, thanks, I agree.
I have two more examples I think are important. The first is named tuples and dictionaries:
out = (a = 0, b = 1) .+ Dict(:a => 2, :b => 4)
If broadcast were iteration based then the set of values (Set(values(out))
) could depend on details like whether we are running Julia on a 32-bit or 64-bit system.
Another example is mixing different types of AbstractDict
:
sortdict .+ hashdict
I feel the only semantically useful definition here is that the broadcast
operates by matching up the keys
and basically ignoring the iteration order. I can't see anything else that is useful or helpful to users.
Basically, I'd like to be able to write generic code over AbstractDict
, like function f(a::AbstractDict, b::AbstractDict); ... ; end
. It would be nice to be able to use something like broadcast
inside this generic function. Going further, I'd love to be able to duck type this function function f(a, b); ... ; end
and let me mix-and-match different containers like dictionaries and named tuples, which really have very similar interfaces.
To reach that goal I believe the best way forward is that map
obeys iteration such that first(map(f, a, b)) == f(first(a), first(b))
and broadcast
obeys indexing such that broadcast(f, a, b)[i] = f(a[i], b[i])
(for single-dimensional containers; we would do something slightly different where i isa CartesianIndex
). With these guarantees I can use map
and broadcast
in more generic code, the two functions operate have well-defined semantics, and they have somewhat orthogonal concerns (that tend to overlap for e.g. AbstractArray
).
a list of possible test cases to consider (if they should error and if not what result should be produced):
(a=1,b=1) .+ [1 2]
(a=1,b=1) .+ (1,2)
(a=1,b=1) .+ [1,2]
(a=1,b=1) .+ (a=1,b=2)
(a=1,b=1) .+ (c=1,d=2)
How about the test case involving the lhs?
y = Dict(:b=>1)
y .= (a = 1,)
I agree that LHS (broadcasting assignment) test cases are relevant, but they are more complex, so I have left them out intentionally for later.
The reason is that copyto!
is usually overloaded by a custom type of LHS anyway, and the first thing we should settle on is what Style
should have a Broadcasted
object have on RHS (this is needed anyway to implement a proper copyto!
method as this Broadcasted
object is passed to it anyway).
But I fully agree that we should have a consistent broadcasting + broadcasting assignment system for types defined in Base.
Just for a reference - in DataFrames.jl we have settled that if we have a broadcasting assignment and we have a data frame on LHS then:
I agree it's clear that broadcasting on dicts should only care about keys and not about order. But the situation is more difficult for named tuples, since they are somewhat a hybrid of a tuple and of a dict. I suggest this:
Before implementing everything in Base
in one go, how about implementing this outside Base
? More specifically, how about _allowing_ to construct Broadcasted
object but throw the not-implemented error in the materialization phase? Note that this is impossible at the moment because the error is thrown in broadcastable
which is invoked before constructing Broadcasted
. This would let people use something like
joinstyle.((a=1,b=1) .+ (a=1,b=2))
where joinstyle
is a "magic" function that sets a custom broadcast style implemented in some package.
Importantly, this code would be forward-compatible even after something similar is implemented in Base. I think this let people try this in more serious code base than some toy experimental packages.
Extending this idea, it would be interesting to have different kind "join style":
innerjoin.((a=1,b=1) .+ (a=1,c=2)) # => (a=2,)
outerjoin.((a=1,b=1) .+ (a=1,c=2)) # => (a=2,b=1,c=2) #???
(I discussed a similar idea in https://github.com/JuliaLang/julia/issues/32081#issuecomment-494648042)
I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary broadcasting.
Most helpful comment
I don't think strings are the best example to decide how
broadcast
should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.