I think this has been suggested in various places before (i.e. I deserve no credit for this idea), but I couldn't find an issue for it, so here it is.
The motivation for something like this are @group .. into
statements in Query.jl. With those one often gets an array of named tuples, and a super typical next step is that one wants to run some aggregation function over one specific field of the named tuple. Say A
is an array of named tuples, then I might want to write something like mean(map(i->i.b,A))
to take the mean of column b
.
Idea 1 would be to simple make A..b
syntactic sugar for map(i->i.b,A)
. The aggregation expression would then be written as mean(A..b)
.
Idea 2 is based on an observation by @JeffBezanson in #21875:
Some languages use
.a
as short forx -> x.a
, which is kind of nice.
Which is probably somehow related to this issue, but I'm not entirely sure.
I think maybe idea 2a might be something like .b.(A)
instead of A..b
? Not sure, more putting this out here for discussion. The aggregation would then be written as mean(.b.(A))
. I find that a bit confusing, though.
Maybe idea 2b could be to still have .b
mean i->i.b
, and then make sure that all aggregation functions like mean
etc. take an anonymous function as their first argument, so that one could always write these aggregations as say mean(.b, A)
.
davidanthoff/Query.jl#121 in Query.jl currently implements A..b
within queries, but I'm a bit hesitant to add too much special syntax in Query.jl, especially around things where we might end up with some other solution in base
UPDATE: It seems pretty clear that idea 1 is not a good one, so I changed the title of this issue to refer to idea 2b, which seems the most plausible one.
..
is widely used in math packages to mean an interval, so this would be quite breaking for packages. I also find the syntax .b.(A)
quite odd. An abbreviated syntax for this kind of map already exists as getfield.(A, :b)
, which is equivalent to broadcast(i->i.b, A)
.
..
is widely used in math packages to mean an interval, so this would be quite breaking for packages.
Ah, that wouldn't be good. Just out of curiosity, what is an example package like that?
An abbreviated syntax for this kind of map already exists as
getfield.(A, :b)
That doesn't seem type stable, whereas both a broadcast
and map
version are type stable. It also is a tad too verbose for my taste.
Given the ..
conflict with other packages, I think my current preference would be idea 2b in that case.
what is an example package like that?
That doesn't seem type stable
I'm confused, why is getfield.(A, :b)
not type stable but map(i->i.b, A)
is? The former lowers to the same code as broadcast(i->i.b, A)
.
I think my current preference would be idea 2b
Of those proposed I do prefer 2b as well, though I'm still not really a fan of it. i->i.b
, while more verbose, is IMO clearer than .b
, since we use prefix .
for dot-broadcasted infix operators. Explicitly providing the i
in i.b
makes it clear that it's a getfield
rather than a broadcast
ed operator of some kind.
I'm confused, why is
getfield.(A, :b)
not type stable
I have no idea, I just looked at the output from @code_warntype
for all three variants, and the getfield.
version was the one that looked type instable.
Agree that we should keep ..
as an operator for intervals, and it's also useful for range queries. I'm fine with the syntax .a
for x->x.a
though.
When you look at code_warntype for getfield.(A, :b)
, it applies typeof
to all the arguments first, so you'll see code for type Symbol
as the final argument. But at a particular call site the constant :b
will be taken into account.
This is a little sketchy to me. It's sort of introducing a global namespace of field names. What kind of accessor is .name
, and what kinds of properties do you expect of this operation? I don't think you can really say, and so these things can't be used in generic code.
We already have a global namespace of field names, as does every other object-oriented language. In any case, those issues apply equally to a.b
and getfield(a, :b)
; .b
is just syntax for the same thing.
We already have a global namespace of field names
Those are called without a leading .
though.
It still seems really weird and confusing to me to be omitting the object from which you're getting the field. What's wrong with i->i.b
?
-1 from me. .
can be already a daunting, seemingly-magical concept to newcomers because of the broadcast lowering. The last thing we need is to for it to have more magical properties.
What's wrong with
i->i.b
?
For my Query.jl use case it is just too verbose (e.g. see this comment).
.
can be already a daunting, seemingly-magical concept to newcomers because of the broadcast lowering. The last thing we need is to for it to have more magical properties.
I hear you, that worries me too. I'm not particularly wedded to this syntax, but so far I couldn't think of anything better, and (at least from my perspective) the benefits of having something for this use-case outweigh the costs, even if we end up using the .b
notation.
If we adopt @JeffBezanson's suggestion for dot overloading, then Field{:b}(x)
could be defined as x.b
.
In my mind, the main use for this is for things like map
and broadcast
/dot calls. For example: map(Field{:b}, x)
or sqrt.(Field{:b}.(foo.(x)))
. Or, in @davidanthoff's example, @select {g.key.metric, m = myfun(Field{:score}.(g), Field{:track_id}.(g)) }
.
Field{:b}
is reasonably terse while remaining fairly readable and explicit. (And if it is not terse enough, we could use dot overloading to make this equivalent to Field.b
.)
(Is there a problem that dot overloading doesn't solve? 😉 )
Another possibility would be to use $.b
as sugar for x -> x.b
and $[i]
as sugar for x -> x[i]
, but $
is pretty overloaded already.
Or _.b
and _[i]
, since we're already turning _
into a quasi-magical placeholder symbol (#9343)?
I definitely feel your pain about verboseness though, @davidanthoff . Perhaps we just have to resort to having a macro that goes in front of a query than relying on changes to Julia syntax though.
resort to having a macro that goes in front of a query than relying on changes to Julia syntax
I think it's highly valuable to try to think of generally-usable syntax that makes macros less necessary.
OK sure, I am all for bending Julia's syntax to be more accommodating to data analysis :) I was just trying to be sensitive to the valid complaints that Julia syntax should not become the symbol soup of Mathematica etc.
I would still like to have a terse function syntax based on _
so that _[i]
and _.b
work as @stevengj mentions above, but it's not a feature we need for 1.0 and since _
is already disallowed as an r-value, we're in the clear to give it some new meaning in the future.
Basically, _
could become an implicit single-argument currying syntax when used as an r-value. f(_, y)
would be sugar for x -> f(x, y)
, and _.b
and _[i]
would just be special cases of this for getfield
and getindex
. People have also suggested using ~
for this. (See also #5571 and #554.)
_.b
is definitely an appealing option here. The syntax rule could be that the anonymous function contains the single function call directly containing the _
. (Similar to how T{<:S}
puts where
outside one set of curly braces.)
Here's a previous discussion with a bunch of good examples to check against: https://github.com/JuliaLang/julia/issues/5571#issuecomment-157424665
Note that @davidanthoff can already use the _.b
syntax in Query.jl, since it parses just fine.
I really like the _.b
idea, and especially that I can use it now :)
For my use-case it does kind of rely on reducer functions having a combined map-reduce method that accepts a map function as an argument. Currently many reducer functions don't have such a method. Over in #20402 @StefanKarpinski has one item "Reducers APIs. Make sure reducers have consistent behaviors – all take a map function before reduction; congruent dimension arguments, etc." I guess if that happens for 1.0 all is good and we would have a pretty elegant solution for the Query.jl use-case (and many others). Thanks all for the great ideas :)
How about .b
being Field{:b}
? Then .b.a
would be broadcast(Field{:b}, a)
.
Eh maybe inconsistent if .field
is essentially a function with special suffix syntax. Still, plain old .field
would be very useful, because once the compiler knows field is a type parameter, not a value, all sorts of operations can be shifted from run time to compile time.
I thought a bit more about this, and I think I could actually solve the original issue in Query.jl that motivated this issue in a much more elegant way if we had dot overloading a la #1974. So from my point of view we could close this issue and just add one more cheer for #1974.
Essentially, I could then extend the Grouping
container that holds results from a @group
operation in Query.jl so that g.a
would extract column a
from the group g
if g
happens to be a collection of NamedTuples
. That would be much more consistent with some future table type where df.a
would extract a column from a table type, something that would also be enabled by #1974.
I'm going to close this issue because I can essentially solve this in a really good way for Query.jl with the new dot-overloading.
Having said that, one crazy idea might be to add such a dot-overloaded method to any AbstractArray
. A modest version would be for any AbstractArray
that holds named tuples, the radical option would be for just any AbstractArray
. In that world, if a
is an AbstractArray
, a.b
would always end up extracting a collection of the b
properties of the individual elements of a
.
Wouldn't that be implicit vectorization of the kind we've moved away from?
Hm, I'm not sure? It would unify the user API for arrays-of-struct and struct-of-array containers in the table world. I assume DataFrame
at some point will get df.a
as a shortcut for df[:a]
, and then a DataFrame
and an array of named tuples would both provide x.a
as a way to get the a
column. I'm not sure that is good, but it could be done ;) I guess another question is what else a.b
could mean...
But in any case, clearly not 1.0 stuff.
To me, a table-like thing is semantically always an array or collection of structs. It might be stored as a struct of arrays, but should have the same API. So for example map(i->i.a, table)
can be O(1) and non-copying if the table is stored as a struct of arrays. That needs better syntax, but you get the idea.
It would be convenient to make .
syntax available to package authors, lowered to something like
.variable
=> dot(:variable)
, where dot
isn't defined in Base. I'd like to be able to use dot
to create custom keys. Currently the syntax to get symbols into the type domain is somewhat ugly; alternatives like Dot(:variable)
and dot"variable"
are definitely not as pretty as .variable
.
Why do you need it in the type domain? IPO on 0.7 will propagate symbols as constants to any inlined functions. From there you can lift them to the type domain yourself if you really wish.
Yeah, I've worked pretty heavily trying to get constant propagation to work. Even with the changes in 0.7
constant propagation is finicky. It's disabled during recursion, and it doesn't work through slurps, making lispy tuple programming very difficult (though not impossible with the judicious use of @pure
). Unless constant propagation becomes a semantic guarantee, it's a lot more reliable just to keep everything in the type domain as early as possible. Making dot
available to package authors could potentially satisfy both mine and David's needs.
See for example https://discourse.julialang.org/t/is-this-pure/8050/6
@bramtayl, in #24990, _.variable
already gets lowered to a Fix2{typeof(getproperty),...}
object that you could dispatch on.
And #26826 seeks to address constant propagation through varargs.
Another issue is that constant propagation doesn't survive keyword arguments and named tuples.
Oh and here's another option: define a custom type with overloaded dots, something like
struct K
end
@inline getproperty(k::K, s::Symbol) = Key{s}()
const k = K()
k.a
Most helpful comment
_.b
is definitely an appealing option here. The syntax rule could be that the anonymous function contains the single function call directly containing the_
. (Similar to howT{<:S}
putswhere
outside one set of curly braces.)