Altair: ENH: add functions for aggregations

Created on 16 Apr 2018  Â·  23Comments  Â·  Source: altair-viz/altair

The more I use Altair, the more the string aggregation shortcuts feel awkward to me.

I would propose we add an alternate syntax for creating aggregates based on top-level altair functions, and also add a top-level bin() function as a shortcut to binning.

For example, instead of

chart.encode(
    x=alt.X('x', bin=alt.Bin(maxbins=20)),
    y='mean(y):Q'
)

we could instead have a syntax like this:

chart.encode(
    x=alt.bin('x', maxbins=20),
    y=alt.mean('y', 'Q')
)

The downside is that this is another step away from the user directly touching the structure of the spec, and might encourage a wrong mental map that may lead to other confusions. To some degree, though, that is already an issue with the current string-based shortcuts.

The upside is that it is much more natural in Python to use actual functions for these functions rather than writing them out in strings.

@elisonbg @kanitw @domoritz I'd love to hear your thoughts on this. If they're generally positive, I can work on an implementation (there will be a couple subtleties, but it shouldn't be too bad).

enhancement

All 23 comments

I like to as it is more pythonic, easier to debug (typos can be caught), and enables autocompete.

I like alt.bin('x', maxbins=20) too! Great idea!

I feel like this breaks the conceptual model.

I feel like this breaks the conceptual model.

That's my concern too, particularly for the bin function.

But I think we've already bent the conceptual model a bit by supporting the "mean(x):Q" form.

That said, I think it would be a net loss to fully maintain the conceptual model and require the user to type alt.X('x', aggregate='mean', type='quantitative') instead.

This suggestion was mainly inspired by comments from a few people who were turned-off by the string-parsing aspect of our shorthand approach. With that in mind, what if I simplify the proposal and just make it so that anywhere that string-style "mean(x):Q" shorthand is used, we can use alt.mean('x') instead (with an optional second argument defaulting to "quantitative")?

I think the simplified proposal is better - it keeps those functions simple and doesn't try to combine them with other unrelated functionality. Although I think the question of the type shorthand should be mostly separate to this. For example, I think it makes a bit more sense to do alt.mean('x:Q') than try to teach these functions about types.

The other consideration is this: will users see that these are Python functions and do things like myfunction('x'). The benefit of having these functions as strings is that it is clear to users that it is not in python land.

Also, I don't think we want to have these functions assume a default type, as that will override our existing pandas based type inference which happens later.

I am open to finding a better way of expressing these aggregations, just think we need to keep iterating...

I feel like this breaks the conceptual model.

I’m not as familiar with pythonic idioms as y’all so I’d be curious to hear more about how/why the original proposal breaks the conceptual model.

What if instead of alt.mean('x') we use alt.agg_mean('x')?

Benefits are two-fold: it indicates that there's something different going on than np.mean, and allows alt.agg_<TAB> to list available aggregates.

I’d be curious to hear more about how/why the original proposal breaks the conceptual model.

The way I'd put it is that the conceptual model is this: Altair fundamentally is a Python object wrapper to the vega-lite schema. Shortcuts that change that (like making x=alt.bin('x', maxbins=20) effectively map to x=alt.X(field='x', bin=alt.BinParams(maxbins=20))) may be convenient for some users in the short-term, but lead users to having the wrong mental model for how things work and thus raise a barrier to going deeper and building more complicated charts.

For the conceptual model issue, I'd argue that bin, aggregate, and timeUnit are functions so having function forms in Python form would match user's mental model of the operations better (but yeah, a bit mismatch to Vega-Lite's JSON syntax).

In a way, if Vega-Lite isn't using JSON, we would have make it looks more like a function in some ways.

Note that we thought about using expression strings like mean(field_name) in VL, but that would make it hard to do programmatic generation + introduce another different expression language (in addition to normal Vega expression) so we decided not to do that in Vega-Lite.

The part that is problematic for me is the usage of an aggregation function
as the entire encoding channel (as in x=alt.mean('col') ). That is
confusing an "has a" relationship with an "is a" relationship. Channels
have an aggregation, but they are not one. The key point is that the
second you want to add anything else to the channel (axes, scale, etc). it
starts to look really weird (do you pass a scale to mean?). That is a sign
the abstractions of a channel and its aggregation or being confused and are
no longer orthogonal.

On Tue, Apr 17, 2018 at 10:32 AM, Kanit Wongsuphasawat <
[email protected]> wrote:

For the conceptual model issue, I'd argue that bin, aggregate, and
timeUnit are functions so having function forms in Python form would
match user's mental model of the operations better (but yeah, a bit
mismatch to Vega-Lite's JSON syntax).

In a way, if Vega-Lite isn't using JSON, we would have make it looks more
like a function in some ways.

Note that we thought about using expression strings like mean(field_name)
in VL, but that would make it hard to do programmatic generation +
introduce another different expression language (in addition to normal Vega
expression) so we decided not to do that in Vega-Lite.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/altair-viz/altair/issues/763#issuecomment-382077426,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABr0LC9LncSJg_Nu4IZmkxPL3AuhvG-ks5tpic3gaJpZM4TXUHq
.

--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]

Thanks Brian. My problem is that we've already broken that abstraction, by allowing users to use x='mean(x):Q' instead of x=alt.X('x', aggregate='mean', type='quantitative'). Given that the abstraction is already broken, I think a more Pythonic API is preferable.

the second you want to add anything else to the channel (axes, scale, etc). it starts to look really weird

This is how I would see that. The old way:

alt.Chart(data).mark_point().encode(
   x='mean(x):Q',
   y=alt.Y('mean(y):Q', axis=Axis(...))
)

The new way:

alt.Chart(data).mark_point().encode(
   x=alt.mean('x', 'Q'),
   y=alt.Y(alt.mean('y', 'Q'), axis=Axis(...))
)

But since aggregations are almost always quantitative, it would actually be this:

alt.Chart(data).mark_point().encode(
   x=alt.mean('x'),
   y=alt.Y(alt.mean('y'), axis=Axis(...))
)

The alternative notation is not a conceptual break from the old notation... it just removes the need to embed function calls within strings.

Ahh, now I see your motivation a bit more clearly. I was focused more on the bin case, where you were also passing additional arguments (maxbins).

Thinking out loud a bit more...

The real motivation in my mind is to make the simple case really simple: x='column'. To me, that syntax, with the type suffix :Q, allows very quick exploration of the space of visualizations. A user can quickly change any of the following with little typing:

I think that is the main usage case for allowing an argument other than an actual channel. The shorthand x='mean(x):Q' allows a broader set of explorations to be done with minimal changes.

The syntax x='mean(x):Q' does break the abstractions, but that breakage is also there for simple x='x'. If we add the syntax x=alt.mean('x') it involves even more edits to go through the sequence of explorations x='x' -> x='x:Q' -> x=alt.mean('x:Q') -> x=alt.X(alt.mean('x:Q'), scale=alt.Scale(...)).

If this new syntax was a full replacement for the very simple cases (x='x'), the benefit would be more clear. Looking at the two cases:

  • x=alt.X('x', aggregate='mean')
  • x=att.X(alt.agg.mean('x'))

It isn't obvious that I would use the second. However, I have observed students being uncertain about our existing syntax:

  • This: x='mean(x:Q) ?
  • Or this: x='mean(x):Q' ?

That is a sign that there are other reasons to reconsider this notation. I think your proposals for fixing this are pretty natural ways of resolving this awkwardness, but it feels like the original goals of keeping the simple things simple has been lost a bit. Hmm, thoughts?

To be honest, with something like alt.X('x', aggregate='mean', type='quantitative') I'm still not sure whether the quantitative type is supposed to apply to the input x or the output mean(x), so I think that confusion is unrelated to the particular syntax used for aggregation.

And in any case, since the vast majority of aggregations are quantitative (I'm having a hard time thinking of examples of non-quantitative aggregations, to be honest), this new syntax would remove the confusion about what part of the expression the type refers to.

whether the quantitative type is supposed to apply to the input x or the output mean(x)

It is applied to the output of the transform. For example, if we have a category field cat, we can use {"aggregate": "distinct", "field": "cat", "type": "quantitative"}.

The rationale for having type being post-transform is that we actually visualize the output of the transform. And Vega-Lite need to know the type of this output in order to configure scales correctly.

Thus, x='mean(x):Q' is actually more accurate than x='mean(x:Q)'.

Note: this reminds me that we should explain this better in VL docs - so I'm adding https://github.com/vega/vega-lite/pull/3617

I agree with Brian that the goal should be to have few edits when you want to make a small change. The shorthand may be an exception but I don't think it justifies other exceptions.

I agree with ham that the type should be defined outside the aggregate transform/function.

I'm having a hard time thinking of examples of non-quantitative aggregations, to be honest

Since the type refers to the output, the examples I always think of are max and min.

Since the type refers to the output, the examples I always think of are max and min.

Do you mean something like taking the max of a quantitative value, and then treating the output as a nominal?

I'm thinking of theax of an ordinal is still an ordinal.

Also, Altair has the opportunity to use smart defaults for the types as it knows the raw type (string, int, ...). Vega-Lite cannot do this as it never sees the data.

Altair could use nominal by default for strings, ordinal for low cardinality integers, temporal for dates, and quantitative for high cardinality ints and floats. Then users only have to set the type when they want to override the default behavior.

Altair could use nominal by default for strings, ordinal for low cardinality integers, temporal for dates, and quantitative for high cardinality ints and floats. Then users only have to set the type when they want to override the default behavior.

This is already the case when data is passed as a dataframe

I'm new to Altair; I like the idea of using functions rather than strings as a more pythonic API.

For those of us following along, what was the resolution to this issue, if any?

The resolution was that we couldn't agree on a change, so we closed the proposal.

@jakevdp Gotcha, thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

floringogianu picture floringogianu  Â·  3Comments

SuperShinyEyes picture SuperShinyEyes  Â·  3Comments

jtbaker picture jtbaker  Â·  3Comments

pabloinsente picture pabloinsente  Â·  3Comments

tonylee3399 picture tonylee3399  Â·  3Comments