Plotly.js: value ascending / value descending

Created on 5 Mar 2019 · 32Comments · Source: plotly/plotly.js

I'm making categorical histograms (with and without z and histfunc: sum). I'd like to be able to sort these histograms in descending by value order without using transforms.

Maybe this would fit under categoryorder: 'value ascending' and categoryorder: 'value descending'? https://github.com/plotly/plotly.js/pull/419/files#diff-0e41c2e162564438ff091d0ed6b5b455R472

feature

Source

chriddyp

Most helpful comment

wanted to braindump part of a conversation I had with @antoinerg here. It's not critical to resolve this for 1.48 but I want to make sure we don't lock ourselves out of a more complete API.

It seems that in categoryorder = "value (a|de)scending" mode there are at least 3 extra pieces of information we need to specify the flavours of ordering I can imagine:

Which traces' values should be included in computing the ordering at all?
How/should we aggregate the per-trace values if there are 2 or more values per trace in a given category?
How/should we aggregate the among-trace values if there are two or more traces with values in a given category?

The motivation for 1 is that I might want to sort the axis by a single trace, and we'd likely want to specify a way of breaking ties, so in extremis we'd want to allow an exhaustive ordered list, and we'd also want to specify "no among-trace order preference".

The motivation for 2 is that in the case of grouped bars at least, I might want to sort by the height of the maximum stack, rather than the maximum single element.

The motivation for 3 is pretty clear: we need to aggregate somehow.

Taking just the case of 3 and 2 together, we might imagine an attr that takes two functions that interact such that my max-stack idea would be "max sum". In that case, the options that @antoinerg already implemented already work nicely as "max max", "min min" and "sum sum". It would also allow us to do something like "mean median" to order grouped box-plots in a particular way. Not all options make sense, clearly, such as "min max" or "mean mean" or "median median" so we could imagine a world where we enumerate the options perhaps.

So in terms of a half-baked attempt at an API, we could accept two attrs: a composite aggregation function that accepts "sum", "max sum" and friends (default=sum?), and a trace-scope array that defaults to [] meaning "no preference" but accepts an ordered list of ... trace.uids? trace indices?

nicolaskruchten on 15 May 2019

👍2 ❤1

All 32 comments

I suppose similarly with stacked bar charts. in this case, it'd be nice to keep the groups (grouping the bars by color) but sort within each group. the sort could be the sum of all bars _or_ the bars within that trace.

chriddyp on 5 Mar 2019

Maybe this would fit under categoryorder: 'value ascending' and categoryorder: 'value descending'

It _could_, but maybe we want this sorting-by-value feature for non-category axes also?

etpinard on 5 Mar 2019

What would it mean to sort a non-category (ie continuous) axis?

alexcjohnson on 5 Mar 2019

maybe we want this sorting-by-value feature for non-category axes also

I can't think of any use case for this off the top of my head

chriddyp on 5 Mar 2019

In the presence of multiple traces, one should be able to specify how to aggregate the values (sum, mean, max, median, etc...). This will require a new attribute. @nicolaskruchten can you remind me the name you suggested for this attribute.

antoinerg on 24 Apr 2019

I don't think I had a good name for it :) categoryorderfunc ?

Please note that this will need to apply not only across traces but across all values of the same category (i.e. you can have multiple data rows at the same category within a trace too!)

nicolaskruchten on 24 Apr 2019

👍1

In branch sort-by-value (compare to master), I use the values obtained from calc to create new sorted categorical axes that are then used to perform a second calc operation.

Unfortunately, running calc again is not optimal, but it is necessary at the moment since the calc routine is sometimes coupled to the axis object (for example in histogram, it is used for the autobinning logic). Note that this second calc call will only happen if categoryorder is "value (de|a)scending".

cc @plotly/plotly_js, @alexcjohnson

antoinerg on 25 Apr 2019

Very nicely done @antoinerg

I don't think we can implement this w/o calling _module.calc twice, so :ok_hand: . We should make sure though that we _only_ call _module.calc twice on subplots with categoryorder: value (de|a)scending" axes (hint: try using gd._fullLayout.xaxis._traceIndices).

Looks like the trickiest part of this PR now becomes this block:

                for(var k = 0; k < cd.length; k++) {
                    if(type === 'scatter') {
                        categoriesValue[cd[k].x][1] += cd[k].y;
                    } else if(type === 'histogram') {
                        categoriesValue[cd[k].p][1] += cd[k].s;
                    }
                }

which will be hard to generalize. Note that categoryorder works also on polar, carpet and gl3d (which doesn't use gd.calcdata) axes. Note also that "key" in questions depends not only on the trace type but also on the axis letter. For example,

                    if(type === 'scatter') {
                        categoriesValue[cd[k].x][1] += cd[k].y;
                    }

works when the x-axis has a set categoryorder, but we should have:

                    if(type === 'scatter') {
                        categoriesValue[cd[k].y][1] += cd[k].x;
                    }

for y-axes with categoryorder.

etpinard on 25 Apr 2019

👍1

Quick question: for horizontal bars, when ordering categories by ascending value, should we have the biggest value at the top or the bottom? In the figure below, the biggest value is at the top. I think I prefer the opposite but I'm not completely sure either. What do you think?
newplot (70)

antoinerg on 25 Apr 2019

should we have the biggest value at the top or the bottom?

You could say the same about category (ascending|descending) - ascending will put a at the bottom and e at the top so they're actually in reverse alphabetical order. In fact you could even make that argument about the order based on data in the trace.

I guess with a categorical Y axis you generally do read the graph from top to bottom, as opposed to numerical axes that you almost always read bottom to top. But making that the default behavior I think has much more extensive consequences than just choosing it for the new feature. So unless we're prepared to alter the rest of that behavior, I think ascending has to put the biggest at the top.

alexcjohnson on 25 Apr 2019

👍2

Looks like the trickiest part of this PR now becomes this block:

                for(var k = 0; k < cd.length; k++) {
                    if(type === 'scatter') {
                        categoriesValue[cd[k].x][1] += cd[k].y;
                    } else if(type === 'histogram') {
                        categoriesValue[cd[k].p][1] += cd[k].s;
                    }
                }

which will be hard to generalize.

Yes, indeed. What trace types should initially be supported?

@nicolaskruchten What aggregation do we want to support initially? Would sum, min and max be sufficient?

antoinerg on 25 Apr 2019

Sum/min/max is a fine start for this release, yes!

nicolaskruchten on 25 Apr 2019

👍1

Adding avg can be done in one pass also though no?

nicolaskruchten on 25 Apr 2019

👍1

Re trace types I would say bar and histogram are quite important

nicolaskruchten on 25 Apr 2019

👍1

What trace types should initially be supported?

ALL 2d cartesian please.

etpinard on 26 Apr 2019

ALL 2d cartesian please.

Ok, then I will need to inspect the output of calc data for each traces.

Looks like the trickiest part of this PR now becomes this block:

                for(var k = 0; k < cd.length; k++) {
                    if(type === 'scatter') {
                        categoriesValue[cd[k].x][1] += cd[k].y;
                    } else if(type === 'histogram') {
                        categoriesValue[cd[k].p][1] += cd[k].s;
                    }
                }

which will be hard to generalize.

To support all traces, we could either have a long list of else if in src/plots/plots.js or we could also add a new field with a standardized name to each trace's calcdata.

To be continued

antoinerg on 26 Apr 2019

add a new field with a standardized name to each trace's calcdata.

sounds like the winner to me! To be computed & added only when necessary.

alexcjohnson on 26 Apr 2019

👍1

sounds like the winner to me! To be computed & added only when necessary.

Good, I also think it's a winner. histogram already has x and y and all traces on cartesian axes should probably have them. I could simply remove the distinction in commit https://github.com/plotly/plotly.js/commit/d53e52cb78078de433ebaa786e8d9a806846a0a1 and it just works.

I now need to write tests to ensure all cartesian traces have x y in their calcdata and compute/add it when necessary in the ones that don't.

Thanks @alexcjohnson for pointing me in the right direction!

antoinerg on 26 Apr 2019

Ok, I made good progress in branch sort-by-value (compare to master), all 2d cartesian traces work except for:

scattergl
contour
contourcarpet
heatmap which work but is buggy because of issue https://github.com/plotly/plotly.js/issues/1097

To be continued

antoinerg on 27 Apr 2019

Awesome work @antoinerg !

Would you be interested in trying to fix #1097 at some point this week? Sounds like fixing that bug and adding some custom logic for scattergl should be enough to :white_check_mark: this feature. I'm not sure if contourcarpet traces even work on category axes, I'm not sure if they _should_ work either :smirk:

etpinard on 29 Apr 2019

I'm not sure if contourcarpet traces even work on category axes, I'm not sure if they _should_ work either

Ok good to know. They are indeed a bit different and I wasn't sure how to tackle them on Friday :) I'll focus on scattergl and fixing #1097!

antoinerg on 29 Apr 2019

🎉1

Note: if we add median as a possible sort ordering we get stuff like Ridgeplots almost for free, which is really really nice :)

nicolaskruchten on 2 May 2019

👍1

Does the current design leave room for a future way of saying "sort by the values of trace X" ?

nicolaskruchten on 6 May 2019

Does the current design leave room for a future way of saying "sort by the values of trace X" ?

How "future"? Does the data/traces split count? 😉

alexcjohnson on 6 May 2019

less future than that :) ... I guess if we added "first" and "last" and expected people to reorder their traces to match that'd be a workaround. But it would be nice to maybe sort by the nth?

nicolaskruchten on 6 May 2019

Does the current design leave room for a future way of saying "sort by the values of trace X" ?

Thank you for the comment @nicolaskruchten. It will be easy to include in the code: when collecting values associated with each category, we could loop only over the traces we care about at this line: https://github.com/plotly/plotly.js/compare/sort-by-value#diff-ad4f76ccd6044ed16514297078e13b84R2855.

As for how it should be specified via attributes, I am not sure yet. In the end, we need to deduce the indices of the traces we want to consider. So maybe an array of trace indices? :thinking:

antoinerg on 6 May 2019

Just leaving a note here: I assume that this sorting will take into account matching axes, right? I.e. if I have two traces on separate subplots where one's axis matches the other, sorting will apply to both and take both into account?

nicolaskruchten on 7 May 2019

question re sorting within/across traces...

For a chart like this:

what should sort by "max" do? the height of the blues implies "p2, p1" but the maximum individual bars ignoring trace-membership implies "p1,p2"

nicolaskruchten on 7 May 2019

❤1

Note: if we add median as a possible sort ordering we get stuff like Ridgeplots almost for free, which is really really nice :)

As discussed in private, adding support for median or say mean raises the following question in the presence of multiple traces: do you want the mean of the means or the mean of all values. We probably will need to support both.

antoinerg on 10 May 2019

wanted to braindump part of a conversation I had with @antoinerg here. It's not critical to resolve this for 1.48 but I want to make sure we don't lock ourselves out of a more complete API.

It seems that in categoryorder = "value (a|de)scending" mode there are at least 3 extra pieces of information we need to specify the flavours of ordering I can imagine:

Which traces' values should be included in computing the ordering at all?
How/should we aggregate the per-trace values if there are 2 or more values per trace in a given category?
How/should we aggregate the among-trace values if there are two or more traces with values in a given category?

The motivation for 2 is that in the case of grouped bars at least, I might want to sort by the height of the maximum stack, rather than the maximum single element.

The motivation for 3 is pretty clear: we need to aggregate somehow.

nicolaskruchten on 15 May 2019

👍2 ❤1

I'll leave the comment above as-is but upon further reflection, I think that point 1 is maybe interesting but doesn't allow you to sort by 'trace 1 with ties broken by trace 3'. For that we need a modification on point 3.

Elaborating: sort by max of per-trace sum of values of trace 1 is fine, but sort by max of per-trace sum of values of traces 1 and 3 isn't the same as sort by the sum of the values of trace 1, breaking ties with the sum of values of trace 3 so simply adding a trace-scoping operator won't be enough.

Maybe the 'outer' operator needs to be able to say not only sum or max or whatever but also [trace 1, trace 3] so you could say sort by [select trace 1, trace 3] of sum of values.

nicolaskruchten on 16 May 2019

@antoinerg can you open a new issue about up-coming category* improvements?

etpinard on 17 May 2019

Was this page helpful?

0 / 5 - 0 ratings