Altair: Non-deterministic behavior with `mark_boxplot`

Created on 9 May 2019  路  22Comments  路  Source: altair-viz/altair

The following code will not order the categories on the x-axis the same way on repeated runs:

import pandas as pd
import numpy as np
import altair as alt

d = pd.DataFrame({
    'x': np.repeat(np.arange(3), 10),
    'y': np.concatenate([np.random.normal(i, 0.1, 10) for i in range(3)]),
    'z': np.repeat(np.arange(3, 0, -1), 10),
}).sort_values(
    'z',
)

alt.Chart(
    d,
).mark_boxplot(
    opacity=1.0,
).encode(
    x=alt.X('x:N', sort=None),
    y='y:Q',
    color='z:N',
)

This does not seem to happen with mark_circle.

My understanding was that using sort=None with alt.X would respect the order in the data (i.e., z).

All 22 comments

It appears that the non-determinism is coming from the data, not from Altair (repeating the plot with the same dataset always results in the same sort order). It seems that when sort is set to None, the exact order of the axes depends on the contents of the data, which seems reasonable to me. If you want a specific sort order that is consistent across different input datasets, you can specify it explicitly.

I thought that might be what's going on, but I ruled it out because in all my repeats I never observed a case where a y value for x = 2 was lower than for x = 1; but then shouldn't x = 2 always be first (or last, depending on the convention that Altair's using)? Sometimes it's in the middle even though it has the highest values of y. Sorry if I'm missing something obvious here.

I don't think there's any obvious reason why it would be sorted one way or the other... it's more of an implementation detail of Vega/Vega-Lite.

My understanding is this: when you say sort=None, you're explicitly saying that the sort order does not matter, so internal implementation details may affect the order. If the order is important to you, you should supply a sort argument that is not None.

I see. Then I guess this is more of a feature request鈥攊t would be nice to have a way of specifying that the order I want is the order that the data appears in. But it sounds like this may be a pain to implement, given that it depends on Vega-Lite implementation details.

Incidentally, the reason I want this is because I couldn't find out how to sort on multiple columns with EncodingSortField.

I guess I'm not understanding what you need... in the example you gave, you can specify whatever sort order for x you wish in the chart spec. Can you give an example of where you're unable to do that?

Yes, with one sorting variable, Altair's sorting works fine, but if there were another column鈥攕ay, w鈥攖hen how can I sort on w first and then z (i.e., df.sort_values(['w', 'z'])) with EncodingSortField? As far as I can tell it takes only a single field.

I just discovered that this doesn't work, presumably because this boxplot implementation, like so many others, do not work on non-consecutive intervals on the x-axis?

import altair as alt
from vega_datasets import data

source = pd.read_json(data.population.url)

alt.Chart(source.sample(100)).mark_boxplot(extent='min-max').encode(
    x='age:O',
    y='people:Q'
)

The plot does not show, no error message either.

@michaelaye When I run your code, I see this (using Colab with the most recent version of Altair: https://colab.research.google.com/drive/1C36B4r9Wo_OWHDqlxxabp_mj-YyWsOLE)
visualization - 2019-07-12T215333 213

Can you share more details about what frontend you're using (JupyterLab, Jupyter Notebook, Nteract, Vegascope, etc.), what version of Altair, and whether there are any error messages in the Javascript console?

(Oh, totally separately: you can use data.population() in place of pd.read_json(data.population.url))

Interesting, I just had a passing case, but it was again when all bins were filled, so I think my presumption is correct. It's really fascinating that nobody has a working implementation of boxplot over time/non-consecutive data points.

Sure can provide more info, I was saving a detailed report for a new issue if you think it's warranted:

Here's the console error:

Screenshot 2019-07-13 15 58 10

My system:

  • Safari 12.1.1 on macOS 10.14.5.
  • altair 3.1.0
  • jupyter lab 1.0.2

conda info:

     active environment : py37
    active env location : /Users/klay6683/miniconda3/envs/py37
            shell level : 4
       user config file : /Users/klay6683/.condarc
 populated config files : /Users/klay6683/.condarc
                          /Users/klay6683/miniconda3/envs/py37/.condarc
          conda version : 4.7.5
    conda-build version : not installed
         python version : 3.7.3.final.0
       virtual packages : 
       base environment : /Users/klay6683/miniconda3  (writable)
           channel URLs : https://conda.anaconda.org/michaelaye/osx-64
                          https://conda.anaconda.org/michaelaye/noarch
                          https://conda.anaconda.org/conda-forge/osx-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/osx-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/klay6683/miniconda3/pkgs
                          /Users/klay6683/.conda/pkgs
       envs directories : /Users/klay6683/miniconda3/envs
                          /Users/klay6683/.conda/envs
               platform : osx-64
             user-agent : conda/4.7.5 requests/2.22.0 CPython/3.7.3 Darwin/18.6.0 OSX/10.14.5
                UID:GID : 273771:2260
             netrc file : None
           offline mode : False

Interesting... I can't reproduce that at all. Ran it several dozen times to get different random seeds. I tried running with smaller samples to try to reproduce your hypothesis of it being due to non-contiguous bins. I'm not sure how to help since I can't reproduce the issue myself.

Can you specify a particular random seed for which you see this problem?

(Also, your presumption that "nobody has a working implementation of boxplot over time/non-consecutive data points" is not accurate or useful here)

(Also, your presumption that "nobody has a working implementation of boxplot over time/non-consecutive data points" is not accurate or useful here)

You may misread my intent. This is not a complaint but a trial in understanding my failure in identifying a plotting library that shows plotted boxes placed mathematically instead of categorically in a box-plot.
And in that way stating this presumption might indeed be helpful because it might point to a mismatch in expectation what this kind of plot actually is supposed to do.

Not accurate? Please point to a plotting library that does plot boxplots over x-axis positions according to their values instead of equi-distantly placed categories? I checked matplotlib, seaborn, plotly, holoviews, bokeh. Also your above plot shows regular x-axis points without any bin missing.

To be more clear, I removed the ages of 25:

subsample = source.query("age!= 25")

and plotted above code using that:

Screenshot 2019-07-13 19 14 28

This does not throw an error but does not what I need: In this case, I expect a hole, no box placed at the x-axis value of 25; instead the box for 30 appears where, mathematically, 25 should be.
In other words, the boxes are placed in a non-mathematical way as pure category bins, not at their mathematical correct linear position. Not sure how to say it differently.

OK, so are you no longer seeing the error you reported?

No, the error is still there, for example using

subsample = source.sample(100, random_state=0)
alt.Chart(subsample).mark_boxplot(extent='min-max').encode(
    x='age:O',
    y='people:Q'
)

When I run that code I see this, using the most recent version of Altair:
visualization - 2019-07-13T192117 463

As to other question about identifying a plotting library that shows plotted boxes placed mathematically instead of categorically in a box-plot, if you would like to force bins without data to be part of the x scale, in Altair you can use the scale domain argument:

subsample = source.query("age!= 25")
alt.Chart(subsample).mark_boxplot(extent='min-max').encode(
    x=alt.X('age:O', scale=alt.Scale(domain=np.arange(0, 95, 5).tolist())),
    y='people:Q'
)

visualization - 2019-07-13T192517 182

And with most recent do you mean 3.1.0 or GH master? On which frontend?
I tried switching to notebook using the /tree URL for the running JLab server, and it shows the same problem:

Screenshot 2019-07-13 20 35 13

I just switched to my mac and tried on safari, and I can see the behavior you reported (it works fine on Chrome and Firefox). It's not an Altair issue, but rather a Vega-Lite issue (You can see it here in the vega editor).

I spent a while trying to find Safari's developer tools to attempt to diagnose the issue, but gave up because it's Saturday night :smile:

I would report this issue on the Vega-Lite issue tracker.

Thanks for creating the issue, I had trouble understanding in how to minimize the spec, first needed to learn all the vocab, like "spec".
I have one quick question if you allow to abuse this GH issue once more: Why does setting the x type to a quantity not work for getting the plot you created by using the alt.X('age:O', scale=alt.Scale(domain=np.arange(0, 95, 5).tolist())), setting? Isn't it conceptually the same idea, to use age at its face value for the axis scale?

I'm getting this when I try:

Screenshot 2019-07-15 16 58 30

It actually kinda works, b/c one can see that no box-median is drawn exactly where I expect the holes, it's just that the graphic is messed up, so it's getting very close.

That looks like a bug in Vega-Lite's boxplot macro. Would you like to report it there?

oh, so you are saying my understanding is correct, it should work? Sure can report it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SuperShinyEyes picture SuperShinyEyes  路  3Comments

mroswell picture mroswell  路  4Comments

breadbaron picture breadbaron  路  4Comments

firasm picture firasm  路  3Comments

zanarmstrong picture zanarmstrong  路  4Comments