The following code will not order the categories on the x-axis the same way on repeated runs:
import pandas as pd
import numpy as np
import altair as alt
d = pd.DataFrame({
'x': np.repeat(np.arange(3), 10),
'y': np.concatenate([np.random.normal(i, 0.1, 10) for i in range(3)]),
'z': np.repeat(np.arange(3, 0, -1), 10),
}).sort_values(
'z',
)
alt.Chart(
d,
).mark_boxplot(
opacity=1.0,
).encode(
x=alt.X('x:N', sort=None),
y='y:Q',
color='z:N',
)
This does not seem to happen with mark_circle.
My understanding was that using sort=None with alt.X would respect the order in the data (i.e., z).
It appears that the non-determinism is coming from the data, not from Altair (repeating the plot with the same dataset always results in the same sort order). It seems that when sort is set to None, the exact order of the axes depends on the contents of the data, which seems reasonable to me. If you want a specific sort order that is consistent across different input datasets, you can specify it explicitly.
I thought that might be what's going on, but I ruled it out because in all my repeats I never observed a case where a y value for x = 2 was lower than for x = 1; but then shouldn't x = 2 always be first (or last, depending on the convention that Altair's using)? Sometimes it's in the middle even though it has the highest values of y. Sorry if I'm missing something obvious here.
I don't think there's any obvious reason why it would be sorted one way or the other... it's more of an implementation detail of Vega/Vega-Lite.
My understanding is this: when you say sort=None, you're explicitly saying that the sort order does not matter, so internal implementation details may affect the order. If the order is important to you, you should supply a sort argument that is not None.
I see. Then I guess this is more of a feature request鈥攊t would be nice to have a way of specifying that the order I want is the order that the data appears in. But it sounds like this may be a pain to implement, given that it depends on Vega-Lite implementation details.
Incidentally, the reason I want this is because I couldn't find out how to sort on multiple columns with EncodingSortField.
I guess I'm not understanding what you need... in the example you gave, you can specify whatever sort order for x you wish in the chart spec. Can you give an example of where you're unable to do that?
Yes, with one sorting variable, Altair's sorting works fine, but if there were another column鈥攕ay, w鈥攖hen how can I sort on w first and then z (i.e., df.sort_values(['w', 'z'])) with EncodingSortField? As far as I can tell it takes only a single field.
I just discovered that this doesn't work, presumably because this boxplot implementation, like so many others, do not work on non-consecutive intervals on the x-axis?
import altair as alt
from vega_datasets import data
source = pd.read_json(data.population.url)
alt.Chart(source.sample(100)).mark_boxplot(extent='min-max').encode(
x='age:O',
y='people:Q'
)
The plot does not show, no error message either.
@michaelaye When I run your code, I see this (using Colab with the most recent version of Altair: https://colab.research.google.com/drive/1C36B4r9Wo_OWHDqlxxabp_mj-YyWsOLE)

Can you share more details about what frontend you're using (JupyterLab, Jupyter Notebook, Nteract, Vegascope, etc.), what version of Altair, and whether there are any error messages in the Javascript console?
(Oh, totally separately: you can use data.population() in place of pd.read_json(data.population.url))
Interesting, I just had a passing case, but it was again when all bins were filled, so I think my presumption is correct. It's really fascinating that nobody has a working implementation of boxplot over time/non-consecutive data points.
Sure can provide more info, I was saving a detailed report for a new issue if you think it's warranted:
Here's the console error:

My system:
conda info:
active environment : py37
active env location : /Users/klay6683/miniconda3/envs/py37
shell level : 4
user config file : /Users/klay6683/.condarc
populated config files : /Users/klay6683/.condarc
/Users/klay6683/miniconda3/envs/py37/.condarc
conda version : 4.7.5
conda-build version : not installed
python version : 3.7.3.final.0
virtual packages :
base environment : /Users/klay6683/miniconda3 (writable)
channel URLs : https://conda.anaconda.org/michaelaye/osx-64
https://conda.anaconda.org/michaelaye/noarch
https://conda.anaconda.org/conda-forge/osx-64
https://conda.anaconda.org/conda-forge/noarch
https://repo.anaconda.com/pkgs/main/osx-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/osx-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /Users/klay6683/miniconda3/pkgs
/Users/klay6683/.conda/pkgs
envs directories : /Users/klay6683/miniconda3/envs
/Users/klay6683/.conda/envs
platform : osx-64
user-agent : conda/4.7.5 requests/2.22.0 CPython/3.7.3 Darwin/18.6.0 OSX/10.14.5
UID:GID : 273771:2260
netrc file : None
offline mode : False
Interesting... I can't reproduce that at all. Ran it several dozen times to get different random seeds. I tried running with smaller samples to try to reproduce your hypothesis of it being due to non-contiguous bins. I'm not sure how to help since I can't reproduce the issue myself.
Can you specify a particular random seed for which you see this problem?
(Also, your presumption that "nobody has a working implementation of boxplot over time/non-consecutive data points" is not accurate or useful here)
(Also, your presumption that "nobody has a working implementation of boxplot over time/non-consecutive data points" is not accurate or useful here)
You may misread my intent. This is not a complaint but a trial in understanding my failure in identifying a plotting library that shows plotted boxes placed mathematically instead of categorically in a box-plot.
And in that way stating this presumption might indeed be helpful because it might point to a mismatch in expectation what this kind of plot actually is supposed to do.
Not accurate? Please point to a plotting library that does plot boxplots over x-axis positions according to their values instead of equi-distantly placed categories? I checked matplotlib, seaborn, plotly, holoviews, bokeh. Also your above plot shows regular x-axis points without any bin missing.
To be more clear, I removed the ages of 25:
subsample = source.query("age!= 25")
and plotted above code using that:

This does not throw an error but does not what I need: In this case, I expect a hole, no box placed at the x-axis value of 25; instead the box for 30 appears where, mathematically, 25 should be.
In other words, the boxes are placed in a non-mathematical way as pure category bins, not at their mathematical correct linear position. Not sure how to say it differently.
OK, so are you no longer seeing the error you reported?
No, the error is still there, for example using
subsample = source.sample(100, random_state=0)
alt.Chart(subsample).mark_boxplot(extent='min-max').encode(
x='age:O',
y='people:Q'
)
When I run that code I see this, using the most recent version of Altair:

As to other question about identifying a plotting library that shows plotted boxes placed mathematically instead of categorically in a box-plot, if you would like to force bins without data to be part of the x scale, in Altair you can use the scale domain argument:
subsample = source.query("age!= 25")
alt.Chart(subsample).mark_boxplot(extent='min-max').encode(
x=alt.X('age:O', scale=alt.Scale(domain=np.arange(0, 95, 5).tolist())),
y='people:Q'
)

And with most recent do you mean 3.1.0 or GH master? On which frontend?
I tried switching to notebook using the /tree URL for the running JLab server, and it shows the same problem:

I just switched to my mac and tried on safari, and I can see the behavior you reported (it works fine on Chrome and Firefox). It's not an Altair issue, but rather a Vega-Lite issue (You can see it here in the vega editor).
I spent a while trying to find Safari's developer tools to attempt to diagnose the issue, but gave up because it's Saturday night :smile:
I would report this issue on the Vega-Lite issue tracker.
Thanks for creating the issue, I had trouble understanding in how to minimize the spec, first needed to learn all the vocab, like "spec".
I have one quick question if you allow to abuse this GH issue once more: Why does setting the x type to a quantity not work for getting the plot you created by using the alt.X('age:O', scale=alt.Scale(domain=np.arange(0, 95, 5).tolist())), setting? Isn't it conceptually the same idea, to use age at its face value for the axis scale?
I'm getting this when I try:

It actually kinda works, b/c one can see that no box-median is drawn exactly where I expect the holes, it's just that the graphic is messed up, so it's getting very close.
That looks like a bug in Vega-Lite's boxplot macro. Would you like to report it there?
oh, so you are saying my understanding is correct, it should work? Sure can report it.
Reported in https://github.com/vega/vega-lite/issues/5259