Altair: Monthly data and ordinal encoding create an "off by one" error

Created on 17 Jul 2018  Â·  9Comments  Â·  Source: altair-viz/altair

I uncovered this "bug" during some data exploration for development of a open-source library tracking the U.S. Consumer Price Index.

The source data is monthly, with the values linked with the first date of each month.

import pandas as pd
import altair as alt
df = pd.read_json("https://raw.githubusercontent.com/datadesk/cpi/master/notebooks/last_13.json", dtype={"date_label": pd.np.datetime64})
df.head(13)[['date',  'pct_change_rounded']]

screenshot from 2018-07-16 18-33-09

My aim was to format the dates in the x-axis labels in the same manner as the government's sample chart.

screenshot from 2018-07-16 18-36-58

I tried to do that with timeUnit and the format option to axis.

alt.Chart(df).mark_bar().encode(
    x=alt.X("date:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
    y="pct_change_rounded:Q"
)

Here's what I got:

visualization 39

Look closely and you can see that the latest value, June 1, is rendered as May.

What's up with that?

bug vega-lite-related

Most helpful comment

Update: thanks to some input from @domoritz I understand this a bit better. It looks like if we want to ensure dates are parsed as local times, we need to either use unix timestamps or fully-qualified ISO-8601 dates. This is not a characteristic of Vega or Vega-Lite, but of Javascript itself. For example:

Given a date string of "March 7, 2014", parse() assumes a local time zone, but given an ISO format such as "2014-03-07" it will assume a time zone of UTC (ES5 and ECMAScript 2015). Therefore Date objects produced using those strings may represent different moments in time depending on the version of ECMAScript supported unless the system is set with a local time zone of UTC. This means that two date strings that appear equivalent may result in two different values depending on the format of the string that is being converted.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/parse

So if we replace df[col_name].astype(str) with df[col_name].dt.strftime('%Y-%m-%dT%H:%M:%SZ') in sanitize_dataframe, then these kinds of off-by-one errors will be avoided.

All 9 comments

Looks like something to do with Vega-Lite. I've opened an issue there.

Per response in https://github.com/vega/vega-lite/issues/4044, t looks like the issue is that Javascript parses dates in UTC. You can fix this here by using timeUnit='utcyearmonth' rather than timeUnit='yearmonth'

I'm not sure I totally get it. If I submit "2018-06-01," how is that parsed by VegaLite? What would the Python datetime object come back as?

My understanding is that it is parsed (in UTC) as 00:00 on 2018-06-01, which is then interpreted (in ISO) as midnight on 2018-05-31, leading to the issue you see.

Edit: I'm wrong here – see new comment below

The solution is to parse the month and year in UTC.

I agree that this is not very satisfactory, but it sounds like it's pretty deeply baked into the vega/vega-lite stack.

I just learned something about ISO.

On Thu, Jul 19, 2018, 11:30 AM Jake Vanderplas notifications@github.com
wrote:

My understanding is that it is parsed (in UTC) as 00:00 on 2018-06-01,
which is then interpreted (in ISO) as midnight on 2018-05-31, leading to
the issue you see.

The solution is to parse the month and year in UTC.

I agree that this is not very satisfactory, but it sounds like it's pretty
deeply baked into the vega/vega-lite stack.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/altair-viz/altair/issues/1027#issuecomment-406372355,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAnCU2-ggCsqC133sX8pjZLA4DNn9fQks5uINBggaJpZM4VSC5E
.

OK, so it turns out I was completely wrong on the reason for this issue when I speculated above.

The issue is that the timestamps are being parsed as if they're UTC, and then displayed in local time (i.e. compensating for timezone). Because the west coast is 8 hours behind london, this shifts the time to the previous month for the dates you're using. If you were east of London when running this code, you wouldn't see this issue :smile:.

So the best solution is to make sure dates are both parsed and displayed as local time, so no time zone correction is required.

Looking at the Vega-Lite docs on UTC time, it looks like (and this seems entirely crazy to me) the way you make sure dates are parsed as local time is to not use ISO format. Altair serializes datetime data in ISO format by default, so timezone corrections will always be applied.

But if you change the serialization to be non-ISO compliant, you can make things work in as expected:

df['date2'] = df['date'].dt.strftime('%b %d %Y')  # non-ISO serialization 

alt.Chart(df).mark_bar().encode(
    x=alt.X("date2:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
    y="pct_change_rounded:Q"
)

visualization 18

This format-dependent parsing of dates in Vega is more than a bit surprising to me, and I hope that it can be addressed upstream. But if not, I'd propose we change the way we serialize dates in Altair so that they will be parsed the same way they are displayed, without any implicit time-zone conversion.

We'd just need to modify this line to use a non-ISO date format that is correctly parsed by the vega stack:
https://github.com/altair-viz/altair/blob/da9d6141efe74211f768adf2efaafceeb5962495/altair/utils/core.py#L127

Update: thanks to some input from @domoritz I understand this a bit better. It looks like if we want to ensure dates are parsed as local times, we need to either use unix timestamps or fully-qualified ISO-8601 dates. This is not a characteristic of Vega or Vega-Lite, but of Javascript itself. For example:

Given a date string of "March 7, 2014", parse() assumes a local time zone, but given an ISO format such as "2014-03-07" it will assume a time zone of UTC (ES5 and ECMAScript 2015). Therefore Date objects produced using those strings may represent different moments in time depending on the version of ECMAScript supported unless the system is set with a local time zone of UTC. This means that two date strings that appear equivalent may result in two different values depending on the format of the string that is being converted.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/parse

So if we replace df[col_name].astype(str) with df[col_name].dt.strftime('%Y-%m-%dT%H:%M:%SZ') in sanitize_dataframe, then these kinds of off-by-one errors will be avoided.

I'm still not sure I totally get this. But that is almost always the case when it comes to time zones! ;)

If you make a patch and want me to test my code again, I'm happy to help.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nielsmde picture nielsmde  Â·  4Comments

galloramiro picture galloramiro  Â·  3Comments

breadbaron picture breadbaron  Â·  4Comments

dzonimn picture dzonimn  Â·  3Comments

floringogianu picture floringogianu  Â·  3Comments