Prophet: Incorrect handling for overlapping holiday dates

Created on 24 Aug 2020  路  8Comments  路  Source: facebook/prophet

The generated_holidays table in R contains a non-existing holiday, Bevrijdingsdag, Hemelvaart, that is actually a concatenation of two holidays (Bevrijdingsdag and Hemelvaart) that happen to coincide in that year, but in general do not fall on the same date.
This has prophet fit a coefficient for this joint holiday instead of the separate holidays. A tidyr::separate_rows(holiday, sep = ", ") fixes this, but it should maybe be fixed in the source data instead.

> prophet::generated_holidays %>% filter(country == "NL", ds == "2005-05-05")

          ds                    holiday country year
1 2005-05-05 Bevrijdingsdag, Hemelvaart      NL 2005
bug ready

All 8 comments

Thanks for raising this, and for the PR that fixed it in the table.

The table is generated here, using the holidays Python package: https://github.com/facebook/prophet/blob/master/python/scripts/generate_holidays_file.py

The issue is coming from how overlapping holidays are handled upstream:

>>> import holidays
>>> holidays.NL(years=[2005])
{datetime.date(2005, 1, 1): 'Nieuwjaarsdag', datetime.date(2005, 3, 27): 'Eerste paasdag', datetime.date(2005, 3, 28): 'Tweede paasdag', datetime.date(2005, 5, 5): 'Bevrijdingsdag, Hemelvaart', datetime.date(2005, 5, 15): 'Eerste Pinksterdag', datetime.date(2005, 5, 16): 'Tweede Pinksterdag', datetime.date(2005, 12, 25): 'Eerste Kerstdag', datetime.date(2005, 12, 26): 'Tweede Kerstdag', datetime.date(2005, 4, 30): 'Koninginnedag'}

Basically, in the holidays package the object for computing holidays is a dictionary keyed by date, and so if there are multiple holidays on a date they must be combined, and that is being done by combining them with comma separation. This is the code upstream where that happens:
https://github.com/dr-prodigy/python-holidays/blob/02d5d510202a02471cd0a939a11525cf02ca115c/holidays/holiday_base.py#L121

This issue affects not just the R table generation, it also affects the Python version:

>>> from fbprophet.make_holidays import make_holidays_df
>>> holidays = make_holidays_df(year_list=[2005], country='NL')
>>> print(holidays)
          ds                     holiday
0 2005-01-01               Nieuwjaarsdag
1 2005-03-27              Eerste paasdag
2 2005-03-28              Tweede paasdag
3 2005-05-05  Bevrijdingsdag, Hemelvaart
4 2005-05-15          Eerste Pinksterdag
5 2005-05-16          Tweede Pinksterdag
6 2005-12-25             Eerste Kerstdag
7 2005-12-26             Tweede Kerstdag
8 2005-04-30               Koninginnedag

And it's not just NL, it will be any time there is an overlapping holiday. For instance, looking at the generated holidays table, I can see an instance of this happening with "New Year Holiday" and "New Year's Day (Observed)". We'll definitely need to get this fixed.

I don't know if the holidays package uses commas in names only for this purpose, and never uses them in the names of individual holidays. I'm hoping that is the case. If so, then the fix on our end will be to parse them out when creating the holidays table. Right now the holidays table is being created in 3 places. Once for use in the Python model:
https://github.com/facebook/prophet/blob/e992e0b7b6666da9bae11d5d2c7571aaac4023f0/python/fbprophet/make_holidays.py#L56-L64

and then in 2 places for generating the R file:
https://github.com/facebook/prophet/blob/e992e0b7b6666da9bae11d5d2c7571aaac4023f0/python/scripts/generate_holidays_file.py#L60-L62
https://github.com/facebook/prophet/blob/e992e0b7b6666da9bae11d5d2c7571aaac4023f0/python/scripts/generate_holidays_file.py#L72-L74

We should update this so that there is only once place (the script for generating the R file should directly call fbprophet.make_holidays.make_holidays_df), and then fix that function so that it correctly parses these comma-separated holiday names.

I don't know if the holidays package uses commas in names only for this purpose, and never uses them in the names of individual holidays. I'm hoping that is the case.

I'm not sure that is the case. Here is one example:
name = "Martin Luther King, Jr. - Idaho Human Rights Day"
https://github.com/dr-prodigy/python-holidays/blob/02d5d510202a02471cd0a939a11525cf02ca115c/holidays/countries/united_states.py#L107

Interestingly, that's only for Idaho. All other states list it as Martin Luther King Jr. Day, with no comma.

One solution might be to try and split it by comma, and seeing whether the separate holidays already exist. If both holidays don't occur elsewhere for the same country, don't split them. If they do, split them.

For Bevrijdingsdag, Hemelvaart, since Bevrijdingsdag and Hemelvaart both exist for NL, they are indeed separate holidays, and we should split them.
For Martin Luther King, Jr. - Idaho Human Rights Day, neither Martin Luther King nor Jr. - Idaho Human Rights Day exist for US, so they should stay together.

@raffg Good catch!
@basjacobs93 I think one of the biggest challenges here is that as far as I can tell there is no way to extract the full list of holidays for a country other than after they have been comma-joined. If you look at something like https://github.com/dr-prodigy/python-holidays/blob/02d5d510202a02471cd0a939a11525cf02ca115c/holidays/countries/united_states.py#L107 , the list of holidays is defined only in the _populate method, and code like here
https://github.com/dr-prodigy/python-holidays/blob/02d5d510202a02471cd0a939a11525cf02ca115c/holidays/countries/united_states.py#L108
is where they are being joined into a comma-separated string with any previous holidays that have that same date that year.

Looking more at the code for holidays, I think it was originally intended that commas should be a unique indicator of two holidays. The Holidays base class has a method get_list that is splitting holidays by commas:
https://github.com/dr-prodigy/python-holidays/blob/02d5d510202a02471cd0a939a11525cf02ca115c/holidays/holiday_base.py#L144-L145

Running this code:

import holidays
x = holidays.NL(years=[2005], expand=False)  # note: expand=False is important
for date, holiday_str in x.items():
    print(date, holiday_str, x.get_list(date))

Produces this:

2005-01-01 Nieuwjaarsdag ['Nieuwjaarsdag']
2005-03-27 Eerste paasdag ['Eerste paasdag']
2005-03-28 Tweede paasdag ['Tweede paasdag']
2005-05-05 Bevrijdingsdag, Hemelvaart ['Bevrijdingsdag', 'Hemelvaart']
2005-05-15 Eerste Pinksterdag ['Eerste Pinksterdag']
2005-05-16 Tweede Pinksterdag ['Tweede Pinksterdag']
2005-12-25 Eerste Kerstdag ['Eerste Kerstdag']
2005-12-26 Tweede Kerstdag ['Tweede Kerstdag']
2005-04-30 Koninginnedag ['Koninginnedag']

where it has correctly separated the holiday. (Of course, for ID this doesn't work because of the case highlighted above).

Here's what I would propose: when generating the holiday table in fbprophet.make_holidays.make_holidays_df, we should change it to use the get_list method as above to separate out any overlapping holidays.

Then, to resolve the ID issue (which may exist for other countries too), we can raise an upstream issue in the holidays package and get it fixed there, where it could be fixed either by removing commas from all holiday names, or by changing the delimiter to something else. Until the fix is in upstream, we would still be producing incorrect holidays for ID but we'd at least be in a much better state than we are now.

The Holidays package fixed Martin Luther King, Jr. Day at the country-level in https://github.com/dr-prodigy/python-holidays/issues/295. It looks like they just missed the state-level. I have already submitted a new issue for that (https://github.com/dr-prodigy/python-holidays/issues/351).

I don't want to step on anyone's toes on this issue, but I'd be happy to fix it and make a PR if no one else is already on it.

@raffg Go for it, that'd be awesome!

Thanks for fixing this @raffg! I updated the R generated file too, so this should all be fixed (pending the upstream fix).

Quick update, the Idaho MLK holiday has been updated in the beta branch of holidays: https://github.com/dr-prodigy/python-holidays/pull/354. I searched all countries and I believe this is the last remaining comma in a holiday name. Once this beta branch is merged with master, Prophet should not have any more issues with overlapping holiday dates.

Was this page helpful?
0 / 5 - 0 ratings