Hi, everyone. I'm new using prophet, but I've already made it some models with a very good accuracy. However, latelly I am trying to develop a hourly model for traffic in a highway. My data starts at 2015-01-01 00:00:00 and finishes at 2018-12-31 23:00:00 .
```
Head(Df)
ds y
1 2015-01-01 00:00:00 239
2 2015-01-01 01:00:00 541
3 2015-01-01 02:00:00 609
4 2015-01-01 03:00:00 671
5 2015-01-01 04:00:00 578
6 2015-01-01 05:00:00 725
Tail(Df)
ds y
35059 2018-12-31 18:00:00 3491
35060 2018-12-31 19:00:00 3706
35061 2018-12-31 20:00:00 3215
35062 2018-12-31 21:00:00 3181
35063 2018-12-31 22:00:00 2410
35064 2018-12-31 23:00:00 816
However, when I fit the dataframe with prophet, the **m$start** become 20**15-01-01 02:00:00**, and the future dataframe too (make_future_dataframe). Somehow, the first two data in my original dataframe wasn't considered. Also, the m and the future dataframe finishes two hours later than expected, "compensating" the first two that was ignored. Any ideas about what is happening here? I would like that my model start become **2015-01-01 00:00:00**, just like the input dataframe.
Vol_Passeio=data_horario[43825:78888,3]
head(Vol_Passeio)
Passeio
1 239
2 541
3 609
4 671
5 578
6 725
ds=data.frame(ds=seq(from=as.POSIXct('2015-01-01 00:00'),to=as.POSIXct('2018-12-31 23:00'),by="hour"))
Df=cbind(ds,Vol_Passeio)
names(Df)[2]="y"
head(Df)
ds y SAT_13
1 2015-01-01 00:00:00 239 331
2 2015-01-01 01:00:00 541 735
3 2015-01-01 02:00:00 609 708
4 2015-01-01 03:00:00 671 520
5 2015-01-01 04:00:00 578 408
6 2015-01-01 05:00:00 725 479
tail(Df)
ds y SAT_13
35059 2018-12-31 18:00:00 3491 1567
35060 2018-12-31 19:00:00 3706 1527
35061 2018-12-31 20:00:00 3215 1461
35062 2018-12-31 21:00:00 3181 1403
35063 2018-12-31 22:00:00 2410 919
35064 2018-12-31 23:00:00 816 376
Df$SAT_13=SAT_13$Vol_Autos[26305:61368]
m=prophet()
m=add_regressor(m,'SAT_13')
m=fit.prophet(m,Df)
m$start
> [1] "2015-01-01 02:00:00 GMT"
prev=make_future_dataframe(m,periods=365*24,freq = 3600)
head(prev)
ds1 2015-01-01 02:00:00
2 2015-01-01 03:00:00
3 2015-01-01 04:00:00
4 2015-01-01 05:00:00
5 2015-01-01 06:00:00
6 2015-01-01 07:00:00
```
Hello @crhistian123,
So this looks like an issue with the date timezone conversions (its caught me out a few times as well). In the codebase this happens here, when you call make_future_dataframe and append history to future dates.
The timezone is converted to GMT . If you pass in a dataframe with timezone other than GMT or UTC you will see an offset (2hrs in your case). So you will need to convert the dates back to original timezone - i've shown a few examples below.
Example from the yosemite data - no offset
> library(lubridate)
> library(prophet)
> df <- read.csv('../examples/example_yosemite_temps.csv')
# data is in 5 min intervals so converted this to hourly data to match your example
> df1 <- df[minute(df$ds) == 0, ]
> row.names(df1) <- NULL #reset row index
> head(df1)
ds y
1 2017-05-01 00:00:00 27.8
2 2017-05-01 01:00:00 20.8
3 2017-05-01 02:00:00 13.4
4 2017-05-01 03:00:00 10.2
5 2017-05-01 04:00:00 8.6
6 2017-05-01 05:00:00 7.7
> m <- prophet(df1)
> future <- make_future_dataframe(m, periods = 365)
> head(future)
ds
1 2017-05-01 00:00:00
2 2017-05-01 01:00:00
3 2017-05-01 02:00:00
4 2017-05-01 03:00:00
5 2017-05-01 04:00:00
6 2017-05-01 05:00:00
# timestamps are exactly the same as they are both UTC or GMT format
Now pass in same data but with date timezone set by POSIXct - gives offset
# this will set to current timezone (in my case BST (British Standard Time))"
> df1$ds <- as.POSIXct(df1$ds)
> head(df1)
ds y
1 2017-05-01 00:00:00 27.8
2 2017-05-01 01:00:00 20.8
3 2017-05-01 02:00:00 13.4
4 2017-05-01 03:00:00 10.2
5 2017-05-01 04:00:00 8.6
6 2017-05-01 05:00:00 7.7
> m <- prophet(df1)
> future <- make_future_dataframe(m, periods = 365)
> head(future)
# dates in future are offset by -1 hr relative to df1
ds
1 2017-04-30 23:00:00
2 2017-05-01 00:00:00
3 2017-05-01 01:00:00
4 2017-05-01 02:00:00
5 2017-05-01 03:00:00
6 2017-05-01 04:00:00聽
> df1$ds[1]
[1] "2017-05-01 BST"
> tz(future$ds)
[1] "GMT"
# this returns BST (in my case) in the original dataframe and GMT in future df
# created from `make_future_dataframe`. GMT is 1hr behind BST, hence the 1 hr offset.
Convert future df dates back to timezone you want
# this will convert to local timezone unless you specify specific timezone to tz argument
> library(lubridate)
> future$ds <- with_tz(future$ds, "")
> future$ds[1]
[1] "2017-05-01 00:00:00 BST"
Hopefully this answers your question. It's a subtle issue but probably worth mentioning in the main docs.
Thank you, @ryankarlos . It really solved the problem! I would never imagine that. Do you know some way to use another timezone inside the function > make_future_dataframe ?
@crhistian123 at the moment, i don't think there is currently support for directly supplying another tz into make_future_dataframe, as this function currently just adds future dates with same timezone setting as history$ds (which is in 'GMT'). Seems this issue is also mentioned in #793 though with another alternative workaround.
Should definitely be mentioned in the documentation though - not sure if there is a specific reason why dates are converted to "GMT' in history before fitting @bletham ?
Thanks for reporting this, this is a bug.
Time zones get really complicated so we've tried to avoid them entirely. This is the bad behavior we have tried to avoid:
> ds <- as.Date(c('2020-01-01', '2020-01-02'))
> print(as.POSIXct(ds, format = "%Y-%m-%d"))
[1] "2019-12-31 16:00:00 PST" "2020-01-01 16:00:00 PST"
what happened is as.Date assumed GMT, and then as.POSIXct used system timezone. To avoid that, we specify as.POSXIct to use GMT. That happens here:
https://github.com/facebook/prophet/blob/f16d9df33337be93d34f0f69904e9fa25e1b6a10/R/R/prophet.R#L259-L264
and then also use GMT when constructing the dates in make_future_dataframe:
https://github.com/facebook/prophet/blob/f16d9df33337be93d34f0f69904e9fa25e1b6a10/R/R/prophet.R#L1627
The idea is meant to be that everything is fixed to be in GMT and we don't have things getting mixed up with the local system timezone anywhere.
What's happening here is that the input ds has a timezone specified that is not GMT, so then things don't work, as you note. We should definitely be handling this setting better.
The Py version handles this a little more aggressively and explicitly requires the user to strip timezones before fitting:
https://github.com/facebook/prophet/blob/f16d9df33337be93d34f0f69904e9fa25e1b6a10/python/fbprophet/forecaster.py#L263-L267
I'm not entirely sure what exactly approach we want to take here. I don't think R has the same concept of a timezone-less POSIXct (it will just default to GMT). The simplest fix would be to check the timezone in the dataframe and error if it is not GMT and ask users to convert to GMT; that sounds a bit obnoxious though. We could also record the timezone of the history and then use that everywhere instead of GMT, and error if in predict the dates are not in the same timezone as was specified in the history. I'm a little worried that explicitly handling time zones like this might get hairy, but right now I think it would be reasonable and that this is what we should try doing.
The workaround for now is to convert things to GMT before fitting, and then convert results back from GMT.
Thank you guys. At first I thought that I would never find some resonable explanation, but now I understand.
I convert the original dataframe to GMT tz and the everything went well. Thanks.
I'm going to leave this open because I do consider it a bug in its current state that we need to fix :-)
hm I thought this would be fairly straightforward to fix but I've gotten a bit stuck.
My intended solution was to check if the input dates have a timezone, and if so, use that as the timezone instead of GMT.
But it seems that a POSIXct can have a timezone that doesn't show up when getting with attr, but is taken into account when setting with attr. Example:
> ds <- as.POSIXct('2018-01-01')
> print(ds) # POSIXct
[1] "2018-01-01 PST" # Shows PST as timezone (my system time)
> attr(ds, "tzone")
[1] "" # tzone attribute is blank. I expected this to be PST.
> attr(ds, "tzone") <- "GMT"
> print(ds)
[1] "2018-01-01 08:00:00 GMT" # time has been converted from PST to GMT. So it did know it was PST.
In order to fix this issue, we need to be able to identify that as.POSIXct('2018-01-01') has PST tzone. Anyone have any ideas?