We'd like to explore making cmdstanpy (official project from the Stan team) an option for Prophet, making it possible to install/run Prophet without pystan (which is GPLv3).
Some considerations:
Some additional comments for this:
During package use:
The interface between fbprophet and PyStan is fairly small during package use. There are only three things that are done that use PyStan:
(1) We load a pickled StanModel that contains the compiled model:
https://github.com/facebook/prophet/blob/e8ddded4fec128191f08be83ae9b5acd7db6f563/python/fbprophet/models.py#L14
(2) We call StanModel.sampling and extract the samples:
https://github.com/facebook/prophet/blob/e8ddded4fec128191f08be83ae9b5acd7db6f563/python/fbprophet/forecaster.py#L1116
(3) We call StanModel.optimizing and extract the MAP fit:
https://github.com/facebook/prophet/blob/e8ddded4fec128191f08be83ae9b5acd7db6f563/python/fbprophet/forecaster.py#L1131
And that's it!.
During package install:
We compile the StanModel and pickle it:
https://github.com/facebook/prophet/blob/e8ddded4fec128191f08be83ae9b5acd7db6f563/python/setup.py#L31
If we want to ultimately swap out PyStan for cmdstanpy then we'll need to be sure that we don't break anyone's setup who is currently using PyStan. I think this means making sure that it will install without issue in an environment in which PyStan installs without issue, and adding it to conda forge.
For the short-term it seems like it might make sense to aim to support both so that if we do want to migrate, that migration can happen a bit more gradually. I think supporting both might not be too terrible since the interface between PyStan and fbprophet is limited to just the few points that I listed above.
I'll comment on the test section since I've been fooling around with cmdstanpy. I've been trying for a while to make the bernoulli.stan example produce the same results for a toy dataset. After a significant time investment, I can't make pystan and cmdstanpy produce the _exact_ same results, even after fussing around with the seed, number of iterations, number of warmups, thinning factor, etc. However, the results are broadly the same, assuming that sensible defaults are used.
This means that point 3 - testing - is a risk. If you can't figure out a way to make the results across runs for the same given input to be consistent, then you end up having to run stochastic tests, which are fairly expensive, as their guarantees are only in the limit.
# pystan
from pystan import StanModel
sm = StanModel(model_code=modelcode)
bern_data = { "N" : 10, "y" : [0,1,0,0,0,0,0,0,0,1] }
results = sm.sampling(bern_data, iters=1000, seed=15)
#cmdstanpy
from cmdstanpy import cmdstan_path, compile_model, sample, get_drawset, summary, diagnose
bernoulli_model = compile_model(bernoulli_stan)
bern_data = { "N" : 10, "y" : [0,1,0,0,0,0,0,0,0,1] }
bern_fit = sample(bernoulli_model, data=bern_data, seed=15)
# bernouilli.stan
data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(1,1);
for (n in 1:N)
y[n] ~ bernoulli(theta);
}
# toy dataset
{ "N" : 10,
"y" : [0,1,0,0,0,0,0,0,0,1]}
Super helpful @daikonradish (oh HI Jireh!!!)
Chatted with @mitzimorris about this a bit and she said that cmdstanpy is at about 80% complete, which means we'd have to pitch in a bit of work on that. In particular, optimization isn't implemented yet which is an immediate blocker.
I don't think it's too much work but it's also not just ready to be dropped in as a replacement quite yet. Once cmdstanpy has a set of issues blocking their release, we can track those and then start a PR on Prophet.
@seanjtaylor Have you looked into using pycmdstan? It has the optimize function.
if cmdstanpy is priority in the near future, then here's a suggestion:
Prophet object take an extra keyword arg to its __init__: use_cmdstanpy, default to False.pystan currently. Wherever use_cmdstanpy is set to true, switch backends accordingly.epsilon, that you will allow the _average_ predictions to diverge against. Each of the predictions should have an _absolute_ tolerance - likely you'll want to test if directionally the predictions are the same.use_cmdstanpy in some release to true.use_cmdstanpy entirely, and remove pystan from the package.I've filed two CmdStanPy issues: https://github.com/stan-dev/cmdstanpy/issues/58 and https://github.com/stan-dev/cmdstanpy/issues/57. The former is an umbrella issue. The latter came up earlier this week because in the future, we'd like to have a lightweight R wrapper "CmdStanR".
The goal of lightweight wrappers is to allow latest Stan release features to be available to R and Python users. CmdStan releases are kept in sync with Stan releases - the 2.19 Stan release hasn't yet made it through the eye of the CRAN needle.
@mitzimorris - just a casual question from a downstream consumer - what will the dev cycles for Pystan and Rstan be, and will they be entirely separate from CmdStan? If I maintain a project today with Rstan, should I be looking to switch, and what are three benefits (aside from the license)?
the dev cycles for PyStan and RStan are (almost) entirely separate from CmdStan, because it's an open source ecosystem - people contribute to the facet of the project that matters to them (and/or in the language they know). in the applied statistics world, there are a lot of R hackers.
PyStan and RStan have dependencies on the Stan APIs. CmdStan is a very simple wrapper around Stan's services layers, so it's almost trivial to do a CmdStan release as soon as a Stan release happens.
the reasons for preferring a lightweight wrapper interface are:
Have you considered just distributing cmdstan (pre)compiled models and call them with cmdstanpy?
This way users would not need to install c++ compiler?
Also PyStan3 is going with ISC licence
@ahartikainen We had not considered pre-compiling, but it probably would be nice for users to avoid that step. I think rstanarm does this so I guess it's possible in principle. I would love to see examples if you have any.
For PyStan3, -- it's been a few months since it was updated so I think folks are a bit concerned about future plans for the project.
@mitzimorris Thank you for filing those issues! I believe @yamamotoseiji has some resources to pitch in on this.
Any thoughts on the existing pycmdstan project that @nuskab2000 mentioned? It looks a bit stale and obviously we'd prefer to invest in a project that has as much community support as possible. I wonder if @maedoc (the top contributor on that project) could chime in and let us know about plans.
Here is a minimal example.
https://github.com/ahartikainen/Precompile_cmdstan
PyStan3 is chugging along. It still in the alpha stage, and might be on that level for a while. Nevertheless, it does work (on linux/osx and with some hacking on Windows).
Regarding PyStan 3, it would indeed be great if a stable version came out that got rid of the GPL3 license. What we're a bit worried about is that it's been promised since 2017 (https://discourse.mc-stan.org/t/pystan-license/274/15), so it's not clear when it'll finally see the light of day.
The discussion here suggests that what we can do is make Prophet compatible either with PyStan or something lighter weight that is GPL-free, such as cmdstanpy or pycmdstan. We have 3 engineers ready to close out whatever work is needed for the lightweight GPL-free option, the first step is just choosing whether to invest in cmdstanpy or pycmdstan. @mitzimorris's cmdstanpy seems to be more active and is it somehow related to the official Stan project? Unfortunately it doesn't have the optimize method implemented yet. pycmdstan has the optimize method, but the repo hasn't been active for 8 months.
None of this need affect Prophet's use of PyStan, so if PyStan 3 comes out sometime in the future perhaps that could also be leveraged by Prophet.
@mitzimorris, thanks for opening the two new issues on cmdstanpy. There are a bunch of other open issues on the repo. Which ones need to be resolved to enable it to work with Prophet?
@maedoc, similar question. Among the open issues on pycmdstan, which ones need to be resolved to get the package production ready for Prophet?
yikes. I really should have asked this in another forum, sorry for hijacking this one, Prophet dev team. I'll answer a couple more Prophet questions to pay it forward.
Any thoughts on the existing pycmdstan project that @nuskab2000 mentioned? It looks a bit stale and obviously we'd prefer to invest in a project that has as much community support as possible. I wonder if @maedoc (the top contributor on that project) could chime in and let us know about plans.
I built and use pycmdstan regularly, but provided the code as a jump off point which @mitzimorris et al have done a great job renovating. I still use pycmdstan because cmdstanpy is still "beta" in my mind, but it would definitely be the community-backed choice.
It looks a bit stale
I support a team and community of students, postdocs, and researchers on various platforms, HPC and cloud, so stale is major feature for me.
Among the open issues on pycmdstan, which ones need to be resolved to get the package production ready for Prophet?
pycmdstan is pip installable since nearly a year, has 95% coverage, passing on a CI pipeline, so it's production ready. There may be a few sharp edges, but the package is tiny and easy to read.
Have you looked into using pycmdstan? It has the optimize function.
Stan devs are understandably a little biased toward HMC/NUTS, but this should be a fairly trivial PR to add to cmdstanpy.
Great, that's helpful @maedoc. As I mentioned on the other thread, since you're not planning to continue developing pycmdstan due to other priorities we'll pitch in on the cmdstanpy repo since that will be the community-backed repo. Thanks for clearing things up! @wsuchy is going to get started on the optimize method for cmdstanpy and we can allocate other folks to additional items as they come up. Once cmdstanpy is in good shape, we can also help the with the necessary changes to Prophet. Looking forward to getting Prophet free of GPL3!
I've created an initial PR where pystan got replaced with cmdstanpy (https://github.com/facebook/prophet/pull/1083). There are however a few things to notice:
CmdStan is slower? that's very unexpected
what version of CmdStan? latest CmdStan - 2.20 was released last Thursday.
what is the compiler optimization level?
(there was a bug in the initial CmdStan release that did make CmdStan slower than PyStan - https://discourse.mc-stan.org/t/cmdstan-slower-than-pystan/8332 - this was fixed by release 2.19.1.)
is there any chance that the slowness is file-based I/O vs whatever PyStan is doing? can you profile?
no cmdstan available for the build job
Jenkins config for CmdStanPy uses the install_cmdstan script to get CmdStan installed. this, unfortunately, adds about 7 minutes to each test run, and there seem to be issues where the initial get request to GitHub fails.
Although it is nearly impossible to get pystan and cmdstan the same results (even after fixing seeds) they are close and they are repeatable on the same machine.
are you specifying inits as well? not sure if this applies to optimize - cf https://github.com/stan-dev/pystan/issues/549#issuecomment-455620653
cmdstan doesn't work with Python 2
original spec for CmdStanPy called for Python agnostic - but
This is now pushed to PyPI in v0.6, thanks to @wsuchy!
It is enabled by specifying STAN_BACKEND=CMDSTANPY as an environment variable prior to running the install script, as shown here:
https://github.com/facebook/prophet/blob/46e56119835f851714d22b285d2e4081853b9fb1/.travis.yml#L14-L17
I'll leave this open until we add this to the documentation somewhere.
Most helpful comment
I built and use pycmdstan regularly, but provided the code as a jump off point which @mitzimorris et al have done a great job renovating. I still use pycmdstan because cmdstanpy is still "beta" in my mind, but it would definitely be the community-backed choice.
I support a team and community of students, postdocs, and researchers on various platforms, HPC and cloud, so stale is major feature for me.
pycmdstan is pip installable since nearly a year, has 95% coverage, passing on a CI pipeline, so it's production ready. There may be a few sharp edges, but the package is tiny and easy to read.
Stan devs are understandably a little biased toward HMC/NUTS, but this should be a fairly trivial PR to add to cmdstanpy.