Statsmodels: ENH: GAMLSS - Box-Cox Cole Green distribution

Created on 14 Feb 2018  Â·  6Comments  Â·  Source: statsmodels/statsmodels

Would it be possible to include a functionality for fitting a model like in GAMLSS? (already mentioned in issue #3915 and #2636)

I am thinking of a simple model with one independent and one dependent variable:
GAMLSS example
It should have the option to choose the link functions and distribution. I am interested in fitting the Box-Cox Cole Green distribution (BCCG) published by Cole (also called LMS method). This method has found application in many medical fields for creating reference curves.

Rigby and Stasinopoulos give enough information in their publications for the implementation (e.g. RS() fitting procedure):

  • R. A. Rigby and D. M. Stasinopoulos, “Smooth centile curves for skew and kurtotic data modelled using the Box-Cox power exponential distribution,” Stat. Med., vol. 23, no. 19, pp. 3053–3076, 2004.
  • R. A. Rigby, D. M. Stasinopoulos, and P. W. Lane, “Generalized additive models for location, scale and shape,” J. R. Stat. Soc. Ser. C Appl. Stat., vol. 54, no. 3, pp. 507–554, 2005.

I tried to implement it but was not so lucky so far.
.

Most helpful comment

about the slow: you can replace for i in x loops by a vectorized numpy computation. Also 'nelder-mead' is slow. If the starting values are not very bad, then bfgs or lbfgsb will be much faster. Analytical derivatives will make it numerically more precise and most likely also faster.

I'm not sure whether something like the RS algorithm is necessary. It looks like a lot of work, and I would try first to see how far you can get with the scipy optimizers. IRLS in GLM is a fisher scoring variant, It might be possible to reuse GLM when all other shape parameter are fixed.
(I'm a bit doubtful about any claims that RS works well, especially the cases and datasets are less "nice")

All 6 comments

Three parts:

  • the penalized splines are not yet available, PR still needs work (that I keep postponing). patsy splines work pretty well but amount of smoothing depends on number of knots or knot selection
  • distribution itself: I didn't look enough at the articles to see how the transformation is used to define the distribution. This might be relatively straightforward for given underlying distribution but might be more work to get all the elements for several possible underlying distributions.
  • a new model with a multiparameter distribution: This should be feasible now by using a standard likelihood model class (not within GLM which is for one-parameter families). We don't have anything currently as general as GAMLSS, BetaRegression (unmerged, #2030 #4238) was an early example for a two parameter distribution with two link functions. Last year we added count models with multi-parameter or multi-part models, that are however more restrictive (hardcoded link function or limited choice of individual parts)
    e.g. GeneralizedPoisson #3727 merged after rebase in #3795 and NegativeBinomialP #3832, merged in #3874, and zero-inflated models for two-part models with possibly two exog)

In general, I think getting a prototype version to work should not be too difficult and requires loglikelihood and keeping track of and connecting/stacking the parameters.
Analytical derivatives for the optimization will most likely be more work and requires hand coding various partial derivatives that can be combined into the full derivatives using chain rule, e.g. that's what GLM does. In at least an initial version this can be replaced by numerical derivatives.

  • I did not know that penalized splines are not available in Python. Parametric functions like polynoms as smoothing curves for median, etc. should work for the start
  • implementation of the distributions is quite easy. I implemented the Box-Cox Cole Green (BCCG) and the Box-Cox Power Exponential (BCPE) distribution in my new repo: gamlss
    Fitting the distribution to some data is easy, too (one dimensional/1D fit, like Python's fitter). However, it is a bit slow which could be because of my poor programming skills. By minimizing the likelihood function, I get the same results like in GAMLSS and R.
  • As you said, it gets more complicated for 2D fitting. The loglikelihood approach fails because there are too many parameters to estimate. Cole and Stasinopoulos propose using Fisher scoring. For that, we need the derivatives (first and second order) and some coding. For BCPE, derivatives and other needed terms are given in Stasinopoulos (2004). Only missing part is the RS() algorithm. Should be doable.

That feature would definitely attract a lot of statisticians and other scientists.

about the slow: you can replace for i in x loops by a vectorized numpy computation. Also 'nelder-mead' is slow. If the starting values are not very bad, then bfgs or lbfgsb will be much faster. Analytical derivatives will make it numerically more precise and most likely also faster.

I'm not sure whether something like the RS algorithm is necessary. It looks like a lot of work, and I would try first to see how far you can get with the scipy optimizers. IRLS in GLM is a fisher scoring variant, It might be possible to reuse GLM when all other shape parameter are fixed.
(I'm a bit doubtful about any claims that RS works well, especially the cases and datasets are less "nice")

Changing the loop to a vector made it of course a lot faster, from 4.3 to 0.6 seconds, thanks! "Nelder-mead" is quite stable and fast, bfgs and lbfgsb did not really succeed (will try it later again). 1D fitting is now pretty good.

I tried to implement the 2D fitting of a BCPE distribution (Python code). I used a simple model with linear median (M) and the other parameters constant (S, L, T). Optimizing again with "nelder-mead" for five parameters (two for M, and one for S, L, T respectively). Results are promising, I get similar parameters like the gamlss fit:
python fit
M = 45.53*x + 1.935
S = 0.033,
L = 1.04,
T = 1.28

gamlss fit
M = 45.535*x + 1.929
S = 0.0312
L = 0.99
T = 1.29

However, results highly depend on the starting values for M, S, L and T. Initial values have to be picked wisely. Maybe other optimizer will help. Next step should be higher degrees of freedom for the parameters.

adding a reference from citing articles (I didn't look at it but abstract sounds interesting)

Borghi, Elaine, Mercedes de Onis, Cutberto Garza, Jan Van den Broeck, Edward A. Frongillo, Laurence Grummer‐Strawn, S. Van Buuren et al. "Construction of the World Health Organization child growth standards: selection of methods for attained growth curves." Statistics in medicine 25, no. 2 (2006): 247-265.
https://doi.org/10.1002/sim.2227

It looks to me like their power exponential distribution is gennorm in scipy, with a slightly different parameterization, normed so that the standard distribution has mean 0 and variance 1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gennorm.htm
https://en.wikipedia.org/wiki/Generalized_normal_distribution#Version_1

the messy part is that the transformation with y> 0 requires truncation which they seem to ignore mostly in practice

gennorm has an explicit "special" cdf and pdf using gammaincc and gammainccinv

gennorm is symmetric so it might be good to combine with skewing transformations

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stoffprof picture stoffprof  Â·  4Comments

bashtage picture bashtage  Â·  3Comments

jseabold picture jseabold  Â·  4Comments

big-o picture big-o  Â·  3Comments

akiross picture akiross  Â·  4Comments