Statsmodels: ENH: GAMLSS - Box-Cox Cole Green distribution

Created on 14 Feb 2018 · 6Comments · Source: statsmodels/statsmodels

Would it be possible to include a functionality for fitting a model like in GAMLSS? (already mentioned in issue #3915 and #2636)

I am thinking of a simple model with one independent and one dependent variable:
GAMLSS example
It should have the option to choose the link functions and distribution. I am interested in fitting the Box-Cox Cole Green distribution (BCCG) published by Cole (also called LMS method). This method has found application in many medical fields for creating reference curves.

Rigby and Stasinopoulos give enough information in their publications for the implementation (e.g. RS() fitting procedure):

R. A. Rigby and D. M. Stasinopoulos, “Smooth centile curves for skew and kurtotic data modelled using the Box-Cox power exponential distribution,” Stat. Med., vol. 23, no. 19, pp. 3053–3076, 2004.
R. A. Rigby, D. M. Stasinopoulos, and P. W. Lane, “Generalized additive models for location, scale and shape,” J. R. Stat. Soc. Ser. C Appl. Stat., vol. 54, no. 3, pp. 507–554, 2005.

I tried to implement it but was not so lucky so far.
.

Source

xi2pi

👍4

Most helpful comment

about the slow: you can replace for i in x loops by a vectorized numpy computation. Also 'nelder-mead' is slow. If the starting values are not very bad, then bfgs or lbfgsb will be much faster. Analytical derivatives will make it numerically more precise and most likely also faster.

I'm not sure whether something like the RS algorithm is necessary. It looks like a lot of work, and I would try first to see how far you can get with the scipy optimizers. IRLS in GLM is a fisher scoring variant, It might be possible to reuse GLM when all other shape parameter are fixed.
(I'm a bit doubtful about any claims that RS works well, especially the cases and datasets are less "nice")

josef-pkt on 15 Feb 2018

👍2

All 6 comments

Three parts:

the penalized splines are not yet available, PR still needs work (that I keep postponing). patsy splines work pretty well but amount of smoothing depends on number of knots or knot selection
distribution itself: I didn't look enough at the articles to see how the transformation is used to define the distribution. This might be relatively straightforward for given underlying distribution but might be more work to get all the elements for several possible underlying distributions.
a new model with a multiparameter distribution: This should be feasible now by using a standard likelihood model class (not within GLM which is for one-parameter families). We don't have anything currently as general as GAMLSS, BetaRegression (unmerged, #2030 #4238) was an early example for a two parameter distribution with two link functions. Last year we added count models with multi-parameter or multi-part models, that are however more restrictive (hardcoded link function or limited choice of individual parts)
e.g. GeneralizedPoisson #3727 merged after rebase in #3795 and NegativeBinomialP #3832, merged in #3874, and zero-inflated models for two-part models with possibly two exog)

In general, I think getting a prototype version to work should not be too difficult and requires loglikelihood and keeping track of and connecting/stacking the parameters.
Analytical derivatives for the optimization will most likely be more work and requires hand coding various partial derivatives that can be combined into the full derivatives using chain rule, e.g. that's what GLM does. In at least an initial version this can be replaced by numerical derivatives.

josef-pkt on 14 Feb 2018

👍2

I did not know that penalized splines are not available in Python. Parametric functions like polynoms as smoothing curves for median, etc. should work for the start
implementation of the distributions is quite easy. I implemented the Box-Cox Cole Green (BCCG) and the Box-Cox Power Exponential (BCPE) distribution in my new repo: gamlss
Fitting the distribution to some data is easy, too (one dimensional/1D fit, like Python's fitter). However, it is a bit slow which could be because of my poor programming skills. By minimizing the likelihood function, I get the same results like in GAMLSS and R.
As you said, it gets more complicated for 2D fitting. The loglikelihood approach fails because there are too many parameters to estimate. Cole and Stasinopoulos propose using Fisher scoring. For that, we need the derivatives (first and second order) and some coding. For BCPE, derivatives and other needed terms are given in Stasinopoulos (2004). Only missing part is the RS() algorithm. Should be doable.

That feature would definitely attract a lot of statisticians and other scientists.

xi2pi on 15 Feb 2018

👍1

josef-pkt on 15 Feb 2018

👍2

Changing the loop to a vector made it of course a lot faster, from 4.3 to 0.6 seconds, thanks! "Nelder-mead" is quite stable and fast, bfgs and lbfgsb did not really succeed (will try it later again). 1D fitting is now pretty good.

I tried to implement the 2D fitting of a BCPE distribution (Python code). I used a simple model with linear median (M) and the other parameters constant (S, L, T). Optimizing again with "nelder-mead" for five parameters (two for M, and one for S, L, T respectively). Results are promising, I get similar parameters like the gamlss fit:
python fit
M = 45.53*x + 1.935
S = 0.033,
L = 1.04,
T = 1.28

gamlss fit
M = 45.535*x + 1.929
S = 0.0312
L = 0.99
T = 1.29

However, results highly depend on the starting values for M, S, L and T. Initial values have to be picked wisely. Maybe other optimizer will help. Next step should be higher degrees of freedom for the parameters.

xi2pi on 16 Feb 2018

👍1

adding a reference from citing articles (I didn't look at it but abstract sounds interesting)

Borghi, Elaine, Mercedes de Onis, Cutberto Garza, Jan Van den Broeck, Edward A. Frongillo, Laurence Grummer‐Strawn, S. Van Buuren et al. "Construction of the World Health Organization child growth standards: selection of methods for attained growth curves." Statistics in medicine 25, no. 2 (2006): 247-265.
https://doi.org/10.1002/sim.2227

josef-pkt on 26 Jan 2021

It looks to me like their power exponential distribution is gennorm in scipy, with a slightly different parameterization, normed so that the standard distribution has mean 0 and variance 1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gennorm.htm
https://en.wikipedia.org/wiki/Generalized_normal_distribution#Version_1

the messy part is that the transformation with y> 0 requires truncation which they seem to ignore mostly in practice

gennorm has an explicit "special" cdf and pdf using gammaincc and gammainccinv

gennorm is symmetric so it might be good to combine with skewing transformations

josef-pkt on 26 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings