Scipy: fitting discrete distributions

Created on 27 Apr 2020  Â·  4Comments  Â·  Source: scipy/scipy

Actually we can use scipy.stats.rv_continuous.fit method to extract the parameters for a theoretical continuous distribution from empirical data, however, it is not implemented for discrete distributions e.g. negative binomial and Poisson... may it be implemented in a near future?

enhancement scipy.stats

Most helpful comment

discrete distributions taken from here:
https://github.com/scipy/scipy/blob/master/scipy/stats/_distr_params.py#L115

| dist name | mle | mme |
| ----------|-------|------|
| bernoulli | sample mean |. |
|betabinom| |mme |
|binom| | n + p only p only n* |
|boltzmann| see note |. |
|dlaplace|mle^2 |. |
|geom|mle |. |
|hypergeom|. |. |
|logser| |. |
|nbinom|mle^1 |. |
|planck| |. |
|poisson|mle | |
|randint|mle for upper bound |. |
|skellam|mle^3 |. |
|zipf| |. |
|yulesimon| mle | only if >1 |

  • note there are several proposals for the 'only n' case.
    ^1 note the UMVUE dominates the MLE, and MLE is biased, and is not the MME
    ^2 MLE \hat{p}= \frac{T}{1+\sqrt{1+T}} where T=\sum_{i=1}^n | x_i| eqn (32) in linked paper and immediately afterwards they state that MLE = MME and they also give a distributional result.
    ^3 the MLE for Skellam is a special case of the root finding problem considered in section 3.2 of that paper, corresponds to COM-Poisson with the \nu_i =1 for i =1,2

Boltzmann note: Boltzmann is a truncated geometric distribution. An MLE for the probability of a geometric exists, so incorporation of a truncation point should be straightfoward. Estimation of truncation point is not known (to me) but also straightforward to work out. Joint estimation of truncation point and probability would need to be worked out. Unclear to me at this time if there are technical difficulties in this joint case.

Note: the remaining distributions: logseries, hypergeometric, planck, zipf, may exist but I was unable to find references.

Existence of estimators in discrete case is better than I expected but note that if a fit method is added this could preclude inclusion of some distributions. For example, there is an open PR with a poisson-binomial distribution, AFAIK there is no known MLE for poisson-binomial.

After doing the research I've gone from a -0.0 on this to a +0.3 on adding the fit. I'd recommend to open a discussion thread on the topic on the scipy-dev list. You'll reach more people there than here and get better opinions than mine. :)

All 4 comments

I agree this would be quite nice to have for the discrete distributions in scipy.stats.rv_discrete, and I would be happy to work on a PR if people think it's worthwhile. For a number of the current distributions (e.g. Poisson obviously), there are closed form solutions to the maximum likelihood estimates, others need optimizers / iterative algorithms, e.g. beta-binomial.

@WarrenWeckesser thought you should be included here.

I'd be happy to help review the discrete fit method PRs. Note that there is currently a lot of work in progress on the continuous side of things. gh-11695 adds method of moments, and @swallan has been looking into which distributions have closed form solutions for MLE and MM fits in gh-11782. Would someone be willing to help review gh-11695?

On Apr 27, 2020, at 8:36 PM, Matt Haberland notifications@github.com wrote:


@WarrenWeckesser thought you should be included here.

I'd be happy to help review the discrete fit method PRs.

Note some distributions have some fairly complicated procedures and constraints. For example, a binomial distribution with a known success probability but unknown n parameter is non standard and I’m not sure how to handle the case if both p and n are not specified, there is probably a paper somewhere on that case though.

I’m not opposed to the idea but it would be nice to see some research h into which discrete distributions we have that have closed form estimators, which require a root finding, and which are (currently) unknown.
IIRC there are currently 14 or so discrete distributions so shouldn’t be too excessive to research.

If we agree to support we’d want to make sure we could support the list we have for discrete.

If others agree that support for discrete distribution fits should be supported then I’d recommend open a tracking issue with distributions listed and pursue one distribution per PR, e.g. one distribution at a time.

Note that there is currently a lot of work in progress on the continuous side of things. gh-11695 adds method of moments, and @swallan has been looking into which distributions have closed form solutions for MLE and MM fits in gh-11782.

@mdhaber

Would someone be willing to help review gh-11695?

If I have time this weekend on Sunday I’ll take another look at 11695.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

discrete distributions taken from here:
https://github.com/scipy/scipy/blob/master/scipy/stats/_distr_params.py#L115

| dist name | mle | mme |
| ----------|-------|------|
| bernoulli | sample mean |. |
|betabinom| |mme |
|binom| | n + p only p only n* |
|boltzmann| see note |. |
|dlaplace|mle^2 |. |
|geom|mle |. |
|hypergeom|. |. |
|logser| |. |
|nbinom|mle^1 |. |
|planck| |. |
|poisson|mle | |
|randint|mle for upper bound |. |
|skellam|mle^3 |. |
|zipf| |. |
|yulesimon| mle | only if >1 |

  • note there are several proposals for the 'only n' case.
    ^1 note the UMVUE dominates the MLE, and MLE is biased, and is not the MME
    ^2 MLE \hat{p}= \frac{T}{1+\sqrt{1+T}} where T=\sum_{i=1}^n | x_i| eqn (32) in linked paper and immediately afterwards they state that MLE = MME and they also give a distributional result.
    ^3 the MLE for Skellam is a special case of the root finding problem considered in section 3.2 of that paper, corresponds to COM-Poisson with the \nu_i =1 for i =1,2

Boltzmann note: Boltzmann is a truncated geometric distribution. An MLE for the probability of a geometric exists, so incorporation of a truncation point should be straightfoward. Estimation of truncation point is not known (to me) but also straightforward to work out. Joint estimation of truncation point and probability would need to be worked out. Unclear to me at this time if there are technical difficulties in this joint case.

Note: the remaining distributions: logseries, hypergeometric, planck, zipf, may exist but I was unable to find references.

Existence of estimators in discrete case is better than I expected but note that if a fit method is added this could preclude inclusion of some distributions. For example, there is an open PR with a poisson-binomial distribution, AFAIK there is no known MLE for poisson-binomial.

After doing the research I've gone from a -0.0 on this to a +0.3 on adding the fit. I'd recommend to open a discussion thread on the topic on the scipy-dev list. You'll reach more people there than here and get better opinions than mine. :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

leynier picture leynier  Â·  3Comments

gvr37leo picture gvr37leo  Â·  3Comments

alonemanuel picture alonemanuel  Â·  4Comments

asteppke picture asteppke  Â·  5Comments

fakekarma picture fakekarma  Â·  3Comments