Pomegranate: Student T Distribution as new distribution?

Created on 29 Jul 2017 · 19Comments · Source: jmschrei/pomegranate

Is there any plan to add a Student T distribution as a possible distribution in the future? I'd love to help anyway I could

0.9.1

Source

arose13

All 19 comments

No current plan, but I would welcome it! If you want to try adding it to he distributions.pyx file, with appropriate unit tests in the tests/test_distributions.py folder, that would be great! You might want to follow the NormalDistribution object as a guide. Let me know if you want to take a stab at it.

jmschrei on 29 Jul 2017

Okay I'll try it and send a PR

arose13 on 30 Jul 2017

Do you need any help to get you started with this? I know that it can be difficult to get started with cython.

jmschrei on 4 Aug 2017

I haven't started yet because I was dealing with a paper but I will start on Friday. That being said I will probably need help but I hope to do it being opening a PR request as I progress so you can make sure I'm on the right track

arose13 on 4 Aug 2017

Of course. There is no rush. Open a [WIP] PR any time you feel like you have some progress you'd like me to look over. And if you feel you can't finish it for one reason or another, there is no pressure. :)

jmschrei on 4 Aug 2017

Hey, my biggest hiccup is trying to understand what and how the summarise and from_summaries functions work

arose13 on 18 Oct 2017

Essentially, the summarize method takes a batch of data and reduces it to a set of sufficient statistics that are additive with previous sufficient statistics. That way, datasets can be partitioned into batches easily for things like parallelization and out of core learning. No modification is made to the model / distribution in this step, just the stored sufficient statistics are updated. The from_summaries method takes these additive statistics and uses them to actually update the parameters of the model / distribution.

jmschrei on 18 Oct 2017

👍1

The kernel density models seem to just memorize the input data is this true or am I misinterpreting this class. (Sorry I'm not extremely familiar with Cython)

arose13 on 19 Oct 2017

You're correct. That's how kernel densities work, they are non parametric
distributions for which the summaries are just the datasets. Essentially,
you can think of them as distribution that is made up of a lot of tiny
Gaussian distributions (for GaussianKernelDensity) that are placed on top
of each sample. If many samples cluster together, the empirical PDF grows
with the overlap between these Gaussian distributions. There are likely
more efficient ways to store them, but this is just the naive approach.

On Thu, Oct 19, 2017 at 1:16 PM Stephen Anthony Rose <
[email protected]> wrote:

The kernel density models seem to just memorize the input data is this
true or am I misinterpreting this class. (Sorry I'm not extremely familiar
with Cython)

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/jmschrei/pomegranate/issues/299#issuecomment-338025231,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADvEEDHEG-O7LGBtePC5YEjjFoUwfUhSks5st64egaJpZM4OncCF
.

jmschrei on 19 Oct 2017

👍1

Hi @jmschrei @arose13,
any news on this? I could not find it in master or a PR. I can volunteer to pick this up, let me know.
S

sdia-zz on 6 Aug 2018

Hey @sdia,

As far as I got, I could not think of a streaming way to fit the student T distribution, the degrees of freedom part in particular.

arose13 on 6 Aug 2018

Hello @arose13, thank you for the reply! I am an absolute beginner on this, not even sure to understand your message :-) Anyway I understand from your answer that this is still open, therefore I will work on it; btw feel free to share with me anything you already have.

Thank you guys! I love this project!

sdia-zz on 6 Aug 2018

@sdia I unfortunately already removed the repo from my computer when I gave up.

Check out this link for the distribution interface/API Custom distribution

For example, The parameters of a normal distribution can be solved by summary statistics got from a 'stream' of data. mean = sum(x_i) / N where you can imagine just counting N as you iterate through the data and the sum is just the running sum.

My problem was that I could not think of any way of doing the same for the nu/degrees of freedom parameter for the Student T distribution.

arose13 on 7 Aug 2018

@arose13 thanks for taking a stab at it though!

@sdia the idea is that all distributions / models have a summaries and from_summaries method that are used to distill a data set down to it's sufficient statistics (the moments), and update the parameters of the model based on the stored sufficient statistics. These statistics are additive, and so one can summarize one batch of data, then summarize the next batch of data, and get the same update as if they'd seen both batches together initially. This allows for a bunch of cool stuff, like out-of-core learning, minibatch learning, parallelized learning, semi-supervised learning, and learning with missing data. Slides 11-14 here can help explain: https://github.com/jmschrei/pomegranate/blob/master/slides/pomegranate%20PyData%20NYC%202017.pdf

In this setup, the fit call is simply a call to both summarize and from_summaries together. In your personal case, you could refine summarize to function the same as fit and ignore from_summaries. This would work in your case when you're only calling fit, but would prevent you from using these other features. If those features aren't too relevant, no harm.

jmschrei on 7 Aug 2018

Hey @jmschrei, your explanation is really clear: being able to summarize a distribution using additive statistics is very useful.
But I have a newbie question: after thinking over it, it does not make sense to me to summarize the t-student as its only parameter is the dof. Is there any use-case you would want to fit a t-distribution? It's like trying to fit the standard normal distribution which has its mean and sigma fixed. What am I missing?
Ref: https://en.wikipedia.org/wiki/Student%27s_t-distribution

sdia-zz on 7 Aug 2018

I think that the typical student t-distribution is mean 0 with sigma corresponding to the dof. I'm not sure whether that is the only formulation of the T-distribution. Would it be possible to have a non-zero centered t-distribution that's calculated in the same manner as a normal distribution, but the probabilities have larger tails? I don't know the answer to this question. I would suggest asking someone that is more familiar with the t-distribution. However, you can still implement the log probability calculation that taks in only a dof and calculates log probabilities while we look for the answer to this question.

jmschrei on 7 Aug 2018

👍1

All right @jmschrei, please have a look at https://github.com/jmschrei/pomegranate/pull/480.
S

sdia-zz on 8 Aug 2018

Hey folks,
I just wanted to clarify a point that Student T distributions takes 3 params.