Is there any plan to add a Student T distribution as a possible distribution in the future? I'd love to help anyway I could
No current plan, but I would welcome it! If you want to try adding it to he distributions.pyx
file, with appropriate unit tests in the tests/test_distributions.py folder, that would be great! You might want to follow the NormalDistribution
object as a guide. Let me know if you want to take a stab at it.
Okay I'll try it and send a PR
Do you need any help to get you started with this? I know that it can be difficult to get started with cython.
I haven't started yet because I was dealing with a paper but I will start on Friday. That being said I will probably need help but I hope to do it being opening a PR request as I progress so you can make sure I'm on the right track
Of course. There is no rush. Open a [WIP] PR any time you feel like you have some progress you'd like me to look over. And if you feel you can't finish it for one reason or another, there is no pressure. :)
Hey, my biggest hiccup is trying to understand what and how the summarise
and from_summaries
functions work
Essentially, the summarize
method takes a batch of data and reduces it to a set of sufficient statistics that are additive with previous sufficient statistics. That way, datasets can be partitioned into batches easily for things like parallelization and out of core learning. No modification is made to the model / distribution in this step, just the stored sufficient statistics are updated. The from_summaries
method takes these additive statistics and uses them to actually update the parameters of the model / distribution.
The kernel density models seem to just memorize the input data is this true or am I misinterpreting this class. (Sorry I'm not extremely familiar with Cython)
You're correct. That's how kernel densities work, they are non parametric
distributions for which the summaries are just the datasets. Essentially,
you can think of them as distribution that is made up of a lot of tiny
Gaussian distributions (for GaussianKernelDensity) that are placed on top
of each sample. If many samples cluster together, the empirical PDF grows
with the overlap between these Gaussian distributions. There are likely
more efficient ways to store them, but this is just the naive approach.
On Thu, Oct 19, 2017 at 1:16 PM Stephen Anthony Rose <
[email protected]> wrote:
The kernel density models seem to just memorize the input data is this
true or am I misinterpreting this class. (Sorry I'm not extremely familiar
with Cython)—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/jmschrei/pomegranate/issues/299#issuecomment-338025231,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADvEEDHEG-O7LGBtePC5YEjjFoUwfUhSks5st64egaJpZM4OncCF
.
Hi @jmschrei @arose13,
any news on this? I could not find it in master or a PR. I can volunteer to pick this up, let me know.
S
Hey @sdia,
As far as I got, I could not think of a streaming way to fit the student T distribution, the degrees of freedom part in particular.
Hello @arose13, thank you for the reply! I am an absolute beginner on this, not even sure to understand your message :-) Anyway I understand from your answer that this is still open, therefore I will work on it; btw feel free to share with me anything you already have.
Thank you guys! I love this project!
@sdia I unfortunately already removed the repo from my computer when I gave up.
Check out this link for the distribution interface/API Custom distribution
For example, The parameters of a normal distribution can be solved by summary statistics got from a 'stream' of data. mean = sum(x_i) / N
where you can imagine just counting N
as you iterate through the data and the sum is just the running sum.
My problem was that I could not think of any way of doing the same for the nu/degrees of freedom parameter for the Student T distribution.
@arose13 thanks for taking a stab at it though!
@sdia the idea is that all distributions / models have a summaries
and from_summaries
method that are used to distill a data set down to it's sufficient statistics (the moments), and update the parameters of the model based on the stored sufficient statistics. These statistics are additive, and so one can summarize one batch of data, then summarize the next batch of data, and get the same update as if they'd seen both batches together initially. This allows for a bunch of cool stuff, like out-of-core learning, minibatch learning, parallelized learning, semi-supervised learning, and learning with missing data. Slides 11-14 here can help explain: https://github.com/jmschrei/pomegranate/blob/master/slides/pomegranate%20PyData%20NYC%202017.pdf
In this setup, the fit
call is simply a call to both summarize
and from_summaries
together. In your personal case, you could refine summarize
to function the same as fit
and ignore from_summaries
. This would work in your case when you're only calling fit, but would prevent you from using these other features. If those features aren't too relevant, no harm.
Hey @jmschrei, your explanation is really clear: being able to summarize a distribution using additive statistics is very useful.
But I have a newbie question: after thinking over it, it does not make sense to me to summarize the t-student as its only parameter is the dof. Is there any use-case you would want to fit a t-distribution? It's like trying to fit the standard normal distribution which has its mean and sigma fixed. What am I missing?
Ref: https://en.wikipedia.org/wiki/Student%27s_t-distribution
I think that the typical student t-distribution is mean 0 with sigma corresponding to the dof. I'm not sure whether that is the only formulation of the T-distribution. Would it be possible to have a non-zero centered t-distribution that's calculated in the same manner as a normal distribution, but the probabilities have larger tails? I don't know the answer to this question. I would suggest asking someone that is more familiar with the t-distribution. However, you can still implement the log probability calculation that taks in only a dof and calculates log probabilities while we look for the answer to this question.
All right @jmschrei, please have a look at https://github.com/jmschrei/pomegranate/pull/480.
S
Hey folks,
I just wanted to clarify a point that Student T distributions takes 3 params.
Only the degrees of freedom affects the weight of the tails
You should be able to implement your own distributions now with the addition of #521