RVM is a Bayesian framework for obtaining sparse solutions to regression and classification tasks. It used a model of identical form to SVM ( Support Vector Machine). It solves the following disadvantages of SVM:
-- The number of basis function in SVM grows linearly with the size of the training set
In RVM, we start with 0 basis and incrementally update (add/delete) the set of basis function until convergence.
-- SVM predictions are not probabilistic while RVM's are probabilistic
-- It is necessary in SVM to estimate the margin trade-off parameter 'C' which is not the case in RVM
-- SVM kernel should be positive definite. In RVM we can use any kernel.
It is already implemented in dlib http://dlib.net/dlib/svm/rvm_abstract.h.html and there is also a matlab implementation here http://www.vectoranomaly.com/downloads/downloads.htm. These codes should serve as a guide.
I think it will be a good idea to add it to scikit-learn.
References :
1- Tipping, M. E. and A. C. Faul (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6.
2- Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211–244.
I'd have to read up on it again but in general I think RVMs would be a good addition.
dlib is boost licensed which should be compatible. It might not be so easy to wrap because of the boost-heavy coding style, though.
Is the problem optimized using SMO? Is it sensible if we implement SMO?
Gotta grab my Bishop.
What is the relation between ARD and RVM? Is RVM just the "basis function" version of ARD?
Btw is anyone ever bothered by the fact that the section Generalized Linear Models
doesn't contain any generalized models?
Ok so we should use the sequential sparse learning algorithm
Bishop p. 352 following I guess?
Know yourself out ;)
I wonder whether there is a similar method for ARD? That would be cool as the current ARD implementation is quite slow :-/
No the implementation of RVM definitely does not use SMO. I think SMO is only used for SVM optimization.
Yes we should use the sequential sparse learning algorithm
in reference 1 page 7. ( is it in bishop p 352? which one). This algorithm is "simple" enough and we can write it without using dlib. I was thinking about writing it in python and then use cython for optimization. In this case we can take full advantage of the matlab implementation. What do you think?
Anyway, It should be possible to write in C++. But for that we will need a good linear algebra library in C++. I am not sure if by default scikit-learn comes with one.
Bishop is "Machine Learning and Pattern Recognition".
The algorithm is probably not completely easy to get right. If you want to start on this, it will definitely be a bigger project. If you are interested and want to implement it any way, go ahead.
Implementing it is also quite a bit different from getting it into scikit-learn. That also involves writing tests, documentation and a user guide - and pretty code.
For a first try you should definitely use just numpy. It uses blas internally and is therefore quite fast.
Speeding up using Cython only makes sense if there is a lot of python overhead. If all the time is spent in the BLAS calls, using Cython doesn't make much sense.
OK for Cython and numpy. I didn't know bishop talk about RVM.
For the ARD and RVM relationship. I don't know a lot about ARD. But in reference 2 the authors said that RVM is based on ARD : "We term those training vectors associated with the remaining non-zero weights 'relevance' vectors, in deference to the principle of automatic relevance determination which motivates the presented approach" Page 3 (213) line 8.
Anyway how does ARD works?
ARD is also explained in Bishops book and in the user guide. It puts a diagonal Gaussian prior on the weights and tries to estimate the variances, which (as I understand it) is the same that RVM does. Is that correct?
I realize that the ref mentioned :
http://books.nips.cc/papers/files/nips20/NIPS2007_0976.pdf
is not the implementation we use. I think that @vmichel implemented the
Bishop approach while this paper purposes a fixed point approach similar to
a coordinate descent approach. This code definitely needs some love...
Thanks @agramfort, I was wondering about that. I didn't go through the details but I thought as the paper was the only reference...
I would very much appreciate it if you could add a comment there citing the chapter of bishop that was used an maybe saying we should implement the other paper instead.
(btw of of the slowest things in the test suite right now is fitting ARD on the boston dataset in the common tests)
Thanks @agramfort, I was wondering about that. I didn't go through the details but I thought as the paper was the only reference...
I would very much appreciate it if you could add a comment there citing the chapter of bishop that was used an maybe saying we should implement the other paper instead.
see : https://github.com/scikit-learn/scikit-learn/pull/1530
Btw is anyone ever bothered by the fact that the section Generalized Linear
Models doesn't contain any generalized models?
It does: logistic regressions.
I wonder whether there is a similar method for ARD? That would be cool as the
current ARD implementation is quite slow :-/
I believe that the most promising rapid solver of ARD is to implement the
strategy exposed in:
http://books.nips.cc/papers/files/nips20/NIPS2007_0976.pdf
I put on a gist a code wrote a while ago.
If somebody wants to work on ARD I thought it could be useful.
https://gist.github.com/4494613
WARNING : it's not much tested and I don't guarantee correctness but it
seems to work.
@amueller
I have analysed ARD in http://books.nips.cc/papers/files/nips20/NIPS2007_0976.pdf and I think RVM and ARD want to optimize the same objective function. The difference appears in the method used to optimize this function. In RVM, the authors noticed that most of the weight will be near zero and they used this to derive a "fast" algorithm.
That sounds odd. If the objective is the same, you should be able to use the same methods for optimization, right?
Yes for sure you should, but I guess the authors of RVM used a different optimization strategy to have a faster and more sparse algorithm.
@yedtoss I'm pretty sure there is some other difference. As I said before, this might be that RVM work in a feature space or with a kernel or something. Otherwise you could just replace the ARD implementation? That is regression, though, and you want classification, right?
@agramfort do you know anything about the difference of ARD and RVM?
@amueller
Initially RVM is a regression technique. But the authors presented a way to use it for classification. RVM used any kind of kernel.
What I mean is the log likelihood of ARD (equation 2 of A New View of Automatic Relevance Determination) and of RVM ( equation 7 of Fast marginal likelihood maximisation for sparse Bayesian models) are identical.
I guess I'd have to read the papers to know whats going on....
sorry guys I am not much of a Bayesian guy... I don't know well the
subtilties...
RVMs are patented by Microsoft.
crazy.
@larsmans @amueller while there is a patent in the US for RVM, the author recommends a GPLv2 Matlab implementation on his web page, so I guess it is ok to implement it...
http://www.miketipping.com/sparsebayes.htm
Best,
Angelos
@kalxas License and patent are quite orthogonal and GPLv2 in particular didn't address software patents. The rights you have with such an implementation are the intersection of the rights granted by the GPL and those granted by the patent holder.
That said, I found out in the meantime that support vector machines are patented by AT&T but the patent was apparently never enforced. If something similar can be proven of RVMs, I might change my mind about them.
@larsmans I wrote a pure numpy/python port of dlib implementation (awfully slow at the moment, I'll try to cythonize it). According to the header, dlib's implem has been around since 2008 and they seem fine with it. Would you consider changing your mind about having RVM in sklearn ?
Let's hear @GaelVaroquaux's opinion on this. The dlib implem doesn't show a thing as long as you can't prove it's widely used without a patent license.
Are there any updates on this topic? I've been looking into RVMs lately and was wondering if there was any code around there...
I don't think anyone has tried for a fast implementation, and we are still unsure about the legal status.
@jlopezpena Take a look at dlib, the code is pretty clear and it's header (templates) only. It should be fairly easy to build a C extension for use from numpy
Hi everyone,
I recently translated Mike Tipping’s freely-available SparseBayes MATLAB program, which mainly implements the RVM, from MATLAB into Python. It can be found here: https://github.com/jhallock7/SparseBayes-Python . I contacted Mike Tipping, and he said that Microsoft’s patent is only for the original slow algorithm, whereas the SparseBayes program uses the faster one found here: http://www.miketipping.com/papers/met-fastsbl.pdf . So it would be fine if some form of his program was folded into scikit-learn. I am relatively new to Python, so my translation can undoubtedly be improved or modified.
Thanks for wanting to contribute and also thanks for checking the patent status.
There is another question though, which is if this algorithms is widely useful.
I haven't looked at the algorithm and its use since I last read Bishop's book, which is a while ago.
I think an interesting example would be to show that it either gives better uncertainty then calibrating
an SVM, or that it is faster (calibrating an SVM and searching over C needs a lot of cross-validation).
@amueller RVM (mainly revelance vector regression (RVR)) is pretty useful in neuroimaging data analysis. Lots of papers uses this method rather than SVR to predict. It will be perfect, if this method can be added to the scikit learn toolbox.
@amueller I implemented slow version of RVM which can use either EM or fixed-point algorithm to fit model (mainly for learning / academic purpose) and major difference between RVM and SVR that I noted from couple of examples is sparsity i.e. number of 'support' vectors used in prediction . In many cases RVM produces results comparable with SVR with number of support vectors being only a fraction of what SVR uses
(here is one simple example that is also used in Tipping 2001)
@amueller (Adding to previous comment) And obviously small number of support vectors will imply very fast prediction.
Another advantage of RVM is probabilistic treatmen. With RVM for every data point in test set you find not only point estimate but also predictive distribution
does RVR provide probability distributions, too?
Sounds like RVR and RVM are reasonable candidates for inclusion. I'm not sure about the state of the art algorithms, though. Is it http://www.miketipping.com/papers/met-fastsbl.pdf ? That seems pretty old. The SparseBayes coding style is.... interesting, and I think it would serve better as a reference than as a basis for the sklearn implementation.
Yes RVR provides probability distributions, however in some cases variance of predictive distribution can be smaller for data points outside the domain of the training set example.
To the best of my knowledge paper you mentioned is last version of RVM it also corresponds to Matlab implementation on Tipping's website (version 2).
I also found interesting comparison of RVM and SVM speed in Kevin Murphy's book:
"RVM is also fastest to train. This is despite the fact that RVM code is in Matlab and the SVM code is in C"(Chapter 14, p.490). However it seems that they made comparisons for small datasets only.
@amueller RVM (mainly revelance vector regression (RVR)) is pretty useful in
neuroimaging data analysis.
I am not convinced (and I do neuroimaging). I haven't seen a good
empirical comparison.
I see the way to go is to have RVMs in a separate package, with
scikit-learn API, and encourage good empirical work to show their
usefulness. If they are useful, merge them in scikit-learn.
Some neuroimaging studies have used relevance vector regression (RVR) and
made comparisons between RVR and SVR.
To list a few:
http://www.sciencedirect.com/science/article/pii/S1053811910000108
http://www.sciencedirect.com/science/article/pii/S1053811910012644
http://www.sciencedirect.com/science/article/pii/S1053811910003459
http://www.nature.com/npp/journal/v39/n3/abs/npp2013251a.html
And, the RVR is implemented in a pattern recognition toolbox for
neuroimaging data:
http://www.mlnl.cs.ucl.ac.uk/pronto/
Hope that RVR can be incorporated in scikit-learn.
Best wishes
Zaixu
On Thu, Oct 15, 2015 at 12:57 PM, Gael Varoquaux [email protected]
wrote:
@amueller RVM (mainly revelance vector regression (RVR)) is pretty
useful in
neuroimaging data analysis.I am not convinced (and I do neuroimaging). I haven't seen a good
empirical comparison.I see the way to go is to have RVMs in a separate package, with
scikit-learn API, and encourage good empirical work to show their
usefulness. If they are useful, merge them in scikit-learn.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/1513#issuecomment-148281784
.
I see the way to go is to have RVMs in a separate package, with scikit-learn API, and encourage good empirical work to show their usefulness. If they are useful, merge them in scikit-learn.
+1
I mean there is https://github.com/AmazaspShumik/Bayesian-Regression-Methods/blob/master/Relevance%20Vector%20Machine%20%26%20ARD/rvm.py which looks relatively compatible. needs set_params
and get_params
or inheriting from BaseEstimator
.
And there is https://github.com/jhallock7/SparseBayes-Python which could be wrapped.
@ZaixuCui why do you want it to be in scikit-learn when there is a ready to use implementation out there?
I tend to agree with @GaelVaroquaux and @mblondel . If no-one published on algorithms in nearly ten years, people don't seem to be very interested. [oh, the standard algorithm is from 2003 even. but then again libsvm is 2005]
Because I use scikit learn to do SVR, elastic-net.
So, if there is a RVR implementation, I will not need to do use matlab when
do machine learning analysis.
Thanks very much.
Wishes you all the best
Zaixu
On Thu, Oct 15, 2015 at 11:13 AM, Andreas Mueller [email protected]
wrote:
I mean there is
https://github.com/AmazaspShumik/Bayesian-Regression-Methods/blob/master/Relevance%20Vector%20Machine%20%26%20ARD/rvm.py
which looks relatively compatible. needs set_params and get_params or
inheriting from BaseEstimator.And there is https://github.com/jhallock7/SparseBayes-Python which could
be wrapped.@ZaixuCui https://github.com/ZaixuCui why do you want it to be in
scikit-learn when there is a ready to use implementation out there?I tend to agree with @GaelVaroquaux https://github.com/GaelVaroquaux
and @mblondel https://github.com/mblondel . If no-one published on
algorithms in nearly ten years, people don't seem to be very interested.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/1513#issuecomment-148440665
.
Because I use scikit learn to do SVR, elastic-net.
So, if there is a RVR implementation, I will not need to do use matlab when
do machine learning analysis.
You can use the Python code doing RVR that we pointed to in the
discussion.
OK,thanks
On Mon, Oct 19, 2015 at 8:29 AM, Gael Varoquaux [email protected]
wrote:
Because I use scikit learn to do SVR, elastic-net.
So, if there is a RVR implementation, I will not need to do use matlab
when
do machine learning analysis.You can use the Python code doing RVR that we pointed to in the
discussion.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/1513#issuecomment-149213151
.
Couldnt we implement this as very lightweight class based on our new Gaussian Process implementation? As a far as I understand, RVR is only the name given to a GP with a special kind of kernel.
Though this would require only minimal effort, basing the implementation of RVR on the one of GP may not be the most appropriate thing to do? CC: @jmetzen
@amueller @GaelVaroquaux @ZaixuCui @yedtoss @jlopezpena
Hi, I implemented fast version of Relevance Vector Machine with scikit-learn API,
so if anybody intends to use it feel free to do it.
Code: https://github.com/AmazaspShumik/sklearn_bayes/blob/master/sklearn_bayes/rvm/fast_rvm.py
There are four implemented classes in the code:
-RegressionARD
-ClassificationARD
-RVC
-RVR
So may be RegressionARD and ClassificationARD can be useful as well
@AmazaspShumik thank you so much for your implementation. Great work :+1:
@AmazaspShumik
Thank you for your efforts very much.
I will definitely try this package.
Wish you all the best.
Zaixu
Has anyone had trouble implementing @AmazaspShumik predict_proba method?
any one here have library for RVM on php? i dont understand with RVm can explain for me?
any one have library RVM for PHP?
Some pointed in discussion links to @AmazaspShumik implementation are broken, just put them here for people who interested in (and some other implementations):
https://github.com/AmazaspShumik/sklearn_bayes - RVM + some other algs implementation
https://github.com/JamesRitchie/scikit-rvm - simple implementations using scipy
https://github.com/siavashserver/neonrvm - C implementation with Python binding
Also, here is relevant papers collection:
http://www.miketipping.com/sparsebayes.htm
There is a C++ implementation of RVM in there (supplementary material of a paper):
https://pubs.acs.org/doi/abs/10.1021/ci200128w
Precisely here:
https://pubs.acs.org/doi/suppl/10.1021/ci200128w/suppl_file/ci200128w_si_001.zip
Microsoft patent has expired. Could we possibly add it to sklearn?
It easily clears the standard requirements, so I don't see why not. Maybe having some good/convincing examples might be interesting. scikit-rvm
and sklearn_bayes
seem unmaintained but might still be useful.
Probably mostly needs a champion now that actually wants to put the work in.
On Murphy's book he claims the RVMs performance is really similar to SVMs but has the advantage of being a true probabilistic method so it gives calibrated probabilities as answers. Here https://github.com/probml/pmtk3/blob/master/docs/tutorial/html/tutKernelClassif.html he compared the methods using a small dataset
Can you provide a link to the rendered version?
Also, itn's not surprising that it's on a small dataset, given that is is likely to have scalability issues.
@amueller http://htmlpreview.github.io/?https://github.com/probml/pmtk3/blob/master/docs/tutorial/html/tutKernelClassif.html
I will try to work on the implementation. Any help would be highly appreciated.
IIRC, one advantage of RVM over SVM, is that you can find the optimal C parameter without doing an optimization pass.
I wonder if the original author would be willing to contribute his reference implementation.
Well, it is in Matlab and Mike Tipping is not even on github...
Hi @amueller and everybody! We saw this thread and decided to implement a sklearn-compatible version of the RVM (https://github.com/Mind-the-Pineapple/sklearn-rvm). We based a lot of what we did on the JamesRitchie implementation. It would be great if someone would be willing to take a look at it and feedbacks and contributions (@themrzmaster) are welcome :)
We have plans of maintaining this repo, implementing the fast version on version 0.2 and hope that maybe some day this code could be helpful for scikit-learn main repository 😊 🍍
Hi @amueller and everybody! We saw this thread and decided to implement a sklearn-compatible version of the RVM (https://github.com/Mind-the-Pineapple/sklearn-rvm). We based a lot of what we did on the JamesRitchie implementation. It would be great if someone would be willing to take a look at it and feedbacks and contributions (@themrzmaster) are welcome :)
We have plans of maintaining this repo, implementing the fast version on version 0.2 and hope that maybe some day this code could be helpful for scikit-learn main repository 😊 🍍
Hi @PedroFerreiradaCosta
I have tested this scikit-learn api and it seems it is still so slow (seems not responding yet). Do you think the reason could be it is implemented on Windows? Below is what I used:
EMRVC(kernel='rbf', gamma='scale', n_iter_posterior=10, max_iter=500, compute_score=True , verbose=True ) Thanks for your answer @themrzmaster @PedroFerreiradaCosta
Hi @mustuner ! Thank you for trying out our API!
The RVM does have a higher computational complexity than for example the SVM (O(M^3)), which might make it slower for cases where you have a large number of basis functions. Either way, there are a number of things that you can do to speed up the process. You can pre-compute the kernel matrix and feed it to our algorithm instead of X, (set kernel='precomputed') or you can also decrease the scale of the alpha_threshold (set by default at 1e5). Please have in mind that this second option might lead to lower accuracy of the model.
Hope this helps! And feel free to open an issue for us to help further.
Best,
Pedro
Hi @mustuner ! Thank you for trying out our API!
The RVM does have a higher computational complexity than for example the SVM (O(M^3)), which might make it slower for cases where you have a large number of basis functions. Either way, there are a number of things that you can do to speed up the process. You can pre-compute the kernel matrix and feed it to our algorithm instead of X, (set kernel='precomputed') or you can also decrease the scale of the alpha_threshold (set by default at 1e5). Please have in mind that this second option might lead to lower accuracy of the model.
Hope this helps! And feel free to open an issue for us to help further.Best,
Pedro
Thank you @PedroFerreiradaCosta Let me try that
Most helpful comment
Hi @amueller and everybody! We saw this thread and decided to implement a sklearn-compatible version of the RVM (https://github.com/Mind-the-Pineapple/sklearn-rvm). We based a lot of what we did on the JamesRitchie implementation. It would be great if someone would be willing to take a look at it and feedbacks and contributions (@themrzmaster) are welcome :)
We have plans of maintaining this repo, implementing the fast version on version 0.2 and hope that maybe some day this code could be helpful for scikit-learn main repository 😊 🍍