Keras: Hessian-free optimization?

Created on 29 Jul 2015 · 6Comments · Source: keras-team/keras

Just probing here a little bit to see if there was an interest for a Hessian-free optimizer, as e.g. in theano-hf by boulanni. Don't know the Keras code much yet, but I'd like to get some input if that's doable in a reasonable amount of time...

Source

harpone

Most helpful comment

I also have a need for a HF optimizer for estimating RNN parameters. I have played around with various first order optimizers but I am seeing too much noise, I was wondering if anyone has build a HF optimizer for Keras yet or whether I need to build one from scratch. I am currently looking at the Optimizer class to figure out what needs to be done.

quantlearn on 21 Feb 2017

👍6

All 6 comments

It would definitely be possible to add it as an Optimizer (see keras/optimizers.py).

It's not a priority, since HF has been shown to underperform RMSprop and Adagrad, while being more computationally intensive.

fchollet on 30 Jul 2015

Oh OK, wasn't aware of that... can you point me to a paper etc. with some benchmarks etc?

harpone on 30 Jul 2015

Can't find any paper, but what I would recommend you do is simply to implement it in Keras (should be quick to adapt boulanni's code) and benchmark it (time and accuracy) against other optimizers. Then you can answer the question without having to rely on what I remember reading about.

The fact is, nobody uses Hessian-Free optimization. If there was any advantage to it, everybody would be using it.

fchollet on 2 Aug 2015

Just a FYI to everybody seeing this: a paper by Sutskever et al discusses Heassian-free vs. momentum methods here:

http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_sutskever13.pdf

Quote: "Momentum-accelerated SGD, despite being a firstorder
approach, is capable of accelerating directions
of low-curvature just like an approximate Newton
method such as HF. Our experiments support the idea
that this is important, as we observed that the use of
stronger momentum (as determined by µ) had a dramatic
e↵ect on optimization performance, particularly
for the RNNs. Moreover, we showed that HF can be
viewed as a first-order method, and as a generalization
of NAG in particular, and that it already derives some
of its benefits through a momentum-like mechanism."

So yeah, probably Hessian-free is not very useful...

harpone on 3 Aug 2015

👍5

For what it's worth (responding to an old thread) the above quote by @harpone does not establish the conclusion. Actually the 2013 ICML paper quoted above demonstrates that careful use of momentum methods _almost_ closes the gap in performance that existed between HF-based methods and other first-order methods. That is, as of 2013, the HF-based methods of Martens + Sutskever were still better than the momentum methods employed in the 2013 paper. (at least for a number of important tasks, esp. training SRNs)

Perhaps the conversation subsequently evolved to the point that HF-based methods no longer tops after the advent of Adam and others. But that requires more reading. Perhaps I will find the paper Chollet refers to about Adagrad and RMSprop as I read further.