Keras: Hierarchical Softmax

Created on 24 Jul 2015  路  22Comments  路  Source: keras-team/keras

I am working with @RobGrimm on a Hierarchical Softmax to speed up training with a large nb of classes. Would this be a welcome PR? It's only two-level right now, but that already gives a dramatic speed-up. The plain softmax is an Activation, but the Hierarchical version seems more of a core layer: what do you reckon?

stale

Most helpful comment

Did anything like this ever get integrated into Keras, or is it still up to users to implement it?

All 22 comments

How would it work?

As a layer, I imagine it would belong in layers.advanced_activations.

Avanced_activations is a great idea -- I didn't think of that. The protoype is here: https://github.com/RobGrimm/benchmark_hierarchical_softmax

Ok. Yes, it could be a welcome PR.

One problem: for this hierarchical softmax layer, we need also access to the correct y's for a minibatch (only during training of course, that is). Is there an easy way to access a minibatches's correct y-output from within a Layer?

we need also access to the correct y's for a minibatch (only during training of course, that is)

Um, really? Softmax is a normalization operation over an array of scores; isn't hierarchical softmax merely a hierarchical (structured as a binary tree, therefore faster) version of softmax?

Is there an easy way to access a minibatches's correct y-output from within a Layer?

No, in that case you'd have to implement it as a loss function.

It would definitely make sense to implement it as a loss function (since that provides easy access to y_true), but it also has parameters that need to be optimized (the path that you optimize through the graph during training only depends on the correct target label; this is what makes it fast: you only compute the cost for the correct target path). Can we pass parameters to a model from within a loss function?

As a layer used at the top of a stack, you have indirect access to the labels through the gradient updates.

Wouldn't it be possible to learn the paths via such updates?

Could you be more specific? It seems like you suggest that we somehow extract the y labels from get_updates() in models.py and then redefine the train_loss fn? It would be better if the new layer would have direct access to y, but we understand that is very difficult to arrive at in the current architecture...

I just mean, would there be any way to learn hierarchical softmax via descent of the gradient of the loss with regard to the parameters of the HS layer, thus requiring no explicit knowledge of the labels?

Could you describe succinctly the hierarchical softmax algorithm?

It is possible to compute the gradients without knowing the true class labels, but then you don't save any time, and you might as well use flat softmax.

With HS, you encode the output labels using a tree or a graph $H$, where each possible path from the root to a leaf corresponds to a label. You can look at a path as the set of nodes $P$ which you have to traverse when taking that path. Each label $y$ is thus associated with a path $P_y$.

Each node in $H$ is associated with an embedding. Given a training example $t$, you compute the similarity between $t$ and each node's embedding. The probability of $y$ being the correct label for $t$ is then given by the product of similarity scores of the nodes in $P_y$.

When you have access to the true class label, you only need to compute the product of scores for the path that corresponds to the true class label. If you don't know the true class label, you have to compute the product for each possible path, and then you don't have any gain compared to flat softmax.

We have been making good progress the past days, but now we're stuck.

@fchollet: could you have a look? With the code there, this should make it easier for you to get a sense of what we are trying to achieve. It would be great if we can figure this out, since we don't seem that far away from a solution.

@mikekestemont could You link a paper that introduces this type of softmax? Until now I thought HS uses a binary tree with as many nodes as there are classes?

@RobGrimm will have more details on this, but our approach is a more general/simplistic two-level approach, where the learning task boils down to selecting two paths. My own keras-fork has a working implementation so you can check out the details there, although this version still needs tuning with respect to the activations used.

As Mike said, this a more simplistic approach. I'm not aware of a paper that describes it.

One problem we are having with it is that the HS does not learn to discriminate well among different classes. It correctly maximizes the probability of the true class label (so you see a decrease in the loss), but probabilities for the other classes remain high (so you do not see a corresponding increase in accuracy). Presumably this could be solved by working with more than just two levels.

So while it works in principle, the thing is not finished at this point.

Did you get it to work on some dataset? I've noticed Your implementation uses shared weights at the second level
self.W2 = theano.shared(value=numpy.zeros((n_in, self.n_level2_nodes)
while the theano implementation does not.

Aha, the theano implementation seems a new development and wasn't around when we experimented with this. Are you interested in this? @RobGrimm: do you want to take this up again? My own keras-fork has a script in the examples folder where I got our previous implementation to work on a language modelling task.

I am interested in this, but the implementation in Your fork doesn't scale vey well. I replaced activaions with sofmaxes and use my own Graph to support two inputs. I merged the weights to speed up computations. But still the performance is quite bad compared to standard softmax. I wanted to save some memory via this weight sharing, but it seems like it won't work after all.

You mean that the HS implementation is faster than the flat version, but not by as much as it should be, right?

I think a large part of this has to do with the requirement by keras that you need to calculate a probability for every output class. During training, this is not necessary for the HS implementation, since you are only going to calculate the probability of the true class label.

But because keras requires a probability for every class, we create a matrix at every training step, then populate it with zeros -- except for the probability of the true class label. (Maybe there's a more efficient way to work around this.)

I made the creation of this matrix an optional parameter in this version here, which runs without keras: https://github.com/RobGrimm/hierarchical_softmax/blob/master/HierarchicalSoftmax.py

It saves a lot of time if you don't have to create the matrix but instead return only the probability for the true class label.

@mikekestemont sure, I'm up for working on this some more

You mean that the HS implementation is faster than the flat version, but not by as much as it should be, right?

No, it's actually a lot faster. But due to parameters sharing it does not learn.

But because keras requires a probability for every class, we create a matrix at every training step, then populate it with zeros -- except for the probability of the true class label. (Maybe there's a more efficient way to work around this.)

You can pass only vector of int8 ones. If the HS returns probas for true labels (passed as vector of int16) then the loss is simpy binary_crossentropy. Profiling theano shows that it does not slow down the learning (small amount of time spent on data transfer).

But again, I think the main problem is that if you share parameters, the HS will not have enough flexibility to learn. My initial experiments confirm this.

We can use theano's implementation at the expense of increased memory usage.

Really cool what you guys are working on here.

Would this type of softmax be applicable for classes of words? I'm trying to find the best way to predict a sequence of words given a sequence of words. (Say sentence to sentence).

In other words, if we had a 100k vocab, we wouldn't want to do a softmax on 100k words, but rather a hierarchical fashion of classes of words until we get to the correct word. Hinton's coursera course, illustrates this very well in lecture 4-5.

The paper that illustrates this hierarchical setup is: https://www.cs.toronto.edu/~amnih/papers/hlbl_final.pdf

He does say though that it still takes a long time to process the hierarchical setup. He doesn't mention anything about inability to learn though (what you guys were saying earlier?)

Did anything like this ever get integrated into Keras, or is it still up to users to implement it?

any news on this? sounds like a great feature!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lmoesch picture lmoesch  路  89Comments

hotplot picture hotplot  路  59Comments

ycdaskin picture ycdaskin  路  100Comments

parag2489 picture parag2489  路  64Comments

EderSantana picture EderSantana  路  219Comments