Keras: Split (or multiple objective functions)

Created on 13 Jun 2015 · 28Comments · Source: keras-team/keras

Apologies if this has already been asked, couldn't find anything though. But do you have the capability to utilize multiple objective functions, say a "Split" similar to "Merge"? The goal would be to implement a model like the one given here:

https://github.com/rbgirshick/fast-rcnn

That is, have a classifier and a regression objective function that both provide gradient information to backprop.

Source

jgbos

Most helpful comment

@iskandr I have a fix that might actually be correct/troubles me less. I'll have a PR later today.

pdermyer on 23 Jun 2015

❤1 😕1 🎉1 😄1 👍1

All 28 comments

This feature is not yet implemented, but it's in the plans. It will be introduced as a "Fork" model.

fchollet on 13 Jun 2015

Will it be possible to do multi-task learning using a Fork model? I have a dataset with several dozen distinct but related output types and I'd like to train a single model which uses a shared representation to predict all the outputs but only do backprop from the output(s) available in the data.

Also, do you need any help implementing this? Is there a branch or PR which needs attention?

iskandr on 19 Jun 2015

Will it be possible to do multi-task learning using a Fork model?

Yes, assuming the problem is tractable. However, why use a single model for every output instead of having separate models? It's bound to have a significant negative impact on performance.

Also, do you need any help implementing this? Is there a branch or PR which needs attention?

There is no branch yet. If you're interested, we can start a discussion about this. This is not just implementation; we would have to discuss architecture choices, since it is not entirely clear what a Fork model would entail (we might end up implementing it as a Fork layer instead, just like the Merge layer).

fchollet on 19 Jun 2015

However, why use a single model for every output instead of having separate models? It's bound to have a significant negative impact on performance.

The outputs are distinct type of assays which measure the same underlying quantity but with slightly different biases and response profiles. Presumably being able to use all the data in a single model will improve performance via a more robust shared representation.

This has worked for other problem domains (e.g. A unified architecture for natural language processing: deep neural networks with multitask learning), though I'm not yet sure how useful multi-task learning will be for me I'd at least like to try it out.

If you're interested, we can start a discussion about this. This is not just implementation; we would have to discuss architecture choices, since it is not entirely clear what a Fork model would entail (we might end up implementing it as a Fork layer instead, just like the Merge layer).

I'm definitely interested. Do you want to start another thread and maybe lay out some of your thoughts about what might need to change to accommodate Fork models or layers?

iskandr on 19 Jun 2015

First, a heads up, with the current implementation of Merge, merging the results of a fork will result in a broken Theano function. It looks like this only requires minimal changes to fix.

I think there are a few natural options for how Fork should look in Keras:

Option A: fork as a function

models = fork(base_model, nb_branches=6)
#and
left, right = fork(base_model, nb_branches=2)

Option B: a Fork layer

left = Sequential()
left.add(Fork(base_model))
right = Sequential()
right.add(Fork(base_model))

Option C: a Fork model which would allow you to branch away from another model:

branch = Fork(base_model)
#and
branches = [Fork(base_model) for i in range(3)]

Given the way Merge works right now, adding to one of the pre-merged models after creating the merged model changes the resulting merged model:

model_merged = Sequential()
model_merged.add(Merge([model1, model2], mode=‘concat’))
#why would anyone do this in this order?
model1.add(Activation(‘relu’))
model2.add(Activation(‘relu’))
#Even with this weird order, the expectation is fairly clear

This behavior is not problematic in the case of Merge, but if Fork works the same way it might prove frustrating and counterintuitive to how users naturally expect a fork to work. This will likely cause confusion especially for options B and C

#model_forked = Fork(model)
model_forked = Sequential()
model_forked.add(Fork(model))
#what should I expect this to do:
model.add(Dense(1024,1024))
#yikes! have we changed the forked model too?

On another note, I need to start using this functionality by early next week. I’m already working on an implementation and would rather produce something that could be contributed back to the project if possible. Let me know your thoughts.

pdermyer on 21 Jun 2015

Thanks for sharing your thoughts, this is very interesting.

For reference, this was the initial proposal for a Fork model: https://github.com/fchollet/keras/issues/104

model = Sequential()
model.add(Dense(10, 128))

two_headed_model = Fork(model, n=2)
two_headed_model.add(Dense(128, 64), position=0)
two_headed_model.add(Dense(128, 1), position=1)

two_headed_model.compile(optimizer, objective)
two_headed_model.fit(X, [y1, y2])

A Fork layer would work essentially in the same way:

model = Sequential()
model.add(Dense(10, 128))

model.add(Fork(n=2))
model.add(Dense(128, 64), position=0)
model.add(Dense(128, 1), position=1)

model.compile(optimizer, objective)
model.fit(X, [y1, y2])

I do not know at this point what the best API and architecture would be. What would be some pro/cons of the above, compared to a fork function, for instance?

fchollet on 21 Jun 2015

I think the crux is that whatever the syntax, we need to end up with a single object providing the model interface for training and prediction.

The Fork Layer:
The fork layer above is awkward. It relies on the add() function of the underlying model to specify the position, which should be a concern of the fork object. It falls apart when we consider the case with more than one fork in a model. In which position for which fork are we specifying when we call add() on the model?

The Fork Model:
The fork model more naturally accommodates multiple forking. But when you look at complex models, you see a mess of forking and multiple auxiliary classifiers (see GoogLeNet http://arxiv.org/pdf/1409.4842.pdf). In such a case, clearly there isn't going to be just one fork model that we can interact with for training, so we wont end up with a final manageable, trainable model. In the simple case, however, the fork model in #104 seems the best option so far.

To distill the requirements as I understand them: We would like to take n input layers (multi-modal) to m output layers with at most one (weighted) objective function applied to each of these m output layers. We should be able to compile and fit the model in the same way as the sequential model.

To that end, I wonder if the most sane route might be adding a graph model like what I sketch below:

#sub models like text_model become immutable after fork or merge to prevent surprises.
model_graph = Graph()

text_model = model_graph.create_input()
text_model.add(…)
text_model.add(…)
…

image_model = model_graph.create_input()
image_model.add(…)
image_model.add(…)
…

#adding auxiliary classification
image_model, image_intermediate = model_graph.fork(image_model, n=2)
image_intermediate.add(Dense(4096,1000))
image_intermediate.add(Activation(‘softmax’))
model_graph.create_output(image_intermediate, objective1)

#merging models
text_and_image_model = model_graph.merge( (text_model, image_model), mode=‘concat’)
text_and_image_model.add(…)
text_and_image_model.add(…)

model_graph.create_output(text_and_image_model, objective2)

model_graph.compile(optimizer)
model_graph.fit([X_text, X_image], [y_image, y_final])

pdermyer on 21 Jun 2015

What I don't yet understand, regardless of whether Fork is a model or layer, is:

how I specify distinct objective functions for my multiple outputs (if, for example, I'm doing both regression and classification)
how do I specify which output "heads" a given input is labeled with
how we combine gradients downstream of the fork

iskandr on 22 Jun 2015

how I specify distinct objective functions for my multiple outputs (if, for example, I'm doing both regression and classification)

The solutions I've entertained for this issue all seem painfully inelegant. What do you think of the layer graph model I sketched above? That seems to solve the objective specification issue and allows you to keep a single model interface even after substantial forking/merging.

how do I specify which output "heads" a given input is labeled with

I'm not sure I follow.

how we combine gradients downstream of the fork

Summation. Here's 'fork' for Torch, https://github.com/torch/nn/blob/master/Replicate.lua. Also take a look at RepeatVector in Keras

pdermyer on 22 Jun 2015

What do you think of the layer graph model I sketched above?

I might have missed where the objective actually gets injected. In your example:

left = Sequential()
left.add(Fork(base_model))
right = Sequential()
right.add(Fork(base_model))

Would you then compile left and right with different objective functions?

On which object do you then call fit?

how do I specify which output "heads" a given input is labeled with

I'm not sure I follow.

If I have a two output sub-networks (left and right), and a data point can have associated ground truth labels for either or both of them, how is that specified? I guess this is related to my confusion about which fit function you would call and what you would pass to it.

iskandr on 22 Jun 2015

Hi, I'm a recent convert from Pylearn2.

If I have a two output sub-networks (left and right), and a data point can have associated ground truth labels for either or both of them, how is that specified? I guess this is related to my confusion about which fit function you would call and what you would pass to it.

In order to optimize a network (call .fit), you should merge it so that left and right get combined into a single objective. So you'd add a Merge Layer which sums (with weights!) the loss from the left and right head.

The major change to the interface in this case is that the model objective would not be specified when compile is called, but it would be specified in the layers themselves. Maybe have a Objective or Loss layer be a thing, and then the final loss is a weighted combination of the Loss layers. I think this would enable GoogLeNet style architectures in keras.

dhammack on 22 Jun 2015

I actually meant the layer graph from this comment. One potential solution could be to specify targets only for the objectives you want to optimize in a given batch:

model_graph.compile(optimizer)
#train a batch against both objectives
model_graph.train([X_1_batch, X_2_batch, X_3_batch], [y_1_batch, y_2_batch])
#train a batch against only the first objective
model_graph.train([X_1_batch, X_2_batch, X_3_batch], [y_1_batch, None])

Another option would be to expose each output as a separate model that can be trained separately, but this would result in redundant computation in the case when more than one objective is being used.

model_output_1 = model_graph.create_output(model_intermediate, objective1)
...
model_output_1.train([X_1_batch, X_2_batch, X_3_batch], y_1_batch)

In the case of a Fork model, you could train each resulting model separately; or you could create some kind of combined objective model to train them in one procedure; or you could use a Merge model and a combined objective function on the concated model outputs. None of those sound like great solutions to me.

pdermyer on 22 Jun 2015

@pdermyer Sorry, I'm not sure how I missed the bottom of your previous comment! That API actually seems fine to me.

One possible refinement: For sufficiently complicated topologies, we might want to add named labels to inputs and outputs, to allow for example:

network_graph.fit({"text": X_test, "image": X_image}, {"output_intermediate": y1, "output_final": y2})

For samples w/ missing outputs in a multi-output network, you can just omit a label for a training batch:

network_graph.fit({"text": X_test, "image": X_image}, {"output_final": y2})

In the network definition you could add these labels as arguments to create_output:

model_graph.create_output(image_intermediate, objective1, name="output_intermediate")

iskandr on 22 Jun 2015

@iskandr I think that's a great idea. Generally, I was trying to avoid using string names, but I can see an advantage in many respects here. It'd be best if it could work either way. I'm actually working on an implementation right now. I'll see if I can add this functionality.

pdermyer on 22 Jun 2015

@pdermyer Let me know if you need help with anything or want any eyeballs on a branch. I'm very interested in generalizing Keras toward multiple output functions.

iskandr on 22 Jun 2015

Actually the first thing we need to solve is how the Merge layer should actually function. Right now if you merge models that overlap, it appears to fail: https://gist.github.com/pdermyer/bd8d9f97e3e67d8eef3d

I think in this toy case, it's easy to think 'why would you do this' but I suspect it'll be a problem if we try to merge anything that came from a Fork as well. I was able to get something that appears to work, but I have no confidence that I really fixed the issue rather than hid it poorly. Maybe someone can take a look?

pdermyer on 23 Jun 2015

To save others the effort, this is the error that arises from using the same model twice as inputs to a Merge layer:

UnusedInputError: Variable <TensorType(float64, 4D)> is used twice in inputs to theano.function, at indices 0 and 1.  This would result in values provided for it being ignored. Please do not duplicate variables in the inputs list.

It's not immediately clear to me what the right solution is. @pdermyer what did you change in Keras that makes this code work?

iskandr on 23 Jun 2015

@iskandr I have a fix that might actually be correct/troubles me less. I'll have a PR later today.

pdermyer on 23 Jun 2015

❤1 😕1 🎉1 😄1 👍1

A functioning Fork is trivial, but the fork layer I'm using isn't very user-friendly for the reason I described above, and I don't have a good answer yet.

@fchollet Adding multiple output support for any new models is either going to result in significant duplication of the Model code or there will need to be refactoring of the central Model to support multiple outputs analogous to the modifications made for Merge. It should be possible to keep everything functionally the same for single output models. Is the multi-output refactor route something that makes sense for the project?

pdermyer on 24 Jun 2015

@fchollet Adding multiple output support for any new models is either going to result in significant duplication of the Model code or there will need to be refactoring of the central Model to support multiple outputs analogous to the modifications made for Merge. It should be possible to keep everything functionally the same for single output models. Is the multi-output refactor route something that makes sense for the project?

I was thinking of integrating list-output support into the abstract Model class, in the exact same way as Model supports list-inputs. In the single-output case, the tensor y would be put in a list [y]. In multiple output cases, you would have y = [y1, y2 ...].

fchollet on 25 Jun 2015

@fchollet list output support certainly seems like a very good (and consistent) option. I assume we'd also need to change the compilation loss argument to a list?

model.compile(loss=[loss1, loss2, loss3], optimizer='sgd')

Edit: Ah I guess specifying the objectives is done on a per model basis in the graph approach suggested by @pdermyer, and the compilation on the graph level.

vzhong on 25 Jun 2015

@vzhong +1 for uncoupling the objective from compilation, it's an output-specific concept.

iskandr on 26 Jun 2015

@fchollet It sounds like we are on the same page.

When y is returned, we should return just y in the single case (for compatibility) and [y1, y2, …, yn] in the multi-case. Similarly, for returning loss and accuracy, returning [loss1, loss2, …, lossn], [acc1, acc2, …, accn] for the multiple output case and the original loss, acc for the single output case seems sensible.

I don’t think we will find an abstract compile() that works for most/all multi-output setups. (It'd be so cumbersome to support all of the different concerns). I think the best course is to support the generalization @vzhong suggests, and try to refactor some of the functionality of compile into separate private methods to make maintaining different compile functions simpler. Then we can leave messy niceties like per output weighting of loss functions or training against subsets of outputs (for missing ground-truth etc) to the models that need them.

pdermyer on 26 Jun 2015

Hey, I've a working code for that( did or months ago while working on caffe support).

Will submit it in few hours.

On Fri, Jun 26, 2015 at 12:18 PM, pdermyer [email protected]
wrote:

@fchollet It sounds like we are on the same page.
When y is returned, we should return just y in the single case (for compatibility) and [y1, y2, …, yn] in the multi-case. Similarly, for returning loss and accuracy, returning [loss1, loss2, …, lossn], [acc1, acc2, …, accn] for the multiple output case and the original loss, acc for the single output case seems sensible.

I don’t think we will find an abstract compile() that works for most/all multi-output setups. (It'd be so cumbersome to support all of the different concerns). I think the best course is to support the generalization @vzhong suggests, and try to refactor some of the functionality of compile into separate private methods to make maintaining different compile functions simpler. Then we can leave messy niceties like per output weighting of loss functions or training against subsets of outputs (for missing ground-truth etc) to the models that need them.

Reply to this email directly or view it on GitHub:
https://github.com/fchollet/keras/issues/224#issuecomment-115547309

pranv on 26 Jun 2015

https://github.com/fchollet/keras/pull/281

pranv on 26 Jun 2015

Discussion moved to: https://github.com/fchollet/keras/issues/302

fchollet on 30 Jun 2015

As it currently stands, what is the best way to perform multiple-output regression in the case where we are partially missing ground truth (e.g. the assays example discussed above)?

Assuming I only have one objective function, is there no simple way to zero out the gradients for missing outputs before backprop?

jamesmf on 15 Nov 2015

👍3

I'm also interested in this, particularly if it can work for thousands of
related tasks with shared structure.
On Nov 15, 2015 10:58 AM, "jamesmf" [email protected] wrote:

As it currently stands, what is the best way to perform multiple-output
regression in the case where we are partially missing ground truth (e.g.
the assays example discussed above)?

Assuming I only have one objective function, is there no simple way to
zero out the gradients for missing outputs before backprop?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/224#issuecomment-156822300.