Keras: Proposal for a new functional API

Created on 12 Mar 2016 · 41Comments · Source: keras-team/keras

Here is a proposal for a new API to replace Graph: https://gist.github.com/fchollet/314085fffa200de9c3da

Why?

Graph was verbose, inelegant, and at times impractical. This is a more beautiful way to build networks.

What about backward compatibility?

This is planned to be 100% compatible: anything that used to work will still work after these changes.

It looks super similar to Torch's nngraph...

Indeed it does. Amazingly enough though, it was developed independently after watching an advanced user reacting negatively to the Keras Graph API --I realized they were right (I never liked Graph tbh), and started to search for ways to make Keras functional. It turns out there is only one API (modulo a few details) that you can converge to in order to make layers functional while keeping 100% backward compatibility, and it is this one.

Please post any feedback or comments on the Gist page.

Source

fchollet

👍18 🎉3

Most helpful comment

So modular metrics are now working, and they are added to the computation graph. The way it works: you pass a metrics argument to compile (could be a list, dictionary, dict of lists... of functions or strings). Allows you to monitor any number of metrics for any or all model outputs. Fully configurable.

fchollet on 23 Mar 2016

👍6

All 41 comments

Survey: which "internal" layer/container methods are you using in your custom layers? This is important so I can plan ahead and avoid breaking your custom layers in the upgrade.

fchollet on 15 Mar 2016

Is it possible to make the later layers easily get access to the information of previous layers, for example, the output_shape （not a fixed value for FCN) of previous layer at running time?

For example, something similar to this:
add_node(ResizeLayer( outputshape = layer_3.output_shape), name = 'layer_10')

ypxie on 15 Mar 2016

@shampool that's possible in the current version and will still be possible in the new one, of course.

Heads up: I am planning on removing the train=True/False argument used when computing layer outputs, and replace it with symbolic control flow in the computation graph. For instance, you would apply train-time dropout via:

x = K.in_train_phase(x, K.dropout(x, level=0.5))

Thoughts, objections?

Note that this allows us to have a single computation graph in our model (instead of two separate ones for train and test time). This also forces us to always pass the training phase as a boolean scalar input to the graph.

fchollet on 16 Mar 2016

Also, we will have modular metrics in the new version (instead of hard-coded access to accuracy only).

Questions:

what metrics would you like to see?
should metrics be implemented as part of the computation graph? Outside, in Numpy-land, as a post-processing step? I would lean towards making them part of the computation graph.

fchollet on 16 Mar 2016

What are the advantages of making dropout (and batch normalization?) more verbose? I know most of the time we dont have differences between train and test in the computation graph, but what are the other reasons?

EderSantana on 16 Mar 2016

👍3

+1 to more metrics. We could use a dict (id, name) pairs as input to "compile". We could have metrics.py the same we have objectives.py. That will be easy for metrics that depend only on outputs and desired.

EderSantana on 16 Mar 2016

Simplifies keras graph management (it's nice not to have two hardcoded
redundant copies of everything), and it reduces compilation time and memory
overhead because there is only one computation graph underneath.
On Mar 15, 2016 7:28 PM, "Eder Santana" [email protected] wrote:

What are the advantages of making dropout (and batch normalization?) more
verbose? I know most of the time we dont have differences between train and
test in the computation graph, but what are the other reasons?

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1960#issuecomment-197117709

fchollet on 16 Mar 2016

+1 to metrics. Things like gradient monitoring would also be really nice to add in

stephenroller on 16 Mar 2016

👍2

I'd like to see user defined metrics. Interface should be similar to how you define loss function but ofcourse it's value is recorded in history instead of being added to gradient

udibr on 16 Mar 2016

👍1

I'll also like to see a way to put the same layer in multiple locations in a graph. For example you create a dense layer and then clone it. The new copy uses the same weights as the first but you are free to put them in different locations as long as the size is the same. This is much more flexible than the current add_shared_node

udibr on 16 Mar 2016

I'll also like to see a way to put the same layer in multiple locations in a graph.

Yes, that is part of what is described in the API proposal above.

I'd like to see user defined metrics. Interface should be similar to how you define loss function but ofcourse it's value is recorded in history instead of being added to gradient

Yes, that's what is planned. But can we answer the main question: should metrics computation be part of the symbolic computation graph, or happen in numpy-land as a post-processing step? Which metrics would you like to see?

fchollet on 16 Mar 2016

Things like gradient monitoring would also be really nice to add in

Thanks for the suggestion, will think about it.

fchollet on 16 Mar 2016

For binary classification, I would like:

-recall/precision
-AUC
-lift

For multiclassification:

-log loss

For regression:

-RMSE, of course

... and many more, but that would be a good start.

And basically you could add all the metrics used in Kaggle competitions. (https://www.kaggle.com/wiki/Metrics).

I am not very familiar with Keras yet, but if you were to build these metrics into your graph, could you still apply the graph to data where you don't know the desired result (for exemple in production environments) ? If the answer is yes, then I would probably advocate for layer-like metrics ;-)

nitlev on 16 Mar 2016

I am planning on removing the train=True/False argument used when computing layer outputs, and replace it with symbolic control flow in the computation graph. For instance, you would apply train-time dropout via:

If this is transparent to the user, then I think it's a net positive to add verbosity to the internals to simplify life for the user. However, I wonder if it can't be simplified with the kind of pattern used in chainer. Sort of define

def dropout_base(x, level=0.5):
    return #bla bla bla

def dropout(x):
    x = K.in_train_phase(x, K.dropout_base(x, level=0.5))
#in other code somewhere now...
x = K.dropout(x) #basicaly expands to `x = K.in_train_phase(x, K.dropout_base(x, level=0.5))`

So that the verbosity penalty applies only once, when you define the function - and then becomes mostly transparent when used.

EdwardRaff on 16 Mar 2016

I think I wasn't clear the first time around, so let me clarify. My dropout example was specifically for applying dropout _inside_ a layer (would formerly have been inside get_output(), although it is getting renamed). It is what you would use _when writing custom layers_.

At the model level, when you are adding a dropout layer to a model the syntax doesn't change. You would still do:

model = Sequential()
model.add(...)
model.add(Dropout(0.5))

# functional API
a = Input(...)
b = Dropout(0.5)(a)

Now what would change would be the _internals_ of the Dropout and BatchNorm layers, as well as any custom layer you have that uses a different behavior at training time and testing time.

fchollet on 16 Mar 2016

I am looking for some robust loss functions that can be switched in after
some initial standard loss function is used to get to a reasonably low loss.

For example
http://arxiv.org/abs/1412.6596

I couldn't find a way to switch out the loss function without recompiling
the model, which reset the learning rates, weights, etc. Not sure if I
wasn't using the API correctly, but this is something I am interested in.
For binary classification, I would like:

-recall/precision
-AUC
-lift
For multiclassification:

-log loss
For regression:

-RMSE, of course

... and many more, but that would be a good start.

And basically you could add all the metrics used in Kaggle competitions. (
https://www.kaggle.com/wiki/Metrics).

I am not very familiar with Keras yet, but if you were to build these
metrics into your graph, could you still apply the graph to data where you
don't know the desired result (for exemple in production environments) ? If
the answer is yes, then I would probably advocate for layer-like metrics ;-)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1960#issuecomment-197151536

konstantinberlin on 16 Mar 2016

I think the proposal is solid. It will allow us more freedom in developing our own layers and reduce "hacks" in order to get a new layer working that the current version requires. As far as what functions should be there, I think you are pretty much the expert on this. One special point for me though would be easy usage of time-distributed layers functionality in my custom layers. Furthermore something that I personally like as a "metric" would be the ability to use a layers weights as an input to a custom weight. It would just visualize the changes from epoch to epoch, or just have the ability to see the dynamic changes in my feature maps, which is not only cool, but gives lots of inside in what the network is learning. Same goes for language modelling (given a rigid training/testing regime). All in all I think if keras keeps going it's going to blow all other frameworks out of the water. Especially with TF 7.0 coming out soon. Btw since I mentioned TF, will keras be able to use it's distributed capabilities out of the box?

Metric calculation: I would prefer the computational graph one, because it's more elegant and since TF is going distributed it will alow the computation of heavier metrics to be distributed and GPU powered if needed.
Internal layers/container: Already mentioned time-distributed containers + all of the TF layer building blocks which I am sure you would have included even if I were not to put it here.
Distributed TF: Will it work out of the box?

That's all for this time. Have a nice day.

AntreasAntoniou on 16 Mar 2016

@fchollet Monitors as part of the graph is the best option because it is more flexible. We could add monitors the same way we add a dropout, something like:

x = K.in_train_phase(x, monitors=["output_norm", "gradient_norm", "number_of_zeros", "is_nan"])

The drawback will be longer compilation times. But let us remember that monitors will be optional, not mandatory like they were for #RipPylearn2 .

EderSantana on 16 Mar 2016

I have been thinking about writing Debug layers, something like a layer that does nothing but prints something about the layer output or weights. That may be easier to do if we add Monitors. The thing is maybe "monitors" are more user friendly than callbacks in some cases.

EderSantana on 16 Mar 2016

❤3

Re metrics, we use some ranking order based metrics, specifically:

top-k accuracy (how often is correct answer in the top k predictions)
mean reciprocal rank (https://en.wikipedia.org/wiki/Mean_reciprocal_rank)

wxs on 16 Mar 2016

I believe the network architecture visualisation should return back.

AhmedGS on 16 Mar 2016

+1 to top-k and network visualizer.

AntreasAntoniou on 16 Mar 2016

What i think is a bottleneck right now is the possibility to change the training procedure:

You need to use a lot of hack and counterintuitive procedures to implement stuff like:

Max-margin loss.
Search based prediction (and losses) for sequential problems (Beam search, Viterbi, Sentence Level Log Likelihood, etc..)

Another stuff more related to the new functional API, consider the follow issue to be taken in account in the way we calculate the gradient for each param. http://deeplearning.net/software/theano/tutorial/faq_tutorial.html

By the way i really like this idea in general.

dbonadiman on 16 Mar 2016

👍4

@fchollet

Not having to compile 4 graphs would be nice, but is there such control flow in TF already? Last time I checked there was only something switch-like wihch evaluated both of the expressions?
Have you thought about making losses a Layer? I believe this would make Keras more flexible (training with hierarchical softmax or negative sampling is currently hacky / ugly / impossible) and more elegant, (writing autoencoder.fit(X) instead of autoencoder.fit(X,X).

Overall the new API is neat.

elanmart on 16 Mar 2016

Have you thought about making losses a Layer?

Can you give a few examples? How would it allow things that are not currently possible?

but is there such control flow in TF already? Last time I checked there was only something switch-like wihch evaluated both of the expressions?

Just like in Theano. Control flow as a symbolic op, part of the graph.

fchollet on 16 Mar 2016

in the gist example I don't see how you define which outputs contribute to the loss.
Perhaps a loss layer will be an easy way to solve this problem.

I tried to implement Ladder network over Keras and it wasn't natural in any way.
In a Ladder network you compare the output of a decoder to output of MLP on different layers and sum the results with weights. How would you do that in the new Keras?

udibr on 16 Mar 2016

Have you thought about making losses a Layer?

Can you give a few examples? How would it allow things that are not currently possible?

Actually one thing I was just looking at: using sampled loss functions from TF, like e.g. sampled_softmax_loss here: https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#sampled_softmax_loss

You'll note that it specifies:

This operation is for training only. It is generally an underestimate of the full softmax loss.

so because Keras' layer API includes information about whether it's a train or test, you could have it do sampled_softmax during train but not at test time. The objectives API does not support this.

wxs on 16 Mar 2016

In a Ladder network you compare the output of a decoder to output of MLP on different layers and sum the results with weights. How would you do that in the new Keras?

The topology of a network is part of the network definition, not a layer add-on. Why couldn't this be built with the functional API outlined above?

fchollet on 16 Mar 2016

Oh another useful metric, of course, is perplexity.

wxs on 16 Mar 2016

Since Keraz is backended by TF, it should seamlessly load pretrained models done by TF?

vodp on 18 Mar 2016

@fchollet sorry for the late reply.
As mentioned before, sampling (eg. Noise Contrastive Estimation) is the simplest example: here You want to evaluate only part of the softmax. Which part? This depends on y.

My intuition is that the more general solution should be usually preferred: why should we be constrained only to simple (X,y) training schemes? When building a model I want to describe how certain inputs are changed into a loss signal, which will be fed into SGD. Sure, sometimes loss is f(y, preds), but sometimes it is f(input, preds), f(input, gradients, preds) and so on.

I often find it easier to work in pure TF than to hack Keras into giving me the control I want.

elanmart on 18 Mar 2016

@elanmart if you need a custom training procedure, but still want to benefit from Keras layers, there is no obstacle to defining your network in Keras then training it in pure TF, outside of Keras. I do that sometimes. Then you can write whatever loss you like, get access to the gradients, etc.

Keras is not meant to be 100% flexible, but you can choose to only use some of its features. It's not all or nothing.

fchollet on 18 Mar 2016

@fchollet Could it be flexible enougth to support morphing networks? I think that in the near future we could front dynamically struct changing networks also on training.

bhack on 19 Mar 2016

And conditional subgraph execution/growing/morph where one of the intermediate output losses could be also part of the subgraph computation/morphing/growing condition of another branch (I.e. bootstrap localization regression, after a threshold on the loss activate the classification subgraph).

bhack on 19 Mar 2016

Last call: should the computation of metrics (accuracy, etc) during training be part of the computation graph (i.e. metrics function should be written in Thenao/TF/K), or should it happen in Numpy-space?

fchollet on 21 Mar 2016

+1 for metrics in TF/K space.

AntreasAntoniou on 21 Mar 2016

+1 NP space

stephenroller on 21 Mar 2016

TF/K space! At the data scales we use Keras with our evaluation takes a significant amount of time (hours) and that's on the GPU.

wxs on 21 Mar 2016

Going to go with K-space metrics, it feels cleaner. Also if users want
Numpy metrics, it's much easier to add them on their own (via callbacks,
etc) than the other way around.

On 21 March 2016 at 09:02, Xavier Snelgrove [email protected]
wrote:

TF/K space! At the data scales we use Keras with our evaluation takes a
significant amount of time (hours) and that's on the GPU.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1960#issuecomment-199356365

fchollet on 21 Mar 2016

fchollet on 23 Mar 2016

👍6

@fchollet

if you need a custom training procedure, but still want to benefit from Keras layers, there is no obstacle to defining your network in Keras then training it in pure TF, outside of Keras.

Yeah, that's what I usually do. The new API looks sweet, when can we expect it on HEAD?

elanmart on 23 Mar 2016

Was this page helpful?

0 / 5 - 0 ratings