Here is a proposal for a new API to replace Graph: https://gist.github.com/fchollet/314085fffa200de9c3da
Graph was verbose, inelegant, and at times impractical. This is a more beautiful way to build networks.
This is planned to be 100% compatible: anything that used to work will still work after these changes.
Indeed it does. Amazingly enough though, it was developed independently after watching an advanced user reacting negatively to the Keras Graph API --I realized they were right (I never liked Graph tbh), and started to search for ways to make Keras functional. It turns out there is only one API (modulo a few details) that you can converge to in order to make layers functional while keeping 100% backward compatibility, and it is this one.
Survey: which "internal" layer/container methods are you using in your custom layers? This is important so I can plan ahead and avoid breaking your custom layers in the upgrade.
Is it possible to make the later layers easily get access to the information of previous layers, for example, the output_shape (not a fixed value for FCN) of previous layer at running time?
For example, something similar to this:
add_node(ResizeLayer( outputshape = layer_3.output_shape), name = 'layer_10')
@shampool that's possible in the current version and will still be possible in the new one, of course.
Heads up: I am planning on removing the train=True/False argument used when computing layer outputs, and replace it with symbolic control flow in the computation graph. For instance, you would apply train-time dropout via:
x = K.in_train_phase(x, K.dropout(x, level=0.5))
Thoughts, objections?
Note that this allows us to have a single computation graph in our model (instead of two separate ones for train and test time). This also forces us to always pass the training phase as a boolean scalar input to the graph.
Also, we will have modular metrics in the new version (instead of hard-coded access to accuracy only).
Questions:
What are the advantages of making dropout (and batch normalization?) more verbose? I know most of the time we dont have differences between train and test in the computation graph, but what are the other reasons?
+1 to more metrics. We could use a dict (id, name) pairs as input to "compile". We could have metrics.py the same we have objectives.py. That will be easy for metrics that depend only on outputs and desired.
Simplifies keras graph management (it's nice not to have two hardcoded
redundant copies of everything), and it reduces compilation time and memory
overhead because there is only one computation graph underneath.
On Mar 15, 2016 7:28 PM, "Eder Santana" [email protected] wrote:
What are the advantages of making dropout (and batch normalization?) more
verbose? I know most of the time we dont have differences between train and
test in the computation graph, but what are the other reasons?—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1960#issuecomment-197117709
+1 to metrics. Things like gradient monitoring would also be really nice to add in
I'd like to see user defined metrics. Interface should be similar to how you define loss function but ofcourse it's value is recorded in history instead of being added to gradient
I'll also like to see a way to put the same layer in multiple locations in a graph. For example you create a dense layer and then clone it. The new copy uses the same weights as the first but you are free to put them in different locations as long as the size is the same. This is much more flexible than the current add_shared_node
I'll also like to see a way to put the same layer in multiple locations in a graph.
Yes, that is part of what is described in the API proposal above.
I'd like to see user defined metrics. Interface should be similar to how you define loss function but ofcourse it's value is recorded in history instead of being added to gradient
Yes, that's what is planned. But can we answer the main question: should metrics computation be part of the symbolic computation graph, or happen in numpy-land as a post-processing step? Which metrics would you like to see?
Things like gradient monitoring would also be really nice to add in
Thanks for the suggestion, will think about it.
-recall/precision
-AUC
-lift
-log loss
-RMSE, of course
... and many more, but that would be a good start.
And basically you could add all the metrics used in Kaggle competitions. (https://www.kaggle.com/wiki/Metrics).
I am not very familiar with Keras yet, but if you were to build these metrics into your graph, could you still apply the graph to data where you don't know the desired result (for exemple in production environments) ? If the answer is yes, then I would probably advocate for layer-like metrics ;-)
I am planning on removing the train=True/False argument used when computing layer outputs, and replace it with symbolic control flow in the computation graph. For instance, you would apply train-time dropout via:
If this is transparent to the user, then I think it's a net positive to add verbosity to the internals to simplify life for the user. However, I wonder if it can't be simplified with the kind of pattern used in chainer. Sort of define
def dropout_base(x, level=0.5):
return #bla bla bla
def dropout(x):
x = K.in_train_phase(x, K.dropout_base(x, level=0.5))
#in other code somewhere now...
x = K.dropout(x) #basicaly expands to `x = K.in_train_phase(x, K.dropout_base(x, level=0.5))`
So that the verbosity penalty applies only once, when you define the function - and then becomes mostly transparent when used.
I think I wasn't clear the first time around, so let me clarify. My dropout example was specifically for applying dropout _inside_ a layer (would formerly have been inside get_output(), although it is getting renamed). It is what you would use _when writing custom layers_.
At the model level, when you are adding a dropout layer to a model the syntax doesn't change. You would still do:
model = Sequential()
model.add(...)
model.add(Dropout(0.5))
# functional API
a = Input(...)
b = Dropout(0.5)(a)
Now what would change would be the _internals_ of the Dropout and BatchNorm layers, as well as any custom layer you have that uses a different behavior at training time and testing time.
I am looking for some robust loss functions that can be switched in after
some initial standard loss function is used to get to a reasonably low loss.
For example
http://arxiv.org/abs/1412.6596
I couldn't find a way to switch out the loss function without recompiling
the model, which reset the learning rates, weights, etc. Not sure if I
wasn't using the API correctly, but this is something I am interested in.
For binary classification, I would like:
-recall/precision
-AUC
-lift
For multiclassification:
-log loss
For regression:
-RMSE, of course
... and many more, but that would be a good start.
And basically you could add all the metrics used in Kaggle competitions. (
https://www.kaggle.com/wiki/Metrics).
I am not very familiar with Keras yet, but if you were to build these
metrics into your graph, could you still apply the graph to data where you
don't know the desired result (for exemple in production environments) ? If
the answer is yes, then I would probably advocate for layer-like metrics ;-)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1960#issuecomment-197151536
I think the proposal is solid. It will allow us more freedom in developing our own layers and reduce "hacks" in order to get a new layer working that the current version requires. As far as what functions should be there, I think you are pretty much the expert on this. One special point for me though would be easy usage of time-distributed layers functionality in my custom layers. Furthermore something that I personally like as a "metric" would be the ability to use a layers weights as an input to a custom weight. It would just visualize the changes from epoch to epoch, or just have the ability to see the dynamic changes in my feature maps, which is not only cool, but gives lots of inside in what the network is learning. Same goes for language modelling (given a rigid training/testing regime). All in all I think if keras keeps going it's going to blow all other frameworks out of the water. Especially with TF 7.0 coming out soon. Btw since I mentioned TF, will keras be able to use it's distributed capabilities out of the box?
Metric calculation: I would prefer the computational graph one, because it's more elegant and since TF is going distributed it will alow the computation of heavier metrics to be distributed and GPU powered if needed.
Internal layers/container: Already mentioned time-distributed containers + all of the TF layer building blocks which I am sure you would have included even if I were not to put it here.
Distributed TF: Will it work out of the box?
That's all for this time. Have a nice day.
@fchollet Monitors as part of the graph is the best option because it is more flexible. We could add monitors the same way we add a dropout, something like:
x = K.in_train_phase(x, monitors=["output_norm", "gradient_norm", "number_of_zeros", "is_nan"])
The drawback will be longer compilation times. But let us remember that monitors will be optional, not mandatory like they were for #RipPylearn2 .
I have been thinking about writing Debug layers, something like a layer that does nothing but prints something about the layer output or weights. That may be easier to do if we add Monitors. The thing is maybe "monitors" are more user friendly than callbacks in some cases.
Re metrics, we use some ranking order based metrics, specifically:
I believe the network architecture visualisation should return back.
+1 to top-k and network visualizer.
What i think is a bottleneck right now is the possibility to change the training procedure:
You need to use a lot of hack and counterintuitive procedures to implement stuff like:
Another stuff more related to the new functional API, consider the follow issue to be taken in account in the way we calculate the gradient for each param. http://deeplearning.net/software/theano/tutorial/faq_tutorial.html
By the way i really like this idea in general.
@fchollet
Layer? I believe this would make Keras more flexible (training with hierarchical softmax or negative sampling is currently hacky / ugly / impossible) and more elegant, (writing autoencoder.fit(X) instead of autoencoder.fit(X,X).Overall the new API is neat.
Have you thought about making losses a Layer?
Can you give a few examples? How would it allow things that are not currently possible?
but is there such control flow in TF already? Last time I checked there was only something switch-like wihch evaluated both of the expressions?
Just like in Theano. Control flow as a symbolic op, part of the graph.
in the gist example I don't see how you define which outputs contribute to the loss.
Perhaps a loss layer will be an easy way to solve this problem.
I tried to implement Ladder network over Keras and it wasn't natural in any way.
In a Ladder network you compare the output of a decoder to output of MLP on different layers and sum the results with weights. How would you do that in the new Keras?
Have you thought about making losses a Layer?
Can you give a few examples? How would it allow things that are not currently possible?
Actually one thing I was just looking at: using sampled loss functions from TF, like e.g. sampled_softmax_loss here: https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#sampled_softmax_loss
You'll note that it specifies:
This operation is for training only. It is generally an underestimate of the full softmax loss.
so because Keras' layer API includes information about whether it's a train or test, you could have it do sampled_softmax during train but not at test time. The objectives API does not support this.
In a Ladder network you compare the output of a decoder to output of MLP on different layers and sum the results with weights. How would you do that in the new Keras?
The topology of a network is part of the network definition, not a layer add-on. Why couldn't this be built with the functional API outlined above?
Oh another useful metric, of course, is perplexity.
Since Keraz is backended by TF, it should seamlessly load pretrained models done by TF?
@fchollet sorry for the late reply.
As mentioned before, sampling (eg. Noise Contrastive Estimation) is the simplest example: here You want to evaluate only part of the softmax. Which part? This depends on y.
My intuition is that the more general solution should be usually preferred: why should we be constrained only to simple (X,y) training schemes? When building a model I want to describe how certain inputs are changed into a loss signal, which will be fed into SGD. Sure, sometimes loss is f(y, preds), but sometimes it is f(input, preds), f(input, gradients, preds) and so on.
I often find it easier to work in pure TF than to hack Keras into giving me the control I want.
@elanmart if you need a custom training procedure, but still want to benefit from Keras layers, there is no obstacle to defining your network in Keras then training it in pure TF, outside of Keras. I do that sometimes. Then you can write whatever loss you like, get access to the gradients, etc.
Keras is not meant to be 100% flexible, but you can choose to only use some of its features. It's not all or nothing.
@fchollet Could it be flexible enougth to support morphing networks? I think that in the near future we could front dynamically struct changing networks also on training.
And conditional subgraph execution/growing/morph where one of the intermediate output losses could be also part of the subgraph computation/morphing/growing condition of another branch (I.e. bootstrap localization regression, after a threshold on the loss activate the classification subgraph).
Last call: should the computation of metrics (accuracy, etc) during training be part of the computation graph (i.e. metrics function should be written in Thenao/TF/K), or should it happen in Numpy-space?
+1 for metrics in TF/K space.
+1 NP space
TF/K space! At the data scales we use Keras with our evaluation takes a significant amount of time (hours) and that's on the GPU.
Going to go with K-space metrics, it feels cleaner. Also if users want
Numpy metrics, it's much easier to add them on their own (via callbacks,
etc) than the other way around.
On 21 March 2016 at 09:02, Xavier Snelgrove [email protected]
wrote:
TF/K space! At the data scales we use Keras with our evaluation takes a
significant amount of time (hours) and that's on the GPU.—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1960#issuecomment-199356365
So modular metrics are now working, and they are added to the computation graph. The way it works: you pass a metrics argument to compile (could be a list, dictionary, dict of lists... of functions or strings). Allows you to monitor any number of metrics for any or all model outputs. Fully configurable.
@fchollet
if you need a custom training procedure, but still want to benefit from Keras layers, there is no obstacle to defining your network in Keras then training it in pure TF, outside of Keras.
Yeah, that's what I usually do. The new API looks sweet, when can we expect it on HEAD?
Most helpful comment
So modular metrics are now working, and they are added to the computation graph. The way it works: you pass a
metricsargument tocompile(could be a list, dictionary, dict of lists... of functions or strings). Allows you to monitor any number of metrics for any or all model outputs. Fully configurable.