I find SoftmaxOutput very confusing and becomes a pitfall for new mxnet users like me.
http://mxnet.io/tutorials/python/mnist.html
In this tutorial, SoftmaxOutput makes its appearance very casually. It appears only once in the tutorials and and no longer appears mentioned in other tutorials.
# The softmax and loss layer
mlp = mx.sym.SoftmaxOutput(data=fc3, name='softmax')
It says in the COMMENT that it is softmax AND loss layer.
I guess this is made as a convenience function so that programmers don't have to write
act3 = mx.sym.Activation(data=fc2, name='softmax', act_type="softmax")
loss = .......
However, this becomes a hidden gotcha in the future.
For example,
http://mxnet.io/tutorials/computer_vision/detection.html
If I were to implement faster R-CNN, you'd have to define activation and loss separately and explicitly, as the regions are regressions, not classifications.
Another hidden gotcha is that the outputs of the symbolic network definition must be a loss function whereas in Keras, activations (or any tensor) are the output of the 'symbolic' network. In Keras you would define loss later somewhere else or you could write in short like
model = convnet()
model.compile( loss='mse'
This means some users are not used to defining the output as a loss function.
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/mnist/mnist_softmax.py
Even in TensorFlow, the convention is to define "model" (which does tensor computation) and "loss" separately.
However, SoftmaxOutput hides this critical point by combining the loss function and the activation in the tutorial.
With the two problems mentioned above, I suffered from the below error for many hours:
mxnet.base.MXNetError: [10:33:58] D:\chhong\mxnet\src\symbol\symbol.cc:122: Symbol.InferShapeKeyword argument name angle not found.
as I didn't know I had to put loss as the output until I looked at the source codes very very very carefully.
I think, the tutorial would be better if
http://mxnet.io/tutorials/python/mnist.html
MNIST tutorial is modified so that
in the example network, SoftmaxOutput is not used and activation and loss are separated.
act3 = mx.sym.Activation(data=fc2, name='softmax', act_type="softmax")
loss = .......
Then there should come another piece of source code saying "You can combine and write the above line in short like this. Note that the last output of the symbolic network must be loss function."
mlp = mx.sym.SoftmaxOutput(data=fc3, name='softmax').
It's the same with caffe's SoftmaxWithLoss.
We do need to improve the docs.
https://github.com/dmlc/mxnet/blob/master/example/python-howto/multiple_outputs.py
In relation to the issue, in the multiple output example, how is fc1 not using any loss function before binding?
I second the complaint that it is confusing that the softmax layer outputs the softmax values while silently computing the loss on the backward pass. This design is problematic in relation to the MakeLoss function. MakeLoss causes the output of the network to be the loss value, but all of the built in output layers output the last layer of the net and silently compute the gradient respect to the loss on the backward pass.
I also second forcecore's question above. The multiple output example is confusing. The example should expanded to understand how loss is computed when using multiple outputs. If I use Makeloss, can one output be the loss and the other be the final layer? Otherwise how is one expected to access the final layer for making predictions?
Overall, what is the proper way to define a custom loss in the mxnet framework. Is it to define a custom operator that outputs the final layer on the forward pass, but computes the gradient with respect to a loss function on the backward pass? If so, how would one plot the loss overtime during training?
Thanks for taking the time to explain your design.
@forcecore Do you have an example for how to define loss function? I'm also confused by this
This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
yes,i think so.
Yes, I concur. I got trapped in the same pitfall. I would have thought that the loss function would have been outside of network definition. And even knowing that it is in the network definition, I would have put it outside of a layer definition.
And I would add that the doc won't necessary help, since there are several languages supported, thus several docs. And I would not necessary think about consulting the documentation of another language that does not have the same philosophy (and a distant API). And the doc in C++ is kinda small - I think some other basics should be covered before explaining that.
Same problem come to me when I'm looking at some codes writen in Mxnet but can't find the loss function. ( In face, there is several SoftmaxOutput in the codes)
Most helpful comment
I second the complaint that it is confusing that the softmax layer outputs the softmax values while silently computing the loss on the backward pass. This design is problematic in relation to the MakeLoss function. MakeLoss causes the output of the network to be the loss value, but all of the built in output layers output the last layer of the net and silently compute the gradient respect to the loss on the backward pass.
I also second forcecore's question above. The multiple output example is confusing. The example should expanded to understand how loss is computed when using multiple outputs. If I use Makeloss, can one output be the loss and the other be the final layer? Otherwise how is one expected to access the final layer for making predictions?
Overall, what is the proper way to define a custom loss in the mxnet framework. Is it to define a custom operator that outputs the final layer on the forward pass, but computes the gradient with respect to a loss function on the backward pass? If so, how would one plot the loss overtime during training?
Thanks for taking the time to explain your design.