darknet 🚀 - [Feature Request] Mish Activation function +0.5 AP wrt. Swish

The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models.

AlexeyAB on 26 Sep 2019

👍1

@AlexeyAB Was wondering if you're considering adding up Mish? In that regards, based on the above screenshot there is a mistake in the derivative formula which I have updated in my paper. The link to the updated paper and additional results are in my repository here - https://github.com/digantamisra98/Mish
Thanks!

digantamisra98 on 6 Nov 2019

👍1

@digantamisra98

I added MISH-activation.
use activation=mish in [convolutional] layers

Please, check that implementation is correct: https://github.com/AlexeyAB/darknet/commit/bf8ea4183dc265ac17f7c9d939dc815269f0a213

Thanks! So the error was in delta?

AlexeyAB on 6 Nov 2019

👍2 🎉1

Just checked, The implementation is correct. Thanks. Yes the error was a typo in the delta term.

digantamisra98 on 7 Nov 2019

👍1

now training.

WongKinYiu on 7 Nov 2019

now training.

usually get nan, do i need adjust learning rate schedule?

burn_in=2000
learning_rate=0.1
policy=poly
power=4
max_batches=1600000

WongKinYiu on 7 Nov 2019

@WongKinYiu What model do you try to train?

AlexeyAB on 7 Nov 2019

densenet based model. i ll try darknet based model first.

WongKinYiu on 7 Nov 2019

@WongKinYiu Here's the DenseNet code I used to test Mish - https://github.com/digantamisra98/Mish/blob/master/Notebooks/cifar-10-DenseNet121_Mish.ipynb
Usually I'll advise to have a lower learning rate probably 1e-3 (0.01 - 0.001)
Can you share the log maybe or the code to reproduce the NaN?

digantamisra98 on 7 Nov 2019

darknet based model also get nan.

[net]
batch=128
subdivisions=1
height=224
width=224
channels=3
momentum=0.9
decay=0.0005
max_crop=320

learning_rate=0.1
policy=poly
power=4
max_batches=1600000

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2
padding=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=mish

[avgpool]

[convolutional]
filters=1000
size=1
stride=1
pad=1
activation=linear

[softmax]
groups=1

WongKinYiu on 7 Nov 2019

@WongKinYiu
Do you train on ILSVRC2012?
How many iterations do you train before Nan occured?
Do you use GPU=1 CUDNN=1 ?

Try to use

[maxpool]
size=2
stride=1

instead of

[maxpool]
size=2
stride=2
padding=1

AlexeyAB on 7 Nov 2019

yes, i train on ILSVRC2012.
about 3k~5k iterations i get nan.
i use gpu=1 and cudnn=1.

WongKinYiu on 8 Nov 2019

@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?

AlexeyAB on 8 Nov 2019

both of 0.1 and 0.05 get nan.
i ll try other setting after finish my breakfast.
thanks for ur advice.

WongKinYiu on 8 Nov 2019

@WongKinYiu I'll go through Mish's implementation again in a while and confirm if everything is alright and also give it a try myself to validate the same. Thanks for raising the issue.

digantamisra98 on 8 Nov 2019

https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-551143150
get nan after 300 iterations.

WongKinYiu on 8 Nov 2019

@digantamisra98 thanks, i can also have time to check the implementation after 11/17.

WongKinYiu on 8 Nov 2019

@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?

0.1, 0.05, 0.01, 0.001 all get nan.

WongKinYiu on 8 Nov 2019

get nan after 10 iterations.
0.001

nhaxin204 on 9 Nov 2019

@AlexeyAB @nhaxin204 @WongKinYiu I went through the implementation again and I believe it's correct. Though I'm gonna practically implement it this week. (Sorry, was a bit occupied the last week). I will also ask the fast.ai forum folks to give the implementation a check to make sure I'm not missing anything.

digantamisra98 on 11 Nov 2019

👍1

@AlexeyAB This is Tom's response regarding the NaN issue:

That implementation is not at all numerically stable. All the exps quickly lead to overflow and hence NaN. Should be possible to adapt either the Eigen based implementation from tensorflow contrib or my mostly pure C++ implementation (mostly as it’s using the PyTorch dispatch/templating but is otherwise standard C++). The TF one is probably slightly more stable given handling of both underflow and overflow but will require more adaptation to remove the Eigen dependency.

Here is the TensorFlow Addons commit for Mish - https://github.com/tensorflow/addons/commit/093cdfa85d334cbe19a37624c33198f3140109ed

Tom's CUDA implementation - https://github.com/thomasbrandon/mish-cuda

digantamisra98 on 12 Nov 2019

👍1

@digantamisra98

So: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31

if (in < threshold) new_in = log( expf(in) );
else new_in= in;

gradient = in * ((1 - tanh(new_in)*tanh(new_in)) * (1 - exp(-new_in))) + tanh(new_in);

delta = delta * gradient;

AlexeyAB on 12 Nov 2019

@digantamisra98 @WongKinYiu @LukeAI @nhaxin204
I fixed MISH to this implementation: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31

AlexeyAB on 12 Nov 2019

👍2

thanks, the behavior is normal now.

WongKinYiu on 13 Nov 2019

👍1

@WongKinYiu Can you post the log?

digantamisra98 on 13 Nov 2019

chart

WongKinYiu on 13 Nov 2019

👍2

@deimsdeutsch

Why does accuracy decrease with increasing number of layers?
What model did you use?
Did you use Residual-connections or Concatenate-layers in this model?
Did you test Mish on big dataset and models?
Also as you can see there are 3 different Mish-implementations, even forward-mish functions are different, so we can't convert model between TF(2 thresholds) <-> Pytorch(1 threshold) <-> MXNet (0 thresholds):
1. your implementation: https://github.com/digantamisra98/Mish/blob/master/Mish/Torch/functional.py#L16
  output = input * tanh(log( exp(input) + 1 ))
2. Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20
```
if (input < THRESHOLD) output  = input * tanh(log( exp(input) ))
else output  = input * tanh(input)
```
1. TF: https://github.com/tensorflow/addons/commit/093cdfa85d334cbe19a37624c33198f3140109ed#diff-ba79ea22df25d0228e4581894a324095R40-R49
```
if (input > THRESHOLD) output  = input * tanh( input );      // too large
else if (input < -THRESHOLD) output  = input * tanh( exp(input) );    // too small
else output  = input * tanh(log( exp(input) + 1 ));
```

How do you think to solve this issue?

69233007-1acdca80-0bb2-11ea-8a79-ced9c0b5780c

AlexeyAB on 20 Nov 2019

@AlexeyAB

The answer to that question is discussed in the Google Brain's paper of Swish - https://arxiv.org/pdf/1710.05941v1.pdf

Simple Fully Connected Conv Net.

"To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained."

No Residual Connections were used.
Currently benchmarking on ImageNet.
I'll take a look again and get back to you on that.

digantamisra98 on 20 Nov 2019

👍1

@digantamisra98

Currently benchmarking on ImageNet.

What model do you use for benchmarking on ImageNet?
Is it ResNet-101, EfficientNet or MixNet?

AlexeyAB on 20 Nov 2019

@AlexeyAB as of right now, I'm doing for ResNet-56, MobileNet v2, NasNet-A, SEResNet-50, ShuffleNet v1. Currently ShuffleNet is in progress.

digantamisra98 on 20 Nov 2019

@AlexeyAB This is the response that Tom provided in regards to your question of varying thresholds:

The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch. They just both come from borrowing the relevant softplus implementation.
I’m not sure the differences make a real impact and wouldn’t prevent converting models, at least not between TF and PyTorch. As noted this would also potentially apply to any model using softplus.
If there is indeed no theshold in MXNet then that may cause issues. But this also depends on other details. There may be other handling of non-finite values that would mitigate issues. It also depends on the datatypes used. In general this is mostly an issue for 16-bit floats. Though I think I did see some issues with 32-bit floats I think that was with the quite unstable calculation involving multiple exponents rather than the symbolically derived gradient calculation.

Oh and I’ve responded to that post.
I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.
The one issue is support in older PyTorch versions. It should be fine in PyTorch 1.2 and 1.3 (though I’ve mostly tested in 1.3). I think it should probably also work in 1.1 and maybe even 1.0 in which case it should always be fine as I can’t imagine you’d want to support pre-1.0 anymore.
But the JIT version should probably be preferred unless older support is key. I’d also note that I don’t think my CUDA version will work pre-1.2 so the JIT version should offer equivalent performance and version support. I just need to run a few extra tests on the JIT version and then will likely update the repo to indicate the JIT version should be preferred.

digantamisra98 on 21 Nov 2019

@digantamisra98

The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch.

I think there are obviously 2 different MISH-functions, so the weights which are trained on Pytorch can't be used in TF and vice versa.
Not only due to 1 vs 2 thresholds, but also due to different formulas - actually different activation-functions:

Pytorch: output = input * tanh(log( exp(input) ))
TF: output = input * tanh(log( exp(input) + 1 ));

Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20

if (input < THRESHOLD) output  = input * tanh(log( exp(input) ))
else output  = input * tanh(input)

TF: https://github.com/tensorflow/addons/blob/093cdfa85d334cbe19a37624c33198f3140109ed/tensorflow_addons/custom_ops/activations/cc/kernels/mish_op.h#L40-L49
```cpp
if (input > THRESHOLD) output = input * tanh( input ); // too large
else if (input < -THRESHOLD) output = input * tanh( exp(input) ); // too small
else output = input * tanh(log( exp(input) + 1 ));

http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ0YW5oKGxuKGV4cCh4KSsxKSkiLCJjb2xvciI6IiMwMDAwMDAifSx7InR5cGUiOjAsImVxIjoidGFuaChsbihleHAoeCkpKSIsImNvbG9yIjoiIzAwMDAwMCJ9LHsidHlwZSI6MTAwMH1d

Also about thresholds:

Threshold in Pytorch doesn't change activation function much, so it is normal output = input * tanh( input ); ~= output = input * tanh(log( exp(input) ))

But the second threshold in TF changes the activation function noticeably at least at some range (may be if input < -THRESHOLD then it doesn't matter) tanh( exp(x) ) != tanh(ln(exp(x) + 1))

http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ0YW5oKGV4cCh4KSkiLCJjb2xvciI6IiMwMDAwMDAifSx7InR5cGUiOjAsImVxIjoidGFuaChsbihleHAoeCkrMSkpIiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwfV0-

I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.

What kind of link are you talking about?

AlexeyAB on 21 Nov 2019

@AlexeyAB agreed to the different functional implementation. I guess I'll do PR to change it up. Thanks for clarifying, I completely missed that out. Regarding the comparison between JIT and Autograd I've asked him for further clarification.

digantamisra98 on 21 Nov 2019

@AlexeyAB hello,
i train my model using 11/13 repo, and test on ilsvrc 2012 val set.

| type | top-1 | top-5 |
| :-: | :-: | :-: |
| leaky | 70.9 | 90.2 |
| swish | 71.7 | 90.8 |
| mish | 70.9 | 90.2 |

i find there are some fixes of mish yesterday.
do i need retrain mish model using latest repo?

WongKinYiu on 22 Nov 2019

@AlexeyAB the PyTorch implementation by Tom has log1p instead of log which computes log(x+1) and not just log(x)
@WongKinYiu can you redirect me to that repository where the code is present to train ImageNet?
What model did you use?

digantamisra98 on 22 Nov 2019

@digantamisra98 Yes, you are right. I implemented MISH with 2 thresholds as in TF.

@WongKinYiu Try to train with the latest code. I fixed MISH today: https://github.com/AlexeyAB/darknet/commit/b9ca5ec781291f01174d6b496a9c3ebc59303c1f

AlexeyAB on 22 Nov 2019

👍1

@WongKinYiu are you working on training ImageNet currently using the updated Mish implementation?

digantamisra98 on 24 Nov 2019

@digantamisra98

no, i m training res2netlite72.
i ll retrain mish model and report results.
it will take 1~2 weeks.

WongKinYiu on 27 Nov 2019

👍2

@AlexeyAB Mish performs well after be fixed https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-557495489.

| Model | Activation | Top-1 | Top-5 |
| :---- | :--------: | :---: | :---: |
| PeleeNet | LReLU | 70.7 | 90.0 |
| PeleeNet | Swish | 71.5 (+0.8) | 90.7 (+0.7) |
| PeleeNet | Mish | 71.4 (+0.7) | 90.4 (+0.4) |
| | | | |
| CSPPeleeNet | LReLU | 70.9 | 90.2 |
| CSPPeleeNet | Swish | 71.7 (+0.8) | 90.8 (+0.6) |
| CSPPeleeNet | Mish | 71.2 (+0.3) | 90.3 (+0.1) |
| | | | |
| CSPResNeXt-50 | LReLU | 77.9 | 94.0 |
| CSPResNeXt-50 | Mish | 78.9 (+1.0) | 94.5 (+0.5) |
| CSPResNeXt-50 | Swish | 64.5 (-13.4) | 86.0 (-8.0) |

WongKinYiu on 14 Dec 2019

👍2

@WongKinYiu thanks for sharing the result. These are single runs right?

digantamisra98 on 14 Dec 2019

@WongKinYiu Thanks! It seems MISH sometimes ~~isn't~~ is better than SWISH on ImageNet, especially on large models.

@digantamisra98 Are there other MISH tests for ImageNet?
Or for recurrent networks (RNN, LSTM, convolutional-LSTM ...) and Transformer/BERT models? As I see ImageNet and Transformer are in the roadmap: https://github.com/digantamisra98/Mish#future-work-coming-soon

AlexeyAB on 14 Dec 2019

@digantamisra98
Yes, I can not afford multiple runs currently.
But in my previous experiments, darknet always give me similar results if I use same machine and same setting for training.

@AlexeyAB
In my experiments, Mish is more stable than Swish.
For ResNeXt-based models, swish can drop more than 10% accuracy on ImageNet.

WongKinYiu on 14 Dec 2019

👍1

@AlexeyAB yes, there are a lot of future benchmarks coming in the next updated version of the paper by January. I'm still working on it.
Though I'm interested to see the Statistical stability and the CI scores of Swish because so far in my results Mish is much more stable than Swish is as @WongKinYiu just pointed out. So I won't completely rely on single run tests.

digantamisra98 on 14 Dec 2019

👍1

What's important to see is the consistency which is just simply the standard deviation of the results. I'm doing those Benchmarks on more standard models like ResNets, SENet, etc.
Additionally I am doing intensive mathematical tests to prove its better than Swish not just based on empirical Benchmark scores.

digantamisra98 on 14 Dec 2019

@WongKinYiu Can you show result for CSPResNeXt-50 + Swish?

AlexeyAB on 14 Dec 2019

@AlexeyAB

upated https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-565692356
I trained twice, both get 6x% top-1 acc.

WongKinYiu on 14 Dec 2019

🚀1

@digantamisra98 In your opinion, what is the reason for appearing Nan during training?
Are you planning to somehow modify the MISH activation to avoid Nan? Or is using a Threshold the best solution?

AlexeyAB on 14 Dec 2019

@AlexeyAB I was experiencing NaNs at the very early stage of experimentation. When I adopted the PyTorch Softplus implementation which has a threshold for the Softplus function, I didn't experience NaN errors. I'm guessing there's some numerical stability issue with Softplus. I'm working with few colleagues to optimize Mish to address that problem.

digantamisra98 on 14 Dec 2019

👍1

@AlexeyAB additionally, I strongly believe there is something that I guess we haven't figured out with information propagation in increasing depth of networks. This is a very strong point since Mish consistently outperforms Swish when depth increases. I'll plot the residuals of these models and see what's the underlying driver affecting performance.

digantamisra98 on 14 Dec 2019

👍1

@WongKinYiu needed some help with ImageNet. Is there someway I can discuss it with you? Thanks!

digantamisra98 on 26 Dec 2019

@digantamisra98

we can discuss here or using e-mail.
sorry for that my english is sooooo poor, i can not use con-call.

WongKinYiu on 27 Dec 2019

@WongKinYiu email would be great. What's your ID?

digantamisra98 on 27 Dec 2019

@digantamisra98

here u r.

WongKinYiu on 27 Dec 2019

👍1

@WongKinYiu dropped an email :)

digantamisra98 on 28 Dec 2019

@AlexeyAB Hello,

When use activation=mish in shortcut layer, it can not train a detector with random=1.

WongKinYiu on 13 Jan 2020

@WongKinYiu I fixed it: https://github.com/AlexeyAB/darknet/commit/04a050cf07d817a5c4381f75fb45354f31be7c4b

AlexeyAB on 13 Jan 2020

@AlexeyAB Thanks a lot.

WongKinYiu on 13 Jan 2020

@digantamisra98 Hi, Did you compare MISH-activation with MAXOUT-activation: convergence time , final accuracy, ...?

AlexeyAB on 17 Feb 2020

@AlexeyAB Further set of benchmarks including against Maxout will be out in March. Bit occupied with my semester exams for this month.

digantamisra98 on 20 Feb 2020

Darknet: [Feature Request] Mish Activation function +0.5 AP wrt. Swish

Most helpful comment

All 59 comments

Related issues