Darknet: [Feature Request] Mish Activation function +0.5 AP wrt. Swish

Created on 26 Sep 2019  ·  59Comments  ·  Source: AlexeyAB/darknet

Mish:
𝑓(𝑥)=⁡𝑥⋅𝑡𝑎𝑛ℎ(𝑠𝑜𝑓𝑡𝑝𝑙𝑢𝑠(𝑥))=⁡𝑥⋅𝑡𝑎𝑛ℎ(ln⁡(1+𝑒𝑥))

https://arxiv.org/abs/1908.08681

https://github.com/digantamisra98/Mish

enhancement

Most helpful comment

@digantamisra98

I added MISH-activation.
use activation=mish in [convolutional] layers

Please, check that implementation is correct: https://github.com/AlexeyAB/darknet/commit/bf8ea4183dc265ac17f7c9d939dc815269f0a213


Thanks! So the error was in delta?

image

All 59 comments

image


image


The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models.

@AlexeyAB Was wondering if you're considering adding up Mish? In that regards, based on the above screenshot there is a mistake in the derivative formula which I have updated in my paper. The link to the updated paper and additional results are in my repository here - https://github.com/digantamisra98/Mish
Thanks!

@digantamisra98

I added MISH-activation.
use activation=mish in [convolutional] layers

Please, check that implementation is correct: https://github.com/AlexeyAB/darknet/commit/bf8ea4183dc265ac17f7c9d939dc815269f0a213


Thanks! So the error was in delta?

image

Just checked, The implementation is correct. Thanks. Yes the error was a typo in the delta term.

now training.

now training.

usually get nan, do i need adjust learning rate schedule?

burn_in=2000
learning_rate=0.1
policy=poly
power=4
max_batches=1600000

@WongKinYiu What model do you try to train?

densenet based model. i ll try darknet based model first.

@WongKinYiu Here's the DenseNet code I used to test Mish - https://github.com/digantamisra98/Mish/blob/master/Notebooks/cifar-10-DenseNet121_Mish.ipynb
Usually I'll advise to have a lower learning rate probably 1e-3 (0.01 - 0.001)
Can you share the log maybe or the code to reproduce the NaN?

darknet based model also get nan.

[net]
batch=128
subdivisions=1
height=224
width=224
channels=3
momentum=0.9
decay=0.0005
max_crop=320

learning_rate=0.1
policy=poly
power=4
max_batches=1600000

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2
padding=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=mish

[avgpool]

[convolutional]
filters=1000
size=1
stride=1
pad=1
activation=linear

[softmax]
groups=1

@WongKinYiu
Do you train on ILSVRC2012?
How many iterations do you train before Nan occured?
Do you use GPU=1 CUDNN=1 ?

Try to use

[maxpool]
size=2
stride=1

instead of

[maxpool]
size=2
stride=2
padding=1

yes, i train on ILSVRC2012.
about 3k~5k iterations i get nan.
i use gpu=1 and cudnn=1.

@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?

both of 0.1 and 0.05 get nan.
i ll try other setting after finish my breakfast.
thanks for ur advice.

@WongKinYiu I'll go through Mish's implementation again in a while and confirm if everything is alright and also give it a try myself to validate the same. Thanks for raising the issue.

https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-551143150
get nan after 300 iterations.

@digantamisra98 thanks, i can also have time to check the implementation after 11/17.

@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?

0.1, 0.05, 0.01, 0.001 all get nan.

get nan after 10 iterations.
0.001

@AlexeyAB @nhaxin204 @WongKinYiu I went through the implementation again and I believe it's correct. Though I'm gonna practically implement it this week. (Sorry, was a bit occupied the last week). I will also ask the fast.ai forum folks to give the implementation a check to make sure I'm not missing anything.

@AlexeyAB This is Tom's response regarding the NaN issue:

That implementation is not at all numerically stable. All the exps quickly lead to overflow and hence NaN. Should be possible to adapt either the Eigen based implementation from tensorflow contrib or my mostly pure C++ implementation (mostly as it’s using the PyTorch dispatch/templating but is otherwise standard C++). The TF one is probably slightly more stable given handling of both underflow and overflow but will require more adaptation to remove the Eigen dependency.

Here is the TensorFlow Addons commit for Mish - https://github.com/tensorflow/addons/commit/093cdfa85d334cbe19a37624c33198f3140109ed

Tom's CUDA implementation - https://github.com/thomasbrandon/mish-cuda

@digantamisra98

So: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31

if (in < threshold) new_in = log( expf(in) );
else new_in= in;

gradient = in * ((1 - tanh(new_in)*tanh(new_in)) * (1 - exp(-new_in))) + tanh(new_in);

delta = delta * gradient;

@digantamisra98 @WongKinYiu @LukeAI @nhaxin204
I fixed MISH to this implementation: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31

thanks, the behavior is normal now.

@WongKinYiu Can you post the log?

chart

@deimsdeutsch

How do you think to solve this issue?

69233007-1acdca80-0bb2-11ea-8a79-ced9c0b5780c

@AlexeyAB

  1. The answer to that question is discussed in the Google Brain's paper of Swish - https://arxiv.org/pdf/1710.05941v1.pdf
  1. Simple Fully Connected Conv Net.

"To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained."

  1. No Residual Connections were used.

  2. Currently benchmarking on ImageNet.

  3. I'll take a look again and get back to you on that.

@digantamisra98

  1. Currently benchmarking on ImageNet.

What model do you use for benchmarking on ImageNet?
Is it ResNet-101, EfficientNet or MixNet?

@AlexeyAB as of right now, I'm doing for ResNet-56, MobileNet v2, NasNet-A, SEResNet-50, ShuffleNet v1. Currently ShuffleNet is in progress.

@AlexeyAB This is the response that Tom provided in regards to your question of varying thresholds:

The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch. They just both come from borrowing the relevant softplus implementation.
I’m not sure the differences make a real impact and wouldn’t prevent converting models, at least not between TF and PyTorch. As noted this would also potentially apply to any model using softplus.
If there is indeed no theshold in MXNet then that may cause issues. But this also depends on other details. There may be other handling of non-finite values that would mitigate issues. It also depends on the datatypes used. In general this is mostly an issue for 16-bit floats. Though I think I did see some issues with 32-bit floats I think that was with the quite unstable calculation involving multiple exponents rather than the symbolically derived gradient calculation.

Oh and I’ve responded to that post.
I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.
The one issue is support in older PyTorch versions. It should be fine in PyTorch 1.2 and 1.3 (though I’ve mostly tested in 1.3). I think it should probably also work in 1.1 and maybe even 1.0 in which case it should always be fine as I can’t imagine you’d want to support pre-1.0 anymore.
But the JIT version should probably be preferred unless older support is key. I’d also note that I don’t think my CUDA version will work pre-1.2 so the JIT version should offer equivalent performance and version support. I just need to run a few extra tests on the JIT version and then will likely update the repo to indicate the JIT version should be preferred.

@digantamisra98

The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch.

I think there are obviously 2 different MISH-functions, so the weights which are trained on Pytorch can't be used in TF and vice versa.
Not only due to 1 vs 2 thresholds, but also due to different formulas - actually different activation-functions:

  • Pytorch: output = input * tanh(log( exp(input) ))
  • TF: output = input * tanh(log( exp(input) + 1 ));

  • Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20

    if (input < THRESHOLD) output  = input * tanh(log( exp(input) ))
    else output  = input * tanh(input)
    
  • TF: https://github.com/tensorflow/addons/blob/093cdfa85d334cbe19a37624c33198f3140109ed/tensorflow_addons/custom_ops/activations/cc/kernels/mish_op.h#L40-L49
    ```cpp
    if (input > THRESHOLD) output = input * tanh( input ); // too large
    else if (input < -THRESHOLD) output = input * tanh( exp(input) ); // too small
    else output = input * tanh(log( exp(input) + 1 ));

http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ0YW5oKGxuKGV4cCh4KSsxKSkiLCJjb2xvciI6IiMwMDAwMDAifSx7InR5cGUiOjAsImVxIjoidGFuaChsbihleHAoeCkpKSIsImNvbG9yIjoiIzAwMDAwMCJ9LHsidHlwZSI6MTAwMH1d

image


Also about thresholds:

  • Threshold in Pytorch doesn't change activation function much, so it is normal output = input * tanh( input ); ~= output = input * tanh(log( exp(input) ))
  • But the second threshold in TF changes the activation function noticeably at least at some range (may be if input < -THRESHOLD then it doesn't matter) tanh( exp(x) ) != tanh(ln(exp(x) + 1))

http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ0YW5oKGV4cCh4KSkiLCJjb2xvciI6IiMwMDAwMDAifSx7InR5cGUiOjAsImVxIjoidGFuaChsbihleHAoeCkrMSkpIiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwfV0-

image


I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.

What kind of link are you talking about?

@AlexeyAB agreed to the different functional implementation. I guess I'll do PR to change it up. Thanks for clarifying, I completely missed that out. Regarding the comparison between JIT and Autograd I've asked him for further clarification.

@AlexeyAB hello,
i train my model using 11/13 repo, and test on ilsvrc 2012 val set.

| type | top-1 | top-5 |
| :-: | :-: | :-: |
| leaky | 70.9 | 90.2 |
| swish | 71.7 | 90.8 |
| mish | 70.9 | 90.2 |

i find there are some fixes of mish yesterday.
do i need retrain mish model using latest repo?

@AlexeyAB the PyTorch implementation by Tom has log1p instead of log which computes log(x+1) and not just log(x)
@WongKinYiu can you redirect me to that repository where the code is present to train ImageNet?
What model did you use?

@digantamisra98 Yes, you are right. I implemented MISH with 2 thresholds as in TF.

@WongKinYiu Try to train with the latest code. I fixed MISH today: https://github.com/AlexeyAB/darknet/commit/b9ca5ec781291f01174d6b496a9c3ebc59303c1f

@WongKinYiu are you working on training ImageNet currently using the updated Mish implementation?

@digantamisra98

no, i m training res2netlite72.
i ll retrain mish model and report results.
it will take 1~2 weeks.

@AlexeyAB Mish performs well after be fixed https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-557495489.

| Model | Activation | Top-1 | Top-5 |
| :---- | :--------: | :---: | :---: |
| PeleeNet | LReLU | 70.7 | 90.0 |
| PeleeNet | Swish | 71.5 (+0.8) | 90.7 (+0.7) |
| PeleeNet | Mish | 71.4 (+0.7) | 90.4 (+0.4) |
| | | | |
| CSPPeleeNet | LReLU | 70.9 | 90.2 |
| CSPPeleeNet | Swish | 71.7 (+0.8) | 90.8 (+0.6) |
| CSPPeleeNet | Mish | 71.2 (+0.3) | 90.3 (+0.1) |
| | | | |
| CSPResNeXt-50 | LReLU | 77.9 | 94.0 |
| CSPResNeXt-50 | Mish | 78.9 (+1.0) | 94.5 (+0.5) |
| CSPResNeXt-50 | Swish | 64.5 (-13.4) | 86.0 (-8.0) |

@WongKinYiu thanks for sharing the result. These are single runs right?

@WongKinYiu Thanks! It seems MISH sometimes isn't is better than SWISH on ImageNet, especially on large models.

@digantamisra98 Are there other MISH tests for ImageNet?
Or for recurrent networks (RNN, LSTM, convolutional-LSTM ...) and Transformer/BERT models? As I see ImageNet and Transformer are in the roadmap: https://github.com/digantamisra98/Mish#future-work-coming-soon

@digantamisra98
Yes, I can not afford multiple runs currently.
But in my previous experiments, darknet always give me similar results if I use same machine and same setting for training.

@AlexeyAB
In my experiments, Mish is more stable than Swish.
For ResNeXt-based models, swish can drop more than 10% accuracy on ImageNet.

@AlexeyAB yes, there are a lot of future benchmarks coming in the next updated version of the paper by January. I'm still working on it.
Though I'm interested to see the Statistical stability and the CI scores of Swish because so far in my results Mish is much more stable than Swish is as @WongKinYiu just pointed out. So I won't completely rely on single run tests.

What's important to see is the consistency which is just simply the standard deviation of the results. I'm doing those Benchmarks on more standard models like ResNets, SENet, etc.
Additionally I am doing intensive mathematical tests to prove its better than Swish not just based on empirical Benchmark scores.

@WongKinYiu Can you show result for CSPResNeXt-50 + Swish?

@AlexeyAB

upated https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-565692356
I trained twice, both get 6x% top-1 acc.

@digantamisra98 In your opinion, what is the reason for appearing Nan during training?
Are you planning to somehow modify the MISH activation to avoid Nan? Or is using a Threshold the best solution?

@AlexeyAB I was experiencing NaNs at the very early stage of experimentation. When I adopted the PyTorch Softplus implementation which has a threshold for the Softplus function, I didn't experience NaN errors. I'm guessing there's some numerical stability issue with Softplus. I'm working with few colleagues to optimize Mish to address that problem.

@AlexeyAB additionally, I strongly believe there is something that I guess we haven't figured out with information propagation in increasing depth of networks. This is a very strong point since Mish consistently outperforms Swish when depth increases. I'll plot the residuals of these models and see what's the underlying driver affecting performance.

@WongKinYiu needed some help with ImageNet. Is there someway I can discuss it with you? Thanks!

@digantamisra98

we can discuss here or using e-mail.
sorry for that my english is sooooo poor, i can not use con-call.

@WongKinYiu email would be great. What's your ID?

@digantamisra98

image

here u r.

@WongKinYiu dropped an email :)

@AlexeyAB Hello,

When use activation=mish in shortcut layer, it can not train a detector with random=1.

@AlexeyAB Thanks a lot.

@digantamisra98 Hi, Did you compare MISH-activation with MAXOUT-activation: convergence time , final accuracy, ...?

@AlexeyAB Further set of benchmarks including against Maxout will be out in March. Bit occupied with my semester exams for this month.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

PROGRAMMINGENGINEER-NIKI picture PROGRAMMINGENGINEER-NIKI  ·  3Comments

shootingliu picture shootingliu  ·  3Comments

HanSeYeong picture HanSeYeong  ·  3Comments

Yumin-Sun-00 picture Yumin-Sun-00  ·  3Comments

hemp110 picture hemp110  ·  3Comments