Yolov3: Activation Function Experiments

Created on 10 Aug 2019  路  16Comments  路  Source: ultralytics/yolov3

This issue documents studies on the YOLOV3 activation function. The PyTorch 1.2 release updated some of the BatchNorm2D weight initializations (from 0-1 uniform random to all 1s), so I thought this would be a good time to benchmark the model and test the default repo against 3 possible improvements:

  1. Default nn.LeakyReLU(0.1, inplace=True)
  2. Swish class Swish(nn.Module)
  3. PRELU nn.PReLU(num_parameters=filters)
  4. PRELU nn.PReLU(num_parameters=1)

I benchmarked 5ff6e6b3a56d8c271e4ab6947b4af1b31f813942 with each of the above activations on the small coco_64img.data tutorial dataset:

python3 train.py --img-size 416 --data data/coco_64img.data --batch-size 16 --accumulate 4 --nosave

results_activations_416

PReLU looks promising, but we can't draw any conclusions from this small dataset. In my next post I will plot the results on the full coco dataset trained to 10% of the final epochs, which should be a much more useful comparison.

python3 train.py --img-size 320 --data data/coco.data --batch-size 32 --accumulate 2 --epochs 27 --nosave
Stale enhancement question

Most helpful comment

Experiments on 5ff6e6b3a56d8c271e4ab6947b4af1b31f813942 below. Results are test.py mAP at conf_thres = 0.1 > conf_thres = 0.001 > --save-json at the end of 27 coco.data epochs. No multi-scale.

Swish produces the best results, with the highest mAP and lowest validation losses, across almost all epochs (not just the final epoch), but the difference is small, and the increase in GPU memory is significant. LeakyReLU is 'inplace', reducing GPU memory, whereas swish requires +50% more GPU memory (being a custom module), and PRELU requires about 30% more GPU memory.

python3 train.py --img-size 320 --epochs 27 --batch-size 64 --accumulate 1  --nosave
  1. nn.LeakyReLU(0.1, inplace=True) (old default): 44.6
  2. class Swish(nn.Module): 44.9
  3. nn.PReLU(num_parameters=filters, init=0.10): 43.4
  4. scale_xy=1.2: 44.3
  5. scale_xy=1.1: 44.0
  6. scale_xy=1.5: 44.4
  7. (1.0 - giou) ** 2).mean() # giou^2 loss: 44.2
  8. yolov3-spp-pan.cfg: 44.4
  9. Initialize cls/obj biases -5 **(new default): 44.9
  10. Adam 9E-5: 45.2
  11. Adam uFBCE 8192: 45.4

results

All 16 comments

Experiments on 5ff6e6b3a56d8c271e4ab6947b4af1b31f813942 below. Results are test.py mAP at conf_thres = 0.1 > conf_thres = 0.001 > --save-json at the end of 27 coco.data epochs. No multi-scale.

Swish produces the best results, with the highest mAP and lowest validation losses, across almost all epochs (not just the final epoch), but the difference is small, and the increase in GPU memory is significant. LeakyReLU is 'inplace', reducing GPU memory, whereas swish requires +50% more GPU memory (being a custom module), and PRELU requires about 30% more GPU memory.

python3 train.py --img-size 320 --epochs 27 --batch-size 64 --accumulate 1  --nosave
  1. nn.LeakyReLU(0.1, inplace=True) (old default): 44.6
  2. class Swish(nn.Module): 44.9
  3. nn.PReLU(num_parameters=filters, init=0.10): 43.4
  4. scale_xy=1.2: 44.3
  5. scale_xy=1.1: 44.0
  6. scale_xy=1.5: 44.4
  7. (1.0 - giou) ** 2).mean() # giou^2 loss: 44.2
  8. yolov3-spp-pan.cfg: 44.4
  9. Initialize cls/obj biases -5 **(new default): 44.9
  10. Adam 9E-5: 45.2
  11. Adam uFBCE 8192: 45.4

results

@glenn-jocher Following https://github.com/lukemelas/EfficientNet-PyTorch/pull/88, GPU memory consumption for Swish decreases, if the swish implementation inherits torch.autograd.Function (code taken from the same pull request) :

class SwishImplementation(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        result = i * torch.sigmoid(i)
        ctx.save_for_backward(i)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        i = ctx.saved_variables[0]
        sigmoid_i = torch.sigmoid(i)
        return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))

class MemoryEfficientSwish(nn.Module):
    def forward(self, x):
        return SwishImplementation.apply(x)

However, this will increase the training time. If you are still using Swish for some of your experiments and getting out of memory errors, it could be useful.

@okanlv nice find!! It looks like Swish is indeed improving performance in this repo, so this new class may be very useful. I will test it on 1 epoch of COCO using the command below on a V100 GCP instance.

python3 train.py --data data/coco.data --cfg cfg/yolov3s.cfg --weights '' --epochs 1

Comparing it to the default LeakyReLU(0.1) and our current Swish() implementation:

class Swish(nn.Module):
    def forward(self, x):
        return x.mul_(torch.sigmoid(x))

-- | Loss | [email protected] | Time | Mem
-- |-- | -- | -- | --
LeakyReLU(0.1, inplace=True) | 15.5 | 0.0309 | 17:20 | 9.6G
Swish() | 15.6 | 0.0483 | 19:28 | 13.0G
MemoryEfficientSwish() | 15.7 | 0.0445 | 19:54 | 10.6G

That's strange, the two Swish versions are returning different losses and mAPs, with the memory efficient version worse in both. I had expected them to produce exactly the same results.

Hmm, I didn't expect that. If I find anything else, I will keep you updated.

To double check, I trained to 27 epochs, and got the same results. MemoryEfficientSwish() produces worse results: 49.3 mAP vs 49.7 mAP compared to default Swish() implementation. I don't exactly know why. I use Apex for mixed precision training BTW, not sure if that has any effect.

Edit: Both forward and backward for both functions produces the same results as expected. I have updated code to plot gradients for both functions.

@glenn-jocher, I might have found the problem. Inplace operation torch.mul_() in Swish class also changes its input. Running the following code shows the difference. Also, changing torch.mul_() to torch.mul() fixes this problem. I have not traced loss in backward direction to see computation graph so I cannot say anything for higher mAP at the moment.

import torch
import torch.nn as nn
import matplotlib.pyplot as plt


class Swish(nn.Module):
    def __init__(self):
        super(Swish, self).__init__()

    def forward(self, x):
        return x.mul_(torch.sigmoid(x))


class SwishImplementation(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        result = i * torch.sigmoid(i)
        ctx.save_for_backward(i)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        i = ctx.saved_variables[0]
        sigmoid_i = torch.sigmoid(i)
        return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))


class MemoryEfficientSwish(nn.Module):
    def __init__(self):
        super(MemoryEfficientSwish, self).__init__()

    def forward(self, x):
        return SwishImplementation.apply(x)


f1 = Swish()
f2 = MemoryEfficientSwish()

# 1st method
# returns RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
# x = torch.linspace(-5, 5, 1000)
# x1 = x.clone().detach()
# x1.requires_grad = True
# y1 = f1(x1)

# 2nd method
x = torch.linspace(-5, 5, 1000, requires_grad=True)
x_copy= x.clone().detach()
x_copy.requires_grad = True
x1 = x.clone()
x2 = x_copy.clone()

y1 = f1(x1)
y2 = f2(x2)

print('\nDid Swish changed its input?')
print(not torch.allclose(x, x1))
print('\nDid MemoryEfficientSwish changed its input?')
print(not torch.allclose(x, x2))

plt.xlim(-6, 6)
plt.ylim(-1, 6)
plt.plot(x.detach().numpy(), y1.detach().numpy())
plt.plot(x.detach().numpy(), y2.detach().numpy())
plt.title('Swish functions')
plt.legend(['Swish', 'MemoryEfficientSwish'], loc='upper left')
plt.show()

y1.backward(torch.ones_like(x))
y2.backward(torch.ones_like(x))

assert torch.allclose(y1, y2)
assert torch.allclose(x.grad, x_copy.grad)

def getBack(var_grad_fn):
    print(var_grad_fn)
    for n in var_grad_fn.next_functions:
        if n[0]:
            try:
                tensor = getattr(n[0], 'variable')
                print('\t', n[0])
                # print('Tensor with grad found:', tensor)
                # print(' - gradient:', tensor.grad)
                print()
            except AttributeError as e:
                getBack(n[0])


print('\nTracing backward functions for Swish')
getBack(y1.grad_fn)
print('\nTracing backward functions for MemoryEfficientSwish')
getBack(y2.grad_fn)

plt.xlim(-6, 6)
plt.ylim(-1, 2)
plt.plot(x.detach().numpy(), x.grad.detach().numpy())
plt.plot(x.detach().numpy(), x_copy.grad.detach().numpy())
plt.title('Swish gradient functions')
plt.legend(['Swish', 'MemoryEfficientSwish'], loc='upper left')
plt.show()

How much better are the results?

@FranciscoReveriano I am referring to @glenn-jocher 's results in this thread. I have not trained the model myself.

@okanlv ah, so you are saying that the inplace operator in Swish() is interfering with the gradient computation? That's odd, because I trained with Swish() with and without the inplace operator .mul_ and got identical results before (but the inplace operator reduced memory a small bit, so I kept it).

So do you think the better results with Swish() might be a random occurance?

I think they are a random occurrence.

@glenn-jocher @FranciscoReveriano I am not sure actually because using x.clone() before applying both swish functions produced the same output and same gradient for x. Also, you did not get any errors during training with this inplaca operator, whereas it raised RuntimeError in my case. Considering we are creating yolov3 with nn.Sequential(), the internal backward calculation might be different. That being said, we can create yolov3 models with different swish implementation and compare their parameters (values and grads) after backwards pass. I will test this in a few days.

@okanlv hmm interesting, ok, keep us updated! I think Swish might be something that we'll want to integrate more in the future, as it does seem to increase mAP a bit in most circumstances.

Yeah I am looking more into understanding Swish. Might be very beneficial.

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

@glenn-jocher Hi, I wonder when you change the loss, say from SmoothL1 to GIOU or activation from ReLU to Swish, will you train the entire model from scratch or load part of the pretrained weights from former version before change as a starting point?

@sudo-rm-covid19 from scratch always.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

NgTuong picture NgTuong  路  4Comments

cyberclone12 picture cyberclone12  路  4Comments

JiahongXue picture JiahongXue  路  5Comments

kaaier picture kaaier  路  3Comments

Aria20155 picture Aria20155  路  3Comments