Apex: negligble performance gains and non convergence on DCGAN using apex (what to change?)

Created on 13 Nov 2018 · 23Comments · Source: NVIDIA/apex

I bought a RTX 2070 with the goal in mind to train my DCGAN on fp16 for bigger and faster models. After carefully adjusting my models and running vanilla model.half() without apex, AMP and FP16_Optimizer I'm not too convinced by the results. Maybe I did something wrong?

The architecture:

       #Loss Function: 
        criterion = nn.BCELoss()


       # Generator
       "512px output": (
        nn.Sequential(
        # Input Z (100x1x1)
        nn.ConvTranspose2d(nz, ngf * 64, 4, 1, 0, bias=False),
        nn.BatchNorm2d(ngf * 64),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 4x4x(ngf*64)

        nn.ConvTranspose2d(ngf * 64, ngf * 32, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 32),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 8x8x(ngf*32)

        nn.ConvTranspose2d(ngf * 32, ngf * 16, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 16),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 16x16x(ngf*16)

        nn.ConvTranspose2d(ngf * 16, ngf * 8, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 8),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 32x32x(ngf*8)

        nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 4),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 64x64x(ngf*4)

        nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 2),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 128x128x(ngf * 2)

        nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 256x256x(ngf)

        nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
        nn.Tanh()
        # 512x512x3 Output
    ),

    # Discriminator
    nn.Sequential(
        # Input 512x512x3
        nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf),
        nn.LeakyReLU(0.2, inplace=True),
        # 256x256xndf

        nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 2),
        nn.LeakyReLU(0.2, inplace=True),
        # 64x64x(ndf * 2)

        nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 4),
        nn.LeakyReLU(0.2, inplace=True),
        # 32x32x(ndf * 4)

        nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 8),
        nn.LeakyReLU(0.2, inplace=True),
        # 16x16x(ndf * 8)

        nn.Conv2d(ndf * 8, ndf * 16, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 16),
        nn.LeakyReLU(0.2, inplace=True),
        # 8x8x(ndf * 16)

        nn.Conv2d(ndf * 16, ndf * 32, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 32),
        nn.LeakyReLU(0.2, inplace=True),
        # 4x4x(ndf * 32)

        nn.Conv2d(ndf * 32, ndf * 64, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 64),
        nn.LeakyReLU(0.2, inplace=True),
        # 2x2x(ndf * 64)

        nn.Conv2d(ndf * 64, 1, 4, 1, 0, bias=False),
        nn.Sigmoid()
        # 1x1x1
    )),

I changed the following parts in my code to accomodate for FP16:

network_to_half(netG)
network_to_half(netD)

optimizerD = FP16_Optimizer(optimizerD, dynamic_loss_scale=True, verbose=False)
optimizerG = FP16_Optimizer(optimizerG, dynamic_loss_scale=True, verbose=False)



md5-681aac52ee3d5b8b03ecff9078c7ea66




in the training loop:

for i, data in enumerate(dataloader, 0):
    # making the input fp16
    input_batch = data[0].cuda().half()
     ....
    # collect gradients for real batch in discriminator
    optimizerD.backward(errD_real, update_master_grads=False)
     ....
    # collect gradients for fake batch in discriminator
    optimizerD.backward(errD_fake, update_master_grads=False)
     ....
    # backprop discriminator
     optimizerD.update_master_grads()
     optimizerD.step()
    ....
    # collect gradients for generated batch in generator and backprop generator
     optimizerG.backward(errG)
     optimizerG.step()
    ....

Results:

using stock model.half() without apex: the model is 2x slower and not converging after 1 epoch
using AMP: the model is 1.5x slower and not converging after 1 epoch
using FP16_Optimizer: the model is 1.2x slower and converging if dynamic_loss_scale is used

Basically the model only somewhat behaves if I'm using dynamic_loss_scale in FP16_Optimizer, although it produces garbage outputs even though the architecture didn't change from the FP32 model that worked.

AMP should use dynamic_loss_scale automatically but it always collapses after 1 iteration and is very slow.

I expected the model to be faster and atleast converge like the FP32 model did. The only benefit is that the model is occupying around 51% less space on the GPU, so bigger models can be trained.

Questions:

What do I need to change in my architecture and training setup to make FP16 work with this DCGAN?

System information

PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Microsoft Windows 10 Home
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration: GPU 0: GeForce RTX 2070
Nvidia driver version: 416.81
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] cuda92 1.0 0 pytorch
[conda] pytorch 0.4.1 py37_cuda92_cudnn7he774522_1 [cuda92] pytorch
[conda] torchvision 0.2.1

Source

toemm

Most helpful comment

@bearpelican, If I read your notebook right, for unet you get slowdown in fp16 for upsampling architecture and ~50% speed-up for conv transpose architecture? Upsampling in fp16 is slow (slower than in fp32) because backwards of upsampling layer is implemented with atomicAdd, and since there is no native support for atomicAdd, the performance is pretty bad https://github.com/pytorch/pytorch/blob/master/aten/src/THCUNN/SpatialUpSamplingNearest.cu#L94. You'd be better off converting upsampling layers to fp32.
For conv transpose architecture speed up is less than for resnet because there are many strided transposed convolution layers, and those provide worse speed-up. If you are not using cudnn 7.4.1 or 7.4.2, try those cudnn versions, they improve strided convolution performance.

ngimel on 28 Dec 2018

❤2

All 23 comments

From a performance perspective, I notice you're using a lot of strided convolutions. Cudnn has been known to exhibit poor performance on strided/dilated convolutions. You used a conda install, correct? If so, this means you have an older version of cudnn statically baked in. Strided convolutions have improved in the most recent cudnn (7.4) although I can't guarantee they will cover your use case.

If you were on Linux, at this point I'd recommend that you try one of our recently released Docker containers, which have a preinstalled version of Pytorch compiled against cudnn 7.4, and get you running immediately. Unfortunately, these Docker containers can't be run on Windows :( Right now, to use cudnn 7.4, your best bet is to download and install it on bare metal, then clone Pytorch and build from source on bare metal. I know this is a hassle but on Windows I can't think of an easier way. Do you also have a Linux partition on your machine?

From a functionality perspective, what you're seeing is a bit more worrisome. Was there any combination of half or mixed precision options that ended up actually converging (ie you said FP16_Optimizer with dynamic loss scaling was "1.2x slower and converging", but then you also said it "produces garbage outputs) so can you clarify that?

mcarilli on 15 Nov 2018

Thanks for the reply.
Yes I was using Windows for my first test. I have since then moved to Ubuntu 16.04 (Python 3.5) and also went with the the newest cudnn 7.4.1 like you recommended.

I only had the time to do a couple of runs with FP16_Optimizer but it looks like it is miles better than before.

Speed is the only issue where with the same model architecture it is only < 5% faster (time per epoch) than my old FP32 models. Better than on Windows though where it was 20% slower.

Moving to Ubuntu and upgrading CUDNN helped with the model output though and I'm seeing good convergence so far (like the old model at the same number of epochs), hope it stays that way. I had to increase my learning_rate by a factor 10 though, don't really know why my old learning_rate doesn't converge, maybe it does later in the training.
I disabled dynamic loss scaling aswell because I used to track the gradient norms manually to check if the model behaves, but since the loss is scaled so much this would give me stupidly high gradient norms.
So using a static loss scale with a factor 10 bigger learning_rate seems to work, albeit with no increase in speed BUT ofcourse more VRAM available (around 30-40% more), which, if the model gives good output, is still huge for such small overhead (which is a compliment to your team for this great package btw :D).

So what could I do to increase the speed? Would you say that Pytorch 1.0 + CUDA 10.0 + CUDNN 7.4.1 is the recommended combination that would theoretically leverage FP16 the most right now?
On a sidenote: when I do nvidia-smi it says CUDA 10.0 in the window, but nvcc -V says CUDA 9.2 is installed, does CUDA 10.0 come with the nvidia driver on Ubuntu?

toemm on 15 Nov 2018

@toemm Can you check if the speed difference between fp16 and fp32 is bigger if you just run the discriminator and feed some fixed tensor as the discriminator input? The reason I think this might be a useful test is that I'm suspecting that conv transpose is the culprit of small speed-ups.

If the above results in big perf deltas between fp16 and fp32, the next thing to test would be the data pipeline. Could you run the whole GAN with fp16 and fp32, but with some fixed tensor filled with random values? If you're still seeing major fp16 speed-ups, then the issue might be the data pipeline.

mkolod on 16 Nov 2018

@mkolod Thanks for the suggestions, here are my results:

1. Removing the Generator architecture to isolate if nn.ConvTranspose2d is slowing down training
FP32: 1,19min per epoch
FP16: 1,14min per epoch

2. In addition to removing the Generator, also input fixed random images instead of for i, data in enumerate(dataloader, 0): to isolate if the data pipeline is slowing down the training
FP32: 0.85min per epoch
FP16: 0.85min per epoch

3. Adding the Generator back into the architecture BUT keeping the discriminator input fixed (again to test if the data pipeline is the issue)
FP32: 1.57min per epoch
FP16: 1.50min per epoch

4. Same as 3. but under Windows
FP32: 1.73min per epoch
FP16: 1.66min per epoch

So it seems like neither ConvTranspose2d nor the data pipeline are responsible. Do you think that Cuda 10.0, Pytorch 1.0 would help?

toemm on 16 Nov 2018

I think it’s a good idea to test with cuda 10, cudnn 7.4, and pytorch 1.0 to make sure we aren’t missing anything.

The backward pass for strided 2d convolutions is dilated 2d transposed convolutions, so unfortunately in this case you can’t just “get rid” of transposed convolutions by commenting out the generator. 2d transposed convolutions will be invoked by the discriminator as well, they’re just hidden in backward().

What are the “baseline” numbers for fp16 and fp32 when both generator and discriminator are present, operating on real input data?

mcarilli on 16 Nov 2018

What are the “baseline” numbers for fp16 and fp32 when both generator and discriminator are present, operating on real input data?

I tested it again because I discarded the model from the above test. New results with another model:

Baseline
FP32: 1.35min per epoch
FP16: 1.33min per epoch

Generator commented out, not used
FP32: 0.77min per epoch
FP16: 0.78min per epoch

Generator back in but input random noise to isolate data pipeline
FP32: 1.26min per epoch
FP16: 1.23min per epoch

Behaves as expected but still no increase in speed between FP16 and FP32.

I think it’s a good idea to test with cuda 10, cudnn 7.4, and pytorch 1.0 to make sure we aren’t missing anything.

I tried to get this to work but I'm not finding a pytorch 1.0 wheel for Ubuntu 16.04 to build from source to get CUDNN 7.4 support.
I tried conda install pytorch-nightly -c pytorch but then print(torch.backends.cudnn.version()) still outputs CUDNN 7.1.
How do I get this to work?

toemm on 17 Nov 2018

You've got a couple options (everything below is for Ubuntu).

Use a Docker container like I said earlier (recommended). You'll need nvidia-docker.
Option 1a: Pull and run our latest public Pytorch container from NGC, which includes Cuda 10, Pytorch 1.0 and cudnn 7.4 preinstalled. Instructions can be found here: https://ngc.nvidia.com/catalog/containers/nvidia%2Fpytorch. The UX for this website used to be a dumpster fire but it looks like they've greatly improved it since the last time I visited. If you go this route, please let me know if you have any problems.
Option 1b: Build your own Pytorch container based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 from Dockerhub:
git clone https://github.com/pytorch/pytorch.git vim docker/pytorch/Dockerfile <change FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 to FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04> docker build -f docker/pytorch/Dockerfile -t my_awesome_cudnn74_container .
Rebuild on bare metal. This requires that you have the Cuda 10 Toolkit installed on bare metal (the Dockerfile methods 1a. and 1b. do not have this requirement, because Cuda 10 is preinstalled in the containers). To make this work, you need to install cudnn** on bare metal, then
git clone https://github.com/pytorch/pytorch.git cd pytorch python setup.py install

1b. and 2. can take up to an hour, because they both involve building Pytorch from source, but being able to rebuild from source is a useful tool in general. For example, maybe at some point you want to mess with the C++ backend functions (e.g. put in some debugging print statements and recompile).

**the install instructions show working with a cuda-9.0 package, but that's just as an example. You can select Download cuDNN v7.4.1 (Nov 8, 2018), for CUDA 10.0 when you're on the download page and make appropriate substitutions while installing.

mcarilli on 17 Nov 2018

Thanks for that great writeup. I went with option 1a and pulled the latest docker image (docker pull nvcr.io/nvidia/pytorch:18.11-py3).
torch.backends.cudnn.version() outputs 7401 as expected with the correct CUDA and PyTorch versions, everything worked flawlessly.

Sadly I got basically the same results as above...
FP16: 1.37min per epoch, FP32: 1.34min per epoch

I tested other models aswell and the docker configuration gave in average 2secs/epoch slower results compared to my old Cuda9.2, CUDNN 7.14, Pytorch 0.41 setup.

So do you think that the strided convolutions are the culprit? Is this something that can be improved on in later versions?

toemm on 18 Nov 2018

Well, thanks for meeting us halfway.

Can you provide a minimal repro (ideally a standalone script using synthetic data) that is representative of your use case? If so, we'll throw it through the profiler and see if anything stands out.

mcarilli on 20 Nov 2018

@mcarilli

Here is the code that reproduces this issue: https://pastebin.com/X1Xs6y42
Usage: Comment and uncomment the model(...) for FP16 or FP32 respectively in the main() function

Explanation:
It's a simple DCGAN structure for generating 256x256 images. The input is random data (real_batch = torch.randint(1000, (batch_size, 3, 256, 256)).cuda()) instead of real images for the test case. The Generator should learn this distribution very quickly (converge to Generator loss = 0), as it does in the FP32 test case after around 17 iterations. The FP16 model only somewhat converges (Generator loss = 0) with dynamic_loss_scale=True after 19 or more iterations (sometimes after a few epochs or never...).

Without dynamic_loss_scale=True the FP16 model doesn't converge (in the test setup: model(fp16=True, dyn_loss_scale=False)). Also if I use the loss criterion = nn.BCEWithLogitsLoss() and remove the last Sigmoid layer in the Discriminator, the model doesn't converge even though that loss is recommended in the AMP documentation. This could explain why my FP16 results are different/worse than with FP32 for my original bigger models, because I've been using BCEWithLogits without dynamic_loss_scale so far. Am I using the BCEWithLogits loss incorrectly? Should I rather use BCELoss with dynamic_loss_scale?

Performance:

FP16: 50% less VRAM usage, 15% faster than FP32
Speedup is better than before, but still not close to 2x etc.

nvprof output (grep for 884):

0.17%  114.67ms       128  895.87us  883.23us  1.1464ms volta_fp16_s884cudnn_fp16_256x128_ldg8_dgrad_f2f_exp_small_nhwc2nchw_tt_v1
0.03%  17.058ms       192  88.841us  87.712us  114.53us  volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_small_nhwc2nchw_tt_v1

I've also tested amp and with the code above I'm getting NaN's after 1 iteration even though I'm using the recommended BCEWithLogitsLoss() loss and not BCELoss.

Questions:

What do I need to change to atleast get the same exact convergence as FP32? For such a simple model FP32 converges after 17 epochs very stable, where as the FP16 model only sometimes converges if at all. I'm a bit frustrated with FP16 currently, why won't it work :P.

toemm on 22 Nov 2018

@mcarilli Have you had the time to check this further?

I've done multiple runs with other architectures etc. on Windows, Ubuntu and the Docker Image but I couldn't get more than 15% speedup / +30% more VRAM for these DCGAN architectures. But the Docker Image is faster than Windows and Ubuntu 16.04 so that is something. :P

Also nvprof always shows below 1% usage for the 884 operations.

toemm on 6 Dec 2018

Sorry, I haven't yet...I have a lot of things to track. Can you post your repro as a gist instead of a pastebin? The pastebin link appears to be broken.

mcarilli on 7 Dec 2018

I just wrote very simple program to benchmark the speed of resnet152 in fp32 and fp16:
https://gist.github.com/matthew-z/2c3067c69ae7835780af361fab6ac82f

I found that loss scaling lowers the speed of fp16 significantly (like 30%-40%, which makes fp16 only a little quicker than fp32), especially with non-ADAM optimizer (e.g., SGD).

I guess the reason is that SGD has to use the python version of loss-scaling, which is very slow, and FusedAdam can do the loss-scaling in a CUDA kernel (amp replaces Adam with FusedAdam automatically, right?).

I wonder if the team has any plan to update the loss-scaling function or implementing more kernel for each optimizer.

Thank you!

matthew-z on 7 Dec 2018

Yes, we do have a plan to include a general fused kernel for loss scaling which should speed it up.

However, in general, if the speedup is affected that strongly by loss scaling, it probably indicates that the network itself isn't achieving very good utilization of the device. This surprises me because our resnet50 example does achieve quite good speedup in FP16 over FP32. Have you tried running the imagenet example with resnet152? It should be as easy as supplying -a resnet152 instead of -a resnet50 to the script. You may have to reduce the batch size to avoid OOM.

mcarilli on 7 Dec 2018

@mcarilli Thank you! I just tried the imagenet example, and FP16 indeed doubles the speed with both static scaling and dynamic scaling.

Then, I wonder if it is possible that AMP is much slower than FP16_optimizer?

the model in my gist is the same as the one in imagenet example (both of them are torchvision.models.resnet152), and I think the only difference is that I used amp context-manager of AMP for loss scaling, and the example used FP16_optimizer.

Update: I just realized that the problem is caused by my incorrect data feeding. After fixing that, the speed of AMP is as quick as FP16_optimizer

matthew-z on 8 Dec 2018

Sorry, I haven't yet...I have a lot of things to track. Can you post your repro as a gist instead of a pastebin? The pastebin link appears to be broken.

No problem. Here is the gist list: https://gist.github.com/toemm/e5b49327f8ed52bb4ac69ffbfa5e843f

toemm on 13 Dec 2018

@toemm Thanks, I'll run that through the profiler. Ping me if I haven't replied by early next week.

mcarilli on 14 Dec 2018

@mcarilli I'm also seeing this performance issue with u-nets.

Here's a comparison of architectures.
Resnet is indeed 2x faster. Unet is 2x slower when converted to half
Unet arch is taken from pix2pix repo
https://gist.github.com/bearpelican/33828d56f4471ab034ab33114f2e7517

Digging a little more, one possible cause -
When number of filters is low (<32), convolutions are slower on half precision than full precision. However, when I change the number of filters/channels to something like 512, half precision becomes faster.
https://gist.github.com/bearpelican/bbd6f2f027e78c7888f9ff44031eb0ea

Any idea why this is the case? I was under the impression that N and C only needed to be multiples of 8 to use the tensor cores.

bearpelican on 28 Dec 2018

ngimel on 28 Dec 2018

❤2

Ahh you are totally right @ngimel. Looks like mine is a separate issue with upsampling. Thanks for pointing me in the right direction. PixelShuffle has a speedup

Does this PR help with atomicAdd performance?
Just wondering if this issue is fixed in the latest builds. Though I believe I am already using latest cuda/cudnn/pytorch versions.

bearpelican on 28 Dec 2018

Unfortunately no, it still maps to the same CAS emulation underneath, IIRC. What would help is rewriting upsampling backward w/o atomicAdds (I think average pooling forward can be tortured into computing what upsampling nearest backward computes, and it does not use atomics), or rewriting it in such a way so that atomicAdd is always called on half2. It's hard to guarantee necessary alignment for arbitrary image sizes, though.

ngimel on 28 Dec 2018

👍1

I'm having the same convergence problem with apex on GAN. Is there any further progress? Thanks.