I'm implementing a BERT-like model in FP16 and training works well for the first few thousand iterations, but both with dynamic loss scale (scaling factor goes all the way down to 1) and static factor of 128, at some point I reach a gradient overflow error at each iteration that is inescapable. The loss up to that point is a reasonable and small scalar (around 4) and it kind of happens out of nowhere. I haven't been able to reproduce the exact cause.
What are some good ways to track down the source of this error, and what are some possible causes you might be familiar with? I'm still on the old API but from what I understand switching to the new one shouldn't fix this problem one way or another.
Thanks!
Are you using the old API via an old version of master, or the old API via a new version of master? Either way, you should switch to the new API. Conceptually, there may be no difference between the old API and the new API with corresponding opt_levels+properties, but I have observed cases where the new API has resolved unexpected errors. I have more thorough test coverage of the new API, so the new API may offer robustness as well as performance benefits. I can't guarantee it will resolve your issue but it's worth a try. If that fails, we can dig deeper into what's causing the infs.
I played around with the epsilon values and it ended up working fine. Thanks!
This actually is still happening on the new API. I'm using DDP for multiple GPUs. How do you suggest debugging this?
Is this caused by #224, or separate?
I'm honestly unsure, but I think I did something that fixed #224 - not sure what - and all model outputs are now non-NaN. I'm now using FusedAdam and running that through amp.initialize, but the loss scale stays at 65536 and never goes down. For some reason (related?) the gradients everywhere in the network are NaN (when I check on one of the parameters in model.parameters()), right from the beginning. I'm trying to figure out why the loss scale isn't being updated but is that even where I should be looking?
After some more debugging, I see that overflow is perhaps not being set because all gradients (both model and master weights - which by the way are all in all_fp32_from_fp32_params in the FP16_Optimizer) are None here: https://github.com/NVIDIA/apex/blob/master/apex/amp/scaler.py#L100
Also - I try manually setting a loss scale of 64 (this way, the gradients I observe on the model are not NaN) but still see no update to model weights after a call to optimizer.step(). I believe this is happening because both p.grad and grad are always None here: https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_adam.py#L110
I wonder if the gradients are None as part of the flattening procedure that takes place? That is, the parameters stored in param_groups are just long (emphasis on long - one of my param groups has a flattened dimensionality of >160M) Tensors, so how can they have gradients? Many different params, each with its own gradient, are flattened into one Tensor. I could be way off here though.
I appreciate any assistance!
FusedAdam originated outside Amp and I had to kind of frankenstein-splice it in. There's a PR open that implements FusedAdam in a more clean and general way (https://github.com/NVIDIA/apex/pull/197). I'm also currently working on a branch to handle arbitrary combinations of optimizers/models/losses, which I hope will resolve https://github.com/NVIDIA/apex/issues/179 and related issues (https://github.com/NVIDIA/apex/tree/no_wrap_optimizers). As part of this refactor, I'll be integrating the new FusedAdam (working on this tomorrow).
My advice is, don't waste time debugging the current implementation. I'm not surprised it's flaky. Give me a few more days to fold in these updates, then we can revisit.
I'm training a CNN model (modified hourglass network) and meet the similar issue. After balancing the output losses of multi branches, I can train the model in _opt-level O0_ and the output tensor seems normal and correct. But the training will fail once I change to _opt-level O1_. The gradient will overflow (loss scale becomes smaller and smaller and then abnormal like 1e-32 ) and some weights of my model become nan . If I clamp the output tensor into a fixed range, the training process will not stop with error but the model seems to refuse to update and the output tensors have many nan or inf values.
@mcarilli Sure. It is training without a hitch using regular Adam. I think there should be an appreciable speedup switching to FusedAdam, though.
@mcarilli Has there been any progress with FusedAdam?
Cleaning up FusedAdam was gated on https://github.com/NVIDIA/apex/pull/232 which is now merged. FusedAdam is my top priority right now because we also need it for internal users. I'll comment when it's ready to try.
Hey @mcarilli, was the FusedAdam optimizer ever implemented? If so - do you know what kind of speedup I might expect to see when using it vs the regular Adam? Thanks!
Yes, but the better implementation is still in a side branch currently (https://github.com/NVIDIA/apex/tree/multi_tensor_sgd), and still only works with opt level O2. This side branch also has a fused SGD optimizer that works for any opt level. I still need to make the FusedAdam work with other opt levels, and add documentation for FusedSGD, before they can be merged (targeting next week)
I'm not sure what speedup is expected, it depends on the model and how much time is spent in optimizer.step(). The call to optimizer.step() itself should be 3-4X faster than a non-fused implementation.
That sounds good. I'm excited to give it a try when it's all merged to the master branch! I can report some speedup metrics as well.
Most helpful comment
Cleaning up FusedAdam was gated on https://github.com/NVIDIA/apex/pull/232 which is now merged. FusedAdam is my top priority right now because we also need it for internal users. I'll comment when it's ready to try.