Detectron2: Support for mixed precision training

Created on 15 Oct 2019  ·  8Comments  ·  Source: facebookresearch/detectron2

Is there any plan to support mixed precision training soon? It was already being supported in the maskrcnn-benchmark code.

enhancement

Most helpful comment

Given that now Pytorch natively supports mixed precision training, is there plan to integrate that?

All 8 comments

@ppwwyyxx I would like to send a PR for apex training. https://github.com/lbin/detectron2/blob/apex/projects/ApexTrainer/README.md

@lbin Hi, Thanks for the great contribution.

I recently also focused in the FP16 training. After checking your code, it seemed that only the amp.initialize and amp.scale_loss APIs are employed. With my experience in image classification scenario, the DistributedDataParallel should also be replaced by from apex.parallel import DistributedDataParallel as DDP. However, I noticed you comment out line 75 in apex_trainer.py. Is there any reason why not using the amp DDP?

BTW, I'm also confused by the data in page https://github.com/lbin/detectron2/blob/apex/projects/ApexTrainer/README.md
The speed and memory cost are higher for the ApexTrainer case, which I expected to be lower than the FP32 version.

Appreciated if you can give some clarification. Thanks

@blueardour I used batchsize 32 not 16, so the speed and memory are higher in my case.

for ddp:
https://github.com/NVIDIA/apex/tree/master/examples/imagenet#distributed-training

@lbin Thanks for the tips.

Given that now Pytorch natively supports mixed precision training, is there plan to integrate that?

At least for RetinaNet, it was quite straightforward to use mixed precision training (using Pytorch 1.6's support for it). You can see how I did it: https://github.com/indigoviolet/detectron2/commit/d175f4a823cfdcce417e5ea38acbabc5215b9294

I see some memory savings from this, but I'm not sure how best to evaluate if half-precision operations are happening everywhere possible or if there is more work to do. If someone has pointers regarding that, it would be useful.

We plan to add support this year. https://github.com/facebookresearch/detectron2/pull/925 is a good example of how to do it with apex.

@ppwwyyxx Hi, sorry to bring this up again, can I force only fp16 for models and training, or is this still work in progress?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

invisprints picture invisprints  ·  4Comments

RomRoc picture RomRoc  ·  4Comments

soumik12345 picture soumik12345  ·  3Comments

Cold-Winter picture Cold-Winter  ·  3Comments

kl720 picture kl720  ·  3Comments