Yolov5: DALI and Horovod

Created on 19 Sep 2020  路  6Comments  路  Source: ultralytics/yolov5

馃殌 Feature

Training could be sped up pretty significantly with the additions of Horovod over DistributedDataParallel, and DALI for loading. I won't be able to submit a PR, but I'd like to submit the feature request for consideration.

Motivation

Increasing training speed makes it much more feasible to do hyperparameter tuning. Using instead GluonCV SSD-512 bs64 for benchmarking with 4 V100 GPUs I got:

Parameter server: ~100 samples/second
Horovod: ~200 samples/second
Horovod+AMP: ~280 samples/second
Horovod+AMP+DALI: ~400 samples/second

Pitch

Implement Horovod as the default multi-GPU implementation, and DALI as the default coco dataloader.

enhancement

Most helpful comment

@austinmw distributed DDP with multiple nodes should work correctly (your application).

In terms of the dataset, 10M is naturally too many images to cache. Your options for speeding up dataloading are to provide the images in an uncompressed format, such as BMP, which may transfer some the time constraints from CPU to SSD (assuming you are not read speed constrained), and to resize your images to the training size (offline prior to training), or both.

For example if my dataset consists of 12MP iPhone images but I train at --img 640, I will incur an enormous disk read, decompression, and resizing overhead (for each image at each epoch), which can largely be mitigated with the steps above.

All 6 comments

Hello @austinmw, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

  • Cloud-based AI systems operating on hundreds of HD video streams in realtime.
  • Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
  • Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

@austinmw thanks for the suggestions and the numbers! We do already have two of these features in place, and have an excellent DDP implementation that results in near linear epoch time reductions as GPU count increases. See https://github.com/ultralytics/yolov5/issues/475

If you would like to contribute, a DALI dataloader implementation (in utils/datasets.py) might be very useful, and as you say may help larger dataset loading times. Two features we already use are:

  • Native PyTorch AMP is already utilized by default for all CUDA training in this repo.
  • Dataset loading times can actually be completely eliminated by passing python train.py --cache argument, which caches all train and test images into system RAM before training starts. This provides for the fastest dataloader speeds possible, faster than DALI, as both reading data from disk, and decompression on CPU, are completely eliminated. This is suitable for small to medium sized datasets (perhaps <10k images depending on your hardware constraints), which covers most users I believe.

@glenn-jocher Thanks, it's great to hear that the DDP implementation is very efficient!

In my particular case I'd like to train on a dataset on the order of 10M high-res images. I wonder if you happen to have any optimization suggestions for very large datasets like this? (I'd like to use a shared file system and distribute across 3-4 machines each with 8x V100 and 488GB RAM)

@austinmw distributed DDP with multiple nodes should work correctly (your application).

In terms of the dataset, 10M is naturally too many images to cache. Your options for speeding up dataloading are to provide the images in an uncompressed format, such as BMP, which may transfer some the time constraints from CPU to SSD (assuming you are not read speed constrained), and to resize your images to the training size (offline prior to training), or both.

For example if my dataset consists of 12MP iPhone images but I train at --img 640, I will incur an enormous disk read, decompression, and resizing overhead (for each image at each epoch), which can largely be mitigated with the steps above.

Thanks, very much appreciated!

One followup question if you happen to have the time, how do common resizing methods scale in time with downsample %? For example if I resize images from 720x1280 to just 736x1280, is that similar or much less significant?

@austinmw the time is dependent on your local hardware and images so there's no point in asking me, cv2.imread() different image sizes and types to find out for yourself.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Alex-afka picture Alex-afka  路  3Comments

maykulkarni picture maykulkarni  路  3Comments

abhiksark picture abhiksark  路  3Comments

dereyly picture dereyly  路  4Comments

we1pingyu picture we1pingyu  路  3Comments