Can you please add a num_workers option for the Dataloader to speed up the data loading process?
I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.
I don't really get what the part at dataset.py Line 210 - 212 does.
img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) # BGR to RGB and cv2 to pytorch img_all = np.ascontiguousarray(img_all, dtype=np.float32) img_all /= 255.0
@Jaiczay if I had a dime for everytime someone didn't know what some section does... I don't have time to be a teacher, and line 210 is already commented with an explanation, so I've simply added comments to the next two to avert the next question by someone else down the line.
You'd have to convert this implementation to use the PyTorch dataloader in order to access the num_workers argument. We ditched it in the past because it was too slow. I don't know if this is still the case or not, but what causes you to believe that the data loading process is a chokepoint?
Do you have profiler results to show?
Because my GPU runs on 30-40% and only on CPU thread runs on 95%
Ah this is strange. Are you using a single GPU or multiple? Can you try something like https://github.com/spyder-ide/spyder-line-profiler to figure out exactly which areas are causing the slowdown?
@Jaiczay I don't have access to a local GPU, but if I run one batch of default COCO training on my Macbook Pro, I see the dataloader uses up about 240 ms. If this is true then yes you might be correct that the dataloader is a chokepoint in the training process. As a reference on a GCP instance with one P100, a batch nominally takes about 600 ms to process.

If I dig deeper in the dataloader, it seems the slowest part of the process is simply loading the jpegs (which are compressed of course, hence the slow speed). Not much to do about things there unless you were to decompress all the jpegs (which might be a good idea if you plan to do lots of training). np.loadtxt() might also be replaced some faster code as well. Ok, so I'll look into replacing np.load() and multithreading the dataloader. I should have some answers in the next day or two.

Good news. I was able to replace np.loadtxt() with python code which reduced the labels loading time from 19ms to 7ms. This shaves off 12 ms from the 240 ms batch times (5% faster). This update is now in commit https://github.com/ultralytics/yolov3/commit/9885903baf3435d8f3de1a0648d47e17ff05e241.

I'll try multithreading the dataloader next, though this will surely take longer to complete.
Thx for the quick answer! I got it running with the Pytorch dataloader, but the loss values are all nan so at least I don't get an error any more, but I will try out your fix first bevor I fix that.
Overall it's not that important, I just wanted to get a little bit more out of my new GPU.
@Jaiczay I've got wonderful news. I re-added support for the PyTorch DataLoader, including num_workers argument, and tested the data loading speeds in various configurations, with excellent speed improvements observed. Updates are in 70fe2204b4250c238be4c32e65f8038a297059cf.
IMPORTANT: Note that cv2.setNumThreads(0) must be set when using num_workers>0 in order to prevent opencv from trying to multithread on its own. train.py does this automatically now:
https://github.com/ultralytics/yolov3/blob/0fb2653c59dffe398e511fbdbca80ace2089753c/train.py#L44-L48
https://support.apple.com/kb/SP776?locale=en_US&viewlocale=en_US
Machine type: 2018 MacBook Pro (6 physical CPU cores, 12 vCPUs, 16 GB memory)
CPU platform: 2.2GHz 6-core Intel Core i7, Turbo Boost up to 4.1GHz, with 9MB shared L3 cache
GPUs: None
HDD: 256 GB SSD
num_workers | cv2.setNumThreads(0) | DataLoader speed (ms/batch)
--- |--- |---
0 | False | 206ms (this repo default)
0 | True | 291ms
1 | True | 252ms
2 | True | 131ms
4 | True | 75ms
6 | True | 57ms
8 | True | 54ms
10 | True | 52ms
12 | True | 51ms
Wow, that's great!
Thank you btw for the awesome repo, this is by far the best Pytorch implementation of YOLOv3!
@glenn-jocher I think you need to update the test.py as well, because when I continue training I suddenly become a mAP of 0.5 and before it was around 0.94
Hmmm yes I think there might be a problem in train.py, maybe in the target loading order. Since they are coming in asynchronously now there may be some sort of issue in assigning targets to images. So your resumed training may be bringing your mAP to zero eventually. I'll try and sort it out later today.
test.py currently works fine, for example with yolov3.weights. But yes I should migrate that over to the dataloader also for faster speed.
Current workaround is not to use MultiGPU.
@glenn-jocher I haven't look into details from the update. I just wanted to point out if you are using Dataloader from the pytorch library, the worker threads might mess up the random seeds. Let's say you have 4 worker threads, you'll might end up with the same augmentation for the 4 threads. And it's also almost impossible to get deterministic behavior (if that's a concern) without modifying the Dataloader class or simply write your own multi-processing dataloader. If all this is already considered, I guess its all good :)
Can you please add a num_workers option for the Dataloader to speed up the data loading process?
I tried it myself with this tutorial https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel, but I couldn't get it to work.
I don't really get what the part at dataset.py Line 210 - 212 does.
img_all = np.stack(img_all)[:, :, :, ::-1].transpose(0, 3, 1, 2) # BGR to RGB and cv2 to pytorch img_all = np.ascontiguousarray(img_all, dtype=np.float32) img_all /= 255.0
@Jaiczay These operations are just to convert BGR to RGB, so, batch_size X w X h X channel. The channel dim was just inversed. Then transpose to batch_size X channel X w X h.
Most helpful comment
@Jaiczay I've got wonderful news. I re-added support for the PyTorch DataLoader, including
num_workersargument, and tested the data loading speeds in various configurations, with excellent speed improvements observed. Updates are in 70fe2204b4250c238be4c32e65f8038a297059cf.IMPORTANT: Note that
cv2.setNumThreads(0)must be set when usingnum_workers>0in order to prevent opencv from trying to multithread on its own. train.py does this automatically now:https://github.com/ultralytics/yolov3/blob/0fb2653c59dffe398e511fbdbca80ace2089753c/train.py#L44-L48
https://support.apple.com/kb/SP776?locale=en_US&viewlocale=en_US
Machine type: 2018 MacBook Pro (6 physical CPU cores, 12 vCPUs, 16 GB memory)
CPU platform: 2.2GHz 6-core Intel Core i7, Turbo Boost up to 4.1GHz, with 9MB shared L3 cache
GPUs: None
HDD: 256 GB SSD
num_workers|cv2.setNumThreads(0)| DataLoader speed (ms/batch)--- |--- |---
0 |
False| 206ms (this repo default)0 |
True| 291ms1 |
True| 252ms2 |
True| 131ms4 |
True| 75ms6 |
True| 57ms8 |
True| 54ms10 |
True| 52ms12 |
True| 51ms