Is your feature request related to a problem? Please describe.
Lightning handles a lot of parallelization and best practices for speed and thus image processing and augmentations often become a bottleneck
Describe the solution you'd like
Support or even integration for DALI
For reference
https://devblogs.nvidia.com/fast-ai-data-preprocessing-with-nvidia-dali/
I am willing to help implement, but this is a new API for me as well.
Right now, lightning isn't meant to help with the data transformation layer according to the image below. @williamFalcon think that should change?
DALI is cool and we (Latent Space) plan on using it with lightning soon as well. What kind of integration would be useful?
Is it currently possible to use DALI with lightning?
DALI bypasses the pytorch dataset and dataloader API and isntead opts to use its own external data loading classes. Can train_dataloader accept these classes? As DALI loads data into specific GPUs, I assume there would need to be some integration with lightning parallelization implementations as well.
In trainer.py -> evaluate(), you can see how we call the dataloaders returned from the user's val_dataloaders. It's pretty generic - just requires the dataloader to return a batch when iterated upon. I'm pretty sure it would be trivial to use a DALI (or any) dataloader with this.
If we can't directly return a DALI dataloader, we could return a simple generator function or iterator from train_dataloader that fits the format as well.
As for the GPU stuff, I don't actually think there needs to be parallelization since we aren't using model with DALI, so they can each deal with their own GPU logic. Maybe there are some specific pains when it's actually being used that we could tackle, but I can't speculate.
Yeah I'm going to need to test this at some point, but I assume DALI would need to distribute data to the correct GPUS since it doesn't have a distributed data sampler and instead of rely on its own sharding pipeline. I would also guess that having mini batches processed by DALI on the same GPU it is needed on (for training) would cut down on transfer overheads between GPUs (if DALI manages data rtansfers between GPUs at all). Hence the lightning integration.
This might provide a bit of insight:
https://github.com/NVIDIA/DALI/issues/1175
I have noticed a few issues when trying to integrate Nvidia DALI with pl. Currently I am using the DALIGenericIterator class from nvidia.dali.plugin.pytorch but since it is an iterator it doesn't have the length attribute so I extended the Dali iterator like this:
from nvidia.dali.plugin.pytorch import DALIGenericIterator as PyTorchIterator
import math
class LightningDaliDataloader(PyTorchIterator):
def __init__(self, pipe, size, batch_size, output_map=["data", "label"], last_batch_padded=False, fill_last_batch=True):
super().__init__(pipe, size=size, output_map=output_map, last_batch_padded=last_batch_padded, fill_last_batch=fill_last_batch)
self.dataset_size, self.batch_size = size, batch_size
self.last_batch_padded = last_batch_padded
def __len__(self):
if self.last_batch_padded:
return math.ceil(self.dataset_size/self.batch_size)
else:
return self.dataset_size//self.batch_size
I also added to my pl model the on_epoch_end method:
def on_epoch_end(self):
self.train_dataloader().reset()
I have only tested it with 1 gpu (since I only have access to one) but the above changes fixes problems with some Nvidia DALI integration issues.
It is probably not the best approach and was wondering if changes in pytorch-lightning need to be made, similar to the ones made for IterableDataset so that the above class isn't required.
@ryanwongsa out of curiosity, did you see vram fluctuations between the GPU working on the DALI pipeline vs pytorch training? Also good to know that single GPU is relatively straight forward, thanks for testing!
@s-rog For what I am working on with a DALI dataloader, I had to create a Python Operator and currently the DALI library only supports CPU operations for it so I can't really get a good comparison between the two, sorry. I would have to convert it to c++ and create a custom operator to get accurate comparisons but that would take some time.
Some things I noticed that can't be supported with pytorch-lightning and DALI is to set a value of train_percent_check to be less than 1. Since you cannot reset the DALIGenericIterator before it finishes going through the whole dataset, but that seems to be a limitation of DALI and not pytorch-lightning.
I'd love to make sure we support dali! I know a few people using it.
I haven't looked at this issue in-depth, but is there anything we need to do to support it or did we already get it for free?
If we do need to make changes, anyone want to submit a PR?
@luiscape
Most helpful comment
I have noticed a few issues when trying to integrate Nvidia DALI with pl. Currently I am using the
DALIGenericIteratorclass fromnvidia.dali.plugin.pytorchbut since it is an iterator it doesn't have the length attribute so I extended the Dali iterator like this:I also added to my pl model the
on_epoch_endmethod:I have only tested it with 1 gpu (since I only have access to one) but the above changes fixes problems with some Nvidia DALI integration issues.
It is probably not the best approach and was wondering if changes in pytorch-lightning need to be made, similar to the ones made for
IterableDatasetso that the above class isn't required.