We use nfs to store our datasets. NFS is configured to be cached locally once the files are accessed.
Nevertheless, dali pipeline can take up to 300 sec to build.
What is done during the pipeline build time that could explain this huge build time ?
After build is passed it runs fine and the dataset can be successfully cached for subsequent epochs,
but that initial 300 sec is problematic to us since we do a lot of small quick experiement.
Hi,
I guess you are using FileReader. When DALI pipeline is built, it lists all the files you have (through NFS it is slow). If you want to speed things up you can use a file_list argument which provides a list of all files and their labels - using that DALI doesn't need to scan your filesystem to discover all images you have there.
Br,
Janusz
Could this listing could be done async? This would allow the first batch to start while it continue to list all files?
When it list all files, does it do just a listdir or it stats all files? If it stats all files, if it is possible to remove that?
@nouiz - answering your question yes and no.
Yes, technically it is possible to rework current iterator and reader to work in the async way.
No, because you will get wrong results. Now DALI reads whole data set and shuffles it initially during setup. Thanks to it you wont end up with getting batches with only label 0 data, then label 1 and etc.
I think that PyTorch torchvision.datasets.ImageFolder works in the same way, it discovers classes and then traverses all classes' dirs looking for images - https://github.com/pytorch/vision/blob/master/torchvision/datasets/folder.py#L84.
Tracked as DALI-322.
So our recommendation is to use file_list. If it doesn't work for you please reopen.
We tested file_list and it work. But I do not find this to be a satisfying solution. Using it request more code from the user and make Dali more complicated to use.
I looked at the code and I do not see a good reason for doing stat on each file. So here is 2 suggestions that would keep using Dali fast and easy to use on NFS (which I think is a common case).
1) This is my prefered solution. It would work out of the box for all user. They won't need to learn more complicated options. I think we should just remove the stat() call in the function assemble_file_list() in the file dali/pipeline/operators/reader/loader/file_loader.cc
Currently stat() result is only used to check if it is a entry of the directory is a regular file. I do not think this is a valid reason because we also do the check for the file extension. If a file have the right extention, it should be a valid file. Otherwise there is something very strange in the user setup. And so warant an error that would appear later in the code. Skipping a entry that should be valid would under report error to the user. This could cause invalid experimentation from user.
2) If you belive the check should stay, what about adding an option to skip it? As the option to list file names, but a bool. Something like skip_validation, that would default to False.
I think the option to list file names should be to take a subset of the dataset, not to speed up on NFS.
What do you think of that?
HI,
Thank you for your thorough analysis. You are right we may avoid calling stat on every file, it looks like readdir could provide the same information without the cost of the additional syscall, at least for some filesystems. We will discuss how much it may affect users and if it going to work well in all use cases we may be missing in this discussion.
Br,
Janusz
Thanks.