Its frustrating to see entire training stops because of one corrupt file in the dataset folder and This should be a easy fix using PIL's Image.verify() This method attempts to determine if the file is broken, without actually decoding the image data. I did some checks to see if this can be added
but after reading file folder.py It came to my notice that the method is_image_file is never invoked by ImageFolder and if this can be changed then making this update is easy.
@dhananjayraut , i see here
redirects to
where has_file_allowed_extension(fname, extensions) is called which is working similiar to the is_image_file .
Here adding this verify condition can be a simple fix.
@surgan12 This code is for parent class DatasetFolder which is more general class than ImageFolder and adding verify there will not be correct as this class can be used at some other place.
@dhananjayraut how much slowdown would it bring to add such check into main torchvision? Parsing the file tree is already fairly slow in some filesystems.
Another solution is to inherit from ImageFolder, and perform an extra pre-processing step that removes invalid images.
class MyDataset(ImageFolder):
def __init__(self, ...):
super(MyDataset, self.__init__(...)
valid_samples = [s for s in self.imgs if Image.verify(s[0])]
self.imgs = valid_samples
or something like that
I know time can be constrain a quick script shows me average time of 3.08e-05 per image with check then normal 3.68e-07 . This will be one time thing when we are creating a Dataset. The problem with your solution is we need to modify self.samples with list of tuples (path, class_to_idx[target] ) and update bunch of things like self.targets etc this basically means the user has to implement ImageFolder on it own.
@dhananjayraut the user only need to implement the aforementioned 4 lines, and use it instead.
I would be willing to accept a PR that adds greater flexibility in DatasetFolder to specify the if a file is valid or not.
This would probably mean letting the user pass a callable to DatasetFolder which returns True if the file is valid.
So the user would be able to customize as he wants the loading of the files.
Thoughts?
@fmassa That would be a nice addition as a ImageFolder class needs to do more than just checking extension. Would like to work on that. Any points about design will be great to discuss before actually writing
code for PR.
Some points:
extensions become an optional argument in https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L81is_valid_file or something like that in the constructorextensions or is_valid_file is passedwhich one will be better
first one will be more natural plus will not make user handle extensions if he decide to pass optional callable.
second will be more standard for a library .
both will be backward compatible of course.
I think we should make both optional, requiring that one and only one is passed and non-None
This is how it's done for batch_sampler / sampler / batch_size and drop_last in DataLoader
@fmassa the official documentation for dataset folder (link) should say that is_valid_file is a callable I don't know why it is not there(the python file has it). also please remove the Image word it's not correct.