Vision: corrupt file handling in ImageFolder

Created on 24 Apr 2019 · 10Comments · Source: pytorch/vision

Its frustrating to see entire training stops because of one corrupt file in the dataset folder and This should be a easy fix using PIL's Image.verify() This method attempts to determine if the file is broken, without actually decoding the image data. I did some checks to see if this can be added
but after reading file folder.py It came to my notice that the method is_image_file is never invoked by ImageFolder and if this can be changed then making this update is easy.

Source

dhananjayraut

All 10 comments

@dhananjayraut , i see here

https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L86

redirects to

https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L45

where has_file_allowed_extension(fname, extensions) is called which is working similiar to the is_image_file .

Here adding this verify condition can be a simple fix.

surgan12 on 24 Apr 2019

😕2

@surgan12 This code is for parent class DatasetFolder which is more general class than ImageFolder and adding verify there will not be correct as this class can be used at some other place.

dhananjayraut on 24 Apr 2019

@dhananjayraut how much slowdown would it bring to add such check into main torchvision? Parsing the file tree is already fairly slow in some filesystems.

Another solution is to inherit from ImageFolder, and perform an extra pre-processing step that removes invalid images.

class MyDataset(ImageFolder):
    def __init__(self, ...):
        super(MyDataset, self.__init__(...)
        valid_samples = [s for s in self.imgs if Image.verify(s[0])]
        self.imgs = valid_samples

or something like that

fmassa on 24 Apr 2019

I know time can be constrain a quick script shows me average time of 3.08e-05 per image with check then normal 3.68e-07 . This will be one time thing when we are creating a Dataset. The problem with your solution is we need to modify self.samples with list of tuples (path, class_to_idx[target] ) and update bunch of things like self.targets etc this basically means the user has to implement ImageFolder on it own.

dhananjayraut on 24 Apr 2019

@dhananjayraut the user only need to implement the aforementioned 4 lines, and use it instead.

I would be willing to accept a PR that adds greater flexibility in DatasetFolder to specify the if a file is valid or not.

This would probably mean letting the user pass a callable to DatasetFolder which returns True if the file is valid.

So the user would be able to customize as he wants the loading of the files.

Thoughts?

fmassa on 24 Apr 2019

👍1

@fmassa That would be a nice addition as a ImageFolder class needs to do more than just checking extension. Would like to work on that. Any points about design will be great to discuss before actually writing
code for PR.

dhananjayraut on 24 Apr 2019

Some points:

extensions become an optional argument in https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L81
add a is_valid_file or something like that in the constructor
assert that either extensions or is_valid_file is passed
replace https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L45 with the new function, making sure to keep backwards compatibility

fmassa on 24 Apr 2019

which one will be better

keep extensions compulsory and add a optional is_valid_file which will only be called for valid extension files or
making extensions optional and one of is_valid_file or extension compulsory.

first one will be more natural plus will not make user handle extensions if he decide to pass optional callable.
second will be more standard for a library .
both will be backward compatible of course.

dhananjayraut on 24 Apr 2019

I think we should make both optional, requiring that one and only one is passed and non-None

This is how it's done for batch_sampler / sampler / batch_size and drop_last in DataLoader

fmassa on 24 Apr 2019

👍1

@fmassa the official documentation for dataset folder (link) should say that is_valid_file is a callable I don't know why it is not there(the python file has it). also please remove the Image word it's not correct.

dhananjayraut on 1 May 2019

Was this page helpful?

0 / 5 - 0 ratings