Vision: corrupt file handling in ImageFolder

Created on 24 Apr 2019  路  10Comments  路  Source: pytorch/vision

Its frustrating to see entire training stops because of one corrupt file in the dataset folder and This should be a easy fix using PIL's Image.verify() This method attempts to determine if the file is broken, without actually decoding the image data. I did some checks to see if this can be added
but after reading file folder.py It came to my notice that the method is_image_file is never invoked by ImageFolder and if this can be changed then making this update is easy.

All 10 comments

@dhananjayraut , i see here

https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L86

redirects to

https://github.com/pytorch/vision/blob/ccbb3221b7f0637f1706df29d2c2995e9d5171bf/torchvision/datasets/folder.py#L45

where has_file_allowed_extension(fname, extensions) is called which is working similiar to the is_image_file .

Here adding this verify condition can be a simple fix.

@surgan12 This code is for parent class DatasetFolder which is more general class than ImageFolder and adding verify there will not be correct as this class can be used at some other place.

@dhananjayraut how much slowdown would it bring to add such check into main torchvision? Parsing the file tree is already fairly slow in some filesystems.

Another solution is to inherit from ImageFolder, and perform an extra pre-processing step that removes invalid images.

class MyDataset(ImageFolder):
    def __init__(self, ...):
        super(MyDataset, self.__init__(...)
        valid_samples = [s for s in self.imgs if Image.verify(s[0])]
        self.imgs = valid_samples

or something like that

I know time can be constrain a quick script shows me average time of 3.08e-05 per image with check then normal 3.68e-07 . This will be one time thing when we are creating a Dataset. The problem with your solution is we need to modify self.samples with list of tuples (path, class_to_idx[target] ) and update bunch of things like self.targets etc this basically means the user has to implement ImageFolder on it own.

@dhananjayraut the user only need to implement the aforementioned 4 lines, and use it instead.

I would be willing to accept a PR that adds greater flexibility in DatasetFolder to specify the if a file is valid or not.

This would probably mean letting the user pass a callable to DatasetFolder which returns True if the file is valid.

So the user would be able to customize as he wants the loading of the files.

Thoughts?

@fmassa That would be a nice addition as a ImageFolder class needs to do more than just checking extension. Would like to work on that. Any points about design will be great to discuss before actually writing
code for PR.

Some points:

which one will be better

  • keep extensions compulsory and add a optional is_valid_file which will only be called for valid extension files or
  • making extensions optional and one of is_valid_file or extension compulsory.

first one will be more natural plus will not make user handle extensions if he decide to pass optional callable.
second will be more standard for a library .
both will be backward compatible of course.

I think we should make both optional, requiring that one and only one is passed and non-None

This is how it's done for batch_sampler / sampler / batch_size and drop_last in DataLoader

@fmassa the official documentation for dataset folder (link) should say that is_valid_file is a callable I don't know why it is not there(the python file has it). also please remove the Image word it's not correct.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chinglamchoi picture chinglamchoi  路  3Comments

xuanqing94 picture xuanqing94  路  3Comments

IssamLaradji picture IssamLaradji  路  3Comments

300LiterPropofol picture 300LiterPropofol  路  3Comments

zhang-zhenyu picture zhang-zhenyu  路  3Comments