Datasets: Read CelebA pictures from archive directly.

Created on 3 Mar 2019  路  11Comments  路  Source: tensorflow/datasets

The CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including:

  • 10,177 number of identities,
  • 202,599 number of face images, and
  • 5 landmark locations, 40 binary attributes annotations per image.

Currently, a ZIP file is extracted and these ~200k images are read from the zipped file. This is extremely slow on CNS (~3 hours), and can lead to timeouts.

This enhancement would be to allow TFDS to read from archive directly. Due to this approach, processing would likely only take a few minutes, and be less flaky.

enhancement

Most helpful comment

It would also be useful to store the original filename as one of the features,
so we can perform joins with the kaggle version of this dataset,
which is used in various examples online:
https://www.kaggle.com/jessicali9530/celeba-dataset

Kevin

All 11 comments

@dynamicwebpaige I am interested to work on it. How should I get started?
Thanks for the help.

Thanks for taking care of this. Tfds already has a feature to read directly from archives dl_manager.iter_archive. The idea here would be to update CelebA to use this feature.
Have a look at https://github.com/tensorflow/datasets/blob/6d075f775bd9d415830b8a8bb5c6b71c38e005fd/tensorflow_datasets/image/horses_or_humans.py for example

okay @Conchylicultor, I will go through it and make the changes accordingly. Thanks :)

@dynamicwebpaige @Conchylicultor The Zip exist in Google drive and I think, we have to download the zip to extract the Images as the zip is very large. We can't reference the zip contents which is stored on Drive.
In horses_or_humans dataset, data is stored on Google Cloud which can directly be referenced for extraction.
Am I missing something? Please correct me If I am wrong anywhere.

One Alternative is that I can request the data owners to stores the zip on the cloud so that we can directly extract images from there and will help in removing timeouts.

You're right, you first have to download the zip using dl_manager.download(), then read from the zip with dl_manager.iter_archive.

okay, Got it. :+1:
Thanks for the quick reply

@dynamicwebpaige @Conchylicultor I have made the changes in the _split_generators function but still facing problem in making changes in _generate_examples function. I have opened a Pull Request #152 for the same. Can you please review and let me know what changes I have to do further.
Thanks

It would also be useful to store the original filename as one of the features,
so we can perform joins with the kaggle version of this dataset,
which is used in various examples online:
https://www.kaggle.com/jessicali9530/celeba-dataset

Kevin

Hii @Conchylicultor @cyfra,
What if we compress txt files into a zip file and use drive links of those zip files for downloading??
Will iter_archive work since extracted_dirs will contain zip files only?
Plz reply as I'd love to fix this issue! Thanks for help!

@dynamicwebpaige @cyfra I am interested to work on it. I am new to open source can you tell me how should I get started?
Thanks for the help.

@Conchylicultor @cyfra Please review #1706
Thank you.

Was this page helpful?
0 / 5 - 0 ratings