Datasets: Read CelebA pictures from archive directly.

Created on 3 Mar 2019 · 11Comments · Source: tensorflow/datasets

The CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including:

10,177 number of identities,
202,599 number of face images, and
5 landmark locations, 40 binary attributes annotations per image.

Currently, a ZIP file is extracted and these ~200k images are read from the zipped file. This is extremely slow on CNS (~3 hours), and can lead to timeouts.

This enhancement would be to allow TFDS to read from archive directly. Due to this approach, processing would likely only take a few minutes, and be less flaky.

enhancement

Source

dynamicwebpaige

Most helpful comment

It would also be useful to store the original filename as one of the features,
so we can perform joins with the kaggle version of this dataset,
which is used in various examples online:
https://www.kaggle.com/jessicali9530/celeba-dataset

Kevin

murphyk on 5 Sep 2019

👍2

All 11 comments

@dynamicwebpaige I am interested to work on it. How should I get started?
Thanks for the help.

ParthS007 on 3 Mar 2019

Thanks for taking care of this. Tfds already has a feature to read directly from archives dl_manager.iter_archive. The idea here would be to update CelebA to use this feature.
Have a look at https://github.com/tensorflow/datasets/blob/6d075f775bd9d415830b8a8bb5c6b71c38e005fd/tensorflow_datasets/image/horses_or_humans.py for example

Conchylicultor on 3 Mar 2019

👍1

okay @Conchylicultor, I will go through it and make the changes accordingly. Thanks :)

ParthS007 on 4 Mar 2019

@dynamicwebpaige @Conchylicultor The Zip exist in Google drive and I think, we have to download the zip to extract the Images as the zip is very large. We can't reference the zip contents which is stored on Drive.
In horses_or_humans dataset, data is stored on Google Cloud which can directly be referenced for extraction.
Am I missing something? Please correct me If I am wrong anywhere.

One Alternative is that I can request the data owners to stores the zip on the cloud so that we can directly extract images from there and will help in removing timeouts.

ParthS007 on 4 Mar 2019

You're right, you first have to download the zip using dl_manager.download(), then read from the zip with dl_manager.iter_archive.

Conchylicultor on 4 Mar 2019

okay, Got it. :+1:
Thanks for the quick reply

ParthS007 on 4 Mar 2019

@dynamicwebpaige @Conchylicultor I have made the changes in the _split_generators function but still facing problem in making changes in _generate_examples function. I have opened a Pull Request #152 for the same. Can you please review and let me know what changes I have to do further.
Thanks

ParthS007 on 5 Mar 2019

Kevin

murphyk on 5 Sep 2019

👍2

Hii @Conchylicultor @cyfra,
What if we compress txt files into a zip file and use drive links of those zip files for downloading??
Will iter_archive work since extracted_dirs will contain zip files only?
Plz reply as I'd love to fix this issue! Thanks for help!