The CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including:
Currently, a ZIP file is extracted and these ~200k images are read from the zipped file. This is extremely slow on CNS (~3 hours), and can lead to timeouts.
This enhancement would be to allow TFDS to read from archive directly. Due to this approach, processing would likely only take a few minutes, and be less flaky.
@dynamicwebpaige I am interested to work on it. How should I get started?
Thanks for the help.
Thanks for taking care of this. Tfds already has a feature to read directly from archives dl_manager.iter_archive. The idea here would be to update CelebA to use this feature.
Have a look at https://github.com/tensorflow/datasets/blob/6d075f775bd9d415830b8a8bb5c6b71c38e005fd/tensorflow_datasets/image/horses_or_humans.py for example
okay @Conchylicultor, I will go through it and make the changes accordingly. Thanks :)
@dynamicwebpaige @Conchylicultor The Zip exist in Google drive and I think, we have to download the zip to extract the Images as the zip is very large. We can't reference the zip contents which is stored on Drive.
In horses_or_humans dataset, data is stored on Google Cloud which can directly be referenced for extraction.
Am I missing something? Please correct me If I am wrong anywhere.
One Alternative is that I can request the data owners to stores the zip on the cloud so that we can directly extract images from there and will help in removing timeouts.
You're right, you first have to download the zip using dl_manager.download(), then read from the zip with dl_manager.iter_archive.
okay, Got it. :+1:
Thanks for the quick reply
@dynamicwebpaige @Conchylicultor I have made the changes in the _split_generators function but still facing problem in making changes in _generate_examples function. I have opened a Pull Request #152 for the same. Can you please review and let me know what changes I have to do further.
Thanks
It would also be useful to store the original filename as one of the features,
so we can perform joins with the kaggle version of this dataset,
which is used in various examples online:
https://www.kaggle.com/jessicali9530/celeba-dataset
Kevin
Hii @Conchylicultor @cyfra,
What if we compress txt files into a zip file and use drive links of those zip files for downloading??
Will iter_archive work since extracted_dirs will contain zip files only?
Plz reply as I'd love to fix this issue! Thanks for help!
@dynamicwebpaige @cyfra I am interested to work on it. I am new to open source can you tell me how should I get started?
Thanks for the help.
@Conchylicultor @cyfra Please review #1706
Thank you.
Most helpful comment
It would also be useful to store the original filename as one of the features,
so we can perform joins with the kaggle version of this dataset,
which is used in various examples online:
https://www.kaggle.com/jessicali9530/celeba-dataset
Kevin