Currently, just by looking at the list of datasets available in TFDS, there is no way to know the size of each dataset prior to downloading. Users may be operating under constrained disk space, and should be informed of the size of the dataset before requesting.
This feature enhancement would detail the download size of each dataset on the markdown file referenced above.
I can work on this issue. Can you please assign this to me ?
Also please tell me what to do. Should I add a new column on the md file ?
@dynamicwebpaige I also want to work on this Issue.
@ChanchalKumarMaji Can we collaborate on this one?
@ParthS007 lets collaborate together.
Hi ,
I have never contributed before to tensorflow and I want to start contributing ,
can I work on this issue
Some pointers on this:
This should be a trival change (probably less than 10 lines of code) to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/document_datasets.py
It would be just to expose the builder.info.size_in_bytes field in the generated markdown.
The download size is automatically computed when downloading the dataset
I would also like to contribute in this issue, warning the user by displaying the size of the dataset he wants to download. Let's collaborate!!
@Anupam-tripathi Please note that the download size is already displayed when downloading a dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L387
Without downloading, the user should already be able to get the info with:
builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)
This issue is mostly to expose this information in the webpage doc.
@Conchylicultor ,
I have generated a new dataset documentation here. There are some datasets where I get the size as 0. Cannot make out why this happens.
But when I run the following code in colab
builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)
it prints 0.
Will I generate a new pull request so that you can see the changes made by me in document_datasets.py?
Oh, sorry about this. Yes, ImageNet was a bad example because it is not automatically downloaded (due to the ImageNet licence, it has to be manually downloaded by the user)
Otherwise, it is possible that most recent datasets do not have size information yet. We are pre-computing the size_in_bytes internally at Google and when a used do a tfds.load, it download the dataset information from the internet (size, statistics about the dataset,...). So after a new dataset is added/updated, it may take some time before info gets available.
Yes, please generate a pull request with your changes.
Also note that there is tfds.units.size_str to have a human readable formatting:
>> tfds.units.size_str(12312312)
'11.74 MiB'
@Conchylicultor I have generated a pull request here. Please check.
@Conchylicultor, I have also added them as Table form in the starting of the Docs. Please review my PR here. Thanks :)
Most helpful comment
@Anupam-tripathi Please note that the download size is already displayed when downloading a dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L387
Without downloading, the user should already be able to get the info with:
This issue is mostly to expose this information in the webpage doc.