Datasets: [Documentation] Display the download size for each dataset.

Created on 3 Mar 2019  路  12Comments  路  Source: tensorflow/datasets

Currently, just by looking at the list of datasets available in TFDS, there is no way to know the size of each dataset prior to downloading. Users may be operating under constrained disk space, and should be informed of the size of the dataset before requesting.

This feature enhancement would detail the download size of each dataset on the markdown file referenced above.

enhancement

Most helpful comment

@Anupam-tripathi Please note that the download size is already displayed when downloading a dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L387

Without downloading, the user should already be able to get the info with:

builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)

This issue is mostly to expose this information in the webpage doc.

All 12 comments

I can work on this issue. Can you please assign this to me ?

Also please tell me what to do. Should I add a new column on the md file ?

@dynamicwebpaige I also want to work on this Issue.
@ChanchalKumarMaji Can we collaborate on this one?

@ParthS007 lets collaborate together.

Hi ,
I have never contributed before to tensorflow and I want to start contributing ,
can I work on this issue

Some pointers on this:
This should be a trival change (probably less than 10 lines of code) to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/document_datasets.py
It would be just to expose the builder.info.size_in_bytes field in the generated markdown.
The download size is automatically computed when downloading the dataset

I would also like to contribute in this issue, warning the user by displaying the size of the dataset he wants to download. Let's collaborate!!

@Anupam-tripathi Please note that the download size is already displayed when downloading a dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L387

Without downloading, the user should already be able to get the info with:

builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)

This issue is mostly to expose this information in the webpage doc.

@Conchylicultor ,

I have generated a new dataset documentation here. There are some datasets where I get the size as 0. Cannot make out why this happens.

But when I run the following code in colab

builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)

it prints 0.

Will I generate a new pull request so that you can see the changes made by me in document_datasets.py?

Oh, sorry about this. Yes, ImageNet was a bad example because it is not automatically downloaded (due to the ImageNet licence, it has to be manually downloaded by the user)

Otherwise, it is possible that most recent datasets do not have size information yet. We are pre-computing the size_in_bytes internally at Google and when a used do a tfds.load, it download the dataset information from the internet (size, statistics about the dataset,...). So after a new dataset is added/updated, it may take some time before info gets available.

Yes, please generate a pull request with your changes.

Also note that there is tfds.units.size_str to have a human readable formatting:

>> tfds.units.size_str(12312312)
'11.74 MiB'

@Conchylicultor I have generated a pull request here. Please check.

@Conchylicultor, I have also added them as Table form in the starting of the Docs. Please review my PR here. Thanks :)

Was this page helpful?
0 / 5 - 0 ratings