Datasets: [GSoC] Update documentation generation script to display a few examples

Created on 27 Feb 2020  路  9Comments  路  Source: tensorflow/datasets

Task: It would be nice to have our dataset catalog to display a snippet of the dataset content, similarly to https://www.tensorflow.org/datasets/overview#visualization.

Instructions:

  • Assume that the datasets are already generated and that you could load a few examples with either with ds.take() or the subsplit-API (DO NOT generate all datasets yourself, but only test on a few ones)
  • Create a separate script to generate all datasets figures/content and save them in some new folder in our repo.
  • Use tfds.show_examples() to generate the image/content to save and display.
  • You probably want to wrap the tfds.show_examples(ds_info, ds) in a try/catch block as not all datasets supports it
  • Then update our documentation template to add a figure section which display the images saved previously.
  • Make sure to test with a variety of datasets (with/without config, with/without images)
  • You probably want to split the script generation and documentation update in separate step.

To test the documentation generation, you can use the following code snippet:

import os

from tensorflow_datasets.scripts import generate_visualization  # Your new script
from tensorflow_datasets.scripts import document_datasets

DATASET_TO_TESTS = ['mnist', 'cifar10',...]  # Datasets you want to test the script on.
dst_dir = ...  # Destination directory

def main(_):
  for ds_name in DATASET_TO_TESTS:
    # 1) Generate the datasets (as script assume the datasets are already generated)
    tfds.load(ds_name)  

    # 2) Generate the figure for the dataset
    generate_visualization.generate_visualization([ds_name])

    # 3) Generate the documentation page which uses the generated figure in 2
    with open(os.path.join(dst_dir, f'{ds_name}.md')) as f:
      doc_content = document_datasets.dataset_docs_str([ds_name])
      f.write(doc_content)

Difficulty: Intermediate

contributions welcome enhancement

Most helpful comment

Yes, thanks you all for fixing this.

Created https://github.com/tensorflow/datasets/issues/1949 to track the visualisation issue.

All 9 comments

I've started working on this, I will make a PR begging of next week

I would like to contribute on this issue.
Hey @vvkio , share your status, if possible we could work on this parallely.

Hey @Nikhilnama18, sure we can work on this together. What's left is writing the logic for updating the documentation and testing.

@Conchylicultor why are we passing the dataset name in list to the generate_visualization function?

@ManishAradwad To indicates which datasets to generate images for. This is similar to dataset_docs_str( in https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/document_datasets.py

@Conchylicultor should I updated https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_info.py with the visualization location to be able to pull the image in the mako template similarly to the other selections? in this instance with builder.info.visualization?

@vvkio You shouldn't have to modify dataset_info. To get the path, you can use tfds_dir() or get_tfds_path() join with info.full_name for instance.

I think this issue can be closed as PR #1632, #1909 got merged

Yes, thanks you all for fixing this.

Created https://github.com/tensorflow/datasets/issues/1949 to track the visualisation issue.

Was this page helpful?
0 / 5 - 0 ratings