Datasets: Wikipedia Dataset HTTP 404

Created on 19 Oct 2019  路  8Comments  路  Source: tensorflow/datasets

Short description
When using the standard DatasetBuilder apache beam "download_and_prepare" command provided here (https://www.tensorflow.org/datasets/beam_datasets), I get a HTTP 404 error.

Environment information

  • Operating System: MacOS
  • Python version: 3.6.8
  • tensorflow-datasets/tfds-nightly version: 1.0.2
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: 1.14.0

Reproduction instructions

From this link: (https://www.tensorflow.org/datasets/beam_datasets)

dl_config = tfds.download.DownloadConfig(
    beam_options=beam.options.pipeline_options.PipelineOptions()
)

builder = tfds.builder('wikipedia')
builder.download_and_prepare(
    download_dir=FLAGS.download_dir,
    download_config=dl_config,
)

Link to logs
If applicable,

Expected behavior
I'm expecting the dataset to be built and saved into the download_dir that I pass to it.

Additional context

Error message below:

tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://dumps.wikimedia.your.org/aawiki/20190301/dumpstatus.json. HTTP code: 404.

Is there somewhere where I'm supposed to provide a specific link or is this something that's more ingrained within the API? Not quite sure if I'm misinterpreting the documentation or that this is an actual bug with the link embedded within this file (https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/wikipedia.py)

bug

All 8 comments

Thanks for reporting. We are downloading the data from the internet original sources, but we don't control whenever those external websites go down or change their urls. It seems that the current link is broken and should be updated. Don't hesitate to send a PR with the new link if you know a mirror url.

As no one can control whenever those external websites go down or change their URLs, TensorFlow team can add a parameter named url in load module. And documentation will contain current url string of every dataset. If any change in URL, then only change in documentation and no change in code.

Duplicate of #1125

FYI, the preprocessed wikipedia dataset is available on GCS for you to copy into your data_dir: gs://tfds-data/datasets/wikipedia

FYI, the preprocessed wikipedia dataset is available on GCS for you to copy into your data_dir: gs://tfds-data/datasets/wikipedia

When I try to copy the dataset from GCS, I get an access denied error:
"AccessDeniedException: 403 USER does not have storage.objects.list access to tfs-data."

@philqc are you using gsutil cp?
Does gsutil ls gs://tfds-data/datasets/wikipedia work?

It works ! Thank you :)

Changing 20190301(date) to 20200220 works.
Ref: https://dumps.wikimedia.your.org/backup-index.html

Was this page helpful?
0 / 5 - 0 ratings