Short description
When using the standard DatasetBuilder apache beam "download_and_prepare" command provided here (https://www.tensorflow.org/datasets/beam_datasets), I get a HTTP 404 error.
Environment information
tensorflow-datasets/tfds-nightly version: 1.0.2tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: 1.14.0Reproduction instructions
From this link: (https://www.tensorflow.org/datasets/beam_datasets)
dl_config = tfds.download.DownloadConfig(
beam_options=beam.options.pipeline_options.PipelineOptions()
)
builder = tfds.builder('wikipedia')
builder.download_and_prepare(
download_dir=FLAGS.download_dir,
download_config=dl_config,
)
Link to logs
If applicable,
Expected behavior
I'm expecting the dataset to be built and saved into the download_dir that I pass to it.
Additional context
Error message below:
tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://dumps.wikimedia.your.org/aawiki/20190301/dumpstatus.json. HTTP code: 404.
Is there somewhere where I'm supposed to provide a specific link or is this something that's more ingrained within the API? Not quite sure if I'm misinterpreting the documentation or that this is an actual bug with the link embedded within this file (https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/wikipedia.py)
Thanks for reporting. We are downloading the data from the internet original sources, but we don't control whenever those external websites go down or change their urls. It seems that the current link is broken and should be updated. Don't hesitate to send a PR with the new link if you know a mirror url.
As no one can control whenever those external websites go down or change their URLs, TensorFlow team can add a parameter named url in load module. And documentation will contain current url string of every dataset. If any change in URL, then only change in documentation and no change in code.
Duplicate of #1125
FYI, the preprocessed wikipedia dataset is available on GCS for you to copy into your data_dir: gs://tfds-data/datasets/wikipedia
FYI, the preprocessed wikipedia dataset is available on GCS for you to copy into your data_dir: gs://tfds-data/datasets/wikipedia
When I try to copy the dataset from GCS, I get an access denied error:
"AccessDeniedException: 403 USER does not have storage.objects.list access to tfs-data."
@philqc are you using gsutil cp?
Does gsutil ls gs://tfds-data/datasets/wikipedia work?
It works ! Thank you :)
Changing 20190301(date) to 20200220 works.
Ref: https://dumps.wikimedia.your.org/backup-index.html