Datasets: Download Manager Fails to Download FIle

Created on 8 Jan 2021  路  4Comments  路  Source: tensorflow/datasets

Short description
The download manager should be able to download files.
However, for a certain file, I get an error.

Environment information

  • Operating System: CentOS 7
  • Python version: 3.7.6
  • tensorflow-datasets/tfds-nightly version: 4.1.0+nightly
  • tensorflow/tf-nightly version: 2.4.0
  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes

Reproduction instructions
Create any dataset with the following _split_generators code (minimal reproduction)

def _split_generators(self, dl_manager: tfds.download.DownloadManager):
    """Returns SplitGenerators."""

    dl_manager.download(
        "https://aslsignbank.haskins.yale.edu/dictionary/protected_media/glossvideo/ASL/BO/BOOK-418.mp4")

Link to logs
It first says:

INFO[download_manager.py]: Downloading https://aslsignbank.haskins.yale.edu/dictionary/protected_media/glossvideo/ASL/BO/BOOK-418.mp4 into /home/nlp/amit/tensorflow_datasets/downloads/asls.hask.yale.edu_dict_prot_medi_glos_ASLfhvOgypj047EznvbC8apzfVHWR69qsI29Clf-twND88.mp4.tmp.7953dc7f67ab4d42a107981d38e45335...

and then:

tensorflow.python.framework.errors_impl.NotFoundError: /home/nlp/amit/tensorflow_datasets/downloads/asls.hask.yale.edu_dict_prot_medi_glos_ASLfhvOgypj047EznvbC8apzfVHWR69qsI29Clf-twND88.mp4.tmp.7953dc7f67ab4d42a107981d38e45335/glossvideo/ASL/BO/BOOK-418.mp4; No such file or directory

Expected behavior
The download should be successful

bug contributions welcome

Most helpful comment

Hi, I would like to work on this issue

All 4 comments

Could you share the full logs, including the stacktrace ? Are you using the returned value of downloaded_path = dl_manager.download('http://...')

I do use the downloaded path, but it is not required for the minimal reproduction example.

Here is the full log from the command tfds build

2021-01-08 21:29:59.552876: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-08 21:29:59.552935: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO[build.py]: Loading dataset  from path: /home/nlp/amit/datasets/wlasl/wlasl.py
2021-01-08 21:30:04.682786: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
INFO[build.py]: download_and_prepare for dataset wlasl/default/0.3.0...
INFO[dataset_builder.py]: Generating dataset wlasl (/home/nlp/amit/tensorflow_datasets/wlasl/default/0.3.0)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/nlp/amit/tensorflow_datasets/wlasl/default/0.3.0...
INFO[download_manager.py]: Downloading https://aslsignbank.haskins.yale.edu/dictionary/protected_media/glossvideo/ASL/BO/BOOK-418.mp4 into /home/nlp/amit/tensorflow_datasets/downloads/asls.hask.yale.edu_dict_prot_medi_glos_ASLfhvOgypj047EznvbC8apzfVHWR69qsI29Clf-twND88.mp4.tmp.b88962d8805149259ed48536c4d2e7bd...
Dl Size...: 0 MiB [00:01, ? MiB/s]                                                                                                                         | 0/1 [00:01<?, ? url/s]
Dl Completed...:   0%|                                                                                                                                     | 0/1 [00:01<?, ? url/s]
Traceback (most recent call last):
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/bin/tfds", line 8, in <module>
    sys.exit(launch_cli())
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/main.py", line 120, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/main.py", line 115, in main
    args.subparser_fn(args)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/build.py", line 199, in _build_datasets
    _download_and_prepare(args, builder)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/build.py", line 357, in _download_and_prepare
    download_config=dl_config,
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 434, in download_and_prepare
    download_config=download_config,
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1136, in _download_and_prepare
    dl_manager, **optional_pipeline_kwargs
  File "/home/nlp/amit/datasets/wlasl/wlasl.py", line 154, in _split_generators
    "https://aslsignbank.haskins.yale.edu/dictionary/protected_media/glossvideo/ASL/BO/BOOK-418.mp4")
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 549, in download
    return _map_promise(self._download, url_or_urls)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 777, in _map_promise
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 659, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 659, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 777, in <lambda>
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/promise/promise.py", line 512, in get
    return self._target_settled_value(_raise=True)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/promise/promise.py", line 516, in _target_settled_value
    return self._target()._settled_value(_raise)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/promise/promise.py", line 226, in _settled_value
    reraise(type(raise_val), raise_val, self._traceback)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/promise/promise.py", line 844, in handle_future_result
    resolve(future.result())
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow_datasets/core/download/downloader.py", line 224, in _sync_download
    file_.write(block)
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 102, in write
    self._prewrite_check()
  File "/home/nlp/amit/anaconda2/envs/meta-scholar/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 88, in _prewrite_check
    compat.path_to_bytes(self.__name), compat.as_bytes(self.__mode))
tensorflow.python.framework.errors_impl.NotFoundError: /home/nlp/amit/tensorflow_datasets/downloads/asls.hask.yale.edu_dict_prot_medi_glos_ASLfhvOgypj047EznvbC8apzfVHWR69qsI29Clf-twND88.mp4.tmp.b88962d8805149259ed48536c4d2e7bd/glossvideo/ASL/BO/BOOK-418.mp4; No such file or directory

We get the following response headers when we make a get request to the URL.

{
   "headers":{

      'Content-Disposition': 'inline;filename=glossvideo/ASL/BO/BOOK-418.mp4;filename*=UTF-8',
      ...
   },
   'url': 'https://aslsignbank.haskins.yale.edu/dictionary/protected_media/glossvideo/ASL/BO/BOOK-418.mp4',
   ....
}

So, when we pass the above response to the _get_filaname function:

https://github.com/tensorflow/datasets/blob/79047eef78a7d8da6af05df89a1a0a6e11c71ce8/tensorflow_datasets/core/download/downloader.py#L98-L105

We get glossvideo/ASL/BO/BOOK-418.mp4 as a return value but the expected value was BOOK-418.mp4.

So, we will have to make some changes here.
For Content-Disposition grammar rule reference, see https://tools.ietf.org/html/rfc6266#section-4.1

Hi, I would like to work on this issue

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ashutosh1919 picture ashutosh1919  路  5Comments

ageron picture ageron  路  4Comments

Eshan-Agarwal picture Eshan-Agarwal  路  3Comments

jinbo-huang picture jinbo-huang  路  3Comments

Ouwen picture Ouwen  路  5Comments