Datasets: SSL CA failures when using a dataset from gs://

Created on 31 Jul 2020  路  18Comments  路  Source: tensorflow/datasets

Short description

When trying to load a dataset I get an error "Problem with the SSL CA cert (path? access rights?)" and a subsequent error when six.reraise is called

Environment information

  • Operating System: RHEL 7
  • Python version: 3.7.4
  • tensorflow-datasets version: 3.2.1
  • tensorflow version: 2.2.0

Reproduction instructions

  • Run in python tfds.builder("imagenet2012").info

Link to logs

Traceback (most recent call last):
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 399, in try_reraise
    yield
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/registered.py", line 244, in builder
    return builder_cls(name)(**builder_kwargs)
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 206, in __init__
    self.info.initialize_from_bucket()
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 423, in initialize_from_bucket
    data_files = gcs_utils.gcs_dataset_info_files(self.full_name)
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/utils/gcs_utils.py", line 71, in gcs_dataset_info_files
    return gcs_listdir(posixpath.join(GCS_DATASET_INFO_DIR, dataset_dir))
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/utils/gcs_utils.py", line 64, in gcs_listdir
    if is_gcs_disabled() or not tf.io.gfile.exists(root_dir):
  File "/sw/installed/TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 280, in file_exists_v2
    pywrap_tensorflow.FileExists(compat.as_bytes(path))
tensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 77 meaning 'Problem with the SSL CA cert (path? access rights?)', error details: error setting certificate verify locations:
  CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
     when reading metadata of gs://tfds-data/dataset_info/imagenet2012/5.0.0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hvd_dnn_benchmark.py", line 231, in <module>
    run()  #pylint: disable=no-value-for-parameter
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "hvd_dnn_benchmark.py", line 105, in run
    dataset = get_dataset(dataset, synthetic=synthetic_data)
  File "/home/h3/s3248973/git/tensorflow_tests/benchmark/datasets.py", line 87, in get_dataset
    return _AVAIL[name](synthetic)
  File "/home/h3/s3248973/git/tensorflow_tests/benchmark/datasets.py", line 77, in _imagenet
    return TFDS_Dataset('imagenet2012', synthetic)
  File "/home/h3/s3248973/git/tensorflow_tests/benchmark/datasets.py", line 54, in __init__
    info = tfds.builder(name).info
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/registered.py", line 244, in builder
    return builder_cls(name)(**builder_kwargs)
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 401, in try_reraise
    reraise(*args, **kwargs)
  File "/home/s3248973/.local/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 392, in reraise
    six.reraise(exc_type, exc_type(msg), exc_traceback)
TypeError: __init__() missing 2 required positional arguments: 'op' and 'message'

Expected behavior
No error

bug contributions welcome

Most helpful comment

Here is a workaround to deal with this problem, downgrade the tensorflow-datasets:

pip install tensorflow-datasets==3.0.0

None of the solutions works for me, but this.

All 18 comments

Hi @Flamefire,

Can you run this code snippet of your machine

import tensorflow as tf
tf.io.gfile.exists("gs://tfds-data/dataset_info/mnist/3.0.1")

and share the results

(Similar issue https://github.com/tensorflow/datasets/issues/2190)

Good idea, that fails with the first exception above. Found the cause here: https://github.com/tensorflow/tensorflow/issues/40065

So that's on TensorFlow.

However it might still be worth looking into the 2nd issue where the reraise fails which looks like that's caused by TFDS

Note that the fix seems to be for a different issue.

This issue is not on Windows but on RHEL and the cause is that TensorFlow misconfigures the bundled libcurl causing a certificate error on execution. So I doubt that PR fixes anything related to this.

The reraise error message should be fixed by https://github.com/tensorflow/datasets/pull/2377. However it seems my fix bug with pytest, but didn't had time to investigate more.

Here is a workaround to deal with this problem, downgrade the tensorflow-datasets:

pip install tensorflow-datasets==3.0.0

None of the solutions works for me, but this.

Not sure about which solutions you are talking about but I'll include 2 solutions:

  • create a symlink at /etc/ssl/certs/ca-certificates.crt (likely) to ca-bundle.crt (make sure this exists, TF looks for /etc/ssl/certs/ca-certificates.crt only)
  • When building TensorFlow from source build with system curl, see $TF_SYSTEM_LIBS. So in this case set export TF_SYSTEM_LIBS='curl' before building TF

I would recommend you to try tfds-nightly as we might have send an update about this recently

Would you able to tell us more about that update? You do you workaround the issue that the curl bundled with TF uses the wrong certificate path? Don't you use TF for downloading anymore?

Reading on GCS is optional, so it should skip GCS rather than failing.

If it doesn't work in tfds-nigthly, could you try to add tf.errors.AbortedError in https://github.com/tensorflow/datasets/blob/adb320ff04b6e93c561dacb2b647c8fcbfea92f3/tensorflow_datasets/core/utils/gcs_utils.py#L43:
and send us a PR ?

First attempt installing it failed because it seems to require bazel. Are you going to make bazel an actual requirement of TFDS? If so please don't! Using Bazel on HPC systems is a nightmare.

After getting past this, this does indeed work without further changes

TFDS does not require bazel. Are you not confusing with Tensorflow ?

Oh, you are right. This is from a dependency added in https://github.com/tensorflow/datasets/commit/6e2540f85bbfd0312d5611c2612cc38de98084f3 which adds dm-tree and which output I confused with TFDS because I did not expect a new lib to be installed. That one requires Bazel so TFDS requires Bazel indirectly. Example output:

  running build_ext
  bazel build //tree:_tree --symlink_prefix=build/temp.linux-ppc64le-3.7/bazel- --compilation_mode=opt
  unable to execute 'bazel': No such file or directory

As this is only required in 2 places and at least the "flatten" is rather trivial to implement, maybe it is possible to avoid it?

When are you seeing this error (which command are you running ?).

The only deps tree require is six (https://github.com/deepmind/tree/blob/master/requirements.txt).
You shouldn't need to compile tree from source. Why not pip install ?

I did use pip install tfds-nightly which basically reduces to pip install dm-tree. I guess the reason why I see this is that I ran it on POWER where no prebuild binary exists: https://pypi.org/project/dm-tree/#files

The same will happen for ARM clusters or when this should be installed as a module for all users of the HPC systems which requires (or at least strongly prefers) source builds.

It seems that TF expose the same pre-built package as dm-tree: https://pypi.org/project/tensorflow/#files

So I'm a little confused why the problem exists with dm-tree but not tensorflow. Or am I missing something ?

We already build TensorFlow from source. This is required for our POWER (and likely ARM) nodes (no wheel) and better in general as performance there really matters. So compiling with arch-specific optimizations and GPU specific architectures enabled has advantages. However their use of Bazel is a constant source of problems. So basically with every TF release new patches are required to make it work, some of them costing multiple days to come up with. Ultimately their use of Bazel is what lead to this issue: Hardcoding paths during the build to avoid using the native configure or the integration with CMake of the dependencies (in this case cURL)

So to summarize: Using Bazel is a huge pain on HPC systems and hence requiring it for building Python packages is a major disadvantage. I already opened an issue with dm-tree that the advantages of (their use of) Bazel are small compared to what they get for it and same applies here: Advantages of dm-tree (2 functions) are small compared to having another dependency, especially one that is hard to get in some environments.

Thank you for the explanations.

The reason we have added dm-tree is because we would like to gradually better layered our dependencies. Ultimately, we would like to have a "core" library which doesn't depends on TF at all, but use instead smaller independent libs (dm-tree, gfile,...).
This would allow building different front-end (e.g. for Jax users) which don't need the full TF package.

But this is more a long term plan. We can remove the dm-tree dependency in the meantime. Please send a PR if you want so.

I think issues here have been fixed. Please open new issue otherwise

Was this page helpful?
0 / 5 - 0 ratings