Datasets: dataset_builder._prepare_split ResourceExhaustedError

Created on 1 Feb 2020  路  10Comments  路  Source: tensorflow/datasets

Short description
dataset_builder._prepare_split throws tensorflow.python.framework.errors_impl.ResourceExhaustedError; Too many open files when used on imagenet_resized 64x64.

Environment information

  • Operating System: linux with user privileges, so not possible to change the Python upperbound for number of open files
  • Python version:3.6.7
  • tensorflow-datasets version: 1.3.2, but the same function exists in 2.0.0 so I suspect the error is also there
  • tensorflow-gpu version: 1.15

Reproduction instructions

import tensorflow_datasets as tfds
data = tfds.load(name="imagenet_resized", split="train", builder_kwargs={'config':'64x64'})

Link to logs
```Traceback (most recent call last):
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/file_format_adapter.py", line 199, in incomplete_dir
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 322, in download_and_prepare
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 969, in _download_and_prepare
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 837, in _download_and_prepare
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 996, in _prepare_split
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/tfrecords_writer.py", line 160, in write
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/shuffle.py", line 194, in add
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/shuffle.py", line 180, in _add_to_mem_buffer
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/shuffle.py", line 174, in _add_to_bucket
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/shuffle.py", line 109, in add
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 106, in write
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 92, in _prewrite_check
tensorflow.python.framework.errors_impl.ResourceExhaustedError: /home/pxd256/tensorflow_datasets/imagenet_resized/64x64/0.1.0.incomplete3CQMFQ/bucket_b6b3f07c-ee6c-4cab-aac4-b9dde31f24c6_717.tmp; Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/registered.py", line 302, in load
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 342, in download_and_prepare
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow_datasets/core/file_format_adapter.py", line 203, in incomplete_dir
File "/home/pxd256/.local/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 569, in delete_recursively_v2
tensorflow.python.framework.errors_impl.ResourceExhaustedError: /home/pxd256/tensorflow_datasets/imagenet_resized/64x64/0.1.0.incomplete3CQMFQ; Too many open files
126393 examples [05:00, 472.55 examples/s]
```

Expected behavior
Not throwing exception.

bug contributions welcome

Most helpful comment

Temporary fix:
Instead of

import tensorflow_datasets as tfds
data = tfds.load(name="imagenet_resized", split="train", builder_kwargs={'config':'64x64'})

Do

import resource
low, high = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (high, high))

import tensorflow_datasets as tfds
data = tfds.load(name="imagenet_resized", split="train", builder_kwargs={'config':'64x64'})

This should work provided that the python process has a high enough hard limit on number of open files (some amount above 1000, which is the hard coded number of shard files used by shuffler).

It appears that even though the Python soft limit (1024 in three of my servers, and I believe this is the same for all the anaconda environments) is above the number of temporary shard files for the shuffler buckets (1000), it is not high enough for the downloaded data to be split into tfrecords using the shuffler, because 27 files are open in addition to the bucket files for various unknown reasons, bringing the total above the soft limit to 1027. Even though user privilege is not enough to modify the hard limit, it suffices to simply increase the soft limit in python without root privilege, and it will be enough (assuming that the system hard limit is high). I will look into whether there exists a useful heuristic to keep the number of shard files within a safe range.

All 10 comments

Thanks for reporting. Could you check the maximum number of files allowed in your system ?
https://easyengine.io/tutorials/linux/increase-open-files-limit/

During generation, around 1000 temporary shards may be created simultaneously. This allow to shuffle and sort the data deterministically. Done in: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/shuffle.py

Thanks for reporting. Could you check the maximum number of files allowed in your system ?
https://easyengine.io/tutorials/linux/increase-open-files-limit/

During generation, around 1000 temporary shards may be created simultaneously. This allow to shuffle and sort the data deterministically. Done in: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/shuffle.py

Thanks for the reply. Two servers I have access to have upperbound hard limited limited at 130k, and cannot be modified with only user privileges. I will see if I can play with some of the tfds source files in shuffle.py and imagenet_resized.py to fix the issue, and provide any updates if they work.

Regarding the system limit:

import resource
resource.getrlimit(resource.RLIMIT_NOFILE)
(1024, 131072)

Temporary fix:
Instead of

import tensorflow_datasets as tfds
data = tfds.load(name="imagenet_resized", split="train", builder_kwargs={'config':'64x64'})

Do

import resource
low, high = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (high, high))

import tensorflow_datasets as tfds
data = tfds.load(name="imagenet_resized", split="train", builder_kwargs={'config':'64x64'})

This should work provided that the python process has a high enough hard limit on number of open files (some amount above 1000, which is the hard coded number of shard files used by shuffler).

It appears that even though the Python soft limit (1024 in three of my servers, and I believe this is the same for all the anaconda environments) is above the number of temporary shard files for the shuffler buckets (1000), it is not high enough for the downloaded data to be split into tfrecords using the shuffler, because 27 files are open in addition to the bucket files for various unknown reasons, bringing the total above the soft limit to 1027. Even though user privilege is not enough to modify the hard limit, it suffices to simply increase the soft limit in python without root privilege, and it will be enough (assuming that the system hard limit is high). I will look into whether there exists a useful heuristic to keep the number of shard files within a safe range.

@Conchylicultor should I go on with the above fix?

Same issue on a MacBook Pro for the 'imagenette' dataset

import tensorflow_datasets as tfds
ds_train = tfds.load('imagenette/full-size-v2', split='train[:80%]', as_supervised=True)

The error:

Downloading and preparing dataset imagenette/full-size-v2/0.1.0 (download: 1.45 GiB, generated: 1.46 GiB, total: 2.91 GiB) to /tensorflow_datasets/imagenette/full-size-v2/0.1.0...
Dl Completed...: 0 url [00:00, ? url/s]          INFO:absl:URL https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz already downloaded: reusing /tensorflow_datasets/downloads/s3_fast-ai-imageclas_imagenette2s_llZjKkuLrzn8hoOF5IBPy4iCn1Rqq1SaRaYZXrNpw.tgz.
INFO:absl:Reusing extraction of /tensorflow_datasets/downloads/s3_fast-ai-imageclas_imagenette2s_llZjKkuLrzn8hoOF5IBPy4iCn1Rqq1SaRaYZXrNpw.tgz at /tensorflow_datasets/downloads/extracted/TAR_GZ.s3_fast-ai-imageclas_imagenette2s_llZjKkuLrzn8hoOF5IBPy4iCn1Rqq1SaRaYZXrNpw.tgz.

Extraction completed...: 0 file [00:00, ? file/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Dl Completed...: 0 url [00:00, ? url/s]
INFO:absl:Generating split train
Traceback (most recent call last):       
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 344, in incomplete_dir
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 387, in download_and_prepare
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1024, in _download_and_prepare
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 974, in _download_and_prepare
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1041, in _prepare_split
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/tfrecords_writer.py", line 202, in write
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/shuffle.py", line 221, in add
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/shuffle.py", line 207, in _add_to_mem_buffer
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/shuffle.py", line 201, in _add_to_bucket
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow_datasets/core/shuffle.py", line 123, in add
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 101, in write
  File "/opt/anaconda3/envs/alexnet/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 87, in _prewrite_check
tensorflow.python.framework.errors_impl.ResourceExhaustedError: /tensorflow_datasets/imagenette/full-size-v2/0.1.0.incomplete0B1ADV/bucket_5f0291c6-3ae7-468f-86e5-02b320b0d602_283.tmp; Too many open files

The above fix gives me this error:

    resource.setrlimit(resource.RLIMIT_NOFILE, (high, high))
ValueError: current limit exceeds maximum limit

When I check my system params I have a high limit:

sysctl kern.maxfiles    # -> 49152
sysctl kern.maxfilesperproc # -> 24576

I tried doubling those limits but I still got the same error.

sysctl -w kern.maxfiles=98304

import resource
low, high = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (high, high))

import tensorflow_datasets as tfds
train, validation = tfds.load('imagenet_resized/32x32', split=['train', 'validation'], as_supervised=True, shuffle_files = True)

works for me.

Hi @houseofai , can you check if the code in pull request above works on your system?

@PerryXDeng I just tried with tf-nightly==2.5.0.dev20201223 on Mac and I got the same issue.

@houseofai

@PerryXDeng I just tried with tf-nightly==2.5.0.dev20201223 on Mac and I got the same issue.

The tensorflow version isn't really relevant here because the issue concerns tf-datasets and not tf itself. Can you try with the dataset code in the pull request above. You can pull the code, set up a virtual environment with the dependencies installed with pip install -e ., and then inside the root directory of the code, run the imports and the statements as you normally would.

@PerryXDeng Indeed, I don't know where I was headed.
Tested on Mac, it works perfectly:

2020-12-23 21:06:06.052105: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Downloading and preparing dataset 1.45 GiB (download: 1.45 GiB, generated: 1.46 GiB, total: 2.91 GiB) to /Users/john/tensorflow_datasets/imagenette/full-size-v2/0.1.0...
Dl Completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 1/1 [05:50<00:00, 331.73s/ url]
Extraction completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [05:50<00:00, 350.17s/ file]
Extraction completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [05:50<00:00, 350.17s/ file]
Dl Size...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 1484/1484 [05:50<00:00,  4.24 MiB/s]

Dl Completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 1/1 [05:50<00:00, 350.18s/ url]
Dataset imagenette downloaded and prepared to /Users/john/tensorflow_datasets/imagenette/full-size-v2/0.1.0. Subsequent calls will reuse this data.

Many thanks

Was this page helpful?
0 / 5 - 0 ratings