Ray: [tune] OSError: File name too long

Created on 24 Jan 2018  路  9Comments  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): pip
  • Ray version: 0.3.0
  • Python version: 3.6.4
  • Exact command to reproduce: run ray as usual, but with a lot of parameters

Describe the problem

A lot of parameters causes a "file name too long" OSError, breaking ray. This happens when ray uses the config serialized into string for filename. See below.

Solution could just be truncating it and appending it with hash to yield a total max length for filename

Source code / logs

Error starting runner, retrying: Traceback (most recent call last):
  File "/home/beast/miniconda3/envs/lab/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 143, in _launch_trial
    trial.start()
  File "/home/beast/miniconda3/envs/lab/lib/python3.6/site-packages/ray/tune/trial.py", line 112, in start
    self._setup_runner()
  File "/home/beast/miniconda3/envs/lab/lib/python3.6/site-packages/ray/tune/trial.py", line 298, in _setup_runner
    prefix=str(self), dir=self.local_dir)
  File "/home/beast/miniconda3/envs/lab/lib/python3.6/tempfile.py", line 368, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [Errno 36] File name too long: '/tmp/ray/dqn_cartpole/lab_trial_39_agent.0.algorithm.explore_anneal_epi=15.434090184839535,agent.0.algorithm.explore_var_end=0.2309529987089945,agent.0.algorithm.explore_var_start=2.558476661289705,agent.0.net.hid_layers=[16, 8],agent.0.net.hid_layers_activation=sigmoid7wl6xn1d'
tune

Most helpful comment

Thanks for posting this issue!

Can you post the config you're using the start the search? It would be very helpful.

All 9 comments

Thanks for posting this issue!

Can you post the config you're using the start the search? It would be very helpful.

here u go

import numpy as np
import random

config = {
  "agent.0.algorithm.explore_anneal_epi": lambda spec: np.random.randint(10, 60),
  "agent.0.algorithm.explore_var_start": lambda spec: np.random.uniform(1.0, 5.0),
  "agent.0.algorithm.explore_var_end": lambda spec: np.random.uniform(0.1, 1.0),
  "agent.0.net.hid_layers": lambda spec: random.choice([[16], [32], [64], [16, 8]]),
  "agent.0.net.hid_layers_activation": lambda spec: random.choice(["relu", "sigmoid"])
}

Thanks!

To give you an update, #1466, which fixes this issue, is now pending review, and our tests are currently broken due to a Gym version bump (to be addressed in #1471) - this should all hopefully be addressed by tomorrow.

Merged #1466 - the latest copy of master should address your issue. I'll close this for now; feel free to reopen if not resolved.

I'm seeing this same issue in 0.5.2@5eaf429c531e01c2956a2297ff0e5dd2e9660203 with the following setup:

import numpy as np
import ray
from ray import tune


def run_experiment_ray(variant, reporter):
    pass


def get_variant_spec():
    variant_spec = {
        'seed': lambda spec: np.random.randint(0, 1000),
        'distance_estimator_parameters': {
            'learning_rate': tune.grid_search(
                [1e-05, 1e-04, 1e-03, 1e-02, 1e-01]),
            'hidden_activation': tune.grid_search(['relu', 'tanh']),
        },
        'lambda_estimator_parameters': {
            'learning_rate': tune.grid_search(
                [1e-05, 1e-04, 1e-03, 1e-02, 1e-01]),
            'hidden_activation': tune.grid_search(['relu', 'tanh']),
        }
    }

    return variant_spec


def main():

    variant_spec = get_variant_spec()
    ray.init()

    tune.register_trainable(
        'experiment-runner', run_experiment_ray)
    tune.run_experiments({
        'experiment-name': {
            'run': 'experiment-runner',
            'config': variant_spec,
            'num_samples': 2,
            'trial_resources': {'cpu': 8},
            'local_dir': '~/ray_results',
            'upload_dir': 'gs://<bucket>/ray/results'
        }
    })


if __name__ == '__main__':
    main()

Here's the error which shows the filename causing the error:

Error starting runner, aborting!
Traceback (most recent call last):
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 116, in start_trial
    self._start_trial(trial, checkpoint_obj)
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 62, in _start_trial
    trial.runner = self._setup_runner(trial)
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 38, in _setup_runner
    trial.init_logger()
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/trial.py", line 176, in init_logger
    dir=self.local_dir)
  File "/home/kristian/miniconda/envs/softlearning/lib/python3.6/tempfile.py", line 368, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [Errno 36] File name too long: '/home/kristian/ray_results/gcp-test-5/experiment-runner_12_hidden_activation=relu,learning_rate=0.0001,hidden_activation=tanh,learning_rate=1e-05,seed=575_2018-09-27_23-03-51vpjjm6b1'

I've verified that I can fix the problem for this particular case by limiting the MAX_LEN_IDENTIFIER here: https://github.com/ray-project/ray/blob/f372f48bf3b51bc4e6b51ad9691f71b4b9004462/python/ray/tune/trial.py#L175

Ultimately the max prefix length should be a function of the local_dir length.

Edit: Here's my system information.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Ray installed from (source or binary): Source
  • Ray version: 0.5.2@5eaf429c531e01c2956a2297ff0e5dd2e9660203
  • Python version: 3.6
  • Exact command to reproduce: See above

I looked into this a bit more carefully, and it turns out that the problem was due to my Ubuntu home directory being encrypted. The encryption restricts the max file name length to be 143 characters. I guess the MAX_LEN_IDENTIFIER could somehow be adjusted based on the supported max file name length, but the problem still persists when e.g. running on a cloud machine without encryption and then syncing files to local machine with encryption. The 256 characters max file name length seems so standard that I don't see it being important to make an extra check to accommodate for other cases.

@hartikainen How did you fix the issue? I saw that changing MAX_LEN_IDENTIFIER helps, but modifying the ray source on every installation seems impractical.

Hey @tom-doerr, it's been so long that I don't actually remember what I did exactly to fix this issue 馃槙 Maybe limiting the trial directory name with custom trial names could help? Let me know if that doesn't solve the issue for you!

Thank you so much! That solved it. :)

Was this page helpful?
0 / 5 - 0 ratings