A lot of parameters causes a "file name too long" OSError, breaking ray. This happens when ray uses the config serialized into string for filename. See below.
Solution could just be truncating it and appending it with hash to yield a total max length for filename
Error starting runner, retrying: Traceback (most recent call last):
File "/home/beast/miniconda3/envs/lab/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 143, in _launch_trial
trial.start()
File "/home/beast/miniconda3/envs/lab/lib/python3.6/site-packages/ray/tune/trial.py", line 112, in start
self._setup_runner()
File "/home/beast/miniconda3/envs/lab/lib/python3.6/site-packages/ray/tune/trial.py", line 298, in _setup_runner
prefix=str(self), dir=self.local_dir)
File "/home/beast/miniconda3/envs/lab/lib/python3.6/tempfile.py", line 368, in mkdtemp
_os.mkdir(file, 0o700)
OSError: [Errno 36] File name too long: '/tmp/ray/dqn_cartpole/lab_trial_39_agent.0.algorithm.explore_anneal_epi=15.434090184839535,agent.0.algorithm.explore_var_end=0.2309529987089945,agent.0.algorithm.explore_var_start=2.558476661289705,agent.0.net.hid_layers=[16, 8],agent.0.net.hid_layers_activation=sigmoid7wl6xn1d'
Thanks for posting this issue!
Can you post the config you're using the start the search? It would be very helpful.
here u go
import numpy as np
import random
config = {
"agent.0.algorithm.explore_anneal_epi": lambda spec: np.random.randint(10, 60),
"agent.0.algorithm.explore_var_start": lambda spec: np.random.uniform(1.0, 5.0),
"agent.0.algorithm.explore_var_end": lambda spec: np.random.uniform(0.1, 1.0),
"agent.0.net.hid_layers": lambda spec: random.choice([[16], [32], [64], [16, 8]]),
"agent.0.net.hid_layers_activation": lambda spec: random.choice(["relu", "sigmoid"])
}
Thanks!
To give you an update, #1466, which fixes this issue, is now pending review, and our tests are currently broken due to a Gym version bump (to be addressed in #1471) - this should all hopefully be addressed by tomorrow.
Merged #1466 - the latest copy of master should address your issue. I'll close this for now; feel free to reopen if not resolved.
I'm seeing this same issue in 0.5.2@5eaf429c531e01c2956a2297ff0e5dd2e9660203 with the following setup:
import numpy as np
import ray
from ray import tune
def run_experiment_ray(variant, reporter):
pass
def get_variant_spec():
variant_spec = {
'seed': lambda spec: np.random.randint(0, 1000),
'distance_estimator_parameters': {
'learning_rate': tune.grid_search(
[1e-05, 1e-04, 1e-03, 1e-02, 1e-01]),
'hidden_activation': tune.grid_search(['relu', 'tanh']),
},
'lambda_estimator_parameters': {
'learning_rate': tune.grid_search(
[1e-05, 1e-04, 1e-03, 1e-02, 1e-01]),
'hidden_activation': tune.grid_search(['relu', 'tanh']),
}
}
return variant_spec
def main():
variant_spec = get_variant_spec()
ray.init()
tune.register_trainable(
'experiment-runner', run_experiment_ray)
tune.run_experiments({
'experiment-name': {
'run': 'experiment-runner',
'config': variant_spec,
'num_samples': 2,
'trial_resources': {'cpu': 8},
'local_dir': '~/ray_results',
'upload_dir': 'gs://<bucket>/ray/results'
}
})
if __name__ == '__main__':
main()
Here's the error which shows the filename causing the error:
Error starting runner, aborting!
Traceback (most recent call last):
File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 116, in start_trial
self._start_trial(trial, checkpoint_obj)
File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 62, in _start_trial
trial.runner = self._setup_runner(trial)
File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 38, in _setup_runner
trial.init_logger()
File "/home/kristian/github/hartikainen/ray/python/ray/tune/trial.py", line 176, in init_logger
dir=self.local_dir)
File "/home/kristian/miniconda/envs/softlearning/lib/python3.6/tempfile.py", line 368, in mkdtemp
_os.mkdir(file, 0o700)
OSError: [Errno 36] File name too long: '/home/kristian/ray_results/gcp-test-5/experiment-runner_12_hidden_activation=relu,learning_rate=0.0001,hidden_activation=tanh,learning_rate=1e-05,seed=575_2018-09-27_23-03-51vpjjm6b1'
I've verified that I can fix the problem for this particular case by limiting the MAX_LEN_IDENTIFIER here: https://github.com/ray-project/ray/blob/f372f48bf3b51bc4e6b51ad9691f71b4b9004462/python/ray/tune/trial.py#L175
Ultimately the max prefix length should be a function of the local_dir length.
Edit: Here's my system information.
I looked into this a bit more carefully, and it turns out that the problem was due to my Ubuntu home directory being encrypted. The encryption restricts the max file name length to be 143 characters. I guess the MAX_LEN_IDENTIFIER could somehow be adjusted based on the supported max file name length, but the problem still persists when e.g. running on a cloud machine without encryption and then syncing files to local machine with encryption. The 256 characters max file name length seems so standard that I don't see it being important to make an extra check to accommodate for other cases.
@hartikainen How did you fix the issue? I saw that changing MAX_LEN_IDENTIFIER helps, but modifying the ray source on every installation seems impractical.
Hey @tom-doerr, it's been so long that I don't actually remember what I did exactly to fix this issue 馃槙 Maybe limiting the trial directory name with custom trial names could help? Let me know if that doesn't solve the issue for you!
Thank you so much! That solved it. :)
Most helpful comment
Thanks for posting this issue!
Can you post the config you're using the start the search? It would be very helpful.