This is a feature request. Currently, Ray does not support experiments requesting GPU resources if the head node doesn't not have any a GPU. A workaround is to put a GPU on the head node, but this is expensive as the head should run as non-preemptible. I think https://github.com/ray-project/ray/pull/4600 was an attempt to fix this, but it doesn't work and there's still no cheap workaround.
@hartikainen @ericl
Couple of other things I've tried:
ray start --head --num-gpus=0.01 --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
I get Error: Invalid value for "--num-gpus": 0.01 is not a valid integer
--num-gpus=1 --num-cpus=1 doesn't work either. Tune reports:== Status ==
Using FIFO scheduling algorithm.
Resources requested: 2/1 CPUs, 0.5/1 GPUs
Memory usage on this node: 1.0/15.8 GB
and it looks like the autoscaler thinks that nothing is being run and thus does not scale up.
initial_workers > 1 together with ray start --num-gpus=1 --num-cpus=1 ... failed too. It's been a while since I ran these, so I don't know what the error here was exactly though.I tried --num-gpus=1 --num-cpus=1 with initial_workers > 0 as well and I got this error:
2019-04-16 08:41:50,317 CRITICAL autoscaler.py:391 -- StandardAutoscaler: Too many errors, abort.
Traceback (most recent call last):
File "/env/lib/python3.5/site-packages/ray/monitor.py", line 389, in <module>
raise e
File "/env/lib/python3.5/site-packages/ray/monitor.py", line 379, in <module>
monitor.run()
File "/env/lib/python3.5/site-packages/ray/monitor.py", line 316, in run
self.autoscaler.update()
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 393, in update
raise e
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 385, in update
self._update()
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 406, in _update
self.log_info_string(nodes)
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 630, in log_info_string
logger.info("StandardAutoscaler: {}".format(self.info_string(nodes)))
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 646, in info_string
len(nodes), self.target_num_workers(), suffix)
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 518, in target_num_workers
cur_used = self.load_metrics.approx_workers_used()
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 193, in approx_workers_used
return self._info()["NumNodesUsed"]
File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 211, in _info
used = amount - avail_resources[resource_id]
KeyError: b'GPU'
This was with the nightly whl from 4/16/19 I think.
OK I think this actually works.
Here's a gist of a YAML and script to run:
https://gist.github.com/richardliaw/a4b1641135bfa1c6f50918de14eb6414
Run the following commands:
ray up sgd.yaml -y ray exec sgd.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*' (Instead of this step, one option would be to have a check programmatically for when the worker node connects in your ray program)ray submit mnist_pytorch_trainable.py --args="--redis-address='localhost:6379'"I'll test this when I get the chance. Thanks!
Your script worked! I (and I think Kristian too?) was trying to get the non-GPU head to work with a single exec_cluster call rather than than a two stage create_or_update and then exec_cluster. This is a good fix, but I think it's a little misleading that passing --start to ray submit doesn't have the same functionality as ray up followed by ray submit.
This still doesn't work because if all worker nodes die, Tune dies.
With #5900, I think this should work now. We can avoid doing the 2-stage thing as suggested above.
@stevenlin1111 can you try this out tomorrow? (after the wheels finish building). If it does work as intended, can you also close this issue?
Thanks!
Sorry, I forgot to update this issue after our meeting! It works if the instances aren't pre-empted, but for some reason it fails to recover if the worker is terminated. The trials recover fine if I use CPUs only, so I'm inclined to say that it's an issue specifically with GPU trial recovery.
Here is the tune logs after some of the worker nodes was pre-empted. The bottom two trials were pre-empted early on and were re-queued, and I just pre-empted the top two.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/20 CPUs, 1.0/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.31 GiB objects
Number of trials: 4 ({'RUNNING': 2, 'PENDING': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name | status | loc | failures | error file | algo_variant/env | iter | total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_63c8aaf6 | RUNNING | ip-172-31-43-193:203 | 0 | | pendulum | 59 | 88.4573 |
| SequentialRayExperiment_63c8e4c6 | RUNNING | ip-172-31-43-193:206 | 0 | | pendulum | 53 | 79.1381 |
| SequentialRayExperiment_63c83472 | PENDING | ip-172-31-32-185:198 | 1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c83472_2019-11-02_10-41-55s0a4f5gm/error_2019-11-02_10-53-20.txt | pendulum | 41 | 62.5881 |
| SequentialRayExperiment_63c87220 | PENDING | ip-172-31-32-185:207 | 2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c87220_2019-11-02_10-49-56xvvi32wg/error_2019-11-02_10-57-28.txt | pendulum | 37 | 54.9184 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
2019-11-02 10:59:26,201 WARNING worker.py:1328 -- The node with client ID 9b961e61e30c2143da61b6fe9dedbed0ae06c79b has been marked dead because the monitor has missed too many heartbeats from it.
2019-11-02 10:59:33,255 WARNING worker.py:1328 -- The actor or task with ID fffffffffffffa35491e01000000 is infeasible and cannot currently be scheduled. It requires {GPU: 0.500000}, {CPU: 2.000000} for execution and {GPU: 0.500000}, {CPU: 2.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
2019-11-02 11:01:21,084 ERROR trial_runner.py:492 -- Error processing event.
Traceback (most recent call last):
File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 438, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/env/lib/python3.5/site-packages/ray/tune/ray_trial_executor.py", line 354, in fetch_result
result = ray.get(trial_future[0])
File "/env/lib/python3.5/site-packages/ray/worker.py", line 1890, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2019-11-02 11:01:21,084 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 11:01:21,086 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8e4c6_2019-11-02_10-57-28suffxasc/ /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8e4c6_2019-11-02_10-57-28suffxasc/
2019-11-02 11:01:21,092 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_63c8e4c6.
2019-11-02 11:01:21,092 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_63c8e4c6.
2019-11-02 11:01:21,093 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 11:01:21,094 DEBUG trial_runner.py:544 -- Notifying Scheduler and requeueing trial.
2019-11-02 11:01:21,108 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 11:01:21,110 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8aaf6_2019-11-02_10-49-563tsiqajc/ /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8aaf6_2019-11-02_10-49-563tsiqajc/
2019-11-02 11:01:21,115 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_63c8aaf6.
2019-11-02 11:01:21,115 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_63c8aaf6.
2019-11-02 11:01:21,115 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 11:01:21,115 DEBUG trial_runner.py:544 -- Notifying Scheduler and requeueing trial.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/4 CPUs, 0.5/0 GPUs, 0.0/4.2 GiB heap, 0.0/1.46 GiB objects
Number of trials: 4 ({'RUNNING': 1, 'PENDING': 3})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name | status | loc | failures | error file | algo_variant/env | iter | total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_63c8aaf6 | RUNNING | ip-172-31-43-193:203 | 0 | | pendulum | 59 | 88.4573 |
| SequentialRayExperiment_63c83472 | PENDING | ip-172-31-32-185:198 | 1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c83472_2019-11-02_10-41-55s0a4f5gm/error_2019-11-02_10-53-20.txt | pendulum | 41 | 62.5881 |
| SequentialRayExperiment_63c87220 | PENDING | ip-172-31-32-185:207 | 2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c87220_2019-11-02_10-49-56xvvi32wg/error_2019-11-02_10-57-28.txt | pendulum | 37 | 54.9184 |
| SequentialRayExperiment_63c8e4c6 | PENDING | ip-172-31-43-193:206 | 1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8e4c6_2019-11-02_10-57-28suffxasc/error_2019-11-02_11-01-21.txt | pendulum | 54 | 80.604 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
Traceback (most recent call last):
File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 109, in <module>
launch_local_experiment(**local_launch_variant)
File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 97, in launch_local_experiment
queue_trials=True,
File "/env/lib/python3.5/site-packages/ray/tune/tune.py", line 286, in run
runner.step()
File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 350, in step
trial.config)))
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 2 CPUs, 0.5 GPUs but the cluster has only 4 CPUs, 0 GPUs, 4.2 GiB heap, 1.46 GiB objects (1.0 node:172.31.37.150). Pass `queue_trials=True` in ray.tune.run() or on the command line to queue trials until the cluster scales up.
ssh: connect to host 172.31.43.193 port 22: No route to host
ssh: connect to host 172.31.43.193 port 22: No route to host
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
2019-11-02 11:01:25,720 INFO commands.py:110 -- teardown_cluster: Shutting down 1 nodes...
2019-11-02 11:01:25,721 INFO node_provider.py:343 -- AWSNodeProvider: terminating nodes ['i-0acfd363e0b70d4e1'] (spot nodes cannot be stopped, only terminated)
2019-11-02 11:01:26,940 INFO log_timer.py:21 -- teardown_cluster: done. [LogTimer=1220ms]
Shared connection to 34.238.161.173 closed.
Connection to 34.238.161.173 closed by remote host.
NodeUpdater: i-0f8ece1adbc0c35e8: Command failed:
ssh -tt -i /home/steven/.ssh/ray-autoscaler_us-east-1.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_6ed61d4b80/25a171c8d1/%C -o ControlPersist=10s [email protected] bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker exec ray-docker /bin/sh -c '"'"'"'"'"'"'"'"'python /home/steven/res/grailrl/launchers/ray/local_launch.py'"'"'"'"'"'"'"'"' ; docker exec ray-docker /bin/sh -c '"'"'"'"'"'"'"'"'ray stop; ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only'"'"'"'"'"'"'"'"' ; sudo shutdown -h now'"'"''
The last autoscaler logs showed this.
2019-11-02 11:01:24,552 INFO autoscaler.py:736 -- StandardAutoscaler: 1/1 target nodes (0 pending) (1 updating)
2019-11-02 11:01:24,552 INFO autoscaler.py:737 -- LoadMetrics: MostDelayedHeartbeats={'172.31.37.150': 0.18326258659362793}, NodeIdleSeconds=Min=1201 Mean=1201 Max=1201, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/4.0 CPU, 0.0 GiB/4.51 GiB memory, 0.0/1.0 node:172.31.37.150, 0.0 GiB/1.57 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-11-02 11:01:24,553 INFO autoscaler.py:534 -- Ending bringup phase
2019-11-02 11:01:24,553 INFO autoscaler.py:736 -- StandardAutoscaler: 1/1 target nodes (0 pending) (1 updating)
2019-11-02 11:01:24,553 INFO autoscaler.py:737 -- LoadMetrics: MostDelayedHeartbeats={'172.31.37.150': 0.18359017372131348}, NodeIdleSeconds=Min=1201 Mean=1201 Max=1201, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/4.0 CPU, 0.0 GiB/4.51 GiB memory, 0.0/1.0 node:172.31.37.150, 0.0 GiB/1.57 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
The first failure log is this:
Traceback (most recent call last):
File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 438, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/env/lib/python3.5/site-packages/ray/tune/ray_trial_executor.py", line 354, in fetch_result
result = ray.get(trial_future[0])
File "/env/lib/python3.5/site-packages/ray/worker.py", line 1890, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
And the second failure log is this:
SequentialRayExperiment_63c87220 (ip: 172.31.32.185) detected as stale. This is likely because the node was lost
The autoscaler and tune logs seem to indicate that the experiment terminated while the worker node was still in the process of restarting.
Reran the experiment with less trials. This time, the worker restarted successfully but the trials all crashed after restoration.
tune logs:
2019-11-02 13:05:22,614 WARNING worker.py:1328 -- The node with client ID 5d08bfa5770efefdb39e24449ee53fe33186e599 has been marked dead because the monitor has missed too many heartbeats from it.
k2019-11-02 13:07:17,871 ERROR trial_runner.py:492 -- Error processing event.
Traceback (most recent call last):
File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 438, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/env/lib/python3.5/site-packages/ray/tune/ray_trial_executor.py", line 354, in fetch_result
result = ray.get(trial_future[0])
File "/env/lib/python3.5/site-packages/ray/worker.py", line 1890, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2019-11-02 13:07:17,872 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:07:17,874 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:07:17,879 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:07:17,880 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:07:17,881 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:07:17,882 WARNING ray_trial_executor.py:469 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2019-11-02 13:07:17,883 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:07:17,884 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:07:17,886 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:07:17,902 WARNING worker.py:1328 -- The actor or task with ID ffffffffffff3724d75b01000000 is infeasible and cannot currently be scheduled. It requires {CPU: 2.000000}, {GPU: 0.500000} for execution and {CPU: 2.000000}, {GPU: 0.500000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
ssh: connect to host 172.31.36.185 port 22: No route to host
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
l2019-11-02 13:20:19,364 WARNING util.py:133 -- The `get_current_ip` operation took 781.4718012809753 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:19,364 INFO logger.py:389 -- Syncing (blocking) results to 172.31.47.107
2019-11-02 13:20:19,364 WARNING syncer.py:159 -- Sync process still running but resetting anyways.
2019-11-02 13:20:19,364 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:20,442 WARNING util.py:133 -- The `sync_to_new_location` operation took 1.0784053802490234 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:22,214 WARNING util.py:133 -- The `restore_from_disk` operation took 1.7714855670928955 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:22,214 WARNING util.py:133 -- The `process_trial` operation took 784.3441452980042 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:22,219 DEBUG syncer.py:152 -- Running sync: aws s3 sync /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/ s3://steven.railrl/ray/aws-autoscaler-gpu-again-2019-11-02
2019-11-02 13:20:22,229 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:22,230 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:22,236 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:22,236 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:22,236 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:22,237 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:22,239 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:22,241 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
ssh: connect to host 172.31.36.185 port 22: No route to host
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
2019-11-02 13:20:25,508 WARNING util.py:133 -- The `get_current_ip` operation took 3.25955867767334 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:25,508 INFO logger.py:389 -- Syncing (blocking) results to 172.31.47.107
2019-11-02 13:20:25,508 WARNING syncer.py:159 -- Sync process still running but resetting anyways.
2019-11-02 13:20:25,508 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:26,449 WARNING util.py:133 -- The `sync_to_new_location` operation took 0.9409334659576416 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:26,566 WARNING util.py:133 -- The `process_failed_trial` operation took 4.337299823760986 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:26,568 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:26,570 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:26,576 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:26,576 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:26,577 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:26,577 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:26,578 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:26,580 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:31,065 WARNING util.py:133 -- The `get_current_ip` operation took 4.478132724761963 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:31,177 WARNING util.py:133 -- The `process_failed_trial` operation took 4.609180688858032 seconds to complete, which may be a performance bottleneck.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:19,372 INFO trainable.py:102 -- _setup took 95.413 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:19,375 WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 1.0/0 GPUs, 0.0/4.2 GiB heap, 0.0/1.42 GiB objects
Number of trials: 2 ({'RUNNING': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name | status | loc | failures | error file | algo_variant/env | iter | total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | RUNNING | ip-172-31-36-185:200 | 1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-07-17.txt | pendulum | 50 | 72.497 |
| SequentialRayExperiment_150c7b46 | RUNNING | ip-172-31-36-185:202 | 0 | | pendulum | 57 | 85.0134 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:22,225 INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/checkpoint_50/exp.pkl
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:22,225 INFO trainable.py:365 -- Current state after restoring: {'_iteration': 50, '_timesteps_total': None, '_episodes_total': None, '_time_total': 72.49702072143555}
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) 2019-11-02 13:20:25,518 WARNING trainable.py:131 -- Getting current IP.
(pid=197, ip=172.31.47.107) 2019-11-02 13:20:26,576 INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/checkpoint_50/exp.pkl
(pid=197, ip=172.31.47.107) 2019-11-02 13:20:26,576 INFO trainable.py:365 -- Current state after restoring: {'_time_total': 74.83066844940186, '_timesteps_total': None, '_episodes_total': None, '_iteration': 50}
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) 2019-11-02 13:20:31,076 WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/20 CPUs, 1.0/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.26 GiB objects
Number of trials: 2 ({'RUNNING': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name | status | loc | failures | error file | algo_variant/env | iter | total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | RUNNING | ip-172-31-36-185:200 | 1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-07-17.txt | pendulum | 50 | 72.497 |
| SequentialRayExperiment_150c7b46 | RUNNING | ip-172-31-36-185:202 | 2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/error_2019-11-02_13-20-26.txt | pendulum | 50 | 74.8307 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
2019-11-02 13:20:31,182 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:31,184 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:31,189 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:31,190 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:31,190 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:31,190 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:31,192 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:31,194 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:35,718 WARNING util.py:133 -- The `get_current_ip` operation took 4.517314910888672 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:35,835 WARNING util.py:133 -- The `process_failed_trial` operation took 4.652915954589844 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:35,843 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:35,847 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:35,849 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:35,850 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:35,853 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:35,854 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:35,859 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:35,859 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:35,860 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:35,860 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:35,861 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:35,863 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:39,278 WARNING util.py:133 -- The `get_current_ip` operation took 3.4083802700042725 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:39,389 WARNING util.py:133 -- The `process_failed_trial` operation took 3.536538600921631 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:39,394 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:39,396 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:39,400 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:39,401 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:39,401 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:39,401 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:39,403 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:39,406 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:42,699 WARNING util.py:133 -- The `get_current_ip` operation took 3.2852749824523926 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:42,812 WARNING util.py:133 -- The `process_failed_trial` operation took 3.418482542037964 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:42,815 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:42,819 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:42,823 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:42,824 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:42,829 DEBUG syncer.py:152 -- Running sync: aws s3 sync /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/ s3://steven.railrl/ray/aws-autoscaler-gpu-again-2019-11-02
(pid=205, ip=172.31.47.107) 2019-11-02 13:20:31,188 INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/checkpoint_50/exp.pkl
(pid=205, ip=172.31.47.107) 2019-11-02 13:20:31,188 INFO trainable.py:365 -- Current state after restoring: {'_timesteps_total': None, '_episodes_total': None, '_iteration': 50, '_time_total': 74.83066844940186}
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) 2019-11-02 13:20:35,728 WARNING trainable.py:131 -- Getting current IP.
(pid=208, ip=172.31.47.107) 2019-11-02 13:20:35,845 INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/checkpoint_50/exp.pkl
(pid=208, ip=172.31.47.107) 2019-11-02 13:20:35,845 INFO trainable.py:365 -- Current state after restoring: {'_episodes_total': None, '_time_total': 74.83066844940186, '_iteration': 50, '_timesteps_total': None}
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) 2019-11-02 13:20:39,288 WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/20 CPUs, 0.5/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.26 GiB objects
Number of trials: 2 ({'RUNNING': 1, 'ERROR': 1})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name | status | loc | failures | error file | algo_variant/env | iter | total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | RUNNING | ip-172-31-36-185:200 | 2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-20-35.txt | pendulum | 50 | 72.497 |
| SequentialRayExperiment_150c7b46 | ERROR | ip-172-31-36-185:202 | 4 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/error_2019-11-02_13-20-35.txt | pendulum | 50 | 74.8307 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
(pid=204, ip=172.31.47.107) 2019-11-02 13:20:39,399 INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/checkpoint_50/exp.pkl
(pid=204, ip=172.31.47.107) 2019-11-02 13:20:39,400 INFO trainable.py:365 -- Current state after restoring: {'_timesteps_total': None, '_episodes_total': None, '_iteration': 50, '_time_total': 72.49702072143555}
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) 2019-11-02 13:20:42,709 WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0.0/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.26 GiB objects
Number of trials: 2 ({'ERROR': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name | status | loc | failures | error file | algo_variant/env | iter | total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | ERROR | ip-172-31-36-185:200 | 4 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-20-42.txt | pendulum | 50 | 72.497 |
| SequentialRayExperiment_150c7b46 | ERROR | ip-172-31-36-185:202 | 4 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/error_2019-11-02_13-20-35.txt | pendulum | 50 | 74.8307 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
(pid=209, ip=172.31.47.107) 2019-11-02 13:20:42,822 INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/checkpoint_50/exp.pkl
(pid=209, ip=172.31.47.107) 2019-11-02 13:20:42,822 INFO trainable.py:365 -- Current state after restoring: {'_time_total': 72.49702072143555, '_episodes_total': None, '_iteration': 50, '_timesteps_total': None}
Traceback (most recent call last):
File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 109, in <module>
launch_local_experiment(**local_launch_variant)
File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 97, in launch_local_experiment
queue_trials=True,
File "/env/lib/python3.5/site-packages/ray/tune/tune.py", line 309, in run
raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [SequentialRayExperiment_150c4086, SequentialRayExperiment_150c7b46])
2019-11-02 13:20:47,309 INFO commands.py:110 -- teardown_cluster: Shutting down 2 nodes...
2019-11-02 13:20:47,310 INFO node_provider.py:343 -- AWSNodeProvider: terminating nodes ['i-0e347900097c9b114', 'i-07319981fe4f47631'] (spot nodes cannot be stopped, only terminated)
2019-11-02 13:20:48,525 INFO log_timer.py:21 -- teardown_cluster: done. [LogTimer=1215ms]
Connection to 54.82.225.148 closed by remote host.
Shared connection to 54.82.225.148 closed.
NodeUpdater: i-0e837ccbb08c3630b: Command failed:
All failure logs after the first say the same thing:
SequentialRayExperiment_150c4086 (ip: 172.31.36.185) detected as stale. This is likely because the node was lost
Most helpful comment
OK I think this actually works.
Here's a gist of a YAML and script to run:
https://gist.github.com/richardliaw/a4b1641135bfa1c6f50918de14eb6414
Run the following commands:
ray up sgd.yaml -yray exec sgd.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'(Instead of this step, one option would be to have a check programmatically for when the worker node connects in your ray program)ray submit mnist_pytorch_trainable.py --args="--redis-address='localhost:6379'"