Ray: [Autoscaler] CPU head node with GPU worker nodes

Created on 15 May 2019 · 10Comments · Source: ray-project/ray

This is a feature request. Currently, Ray does not support experiments requesting GPU resources if the head node doesn't not have any a GPU. A workaround is to put a GPU on the head node, but this is expensive as the head should run as non-preemptible. I think https://github.com/ray-project/ray/pull/4600 was an attempt to fix this, but it doesn't work and there's still no cheap workaround.

@hartikainen @ericl

autoscaler

Source

stevenlin1111

👍2

Most helpful comment

OK I think this actually works.

Here's a gist of a YAML and script to run:
https://gist.github.com/richardliaw/a4b1641135bfa1c6f50918de14eb6414

Run the following commands:

Launch cluster: ray up sgd.yaml -y
Wait until the first worker node connects: ray exec sgd.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*' (Instead of this step, one option would be to have a check programmatically for when the worker node connects in your ray program)
Run a ray program: ray submit mnist_pytorch_trainable.py --args="--redis-address='localhost:6379'"

richardliaw on 15 May 2019

🎉3

All 10 comments

Couple of other things I've tried:

With

ray start --head --num-gpus=0.01 --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

I get Error: Invalid value for "--num-gpus": 0.01 is not a valid integer

Setting --num-gpus=1 --num-cpus=1 doesn't work either. Tune reports:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 2/1 CPUs, 0.5/1 GPUs
Memory usage on this node: 1.0/15.8 GB

and it looks like the autoscaler thinks that nothing is being run and thus does not scale up.

Specifying initial_workers > 1 together with ray start --num-gpus=1 --num-cpus=1 ... failed too. It's been a while since I ran these, so I don't know what the error here was exactly though.

hartikainen on 15 May 2019

I tried --num-gpus=1 --num-cpus=1 with initial_workers > 0 as well and I got this error:

2019-04-16 08:41:50,317 CRITICAL autoscaler.py:391 -- StandardAutoscaler: Too many errors, abort.
Traceback (most recent call last):
  File "/env/lib/python3.5/site-packages/ray/monitor.py", line 389, in <module>
    raise e
  File "/env/lib/python3.5/site-packages/ray/monitor.py", line 379, in <module>
    monitor.run()
  File "/env/lib/python3.5/site-packages/ray/monitor.py", line 316, in run
    self.autoscaler.update()
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 393, in update
    raise e
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 385, in update
    self._update()
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 406, in _update
    self.log_info_string(nodes)
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 630, in log_info_string
    logger.info("StandardAutoscaler: {}".format(self.info_string(nodes)))
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 646, in info_string
    len(nodes), self.target_num_workers(), suffix)
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 518, in target_num_workers
    cur_used = self.load_metrics.approx_workers_used()
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 193, in approx_workers_used
    return self._info()["NumNodesUsed"]
  File "/env/lib/python3.5/site-packages/ray/autoscaler/autoscaler.py", line 211, in _info
    used = amount - avail_resources[resource_id]
KeyError: b'GPU'

This was with the nightly whl from 4/16/19 I think.

stevenlin1111 on 15 May 2019

OK I think this actually works.

Here's a gist of a YAML and script to run:
https://gist.github.com/richardliaw/a4b1641135bfa1c6f50918de14eb6414

Run the following commands:

Launch cluster: ray up sgd.yaml -y
Wait until the first worker node connects: ray exec sgd.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*' (Instead of this step, one option would be to have a check programmatically for when the worker node connects in your ray program)
Run a ray program: ray submit mnist_pytorch_trainable.py --args="--redis-address='localhost:6379'"

richardliaw on 15 May 2019

🎉3

I'll test this when I get the chance. Thanks!

stevenlin1111 on 15 May 2019

👍1

Your script worked! I (and I think Kristian too?) was trying to get the non-GPU head to work with a single exec_cluster call rather than than a two stage create_or_update and then exec_cluster. This is a good fix, but I think it's a little misleading that passing --start to ray submit doesn't have the same functionality as ray up followed by ray submit.

stevenlin1111 on 25 May 2019

🎉1 👍1

This still doesn't work because if all worker nodes die, Tune dies.

richardliaw on 2 Oct 2019

With #5900, I think this should work now. We can avoid doing the 2-stage thing as suggested above.

richardliaw on 14 Oct 2019

@stevenlin1111 can you try this out tomorrow? (after the wheels finish building). If it does work as intended, can you also close this issue?

Thanks!

richardliaw on 14 Oct 2019

👍1

Sorry, I forgot to update this issue after our meeting! It works if the instances aren't pre-empted, but for some reason it fails to recover if the worker is terminated. The trials recover fine if I use CPUs only, so I'm inclined to say that it's an issue specifically with GPU trial recovery.

Here is the tune logs after some of the worker nodes was pre-empted. The bottom two trials were pre-empted early on and were re-queued, and I just pre-empted the top two.

== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/20 CPUs, 1.0/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.31 GiB objects
Number of trials: 4 ({'RUNNING': 2, 'PENDING': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name                       | status   | loc                  |   failures | error file                                                                                                                                        | algo_variant/env   |   iter |   total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_63c8aaf6 | RUNNING  | ip-172-31-43-193:203 |          0 |                                                                                                                                                   | pendulum           |     59 |          88.4573 |
| SequentialRayExperiment_63c8e4c6 | RUNNING  | ip-172-31-43-193:206 |          0 |                                                                                                                                                   | pendulum           |     53 |          79.1381 |
| SequentialRayExperiment_63c83472 | PENDING  | ip-172-31-32-185:198 |          1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c83472_2019-11-02_10-41-55s0a4f5gm/error_2019-11-02_10-53-20.txt | pendulum           |     41 |          62.5881 |
| SequentialRayExperiment_63c87220 | PENDING  | ip-172-31-32-185:207 |          2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c87220_2019-11-02_10-49-56xvvi32wg/error_2019-11-02_10-57-28.txt | pendulum           |     37 |          54.9184 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
2019-11-02 10:59:26,201 WARNING worker.py:1328 -- The node with client ID 9b961e61e30c2143da61b6fe9dedbed0ae06c79b has been marked dead because the monitor has missed too many heartbeats from it.
2019-11-02 10:59:33,255 WARNING worker.py:1328 -- The actor or task with ID fffffffffffffa35491e01000000 is infeasible and cannot currently be scheduled. It requires {GPU: 0.500000}, {CPU: 2.000000} for execution and {GPU: 0.500000}, {CPU: 2.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
2019-11-02 11:01:21,084 ERROR trial_runner.py:492 -- Error processing event.
Traceback (most recent call last):
  File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 438, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/env/lib/python3.5/site-packages/ray/tune/ray_trial_executor.py", line 354, in fetch_result
    result = ray.get(trial_future[0])
  File "/env/lib/python3.5/site-packages/ray/worker.py", line 1890, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2019-11-02 11:01:21,084 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 11:01:21,086 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8e4c6_2019-11-02_10-57-28suffxasc/ /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8e4c6_2019-11-02_10-57-28suffxasc/
2019-11-02 11:01:21,092 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_63c8e4c6.
2019-11-02 11:01:21,092 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_63c8e4c6.
2019-11-02 11:01:21,093 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 11:01:21,094 DEBUG trial_runner.py:544 -- Notifying Scheduler and requeueing trial.
2019-11-02 11:01:21,108 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 11:01:21,110 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8aaf6_2019-11-02_10-49-563tsiqajc/ /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8aaf6_2019-11-02_10-49-563tsiqajc/
2019-11-02 11:01:21,115 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_63c8aaf6.
2019-11-02 11:01:21,115 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_63c8aaf6.
2019-11-02 11:01:21,115 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 11:01:21,115 DEBUG trial_runner.py:544 -- Notifying Scheduler and requeueing trial.

== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/4 CPUs, 0.5/0 GPUs, 0.0/4.2 GiB heap, 0.0/1.46 GiB objects
Number of trials: 4 ({'RUNNING': 1, 'PENDING': 3})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name                       | status   | loc                  |   failures | error file                                                                                                                                        | algo_variant/env   |   iter |   total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_63c8aaf6 | RUNNING  | ip-172-31-43-193:203 |          0 |                                                                                                                                                   | pendulum           |     59 |          88.4573 |
| SequentialRayExperiment_63c83472 | PENDING  | ip-172-31-32-185:198 |          1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c83472_2019-11-02_10-41-55s0a4f5gm/error_2019-11-02_10-53-20.txt | pendulum           |     41 |          62.5881 |
| SequentialRayExperiment_63c87220 | PENDING  | ip-172-31-32-185:207 |          2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c87220_2019-11-02_10-49-56xvvi32wg/error_2019-11-02_10-57-28.txt | pendulum           |     37 |          54.9184 |
| SequentialRayExperiment_63c8e4c6 | PENDING  | ip-172-31-43-193:206 |          1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-2019-11-02/SequentialRayExperiment_63c8e4c6_2019-11-02_10-57-28suffxasc/error_2019-11-02_11-01-21.txt | pendulum           |     54 |          80.604  |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+

Traceback (most recent call last):
  File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 109, in <module>
    launch_local_experiment(**local_launch_variant)
  File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 97, in launch_local_experiment
    queue_trials=True,
  File "/env/lib/python3.5/site-packages/ray/tune/tune.py", line 286, in run
    runner.step()
  File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 350, in step
    trial.config)))
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 2 CPUs, 0.5 GPUs but the cluster has only 4 CPUs, 0 GPUs, 4.2 GiB heap, 1.46 GiB objects (1.0 node:172.31.37.150). Pass `queue_trials=True` in ray.tune.run() or on the command line to queue trials until the cluster scales up.
ssh: connect to host 172.31.43.193 port 22: No route to host
ssh: connect to host 172.31.43.193 port 22: No route to host
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
2019-11-02 11:01:25,720 INFO commands.py:110 -- teardown_cluster: Shutting down 1 nodes...
2019-11-02 11:01:25,721 INFO node_provider.py:343 -- AWSNodeProvider: terminating nodes ['i-0acfd363e0b70d4e1'] (spot nodes cannot be stopped, only terminated)
2019-11-02 11:01:26,940 INFO log_timer.py:21 -- teardown_cluster: done. [LogTimer=1220ms]
Shared connection to 34.238.161.173 closed.
Connection to 34.238.161.173 closed by remote host.
NodeUpdater: i-0f8ece1adbc0c35e8: Command failed:

  ssh -tt -i /home/steven/.ssh/ray-autoscaler_us-east-1.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_6ed61d4b80/25a171c8d1/%C -o ControlPersist=10s [email protected] bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker exec  ray-docker /bin/sh -c '"'"'"'"'"'"'"'"'python /home/steven/res/grailrl/launchers/ray/local_launch.py'"'"'"'"'"'"'"'"' ; docker exec  ray-docker /bin/sh -c '"'"'"'"'"'"'"'"'ray stop; ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only'"'"'"'"'"'"'"'"' ; sudo shutdown -h now'"'"''

The last autoscaler logs showed this.

2019-11-02 11:01:24,552 INFO autoscaler.py:736 -- StandardAutoscaler: 1/1 target nodes (0 pending) (1 updating)
2019-11-02 11:01:24,552 INFO autoscaler.py:737 -- LoadMetrics: MostDelayedHeartbeats={'172.31.37.150': 0.18326258659362793}, NodeIdleSeconds=Min=1201 Mean=1201 Max=1201, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/4.0 CPU, 0.0 GiB/4.51 GiB memory, 0.0/1.0 node:172.31.37.150, 0.0 GiB/1.57 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-11-02 11:01:24,553 INFO autoscaler.py:534 -- Ending bringup phase
2019-11-02 11:01:24,553 INFO autoscaler.py:736 -- StandardAutoscaler: 1/1 target nodes (0 pending) (1 updating)
2019-11-02 11:01:24,553 INFO autoscaler.py:737 -- LoadMetrics: MostDelayedHeartbeats={'172.31.37.150': 0.18359017372131348}, NodeIdleSeconds=Min=1201 Mean=1201 Max=1201, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/4.0 CPU, 0.0 GiB/4.51 GiB memory, 0.0/1.0 node:172.31.37.150, 0.0 GiB/1.57 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

The first failure log is this:

Traceback (most recent call last):
  File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 438, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/env/lib/python3.5/site-packages/ray/tune/ray_trial_executor.py", line 354, in fetch_result
    result = ray.get(trial_future[0])
  File "/env/lib/python3.5/site-packages/ray/worker.py", line 1890, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

And the second failure log is this:
SequentialRayExperiment_63c87220 (ip: 172.31.32.185) detected as stale. This is likely because the node was lost

The autoscaler and tune logs seem to indicate that the experiment terminated while the worker node was still in the process of restarting.

stevenlin1111 on 2 Nov 2019

Reran the experiment with less trials. This time, the worker restarted successfully but the trials all crashed after restoration.

tune logs:

  2019-11-02 13:05:22,614       WARNING worker.py:1328 -- The node with client ID 5d08bfa5770efefdb39e24449ee53fe33186e599 has been marked dead because the monitor has missed too many heartbeats from it.
k2019-11-02 13:07:17,871        ERROR trial_runner.py:492 -- Error processing event.
Traceback (most recent call last):
  File "/env/lib/python3.5/site-packages/ray/tune/trial_runner.py", line 438, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/env/lib/python3.5/site-packages/ray/tune/ray_trial_executor.py", line 354, in fetch_result
    result = ray.get(trial_future[0])
  File "/env/lib/python3.5/site-packages/ray/worker.py", line 1890, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2019-11-02 13:07:17,872 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:07:17,874 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:07:17,879 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:07:17,880 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:07:17,881 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:07:17,882 WARNING ray_trial_executor.py:469 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2019-11-02 13:07:17,883 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:07:17,884 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:07:17,886 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:07:17,902 WARNING worker.py:1328 -- The actor or task with ID ffffffffffff3724d75b01000000 is infeasible and cannot currently be scheduled. It requires {CPU: 2.000000}, {GPU: 0.500000} for execution and {CPU: 2.000000}, {GPU: 0.500000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
ssh: connect to host 172.31.36.185 port 22: No route to host
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
l2019-11-02 13:20:19,364        WARNING util.py:133 -- The `get_current_ip` operation took 781.4718012809753 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:19,364 INFO logger.py:389 -- Syncing (blocking) results to 172.31.47.107
2019-11-02 13:20:19,364 WARNING syncer.py:159 -- Sync process still running but resetting anyways.
2019-11-02 13:20:19,364 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:20,442 WARNING util.py:133 -- The `sync_to_new_location` operation took 1.0784053802490234 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:22,214 WARNING util.py:133 -- The `restore_from_disk` operation took 1.7714855670928955 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:22,214 WARNING util.py:133 -- The `process_trial` operation took 784.3441452980042 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:22,219 DEBUG syncer.py:152 -- Running sync: aws s3 sync /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/ s3://steven.railrl/ray/aws-autoscaler-gpu-again-2019-11-02
2019-11-02 13:20:22,229 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:22,230 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:22,236 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:22,236 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:22,236 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:22,237 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:22,239 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:22,241 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
ssh: connect to host 172.31.36.185 port 22: No route to host
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
2019-11-02 13:20:25,508 WARNING util.py:133 -- The `get_current_ip` operation took 3.25955867767334 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:25,508 INFO logger.py:389 -- Syncing (blocking) results to 172.31.47.107
2019-11-02 13:20:25,508 WARNING syncer.py:159 -- Sync process still running but resetting anyways.
2019-11-02 13:20:25,508 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:26,449 WARNING util.py:133 -- The `sync_to_new_location` operation took 0.9409334659576416 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:26,566 WARNING util.py:133 -- The `process_failed_trial` operation took 4.337299823760986 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:26,568 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:26,570 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:26,576 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:26,576 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:26,577 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:26,577 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:26,578 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:26,580 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:31,065 WARNING util.py:133 -- The `get_current_ip` operation took 4.478132724761963 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:31,177 WARNING util.py:133 -- The `process_failed_trial` operation took 4.609180688858032 seconds to complete, which may be a performance bottleneck.

(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:19,372     INFO trainable.py:102 -- _setup took 95.413 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:19,375     WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/4 CPUs, 1.0/0 GPUs, 0.0/4.2 GiB heap, 0.0/1.42 GiB objects
Number of trials: 2 ({'RUNNING': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name                       | status   | loc                  |   failures | error file                                                                                                                                              | algo_variant/env   |   iter |   total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | RUNNING  | ip-172-31-36-185:200 |          1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-07-17.txt | pendulum           |     50 |          72.497  |
| SequentialRayExperiment_150c7b46 | RUNNING  | ip-172-31-36-185:202 |          0 |                                                                                                                                                         | pendulum           |     57 |          85.0134 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+

(pid=206, ip=172.31.47.107) 2019-11-02 13:20:22,225     INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/checkpoint_50/exp.pkl
(pid=206, ip=172.31.47.107) 2019-11-02 13:20:22,225     INFO trainable.py:365 -- Current state after restoring: {'_iteration': 50, '_timesteps_total': None, '_episodes_total': None, '_time_total': 72.49702072143555}
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=197, ip=172.31.47.107) 2019-11-02 13:20:25,518     WARNING trainable.py:131 -- Getting current IP.
(pid=197, ip=172.31.47.107) 2019-11-02 13:20:26,576     INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/checkpoint_50/exp.pkl
(pid=197, ip=172.31.47.107) 2019-11-02 13:20:26,576     INFO trainable.py:365 -- Current state after restoring: {'_time_total': 74.83066844940186, '_timesteps_total': None, '_episodes_total': None, '_iteration': 50}
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=205, ip=172.31.47.107) 2019-11-02 13:20:31,076     WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/20 CPUs, 1.0/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.26 GiB objects
Number of trials: 2 ({'RUNNING': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name                       | status   | loc                  |   failures | error file                                                                                                                                              | algo_variant/env   |   iter |   total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | RUNNING  | ip-172-31-36-185:200 |          1 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-07-17.txt | pendulum           |     50 |          72.497  |
| SequentialRayExperiment_150c7b46 | RUNNING  | ip-172-31-36-185:202 |          2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/error_2019-11-02_13-20-26.txt | pendulum           |     50 |          74.8307 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
2019-11-02 13:20:31,182 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:31,184 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:31,189 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:31,190 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:31,190 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:31,190 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:31,192 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:31,194 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:35,718 WARNING util.py:133 -- The `get_current_ip` operation took 4.517314910888672 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:35,835 WARNING util.py:133 -- The `process_failed_trial` operation took 4.652915954589844 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:35,843 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/
2019-11-02 13:20:35,847 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:35,849 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:35,850 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c7b46.
2019-11-02 13:20:35,853 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:35,854 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:35,859 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:35,859 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:35,860 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:35,860 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:35,861 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:35,863 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:39,278 WARNING util.py:133 -- The `get_current_ip` operation took 3.4083802700042725 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:39,389 WARNING util.py:133 -- The `process_failed_trial` operation took 3.536538600921631 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:39,394 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:39,396 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:39,400 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:39,401 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:39,401 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:39,401 INFO trial_runner.py:538 -- Attempting to recover trial state from last checkpoint.
2019-11-02 13:20:39,403 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:39,406 WARNING syncer.py:148 -- Last sync is still in progress, skipping.
2019-11-02 13:20:42,699 WARNING util.py:133 -- The `get_current_ip` operation took 3.2852749824523926 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:42,812 WARNING util.py:133 -- The `process_failed_trial` operation took 3.418482542037964 seconds to complete, which may be a performance bottleneck.
2019-11-02 13:20:42,815 DEBUG syncer.py:152 -- Running sync: rsync -savz --rsync-path="sudo rsync" -e "ssh -i /home/ubuntu/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no" [email protected]:/home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/ /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/
2019-11-02 13:20:42,819 DEBUG trial_executor.py:55 -- Saving trial metadata.
2019-11-02 13:20:42,823 DEBUG ray_trial_executor.py:192 -- Destroying actor for trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:42,824 DEBUG ray_trial_executor.py:248 -- Returning resources for Trial SequentialRayExperiment_150c4086.
2019-11-02 13:20:42,829 DEBUG syncer.py:152 -- Running sync: aws s3 sync /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/ s3://steven.railrl/ray/aws-autoscaler-gpu-again-2019-11-02

(pid=205, ip=172.31.47.107) 2019-11-02 13:20:31,188     INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/checkpoint_50/exp.pkl
(pid=205, ip=172.31.47.107) 2019-11-02 13:20:31,188     INFO trainable.py:365 -- Current state after restoring: {'_timesteps_total': None, '_episodes_total': None, '_iteration': 50, '_time_total': 74.83066844940186}
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=208, ip=172.31.47.107) 2019-11-02 13:20:35,728     WARNING trainable.py:131 -- Getting current IP.
(pid=208, ip=172.31.47.107) 2019-11-02 13:20:35,845     INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/checkpoint_50/exp.pkl
(pid=208, ip=172.31.47.107) 2019-11-02 13:20:35,845     INFO trainable.py:365 -- Current state after restoring: {'_episodes_total': None, '_time_total': 74.83066844940186, '_iteration': 50, '_timesteps_total': None}
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=204, ip=172.31.47.107) 2019-11-02 13:20:39,288     WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 2/20 CPUs, 0.5/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.26 GiB objects
Number of trials: 2 ({'RUNNING': 1, 'ERROR': 1})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name                       | status   | loc                  |   failures | error file                                                                                                                                              | algo_variant/env   |   iter |   total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | RUNNING  | ip-172-31-36-185:200 |          2 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-20-35.txt | pendulum           |     50 |          72.497  |
| SequentialRayExperiment_150c7b46 | ERROR    | ip-172-31-36-185:202 |          4 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/error_2019-11-02_13-20-35.txt | pendulum           |     50 |          74.8307 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+

(pid=204, ip=172.31.47.107) 2019-11-02 13:20:39,399     INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/checkpoint_50/exp.pkl
(pid=204, ip=172.31.47.107) 2019-11-02 13:20:39,400     INFO trainable.py:365 -- Current state after restoring: {'_timesteps_total': None, '_episodes_total': None, '_iteration': 50, '_time_total': 72.49702072143555}
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=209, ip=172.31.47.107) 2019-11-02 13:20:42,709     WARNING trainable.py:131 -- Getting current IP.
== Status ==
Memory usage on this node: 1.0/7.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0.0/1 GPUs, 0.0/105.13 GiB heap, 0.0/14.26 GiB objects
Number of trials: 2 ({'ERROR': 2})
Result logdir: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+
| Trial name                       | status   | loc                  |   failures | error file                                                                                                                                              | algo_variant/env   |   iter |   total time (s) |
|----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------|
| SequentialRayExperiment_150c4086 | ERROR    | ip-172-31-36-185:200 |          4 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/error_2019-11-02_13-20-42.txt | pendulum           |     50 |          72.497  |
| SequentialRayExperiment_150c7b46 | ERROR    | ip-172-31-36-185:202 |          4 | /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c7b46_2019-11-02_13-03-22cy3louel/error_2019-11-02_13-20-35.txt | pendulum           |     50 |          74.8307 |
+----------------------------------+----------+----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+--------+------------------+

(pid=209, ip=172.31.47.107) 2019-11-02 13:20:42,822     INFO trainable.py:358 -- Restored from checkpoint: /home/ubuntu/ray_results/aws-autoscaler-gpu-again-2019-11-02/SequentialRayExperiment_150c4086_2019-11-02_12-55-44mjtio_kt/checkpoint_50/exp.pkl
(pid=209, ip=172.31.47.107) 2019-11-02 13:20:42,822     INFO trainable.py:365 -- Current state after restoring: {'_time_total': 72.49702072143555, '_episodes_total': None, '_iteration': 50, '_timesteps_total': None}
Traceback (most recent call last):
  File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 109, in <module>
    launch_local_experiment(**local_launch_variant)
  File "/home/steven/res/grailrl/launchers/ray/local_launch.py", line 97, in launch_local_experiment
    queue_trials=True,
  File "/env/lib/python3.5/site-packages/ray/tune/tune.py", line 309, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [SequentialRayExperiment_150c4086, SequentialRayExperiment_150c7b46])
2019-11-02 13:20:47,309 INFO commands.py:110 -- teardown_cluster: Shutting down 2 nodes...
2019-11-02 13:20:47,310 INFO node_provider.py:343 -- AWSNodeProvider: terminating nodes ['i-0e347900097c9b114', 'i-07319981fe4f47631'] (spot nodes cannot be stopped, only terminated)
2019-11-02 13:20:48,525 INFO log_timer.py:21 -- teardown_cluster: done. [LogTimer=1215ms]
Connection to 54.82.225.148 closed by remote host.
Shared connection to 54.82.225.148 closed.
NodeUpdater: i-0e837ccbb08c3630b: Command failed:

All failure logs after the first say the same thing:

SequentialRayExperiment_150c4086 (ip: 172.31.36.185) detected as stale. This is likely because the node was lost

stevenlin1111 on 2 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

rllib: Using gym.RewardWrapper around MultiAgentEnv cause observation mismatch with observation_space

0luhancheng0 · 3Comments

Unrecognised instruction error running valgrind tests.

robertnishihara · 3Comments

Trouble installing in virtual env.

robertnishihara · 3Comments

when possibly it will be available for windows? more than half of the world is using windows!

Khalilsqu · 3Comments

[rllib] In multiagent environment, is timesteps_total the total timesteps per agent or over all agents?

coreylowman · 3Comments