Reading docs on Automatic Cluster Setup, section "Docker":
This currently does not have GPU support
However, aws/example-gpu-docker.yaml coupled with merged PR suggests GPU is supported.
Could you please clarify and if GPU is indeed supported in docker - would docker be good choice from stability perspective?
Hey @vtomenko - you might run into some bugs with the current state of the code, but @ijrsvt is hard at work on improving this right now.
@ijrsvt feel free to chime in
Hey @richardliaw - thanks for your reply, I'll give it a try and let you know guys know what happened.
It is also great to hear you are working on this - I believe stable autoscaler + docker with GPU support combination would be really beneficial for many ML scenarios because it cleanly separates concerns:
In this way ML engineer could just change docker image reference in cluster YAML leaving remaining parts as is (as opposed to updating setup_commands for each different ML task). Then it would be really easy to execute different training tasks on the same cluster, provided ray handles correctly the sequence of (a) updating docker image reference in cluster YAML and (b) ray up
@vtomenko Please do let us know what the result is for running it with GPUs! If you have any further questions or problems please let me know!
@vtomenko absolutely agree! Will keep you updated from our side too.
cc @anabranch @pcmoritz
I'm using ray version 0.8.5 and trying to test basic docker setup.
From autoscaler doc:
Docker: Specify docker image. This executes all commands on all nodes in the docker container, and opens all the necessary ports to support the Ray cluster. It will also automatically install Docker if Docker is not installed.
It does not seem to be the case, the error when creating the cluster with say busybox docker image:
Command 'docker' not found, but can be installed with:
Installing docker in initialization_commands section also does not help for reasons described in #7519
Here is what I have in initialization_commands:
[
"sudo apt update -y",
"sudo apt install docker.io -y",
"sudo usermod -aG docker $USER",
"sudo systemctl restart docker"
]
Commands above run fine. The next thing ray tries to pull the image, and this is where it fails:
2020-06-18 06:48:14,835 INFO updater.py:264 -- NodeUpdater: i-025cd6bcda77a92aa: Running sudo usermod -aG docker $USER on 35.165.161.154...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-06-18 06:48:14,916 INFO updater.py:264 -- NodeUpdater: i-025cd6bcda77a92aa: Running sudo systemctl restart docker on 35.165.161.154...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-06-18 06:48:16,019 INFO updater.py:264 -- NodeUpdater: i-025cd6bcda77a92aa: Running docker pull busybox:latest on 35.165.161.154...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/images/create?fromImage=busybox&tag=latest: dial unix /var/run/docker.sock: connect: permission denied
2020-06-18 06:48:16,114 INFO log_timer.py:17 -- NodeUpdater: i-025cd6bcda77a92aa: Initialization commands completed [LogTimer=23330ms]
2020-06-18 06:48:16,114 INFO log_timer.py:17 -- NodeUpdater: i-025cd6bcda77a92aa: Applied config a8575940426af8a45f754a232f9474405ae13ee8 [LogTimer=51055ms]
2020-06-18 06:48:16,114 ERROR updater.py:359 -- NodeUpdater: i-025cd6bcda77a92aa: Error updating (Exit Status 1) ssh -i /home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker pull busybox:latest'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 362, in run
raise e
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 352, in run
self.do_update()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 436, in do_update
self.cmd_runner.run(cmd)
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '[email protected]', 'bash', '--login', '-c', '-i', "'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker pull busybox:latest'"]' returned non-zero exit status 1.
Note: when I manually attach to cluster after this error and run docker pull busybox (as non root user) it works as expected.
Can you please point me to any currently working example on how to configure autoscaler with docker?
Hmm, let me try working on a solution to this--if you rerun after the first install, does it work?
Also what are you running on: a local cluster or public cloud?
I use AWS, attaching configuration I tried
Rerun does not help:
2020-06-18 08:01:00,874 INFO updater.py:264 -- NodeUpdater: i-0374e72fcc8b8ea2a: Running docker inspect -f '{{.State.Running}}' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash on 54.69.9.155...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Template parsing error: template: :1:8: executing "" at <.State.Running>: map has no entry for key "State"
WARNING: Published ports are discarded when using host network mode
e2735eef81dcd0dd304fd131ab7c092c68556b8414e908449bd925fb08810bd7
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown.
2020-06-18 08:01:01,527 INFO log_timer.py:17 -- NodeUpdater: i-0374e72fcc8b8ea2a: Setup commands completed [LogTimer=653ms]
2020-06-18 08:01:01,527 INFO log_timer.py:17 -- NodeUpdater: i-0374e72fcc8b8ea2a: Applied config 9784c6630decbe7b8ad6cb2a34f12c3c48314063 [LogTimer=2745ms]
2020-06-18 08:01:01,527 ERROR updater.py:359 -- NodeUpdater: i-0374e72fcc8b8ea2a: Error updating (Exit Status 127) ssh -i /home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 362, in run
raise e
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 352, in run
self.do_update()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 440, in do_update
self.cmd_runner.run(cmd)
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '[email protected]', 'bash', '--login', '-c', '-i', '\'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f \'"\'"\'{{.State.Running}}\'"\'"\' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash\'']' returned non-zero exit status 127.
I'll try using that AMI! Thanks for sharing this!
The issue is that we reuse SSH sessions. I'm working on a PR now, but in the meantime there is a subpart workaround:
1) Run ray up, with the initialization commands you have above
2) Remove the initialization commands
3) Rerun ray up in about 15 seconds or so.
Thanks @ijrsvt , I tried the workaround and it fails. The error seems to be similar to the one reported previously for rerun. Does the workaround work for you?
2020-06-18 22:45:57,181 INFO updater.py:264 -- NodeUpdater: i-0230e74237fc85cc4: Running docker inspect -f '{{.State.Running}}' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash on 54.187.139.187...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Template parsing error: template: :1:8: executing "" at <.State.Running>: map has no entry for key "State"
WARNING: Published ports are discarded when using host network mode
99e67b7ac1327a73ee5f81be226f96a3712d785d5475080c511c60bddbe40609
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown.
2020-06-18 22:45:57,860 INFO log_timer.py:17 -- NodeUpdater: i-0230e74237fc85cc4: Setup commands completed [LogTimer=679ms]
2020-06-18 22:45:57,861 INFO log_timer.py:17 -- NodeUpdater: i-0230e74237fc85cc4: Applied config 0318fe55a69a325f60716771d1a6f9f9a36457ec [LogTimer=3565ms]
2020-06-18 22:45:57,861 ERROR updater.py:359 -- NodeUpdater: i-0230e74237fc85cc4: Error updating (Exit Status 127) ssh -i /home/ubuntu/.ssh/ray-autoscaler.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 362, in run
raise e
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 352, in run
self.do_update()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 440, in do_update
self.cmd_runner.run(cmd)
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/ubuntu/.ssh/ray-autoscaler.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '[email protected]', 'bash', '--login', '-c', '-i', '\'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f \'"\'"\'{{.State.Running}}\'"\'"\' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash\'']' returned non-zero exit status 127.
I think this error is because the docker image (busybox) in your YAML does
not have bash installed.
On Thu, Jun 18, 2020 at 3:49 PM Volodymyr Tomenko notifications@github.com
wrote:
Thanks @ijrsvt https://github.com/ijrsvt , I tried the workaround and
it fails. The error seems to be similar to the one reported previously for
rerun. Does the workaround work for you?2020-06-18 22:45:57,181 INFO updater.py:264 -- NodeUpdater:
i-0230e74237fc85cc4: Running docker inspect -f '{{.State.Running}}' busybox
|| docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p
4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash
on 54.187.139.187...
bash: cannot set terminal process group (-1): Inappropriate ioctl for
device
bash: no job control in this shell
Template parsing error: template: :1:8: executing "" at <.State.Running>:
map has no entry for key "State"
WARNING: Published ports are discarded when using host network mode
99e67b7ac1327a73ee5f81be226f96a3712d785d5475080c511c60bddbe40609
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec: "bash":
executable file not found in $PATH": unknown.
2020-06-18 22:45:57,860 INFO log_timer.py:17 -- NodeUpdater:
i-0230e74237fc85cc4: Setup commands completed [LogTimer=679ms]
2020-06-18 22:45:57,861 INFO log_timer.py:17 -- NodeUpdater:
i-0230e74237fc85cc4: Applied config
0318fe55a69a325f60716771d1a6f9f9a36457ec [LogTimer=3565ms]
2020-06-18 22:45:57,861 ERROR updater.py:359 -- NodeUpdater:
i-0230e74237fc85cc4: Error updating (Exit Status 127) ssh -i
/home/ubuntu/.ssh/ray-autoscaler.pem -o ConnectTimeout=120s -o
StrictHostKeyChecking=no -o ControlMaster=auto -o
ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o
IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o
ServerAliveCountMax=3 [email protected] bash --login -c -i 'true &&
source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore &&
docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm
--name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e
LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 362, in run
raise e
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 352, in run
self.do_update()
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 440, in do_update
self.cmd_runner.run(cmd)
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i',
'/home/ubuntu/.ssh/ray-autoscaler.pem', '-o', 'ConnectTimeout=120s', '-o',
'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o',
'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o',
'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o',
'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o',
'ServerAliveCountMax=3', '[email protected]', 'bash', '--login',
'-c', '-i', ''true && source ~/.bashrc && export OMP_NUM_THREADS=1
PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"'
busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076
-p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest
bash'']' returned non-zero exit status 127.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/8975#issuecomment-646343255,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFC5KQX7IJC6GDNYE6KWCR3RXKKX5ANCNFSM4N77F7FA
.
Makes sense, thank you @ijrsvt ! I will update the YAML and use the docker image I compiled to train ML models and see what happens.
Awesome--I have a PR out to automatically install docker if it is not preinstalled; this won't be a problem in the future!
Hey @ijrsvt , thank you for the PR - any chance it is going to be merged into master soon so that I can avoid the workaround?
@vtomenko I am closing it for the moment because it actually breaks some of the autoscaler--It should be merged in about a week
@vtomenko any updates on your end? How are things going?
@ijrsvt , I got it working with the following configuration for AWS:
Note: I also added openssh-client to docker image because autoscaler manages worker nodes via ssh from docker container on head node
@vtomenko Great to hear!! The autoscaler _should_ work without needing to directly SSH into the containers on the worker nodes. Was this not happening for you?
@ijrsvt , my understanding is that autoscaler ssh into worker node from docker container on head node. So if the container on head node does not have ssh installed, worker nodes are not configured properly. Looks like same issue here #5496
Oh--that totally makes sense--my bad. I mis-read that as an openssh-server. The general requirements of the autoscaler are captured in this Dockerfile. {I'll make sure to add SSH into this} Please let me know if you have any other problems/questions/feedback! I'm always happy to help!