I'm running the following (on our Jenkins workers).
python test/jenkins_tests/multi_node_docker_test.py --num-nodes=5 --num-redis-shards=10 --test-script=/ray/test/jenkins_tests/multi_node_tests/test_0.py
taken from
Occasionally, it will hang, with output
Starting head node with command:['docker', 'run', '-d', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--head', '--block', '--redis-port=6379', '--num-redis-shards=10', '--num-cpus=10', '--num-gpus=0', '--no-ui']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting driver with command /ray/test/jenkins_tests/multi_node_tests/test_0.py.
Doing docker ps shows
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b408155b0153 f9636310fd09 "ray start --block..." 21 minutes ago Up 21 minutes cranky_kowalevski
f295908861d7 f9636310fd09 "ray start --block..." 21 minutes ago Up 21 minutes elastic_franklin
a37bc1714877 f9636310fd09 "ray start --block..." 21 minutes ago Up 21 minutes optimistic_jones
a4891cbb597b f9636310fd09 "ray start --block..." 21 minutes ago Up 21 minutes objective_goodall
43b74949afb7 f9636310fd09 "ray start --head ..." 21 minutes ago Up 21 minutes awesome_lamarr
Attaching to the head node with docker exec -t -i 43b74949afb7 /bin/bash and inspecting /tmp/raylogs, the only interesting thing I see is
root@43b74949afb7:/tmp/raylogs# cat global_scheduler-0199444623.err
[ERROR] (/ray/src/global_scheduler/global_scheduler_algorithm.cc:190: errno: Operation now in progress) Infeasible task. No nodes satisfy hard constraints for task = 108d1007838a4873edc1a4a17868e092d6b0e950
[ERROR] (/ray/src/global_scheduler/global_scheduler_algorithm.cc:190: errno: Operation now in progress) Infeasible task. No nodes satisfy hard constraints for task = 81fcdb52b4ba5f6df6567c5fecf91e8a121958e6
[ERROR] (/ray/src/global_scheduler/global_scheduler_algorithm.cc:190: errno: Operation now in progress) Infeasible task. No nodes satisfy hard constraints for task = 01eed3d3720a92589b14b7bc9a0abb7d18a88551
That isn't obviously a problem. It just means that we submitted some tasks to the global scheduler before it was aware of any local schedulers with any available workers.
Inspecting the task table, I did
import ray
ray.init(redis_address="172.17.0.2:6379")
tt = ray.global_state.task_table()
states = [v['State'] for k, v in tt.items()]
[(s, states.count(s)) for s in set(states)]
The last command printed
[(16, 980), (1, 20), (8, 2)]
State 8 is RUNNING. The two tasks in those states are probably the two driver tasks (one for the actual test and one for the driver that I just attached). Only 20 tasks are in state 16 which is DONE, and 980 are in state 1, which is WAITING, meaning that they have been pushed to the global scheduler and we are waiting for the global scheduler to assign them to a local scheduler.
Printing
missed_tasks = [v for k, v in tt.items() if v['State'] == 1]
parent_counters = [t['TaskSpec']['ParentCounter'] for t in missed_tasks]
parent_counters.sort()
print(parent_counters)
The last line prints
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
So this suggests that the first 20 tasks didn't get processed by the global scheduler.
The issue I believe is a case that we're not handling, described in this comment
The solution could be to resubmit it to the global scheduler if it is not received the first time (because the global scheduler hadn't started subscribing yet). Still need to verify that this is actually the problem.
cc @shaneknapp
how does this explain what we saw yesterday on this build:
https://amplab.cs.berkeley.edu/jenkins/job/Ray/1892
points:
docker ps -a we saw four worker nodes, and no head node.(this behavior could be explained by STDOUT needing to be flushed)
So, this doesn't explain why we never saw the console output, and it also doesn't explain why we didn't see the head node.
However, I could reproduce the problem (or a similar problem) fairly easily, and whenever I did that I did see the console output as well as the head node container.
The problem may be fixed by #1106, but we shall see :)
This should be solved by #1106.
There are several PRs in progress that are not rebased on top of #1106, so we may see the problem a few more times, but for anything rebased on top of this, the problem should be gone. If not, we'll need to open a new issue (or reopen this one).