Ray: In Jenkins tests, test_0.py hangs occasionally.

Created on 11 Oct 2017 · 3Comments · Source: ray-project/ray

I'm running the following (on our Jenkins workers).

python test/jenkins_tests/multi_node_docker_test.py --num-nodes=5 --num-redis-shards=10 --test-script=/ray/test/jenkins_tests/multi_node_tests/test_0.py

taken from

https://github.com/ray-project/ray/blob/b1660c4edfcbfec5645103eaa0aad58a255015fa/test/jenkins_tests/run_multi_node_tests.sh#L14-L18

Occasionally, it will hang, with output

Starting head node with command:['docker', 'run', '-d', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--head', '--block', '--redis-port=6379', '--num-redis-shards=10', '--num-cpus=10', '--num-gpus=0', '--no-ui']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting worker node with command:['docker', 'run', '-d', '--shm-size=1G', '--shm-size=1G', 'f9636310fd09eed9bce58c174320749526b394816ee15b2f9c674d3c5124aee4', 'ray', 'start', '--block', '--redis-address=172.17.0.2:6379', '--num-cpus=10', '--num-gpus=0']
Starting driver with command /ray/test/jenkins_tests/multi_node_tests/test_0.py.

Doing docker ps shows

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
b408155b0153        f9636310fd09        "ray start --block..."   21 minutes ago      Up 21 minutes                           cranky_kowalevski
f295908861d7        f9636310fd09        "ray start --block..."   21 minutes ago      Up 21 minutes                           elastic_franklin
a37bc1714877        f9636310fd09        "ray start --block..."   21 minutes ago      Up 21 minutes                           optimistic_jones
a4891cbb597b        f9636310fd09        "ray start --block..."   21 minutes ago      Up 21 minutes                           objective_goodall
43b74949afb7        f9636310fd09        "ray start --head ..."   21 minutes ago      Up 21 minutes                           awesome_lamarr

Attaching to the head node with docker exec -t -i 43b74949afb7 /bin/bash and inspecting /tmp/raylogs, the only interesting thing I see is

root@43b74949afb7:/tmp/raylogs# cat global_scheduler-0199444623.err
[ERROR] (/ray/src/global_scheduler/global_scheduler_algorithm.cc:190: errno: Operation now in progress) Infeasible task. No nodes satisfy hard constraints for task = 108d1007838a4873edc1a4a17868e092d6b0e950
[ERROR] (/ray/src/global_scheduler/global_scheduler_algorithm.cc:190: errno: Operation now in progress) Infeasible task. No nodes satisfy hard constraints for task = 81fcdb52b4ba5f6df6567c5fecf91e8a121958e6
[ERROR] (/ray/src/global_scheduler/global_scheduler_algorithm.cc:190: errno: Operation now in progress) Infeasible task. No nodes satisfy hard constraints for task = 01eed3d3720a92589b14b7bc9a0abb7d18a88551

That isn't obviously a problem. It just means that we submitted some tasks to the global scheduler before it was aware of any local schedulers with any available workers.

Inspecting the task table, I did

import ray
ray.init(redis_address="172.17.0.2:6379")
tt = ray.global_state.task_table()
states = [v['State'] for k, v in tt.items()]
[(s, states.count(s)) for s in set(states)]

The last command printed

[(16, 980), (1, 20), (8, 2)]

State 8 is RUNNING. The two tasks in those states are probably the two driver tasks (one for the actual test and one for the driver that I just attached). Only 20 tasks are in state 16 which is DONE, and 980 are in state 1, which is WAITING, meaning that they have been pushed to the global scheduler and we are waiting for the global scheduler to assign them to a local scheduler.

Printing

missed_tasks = [v for k, v in tt.items() if v['State'] == 1]
parent_counters = [t['TaskSpec']['ParentCounter'] for t in missed_tasks]
parent_counters.sort()
print(parent_counters)

The last line prints

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

So this suggests that the first 20 tasks didn't get processed by the global scheduler.

The issue I believe is a case that we're not handling, described in this comment

https://github.com/ray-project/ray/blob/1e0ab3d3864f40486081a560649ed78e8d00611a/src/global_scheduler/global_scheduler.cc#L419-L423

The solution could be to resubmit it to the global scheduler if it is not received the first time (because the global scheduler hadn't started subscribing yet). Still need to verify that this is actually the problem.

cc @shaneknapp

bug

Source

robertnishihara

All 3 comments

how does this explain what we saw yesterday on this build:
https://amplab.cs.berkeley.edu/jenkins/job/Ray/1892

points:

if you look at the console output (https://amplab.cs.berkeley.edu/jenkins/job/Ray/1892/console) you will see that it never actually told us that it had launched the docker containers
however, after checking docker ps -a we saw four worker nodes, and no head node.

(this behavior could be explained by STDOUT needing to be flushed)

shaneknapp on 11 Oct 2017

So, this doesn't explain why we never saw the console output, and it also doesn't explain why we didn't see the head node.

However, I could reproduce the problem (or a similar problem) fairly easily, and whenever I did that I did see the console output as well as the head node container.

The problem may be fixed by #1106, but we shall see :)

robertnishihara on 11 Oct 2017

👍1

This should be solved by #1106.

There are several PRs in progress that are not rebased on top of #1106, so we may see the problem a few more times, but for anything rebased on top of this, the problem should be gone. If not, we'll need to open a new issue (or reopen this one).

robertnishihara on 12 Oct 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings