Ray: Workload hangs when plasma_manager is killed

Created on 28 Nov 2017 · 7Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Ray installed from (source or binary): wheel from https://github.com/ray-project/ray/commit/f7c4f41df80b5b0b57068138e87c728ee609e392
Python version: 3.6.3

Describe the problem

Start a cluster on 10 m4.4xlarge as described here: http://ray.readthedocs.io/en/latest/using-ray-on-a-large-cluster.html

After the cluster is started, run this code:

@ray.remote
def f(x):
  i = 1
  return i

u = ray.put("hello")
%time sum(ray.get([f.remote(u) for i in range(100000)]))

While this is running, if we now kill a plasma_manager on one of the worker nodes, the workload hangs with the following message:

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 756, in _process_task
    args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 687, in _get_arguments_for_execution
    argument = self.get_object([arg])[0]
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 440, in get_object
    ray._config.worker_fetch_request_size())])
  File "pyarrow/plasma.pyx", line 552, in pyarrow.plasma.PlasmaClient.fetch (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:6809)
  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8192)
pyarrow.lib.ArrowIOError: Broken pipe


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="34.234.100.132:6379")

bug

Source

pcmoritz

Most helpful comment

I think the issue here is that the local scheduler attached to the killed plasma manager never detects that it's lost the connection. Normally, the local scheduler contacts the plasma manager at an interval to fetch any missing object dependencies. In this case, it doesn't contact the plasma manager because all task dependencies are fulfilled after the first task, and so the local scheduler never learns that its associated plasma manager is dead. Then, the task never gets reconstructed since that local scheduler still holds a lock on the task in the task table.

We should probably fix this by having the local scheduler detect whether a dead plasma manager is associated with itself during reconstruction.

stephanie-wang on 28 Nov 2017

👍2

All 7 comments

We should probably fix this by having the local scheduler detect whether a dead plasma manager is associated with itself during reconstruction.

stephanie-wang on 28 Nov 2017

👍2

Ok, I just ran the same experiment and also killed the local scheduler and then everything works as advertised. This very much supports your suspicion!

pcmoritz on 28 Nov 2017

I'm running into this error frequently when spot instances get killed by AWS. It seems like some progress was made towards a fix -- what's the current status of this? Alternately, any ideas for a workaround?

AdamGleave on 6 Jun 2018

👍1

Hmm I'll look into this today and see if we can reopen that PR. I can't think of a foolproof workaround, unfortunately.

We've also been working on a rewrite of the backend that should make this problem go away, since the object manager and local schedulers will be in the same process. Fault tolerance won't ready for that until around mid-July, though.

stephanie-wang on 6 Jun 2018

But if a spot instance dies, that will definitely kill both the local scheduler and plasma manager on the dead instance.

@AdamGleave do you know if you're running into a situation (e.g., on the head machine or a different machine) where the plasma manager has died but the local scheduler hasn't? E.g., if you look at the logs under /tmp/raylogs/ do you see any sign of dead processes, or if you do ps aux | grep "plasma_manager " and ps aux | grep "local_scheduler ", do you find that one of them is dead?

robertnishihara on 6 Jun 2018

My guess was it's a race condition. The signal to kill processes isn't
simultaneous. I'll do further diagnostics next time I encounter this
problem.

El mié., 6 de jun. de 2018 11:28, Robert Nishihara notifications@github.com
escribió:

But if a spot instance dies, that will definitely kill both the local
scheduler and plasma manager on the dead instance.

@AdamGleave https://github.com/AdamGleave do you know if you're running
into a situation (e.g., on the head machine or a different machine) where
the plasma manager has died but the local scheduler hasn't? E.g., if you
look at the logs under /tmp/raylogs/ do you see any sign of dead
processes, or if you do ps aux | grep "plasma_manager " and ps aux | grep
"local_scheduler ", do you find that one of them is dead?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/1270#issuecomment-395168153,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABedo5F46OHaNpw6H_hTPThQ-rRZHiFWks5t6B8ugaJpZM4QsoX5
.