Ray: [Core] Raylet breaks when many actor tasks are submitted

Created on 4 Sep 2020  路  15Comments  路  Source: ray-project/ray

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):
Tested with 16 core macbook pro

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

This is caused by a low ulimt but we should have a better error message.

Note that actor pool ensures that there is at most one in flight task per actor.

import ray
from ray.util import ActorPool

@ray.remote(num_cpus=0)
class DummyActor:

    def __init__(self):
        pass

    def do_stuff(self):
        pass

ray.init()

things = [x for x in range(10000)]

nworkers = int(ray.cluster_resources()['CPU']) * 4
actors = [DummyActor.remote() for _ in range(int(nworkers))]

pool = ActorPool(actors)

res = pool.map(lambda a, v: a.do_stuff.remote(), things)

for i, x in enumerate(res):
    if i % 100 == 0:
        print(x)

Output:

(pid=49596) F0904 12:07:30.034916 49596 377699776 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49596) *** Check failure stack trace: ***
(pid=raylet) F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @        0x1083e0112  google::LogMessage::~LogMessage()
(pid=raylet)     @        0x10837cdc5  ray::RayLog::~RayLog()
(pid=raylet)     @        0x107f6f96e  ray::raylet::WorkerPool::StartProcess()
(pid=raylet)     @        0x107f6d04f  ray::raylet::WorkerPool::StartWorkerProcess()
(pid=raylet)     @        0x107f73707  ray::raylet::WorkerPool::PopWorker()
(pid=raylet)     @        0x107ec6923  ray::raylet::NodeManager::DispatchTasks()
(pid=raylet)     @        0x107ed8b09  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @        0x107ed15cf  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @        0x107ecfa86  ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet)     @        0x107f3817a  std::__1::__function::__func<>::operator()()
(pid=raylet)     @        0x1083558ee  ray::ClientConnection::ProcessMessage()
(pid=raylet)     @        0x10835cdb0  boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet)     @        0x1087e830e  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @        0x1087dbca1  boost::asio::detail::scheduler::run()
(pid=raylet)     @        0x1087dbb2c  boost::asio::io_context::run()
(pid=raylet)     @        0x107ea7d8a  main
(pid=raylet)     @     0x7fff6a420cc9  start
(pid=49566) F0904 12:07:30.043242 49566 287137216 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49566) *** Check failure stack trace: ***
(pid=49566)     @        0x10c6f44e2  google::LogMessage::~LogMessage()
(pid=49566)     @        0x10c691745  ray::RayLog::~RayLog()
(pid=49566)     @        0x10c2a1b99  ray::raylet::RayletClient::RayletClient()
(pid=49566)     @        0x10c1d1e6a  ray::CoreWorker::CoreWorker()
(pid=49566)     @        0x10c1cfbdf  ray::CoreWorkerProcess::CreateWorker()
(pid=49566)     @        0x10c1ce913  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49566)     @        0x10c1cdab7  ray::CoreWorkerProcess::Initialize()
(pid=49566)     @        0x10c13c275  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49566)     @        0x10b85ca8f  type_call
(pid=49566)     @        0x10b7d14f3  _PyObject_FastCallKeywords
(pid=49566)     @        0x10b90ee75  call_function
(pid=49566)     @        0x10b90bb92  _PyEval_EvalFrameDefault
(pid=49566)     @        0x10b90046e  _PyEval_EvalCodeWithName
(pid=49566)     @        0x10b7d1a03  _PyFunction_FastCallKeywords
(pid=49566)     @        0x10b90ed67  call_function
(pid=49566)     @        0x10b90cb8d  _PyEval_EvalFrameDefault
(pid=49566)     @        0x10b90046e  _PyEval_EvalCodeWithName
(pid=49566)     @        0x10b963ce0  PyRun_FileExFlags
(pid=49566)     @        0x10b963157  PyRun_SimpleFileExFlags
(pid=49566)     @        0x10b990dc3  pymain_main
(pid=49566)     @        0x10b7a3f2d  main
(pid=49566)     @     0x7fff6a420cc9  start
(pid=49566)     @                0xb  (unknown)
(pid=49563) F0904 12:07:30.037027 49563 245689792 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
(pid=49563) *** Check failure stack trace: ***
(pid=49563)     @        0x10f06e4e2  google::LogMessage::~LogMessage()
(pid=49563)     @        0x10f00b745  ray::RayLog::~RayLog()
(pid=49563)     @        0x10ec1bb99  ray::raylet::RayletClient::RayletClient()
(pid=49563)     @        0x10eb4be6a  ray::CoreWorker::CoreWorker()
(pid=49563)     @        0x10eb49bdf  ray::CoreWorkerProcess::CreateWorker()
(pid=49563)     @        0x10eb48913  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49563)     @        0x10eb47ab7  ray::CoreWorkerProcess::Initialize()
(pid=49563)     @        0x10eab6275  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49563)     @        0x10ded3a8f  type_call
(pid=49563)     @        0x10de484f3  _PyObject_FastCallKeywords
(pid=49563)     @        0x10df85e75  call_function
(pid=49563)     @        0x10df82b92  _PyEval_EvalFrameDefault
(pid=49563)     @        0x10df7746e  _PyEval_EvalCodeWithName
(pid=49563)     @        0x10de48a03  _PyFunction_FastCallKeywords
(pid=49563)     @        0x10df85d67  call_function
(pid=49563)     @        0x10df83b8d  _PyEval_EvalFrameDefault
(pid=49563)     @        0x10df7746e  _PyEval_EvalCodeWithName
(pid=49563)     @        0x10dfdace0  PyRun_FileExFlags
(pid=49563)     @        0x10dfda157  PyRun_SimpleFileExFlags
(pid=49563)     @        0x10e007dc3  pymain_main
(pid=49563)     @        0x10de1af2d  main
(pid=49563)     @     0x7fff6a420cc9  start
(pid=49512) E0904 12:07:30.084020 49512 218574848 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49519) E0904 12:07:30.083940 49519 101613568 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49521) E0904 12:07:30.083511 49521 5890048 core_worker.cc:694] Raylet failed. Shutting down.
(pid=49562) F0904 12:07:30.097999 49562 180841920 core_worker.cc:330]  Check failed: _s.ok() Bad status: IOError: Broken pipe
(pid=49562) *** Check failure stack trace: ***
(pid=49562)     @        0x101a884e2  google::LogMessage::~LogMessage()
(pid=49562)     @        0x101a25745  ray::RayLog::~RayLog()
(pid=49562)     @        0x1015661df  ray::CoreWorker::CoreWorker()
(pid=49562)     @        0x101563bdf  ray::CoreWorkerProcess::CreateWorker()
(pid=49562)     @        0x101562913  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=49562)     @        0x101561ab7  ray::CoreWorkerProcess::Initialize()
(pid=49562)     @        0x1014d0275  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=49562)     @        0x100bf0a8f  type_call
(pid=49562)     @        0x100b654f3  _PyObject_FastCallKeywords
(pid=49562)     @        0x100ca2e75  call_function
(pid=49562)     @        0x100c9fb92  _PyEval_EvalFrameDefault
(pid=49562)     @        0x100c9446e  _PyEval_EvalCodeWithName
(pid=49562)     @        0x100b65a03  _PyFunction_FastCallKeywords
(pid=49562)     @        0x100ca2d67  call_function
(pid=49562)     @        0x100ca0b8d  _PyEval_EvalFrameDefault
(pid=49562)     @        0x100c9446e  _PyEval_EvalCodeWithName
(pid=49562)     @        0x100cf7ce0  PyRun_FileExFlags
(pid=49562)     @        0x100cf7157  PyRun_SimpleFileExFlags
(pid=49562)     @        0x100d24dc3  pymain_main
(pid=49562)     @        0x100b37f2d  main
(pid=49562)     @     0x7fff6a420cc9  start
(pid=49562)     @                0xb  (unknown)

If we cannot run your script, we cannot fix your issue.

  • [ ] I have verified my script runs in a clean environment and reproduces the issue.
  • [ ] I have verified the issue also occurs with the latest wheels.
P1 bug

Most helpful comment

Hmm it seems a bit scary to me to be overriding that limit as part of Ray's source code...It's also pretty awkward since then we'd have to make the value configurable. We do it in the example cluster launcher scripts, but that's less sketchy since it's very clear and configurable, plus it's not going to be running on the user's laptop.

I would be in favor of just making the error message less scary (we probably don't need a stacktrace) and adding a tip to let the user know that they should manually increase the limit.

All 15 comments

This seems to work fine for me on master. Can you provide the stacktrace from the raylet process (raylet.out and .err)?

Ah sorry, didn't see this part: (pid=raylet) F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files

Have you tried with ulimit? Maybe you have some other processes running with too many open files?

funny, i'm also on latest master :/

raylet.out

I0904 12:07:26.417099 49403 281886144 io_service_pool.cc:36] IOServicePool is running with 1 io_service.
I0904 12:07:26.417464 49403 281886144 store_runner.cc:30] Allowing the Plasma store to use up to 2.69936GB of memory.
I0904 12:07:26.417488 49403 281886144 store_runner.cc:44] Starting object store with directory /tmp and huge page support disabled
I0904 12:07:26.418530 49403 281886144 grpc_server.cc:74] ObjectManager server started, listening on port 49698.
I0904 12:07:26.453881 49403 281886144 node_manager.cc:166] Initializing NodeManager with ID 63baeb290c4d199e64fab2fe91861f27f0248ebe
I0904 12:07:26.460155 49403 281886144 grpc_server.cc:74] NodeManager server started, listening on port 62164.
I0904 12:07:26.463145 49403 281886144 agent_manager.cc:40] Not starting agent, the agent command is empty.
I0904 12:07:26.474151 49403 281886144 service_based_accessor.cc:791] Received notification for node id = 63baeb290c4d199e64fab2fe91861f27f0248ebe, IsAlive = 1
I0904 12:07:26.474448 49403 281886144 service_based_accessor.cc:791] Received notification for node id = 63baeb290c4d199e64fab2fe91861f27f0248ebe, IsAlive = 1
F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files

raylet.err

F0904 12:07:30.037915 49403 281886144 worker_pool.cc:364] Failed to start worker with return value system:24: Too many open files
*** Check failure stack trace: ***
    @        0x1083e0112  google::LogMessage::~LogMessage()
    @        0x10837cdc5  ray::RayLog::~RayLog()
    @        0x107f6f96e  ray::raylet::WorkerPool::StartProcess()
    @        0x107f6d04f  ray::raylet::WorkerPool::StartWorkerProcess()
    @        0x107f73707  ray::raylet::WorkerPool::PopWorker()
    @        0x107ec6923  ray::raylet::NodeManager::DispatchTasks()
    @        0x107ed8b09  ray::raylet::NodeManager::HandleWorkerAvailable()
    @        0x107ed15cf  ray::raylet::NodeManager::HandleWorkerAvailable()
    @        0x107ecfa86  ray::raylet::NodeManager::ProcessClientMessage()
    @        0x107f3817a  std::__1::__function::__func<>::operator()()
    @        0x1083558ee  ray::ClientConnection::ProcessMessage()
    @        0x10835cdb0  boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @        0x1087e830e  boost::asio::detail::scheduler::do_run_one()
    @        0x1087dbca1  boost::asio::detail::scheduler::run()
    @        0x1087dbb2c  boost::asio::io_context::run()
    @        0x107ea7d8a  main
    @     0x7fff6a420cc9  start

(i believe these were properly redirected to the driver)

$ ulimit -n 
256

i should be starting 64 actor processes. let's see if someone else can reproduce this on a mac maybe?

Maybe check how many files you have open before starting Ray? Or increase ulimit -n.

I just tested it out on my laptop and the raylet crashed:

(pid=raylet) F0904 14:25:50.808766  6518 319638976 worker_pool.cc:364] Failed to start worker with return value
 system:24: Too many open files                                                                                
(pid=raylet) *** Check failure stack trace: ***                                                                
(pid=raylet)     @        0x109cae112  google::LogMessage::~LogMessage()                                       
(pid=raylet)     @        0x109c4adc5  ray::RayLog::~RayLog()                                                  
(pid=raylet)     @        0x10983d96e  ray::raylet::WorkerPool::StartProcess()                                 
(pid=raylet)     @        0x10983b04f  ray::raylet::WorkerPool::StartWorkerProcess()
(pid=raylet)     @        0x109841707  ray::raylet::WorkerPool::PopWorker()
(pid=raylet)     @        0x109794923  ray::raylet::NodeManager::DispatchTasks()
(pid=raylet)     @        0x1097a6b09  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @        0x10979f5cf  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @        0x10979f49e  ray::raylet::NodeManager::ProcessAnnounceWorkerPortMessage()
(pid=raylet)     @        0x10979db30  ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet)     @        0x10980617a  std::__1::__function::__func<>::operator()()
(pid=raylet)     @        0x109c238ee  ray::ClientConnection::ProcessMessage()
(pid=raylet)     @        0x109c2adb0  boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet)     @        0x10a0b630e  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @        0x10a0a9ca1  boost::asio::detail::scheduler::run()
(pid=raylet)     @        0x10a0a9b2c  boost::asio::io_context::run()
(pid=raylet)     @        0x109775d8a  main
(pid=raylet)     @     0x7fff6ce4ccc9  start

I also ran into a similar issue recently when trying to run some Serve tests where the GCS service crashed due to too many open files. I was not able to resolve it but it only happened locally - not on CI or linux boxes.

setting ulimit works for me but definitely scary stack trace in the initial execution.

I guess this is a ulimit issue and we probably just open 4 files per process (i only know of 3). I'll downgrade this to P1, but we should probably just catch this error and make a less scary error message

On ubuntu ulimit defaults to 1024, on os x it's 256. that probably explains it...

I guess we should warn users that they should increase ulimit when they start with big number of workers?

yeah, actually why don't we just setrlimit ourselves when starting ray?

also, imo 4*num_cpu's isn't a large number of workers, i think it's actually fairly reasonable (think about large numbers of tasks which perform io)

Hmm it seems a bit scary to me to be overriding that limit as part of Ray's source code...It's also pretty awkward since then we'd have to make the value configurable. We do it in the example cluster launcher scripts, but that's less sketchy since it's very clear and configurable, plus it's not going to be running on the user's laptop.

I would be in favor of just making the error message less scary (we probably don't need a stacktrace) and adding a tip to let the user know that they should manually increase the limit.

@wuisawesome Can I close the issue?

whoops, yeah sorry

Was this page helpful?
0 / 5 - 0 ratings