Ray: Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses

Created on 1 Jan 2020  ·  23Comments  ·  Source: ray-project/ray

I've migrated RLlib training code from Ray 0.7.2 to 0.8.0. While it runs on local_mode=True well, it fails in the very beginning when running with local_mode=False, giving the following (not-very-informative) stack trace:

2020-01-01 19:45:33,310 INFO resource_spec.py:216 -- Starting Ray with 34.23 GiB memory available for workers and up to 17.12 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2020-01-01 19:45:33,607 WARNING services.py:1354 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 17179865088 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-01-01 19:45:33,812 INFO trainer.py:371 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-01-01 19:45:33,834 INFO trainer.py:512 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Loading configuration... done.
Loading configuration... done.
2020-01-01 19:45:38,914 WARNING util.py:45 -- Install gputil for GPU system monitoring.
Loading configuration... done.
F0101 19:45:39.223114 8665 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses
* Check failure stack trace: *
@ 0x7f1d2b42cecd google::LogMessage::Fail()
@ 0x7f1d2b42e33c google::LogMessage::SendToLog()
@ 0x7f1d2b42cba9 google::LogMessage::Flush()
@ 0x7f1d2b42cdc1 google::LogMessage::~LogMessage()
@ 0x7f1d2b201619 ray::RayLog::~RayLog()
@ 0x7f1d2b1225f9 ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
@ 0x7f1d2b124d93 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
@ 0x7f1d2b15c16f ray::rpc::ClientCallImpl<>::OnReplyReceived()
@ 0x7f1d2b0fef03 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x7f1d2b0fd8d5 boost::asio::detail::scheduler::run()
@ 0x7f1d2b1019d3 ray::CoreWorker::RunIOService()
@ 0x7f1d2ac849e0 (unknown)
@ 0x7f1d2cee06db start_thread
@ 0x7f1d2d21988f clone
Error: tcpip::Socket::recvAndCheck @ recv: peer shutdown
Quitting (on error).
Error: tcpip::Socket::recvAndCheck @ recv: peer shutdown
Quitting (on error).

Is anyone familiar with this type of exception? It seems to be a low level exception, and I can't figure out a way to debug it.

bug

Most helpful comment

Hi @ericl, I found the problem caused by the http_proxy. It works if I set the proxy to empty.

All 23 comments

Can you provide reproduction details?

Got same issues:

[I 09:51:25.177 NotebookApp] Saving file at /workspace/jupyter/ray/ray_learn.ipynb
F0107 09:51:50.732267 22128 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: Socket closed
*** Check failure stack trace: ***
    @     0x7f308476598d  google::LogMessage::Fail()
    @     0x7f3084766dfc  google::LogMessage::SendToLog()
    @     0x7f3084765669  google::LogMessage::Flush()
    @     0x7f3084765881  google::LogMessage::~LogMessage()
    @     0x7f308453a0e9  ray::RayLog::~RayLog()
    @     0x7f308445b0c9  ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
    @     0x7f308445d863  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7f3084494c3f  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7f30844379d3  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7f30844363a5  boost::asio::detail::scheduler::run()
    @     0x7f308443a4a3  ray::CoreWorker::RunIOService()
    @     0x7f308908c421  execute_native_thread_routine_compat
    @     0x7f308c4876db  start_thread
    @     0x7f308c1b088f  clone

just run the follows in jupyter-notebook:

@ray.remote
def remote_function():
    return 1

id = remote_function.remote()

+1 getting the same error

I also get this when going from 0.7.7 to 0.8.0.

F0107 12:13:33.059109 10234 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses
*** Check failure stack trace: ***
    @     0x7f3d27cf398d  google::LogMessage::Fail()
    @     0x7f3d27cf4dfc  google::LogMessage::SendToLog()
    @     0x7f3d27cf3669  google::LogMessage::Flush()
    @     0x7f3d27cf3881  google::LogMessage::~LogMessage()
    @     0x7f3d27ac80e9  ray::RayLog::~RayLog()
    @     0x7f3d279e90c9  ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
    @     0x7f3d279eb863  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7f3d27a22c3f  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7f3d279c59d3  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7f3d279c43a5  boost::asio::detail::scheduler::run()
    @     0x7f3d279c84a3  ray::CoreWorker::RunIOService()
    @     0x7f3d2778d678  execute_native_thread_routine_compat
    @     0x7f3d37b176db  start_thread
    @     0x7f3d3784088f  clone
Aborted (core dumped)

Is this only in jupyter notebooks? Does it work fine outside?

By the way, it is likely this means something crashed. Please include the full jupyter logs of the application (and ray logs in /tmp/ray/logs).

@ericl I'm not using jupyter notebooks and still get the error.

Here are the logs:
logs.zip

I don't see anything in the logs. It must be some environment issue. Can you provide a reproduction environment as well (i.e., which cloud AMI, operating system version, python version)?

I'm on python 3.7.5, Ubuntu 18.04, running locally. Unfortunately the codebase is pretty complex and I haven't made a simple repro script.

Hmm, some usual suspects are hitting system limits (try setting OMP_NUM_THREADS=1), though it shouldn't have changed in 0.8.

Unfortunately we can't really help unless there is a reproduction script.

I see a similar (perhaps different) error from time to time on master branch from yesterday (python 3.7.5, redhat linux 7.6)

... direct_task_transport.cc:147] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses

spams the stderr until I kill the process. Restarting the cluster and rerunning the same script has been successful albeit not a good solution.

If this is a different problem happy to open a new issue.

I think these are two related but possibly different issues:

  1. The worker process cannot connect via gRPC to the raylet on the same node (due to the raylet dying or unknown networking issues?)
  2. The worker cannot connect via gRPC to some remote node repeatedly (due to some networking issue?)

cc @zhijunfu @raulchen have you ever seen this in production?

Hi @ericl, I have tested locally without jupyter as follows:

import ray 

ray.init(temp_dir='.', num_cpus=2)

@ray.remote
def remote_function():
    return 1

id = remote_function.remote()
print(ray.get(id))

ray.shutdown()

The out msg:

2020-01-08 09:35:17,620 INFO resource_spec.py:216 -- Starting Ray with 33.25 GiB memory available for workers and up to 16.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
F0108 09:35:17.983886  8952 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: Socket closed
*** Check failure stack trace: ***
    @     0x7fa0ea7ef98d  google::LogMessage::Fail()
    @     0x7fa0ea7f0dfc  google::LogMessage::SendToLog()
    @     0x7fa0ea7ef669  google::LogMessage::Flush()
    @     0x7fa0ea7ef881  google::LogMessage::~LogMessage()
    @     0x7fa0ea5c40e9  ray::RayLog::~RayLog()
    @     0x7fa0ea4e50c9  ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
    @     0x7fa0ea4e7863  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7fa0ea51ec3f  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7fa0ea4c19d3  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7fa0ea4c03a5  boost::asio::detail::scheduler::run()
    @     0x7fa0ea4c44a3  ray::CoreWorker::RunIOService()
    @     0x7fa0ea266421  execute_native_thread_routine_compat
    @     0x7fa0ebfb26db  start_thread
    @     0x7fa0ebcdb88f  clone
Aborted (core dumped)

The error msg in raylet.out:

  1 I0108 09:35:17.872679  8927 stats.h:48] Succeeded to initialize stats: exporter address is 127.0.0.1:8888
  2 I0108 09:35:17.881716  8927 redis_gcs_client.cc:156] RedisGcsClient Connected.
  3 I0108 09:35:17.882542  8927 grpc_server.cc:57] ObjectManager server started, listening on port 40815.
  4 I0108 09:35:17.883440  8927 grpc_server.cc:57] NodeManager server started, listening on port 57235.
  5 I0108 09:35:18.145543  8927 main.cc:171] Raylet received SIGTERM, shutting down...
  6 E0108 09:35:18.145653  8927 object_store_notification_manager.cc:49] Failed to process store length: IOError: No such file or directory, most likely plasma store is down, raylet will exit

in raylet.err:

E0108 09:35:18.145653  8927 object_store_notification_manager.cc:49] Failed to process store length: IOError: No such file or directory, most likely plasma store is down, raylet will exit

It seems the plasma failed first, and then raylet closed.

The env is:
python: Python 3.7.5
ray: '0.8.0'
os: ubuntu 18.04
kernel: Linux 4.15.0-72-generic

attach plasma log.
plasma_store.err:

  1 WARNING: Logging before InitGoogleLogging() is written to STDERR
  2 I0108 09:35:17.863186  8926 store.cc:1228] Allowing the Plasma store to use up to 17.8538GB of memory.
  3 I0108 09:35:17.863298  8926 store.cc:1255] Starting object store with directory /dev/shm and huge page support disabled
  4 I0108 09:35:18.145128  8926 store.cc:738] Disconnecting client on fd 10
  5 I0108 09:35:18.145280  8926 store.cc:1176] SIGTERM Signal received, closing Plasma Server...

plasma_store.out log is empty and raylet_monitor.err is

 *** Aborted at 1578447318 (unix time) try "date -d @1578447318" if you are using GNU date ***
  2 PC: @                0x0 (unknown)
  3 *** SIGTERM (@0x3e8000022c7) received by PID 8925 (TID 0x7f8687e25780) from PID 8903; stack trace: ***
  4     @     0x7f86872f4890 (unknown)
  5     @     0x7f8686bf2b77 epoll_wait
  6     @           0x418e1c boost::asio::detail::epoll_reactor::run()
  7     @           0x4194b9 boost::asio::detail::scheduler::run()
  8     @           0x409c5f main
  9     @     0x7f8686af2b97 __libc_start_main
 10     @           0x40ed81 (unknown)

I can't reproduce on binder: https://mybinder.org/v2/gh/ray-project/tutorial/master?urlpath=lab It would be helpful to get a reproduction on a publicly available environment (either some notebook, or starting from a VM image).

Is it possible you have some firewall config that is preventing local connections for some port range? I think gRPC picks ports randomly from the ephemeral port range right now.

One more thing you can try is setting the env var RAY_FORCE_DIRECT=0, to confirm the issue is gRPC.

The ray version in https://mybinder.org/v2/gh/ray-project/tutorial/master?urlpath=lab is '0.7.4'. Could it upgrade to '0.8.0' for a try?

Yeah, I upgraded it before trying: !pip install -U ray

Can you also try setting GRPC_VERBOSITY=DEBUG and see what the output is?

You can also try setting GRPC_TRACE=tcp to see the addresses, though this is quite verbose.

Hi @ericl, I found the problem caused by the http_proxy. It works if I set the proxy to empty.

You could reproduce the problem with the follows:

import ray 
import os

os.environ['GRPC_VERBOSITY']='DEBUG'
os.environ['http_proxy']='some proxy'
os.environ['https_proxy']='some proxy'

ray.init(temp_dir='.', num_cpus=2)

@ray.remote
def remote_function():
    return 1

id = remote_function.remote()
print(ray.get(id))

ray.shutdown()

Thanks for figuring this out @ConeyLiu , this should fix it: https://github.com/ray-project/ray/pull/6744

thanks for the quickly fix.

获取 Outlook for Androidhttps://aka.ms/ghei36


From: Eric Liang notifications@github.com
Sent: Wednesday, January 8, 2020 2:40:51 PM
To: ray-project/ray ray@noreply.github.com
Cc: Xianyang Liu liu-xianyang@hotmail.com; Mention mention@noreply.github.com
Subject: Re: [ray-project/ray] Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses (#6662)

Thanks for figuring this out @ConeyLiuhttps://github.com/ConeyLiu , this should fix it: #6744https://github.com/ray-project/ray/pull/6744


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/ray-project/ray/issues/6662?email_source=notifications&email_token=ADBEWSFQ3527SORROCT74PTQ4VYPHA5CNFSM4KB4DP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEILLFBA#issuecomment-571912836, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBEWSE34RJVWQCYW7CDID3Q4VYPHANCNFSM4KB4DP4A.

@virtualluke , if you still see the issue, can you file a new bug?

Was this page helpful?
0 / 5 - 0 ratings