I've migrated RLlib training code from Ray 0.7.2 to 0.8.0. While it runs on local_mode=True well, it fails in the very beginning when running with local_mode=False, giving the following (not-very-informative) stack trace:
2020-01-01 19:45:33,310 INFO resource_spec.py:216 -- Starting Ray with 34.23 GiB memory available for workers and up to 17.12 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ).
2020-01-01 19:45:33,607 WARNING services.py:1354 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 17179865088 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-01-01 19:45:33,812 INFO trainer.py:371 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-01-01 19:45:33,834 INFO trainer.py:512 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Loading configuration... done.
Loading configuration... done.
2020-01-01 19:45:38,914 WARNING util.py:45 -- Install gputil for GPU system monitoring.
Loading configuration... done.
F0101 19:45:39.223114 8665 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses
* Check failure stack trace: *
@ 0x7f1d2b42cecd google::LogMessage::Fail()
@ 0x7f1d2b42e33c google::LogMessage::SendToLog()
@ 0x7f1d2b42cba9 google::LogMessage::Flush()
@ 0x7f1d2b42cdc1 google::LogMessage::~LogMessage()
@ 0x7f1d2b201619 ray::RayLog::~RayLog()
@ 0x7f1d2b1225f9 ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
@ 0x7f1d2b124d93 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
@ 0x7f1d2b15c16f ray::rpc::ClientCallImpl<>::OnReplyReceived()
@ 0x7f1d2b0fef03 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x7f1d2b0fd8d5 boost::asio::detail::scheduler::run()
@ 0x7f1d2b1019d3 ray::CoreWorker::RunIOService()
@ 0x7f1d2ac849e0 (unknown)
@ 0x7f1d2cee06db start_thread
@ 0x7f1d2d21988f clone
Error: tcpip::Socket::recvAndCheck @ recv: peer shutdown
Quitting (on error).
Error: tcpip::Socket::recvAndCheck @ recv: peer shutdown
Quitting (on error).
Is anyone familiar with this type of exception? It seems to be a low level exception, and I can't figure out a way to debug it.
Can you provide reproduction details?
Got same issues:
[I 09:51:25.177 NotebookApp] Saving file at /workspace/jupyter/ray/ray_learn.ipynb
F0107 09:51:50.732267 22128 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: Socket closed
*** Check failure stack trace: ***
@ 0x7f308476598d google::LogMessage::Fail()
@ 0x7f3084766dfc google::LogMessage::SendToLog()
@ 0x7f3084765669 google::LogMessage::Flush()
@ 0x7f3084765881 google::LogMessage::~LogMessage()
@ 0x7f308453a0e9 ray::RayLog::~RayLog()
@ 0x7f308445b0c9 ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
@ 0x7f308445d863 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
@ 0x7f3084494c3f ray::rpc::ClientCallImpl<>::OnReplyReceived()
@ 0x7f30844379d3 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x7f30844363a5 boost::asio::detail::scheduler::run()
@ 0x7f308443a4a3 ray::CoreWorker::RunIOService()
@ 0x7f308908c421 execute_native_thread_routine_compat
@ 0x7f308c4876db start_thread
@ 0x7f308c1b088f clone
just run the follows in jupyter-notebook:
@ray.remote
def remote_function():
return 1
id = remote_function.remote()
+1 getting the same error
I also get this when going from 0.7.7 to 0.8.0.
F0107 12:13:33.059109 10234 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses
*** Check failure stack trace: ***
@ 0x7f3d27cf398d google::LogMessage::Fail()
@ 0x7f3d27cf4dfc google::LogMessage::SendToLog()
@ 0x7f3d27cf3669 google::LogMessage::Flush()
@ 0x7f3d27cf3881 google::LogMessage::~LogMessage()
@ 0x7f3d27ac80e9 ray::RayLog::~RayLog()
@ 0x7f3d279e90c9 ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
@ 0x7f3d279eb863 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
@ 0x7f3d27a22c3f ray::rpc::ClientCallImpl<>::OnReplyReceived()
@ 0x7f3d279c59d3 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x7f3d279c43a5 boost::asio::detail::scheduler::run()
@ 0x7f3d279c84a3 ray::CoreWorker::RunIOService()
@ 0x7f3d2778d678 execute_native_thread_routine_compat
@ 0x7f3d37b176db start_thread
@ 0x7f3d3784088f clone
Aborted (core dumped)
Is this only in jupyter notebooks? Does it work fine outside?
By the way, it is likely this means something crashed. Please include the full jupyter logs of the application (and ray logs in /tmp/ray/logs).
@ericl I'm not using jupyter notebooks and still get the error.
Here are the logs:
logs.zip
I don't see anything in the logs. It must be some environment issue. Can you provide a reproduction environment as well (i.e., which cloud AMI, operating system version, python version)?
I'm on python 3.7.5, Ubuntu 18.04, running locally. Unfortunately the codebase is pretty complex and I haven't made a simple repro script.
Hmm, some usual suspects are hitting system limits (try setting OMP_NUM_THREADS=1), though it shouldn't have changed in 0.8.
Unfortunately we can't really help unless there is a reproduction script.
I see a similar (perhaps different) error from time to time on master branch from yesterday (python 3.7.5, redhat linux 7.6)
... direct_task_transport.cc:147] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
spams the stderr until I kill the process. Restarting the cluster and rerunning the same script has been successful albeit not a good solution.
If this is a different problem happy to open a new issue.
I think these are two related but possibly different issues:
cc @zhijunfu @raulchen have you ever seen this in production?
Hi @ericl, I have tested locally without jupyter as follows:
import ray
ray.init(temp_dir='.', num_cpus=2)
@ray.remote
def remote_function():
return 1
id = remote_function.remote()
print(ray.get(id))
ray.shutdown()
The out msg:
2020-01-08 09:35:17,620 INFO resource_spec.py:216 -- Starting Ray with 33.25 GiB memory available for workers and up to 16.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
F0108 09:35:17.983886 8952 direct_task_transport.cc:154] Lost connection with local raylet. Error: IOError: 14: Socket closed
*** Check failure stack trace: ***
@ 0x7fa0ea7ef98d google::LogMessage::Fail()
@ 0x7fa0ea7f0dfc google::LogMessage::SendToLog()
@ 0x7fa0ea7ef669 google::LogMessage::Flush()
@ 0x7fa0ea7ef881 google::LogMessage::~LogMessage()
@ 0x7fa0ea5c40e9 ray::RayLog::~RayLog()
@ 0x7fa0ea4e50c9 ray::CoreWorkerDirectTaskSubmitter::RetryLeaseRequest()
@ 0x7fa0ea4e7863 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc16WorkerLeaseReplyEEZNS0_29CoreWorkerDirectTaskSubmitter24RequestNewWorkerIfNeededERKSt4pairIiSt6vectorINS0_8ObjectIDESaISC_EEEPKNS4_7AddressEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
@ 0x7fa0ea51ec3f ray::rpc::ClientCallImpl<>::OnReplyReceived()
@ 0x7fa0ea4c19d3 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
@ 0x7fa0ea4c03a5 boost::asio::detail::scheduler::run()
@ 0x7fa0ea4c44a3 ray::CoreWorker::RunIOService()
@ 0x7fa0ea266421 execute_native_thread_routine_compat
@ 0x7fa0ebfb26db start_thread
@ 0x7fa0ebcdb88f clone
Aborted (core dumped)
The error msg in raylet.out:
1 I0108 09:35:17.872679 8927 stats.h:48] Succeeded to initialize stats: exporter address is 127.0.0.1:8888
2 I0108 09:35:17.881716 8927 redis_gcs_client.cc:156] RedisGcsClient Connected.
3 I0108 09:35:17.882542 8927 grpc_server.cc:57] ObjectManager server started, listening on port 40815.
4 I0108 09:35:17.883440 8927 grpc_server.cc:57] NodeManager server started, listening on port 57235.
5 I0108 09:35:18.145543 8927 main.cc:171] Raylet received SIGTERM, shutting down...
6 E0108 09:35:18.145653 8927 object_store_notification_manager.cc:49] Failed to process store length: IOError: No such file or directory, most likely plasma store is down, raylet will exit
in raylet.err:
E0108 09:35:18.145653 8927 object_store_notification_manager.cc:49] Failed to process store length: IOError: No such file or directory, most likely plasma store is down, raylet will exit
It seems the plasma failed first, and then raylet closed.
The env is:
python: Python 3.7.5
ray: '0.8.0'
os: ubuntu 18.04
kernel: Linux 4.15.0-72-generic
attach plasma log.
plasma_store.err:
1 WARNING: Logging before InitGoogleLogging() is written to STDERR
2 I0108 09:35:17.863186 8926 store.cc:1228] Allowing the Plasma store to use up to 17.8538GB of memory.
3 I0108 09:35:17.863298 8926 store.cc:1255] Starting object store with directory /dev/shm and huge page support disabled
4 I0108 09:35:18.145128 8926 store.cc:738] Disconnecting client on fd 10
5 I0108 09:35:18.145280 8926 store.cc:1176] SIGTERM Signal received, closing Plasma Server...
plasma_store.out log is empty and raylet_monitor.err is
*** Aborted at 1578447318 (unix time) try "date -d @1578447318" if you are using GNU date ***
2 PC: @ 0x0 (unknown)
3 *** SIGTERM (@0x3e8000022c7) received by PID 8925 (TID 0x7f8687e25780) from PID 8903; stack trace: ***
4 @ 0x7f86872f4890 (unknown)
5 @ 0x7f8686bf2b77 epoll_wait
6 @ 0x418e1c boost::asio::detail::epoll_reactor::run()
7 @ 0x4194b9 boost::asio::detail::scheduler::run()
8 @ 0x409c5f main
9 @ 0x7f8686af2b97 __libc_start_main
10 @ 0x40ed81 (unknown)
I can't reproduce on binder: https://mybinder.org/v2/gh/ray-project/tutorial/master?urlpath=lab It would be helpful to get a reproduction on a publicly available environment (either some notebook, or starting from a VM image).
Is it possible you have some firewall config that is preventing local connections for some port range? I think gRPC picks ports randomly from the ephemeral port range right now.
One more thing you can try is setting the env var RAY_FORCE_DIRECT=0, to confirm the issue is gRPC.
The ray version in https://mybinder.org/v2/gh/ray-project/tutorial/master?urlpath=lab is '0.7.4'. Could it upgrade to '0.8.0' for a try?
Yeah, I upgraded it before trying: !pip install -U ray
Can you also try setting GRPC_VERBOSITY=DEBUG and see what the output is?
You can also try setting GRPC_TRACE=tcp to see the addresses, though this is quite verbose.
Hi @ericl, I found the problem caused by the http_proxy. It works if I set the proxy to empty.
You could reproduce the problem with the follows:
import ray
import os
os.environ['GRPC_VERBOSITY']='DEBUG'
os.environ['http_proxy']='some proxy'
os.environ['https_proxy']='some proxy'
ray.init(temp_dir='.', num_cpus=2)
@ray.remote
def remote_function():
return 1
id = remote_function.remote()
print(ray.get(id))
ray.shutdown()
Thanks for figuring this out @ConeyLiu , this should fix it: https://github.com/ray-project/ray/pull/6744
thanks for the quickly fix.
获取 Outlook for Androidhttps://aka.ms/ghei36
From: Eric Liang notifications@github.com
Sent: Wednesday, January 8, 2020 2:40:51 PM
To: ray-project/ray ray@noreply.github.com
Cc: Xianyang Liu liu-xianyang@hotmail.com; Mention mention@noreply.github.com
Subject: Re: [ray-project/ray] Lost connection with local raylet. Error: IOError: 14: failed to connect to all addresses (#6662)
Thanks for figuring this out @ConeyLiuhttps://github.com/ConeyLiu , this should fix it: #6744https://github.com/ray-project/ray/pull/6744
―
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/ray-project/ray/issues/6662?email_source=notifications&email_token=ADBEWSFQ3527SORROCT74PTQ4VYPHA5CNFSM4KB4DP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEILLFBA#issuecomment-571912836, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBEWSE34RJVWQCYW7CDID3Q4VYPHANCNFSM4KB4DP4A.
@virtualluke , if you still see the issue, can you file a new bug?
Most helpful comment
Hi @ericl, I found the problem caused by the
http_proxy. It works if I set the proxy to empty.