My workstation has become unable to initialize ray within python3 environment with default and modified object memory allocation (going above an approximate limit). The system has 100+ GB DDR3 free, but crashes whenever more than ~10-12GB is specified for object_store_memory. I have verified through several methods that sufficient RAM is free for the amounts ray indicates will be used for memory at its instantiation.
The issue remains irregardless of chosen ray version (downgrading all the way through the v0.6-0.8 releases and also with Python v3.6.5). I have verified the issue also occurs with the latest wheels. The issue did not occur as of a couple of weeks ago, yet no system changes have occurred that I am aware of except for upgrading ray from v0.8.2 to v0.8.4. Multiple clean installations of the package, both through pip3 and manual installations, as well as all of its listed dependencies have been performed in conjunction with rebooting the system. Although the issue presents as a network error, running:
ray.init(object_store_memory=10000000000)
does function as expected. However, my code requires significantly more memory allocation to function. While I cannot examine another mac system with similar system resources, an identical installation and package versioning was able to function on a MacBook Air with 8 GB of RAM and a Windows 10 box with 128 GB RAM (using the Ubuntu subsystem) on the same network. My best guess is that one of the functions is timing out while waiting for the memory to be allocated? Any thoughts or directions as to continuing to debug, or fix this issue would be greatly appreciated.
Terminal Output
2020-04-07 23:15:16,053 INFO resource_spec.py:212 -- Starting Ray with 61.57 GiB memory available for workers and up to 30.38 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-07 23:15:16,506 INFO services.py:1148 -- View the Ray dashboard at localhost:8265
E0407 23:15:17.059361 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_23-15-16_042185_13532/sockets/raylet (num_attempts = 1, num_retries = 5)
E0407 23:15:19.583269 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_23-15-16_042185_13532/sockets/raylet (num_attempts = 2, num_retries = 5)
E0407 23:15:20.085359 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_23-15-16_042185_13532/sockets/raylet (num_attempts = 3, num_retries = 5)
E0407 23:15:20.464176 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_23-15-16_042185_13532/sockets/raylet (num_attempts = 4, num_retries = 5)
F0407 23:15:20.485735 2783282048 raylet_client.cc:83] Could not connect to socket /tmp/ray/session_2020-04-07_23-15-16_042185_13532/sockets/raylet
*** Check failure stack trace: ***
@ 0x106b25722 google::LogMessage::~LogMessage()
@ 0x106795b65 ray::RayLog::~RayLog()
@ 0x10652778b ray::raylet::RayletConnection::RayletConnection()
@ 0x106528b95 ray::raylet::RayletClient::RayletClient()
@ 0x10648ae34 ray::CoreWorker::CoreWorker()
@ 0x106427501 __pyx_tp_new_3ray_7_raylet_CoreWorker()
@ 0x105d4c1bb type_call
@ 0x105d10a89 _PyObject_FastCallKeywords
@ 0x105da7822 call_function
@ 0x105da0405 _PyEval_EvalFrameDefault
@ 0x105da80be _PyEval_EvalCodeWithName
@ 0x105d10be8 _PyFunction_FastCallKeywords
@ 0x105da7829 call_function
@ 0x105da0549 _PyEval_EvalFrameDefault
@ 0x105d10ffe function_code_fastcall
@ 0x105da7829 call_function
@ 0x105da0405 _PyEval_EvalFrameDefault
@ 0x105da80be _PyEval_EvalCodeWithName
@ 0x105d9e815 PyEval_EvalCode
@ 0x105dcd6c9 run_mod
@ 0x105dcc34c PyRun_InteractiveOneObjectEx
@ 0x105dcbb7b PyRun_InteractiveLoopFlags
@ 0x105dcbad4 PyRun_AnyFileExFlags
@ 0x105de48c5 pymain_main
@ 0x105de4fc7 _Py_UnixMain
@ 0x7fff6d758015 start
@ 0x1 (unknown)
Abort trap: 6
Raylet.err
E0407 20:06:53.051224 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 300 more times
E0407 20:06:53.170120 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 299 more times
E0407 20:06:53.271245 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 298 more times
E0407 20:06:53.373128 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 297 more times
E0407 20:06:53.473491 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 296 more times
E0407 20:06:53.578837 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 295 more times
E0407 20:06:53.684170 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 294 more times
E0407 20:06:53.789516 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 293 more times
E0407 20:06:53.894868 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 292 more times
E0407 20:06:54.000061 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 291 more times
E0407 20:06:54.100605 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 290 more times
E0407 20:06:54.201131 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 289 more times
E0407 20:06:54.306321 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 288 more times
E0407 20:06:54.409478 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 287 more times
E0407 20:06:54.510324 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 286 more times
E0407 20:06:54.615197 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 285 more times
E0407 20:06:54.720492 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 284 more times
E0407 20:06:54.823226 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 283 more times
E0407 20:06:54.924772 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 282 more times
E0407 20:06:55.026051 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 281 more times
E0407 20:06:55.130087 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 280 more times
E0407 20:06:55.231253 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 279 more times
E0407 20:06:55.333236 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 278 more times
E0407 20:06:55.433928 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 277 more times
E0407 20:06:55.537894 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 276 more times
E0407 20:06:55.639290 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 275 more times
E0407 20:06:55.743938 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 274 more times
E0407 20:06:55.847434 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 273 more times
E0407 20:06:55.952606 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 272 more times
E0407 20:06:56.057772 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 271 more times
E0407 20:06:56.158499 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 270 more times
E0407 20:06:56.262081 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 269 more times
E0407 20:06:56.366858 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 268 more times
E0407 20:06:56.468652 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 267 more times
E0407 20:06:56.573797 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 266 more times
E0407 20:06:56.676746 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 265 more times
E0407 20:06:56.777926 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 264 more times
E0407 20:06:56.881317 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 263 more times
E0407 20:06:56.984231 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 262 more times
E0407 20:06:57.087503 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 261 more times
E0407 20:06:57.192664 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 260 more times
E0407 20:06:57.297238 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 259 more times
E0407 20:06:57.399291 2783282048 io.cc:212] Connection to IPC socket failed for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/plasma_store, retrying 258 more times
*** Aborted at 1586308017 (unix time) try "date -d @1586308017" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGTERM (@0x7fff6d8a8d82) received by PID 1997 (TID 0x7fffa5e58380) stack trace: ***
@ 0x7fff6da66f5a _sigtramp
@ 0x3835320000000001 (unknown)
@ 0x7fff6d823618 usleep
@ 0x102def485 plasma::ConnectIpcSocketRetry()
@ 0x102de9734 plasma::PlasmaClient::Impl::Connect()
@ 0x102de9e71 plasma::PlasmaClient::Connect()
@ 0x1026efa1e ray::ObjectStoreNotificationManager::ObjectStoreNotificationManager()
@ 0x1026cdb19 ray::ObjectManager::ObjectManager()
@ 0x10267cdd3 ray::raylet::Raylet::Raylet()
@ 0x10267d438 ray::raylet::Raylet::Raylet()
@ 0x10260bebc main
@ 0x7fff6d758015 start
dashboard.err
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/site-packages/ray/dashboard/dashboard.py", line 764, in run
for x in p.listen():
File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 3553, in listen
response = self.handle_message(self.parse_response(block=True))
File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 3453, in parse_response
response = self._execute(conn, conn.read_response)
File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 3427, in _execute
return command(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 734, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 316, in read_response
response = self._buffer.readline()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 248, in readline
self._read_from_socket()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 193, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 1191, in get_connection
if connection.can_read():
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 729, in can_read
return self._parser.can_read(timeout)
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 313, in can_read
return self._buffer and self._buffer.can_read(timeout)
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 223, in can_read
raise_on_timeout=False)
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 193, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 552, in connect
sock = self._connect()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 609, in _connect
raise err
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 597, in _connect
sock.connect(socket_address)
ConnectionRefusedError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/site-packages/ray/dashboard/dashboard.py", line 946, in run
self._update_nodes()
File "/usr/local/lib/python3.7/site-packages/ray/dashboard/dashboard.py", line 835, in _update_nodes
self.nodes = ray.nodes()
File "/usr/local/lib/python3.7/site-packages/ray/state.py", line 1088, in nodes
return state.client_table()
File "/usr/local/lib/python3.7/site-packages/ray/state.py", line 459, in client_table
client_table = _parse_client_table(self.redis_client)
File "/usr/local/lib/python3.7/site-packages/ray/state.py", line 31, in _parse_client_table
NIL_CLIENT_ID)
File "/usr/local/lib/python3.7/site-packages/redis/client.py", line 875, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 1195, in get_connection
connection.connect()
File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 557, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 61 connecting to 10.0.1.99:15040. Connection refused.
python-driver
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0407 20:06:52.969794 2783282048 redis_gcs_client.cc:171] RedisGcsClient Connected.
I0407 20:06:53.035861 2783282048 grpc_server.cc:64] driver server started, listening on port 55490.
E0407 20:06:53.540202 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/raylet (num_attempts = 1, num_retries = 5)
E0407 20:06:55.671161 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/raylet (num_attempts = 2, num_retries = 5)
E0407 20:06:56.171767 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/raylet (num_attempts = 3, num_retries = 5)
E0407 20:06:56.675864 2783282048 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/raylet (num_attempts = 4, num_retries = 5)
F0407 20:06:57.180969 2783282048 raylet_client.cc:83] Could not connect to socket /tmp/ray/session_2020-04-07_20-06-52_516679_1897/sockets/raylet
raylet_monitor.err
*** Aborted at 1586308017 (unix time) try "date -d @1586308017" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGTERM (@0x7fff6d8a9bea) received by PID 1993 (TID 0x7fffa5e58380) stack trace: ***
@ 0x7fff6da66f5a _sigtramp
@ 0x7fffea2442ed (unknown)
@ 0x1063b0c66 boost::asio::detail::scheduler::do_run_one()
@ 0x1063a4c42 boost::asio::detail::scheduler::run()
@ 0x1063a4adb boost::asio::io_context::run()
@ 0x10618de4d main
@ 0x7fff6d758015 start
@ 0x5 (unknown)
redis-shard_0.out
1992:C 07 Apr 2020 20:06:52.678 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1992:C 07 Apr 2020 20:06:52.678 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=1992, just started
1992:C 07 Apr 2020 20:06:52.678 # Configuration loaded
1992:M 07 Apr 2020 20:06:52.681 # Server initialized
1992:signal-handler (1586308017) Received SIGTERM scheduling shutdown...
1992:M 07 Apr 2020 20:06:57.937 # User requested shutdown...
1992:M 07 Apr 2020 20:06:57.937 # Redis is now ready to exit, bye bye...
plasma_store.err
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0407 20:06:52.963517 2783282048 store.cc:1244] Allowing the Plasma store to use up to 17.3442GB of memory.
I0407 20:06:52.964265 2783282048 store.cc:1271] Starting object store with directory /tmp and huge page support disabled
I0407 20:06:57.876449 2783282048 store.cc:1192] SIGTERM Signal received, closing Plasma Server...
Ray version and other system information (Python version, TensorFlow version, OS):
System information
-------------------------------------
Mac OS X 10.13.6
2 x 2.8 GHz 8-Core Intel Xeon E5
128 GB 1866 MHz DDR3
Package/Program, Version
-------------------------------------
Python 3.7.7
aiohttp 3.6.2
appnope 0.1.0
async-timeout 3.0.1
attrs 19.3.0
backcall 0.1.0
beautifulsoup4 4.9.0
bleach 3.1.4
chardet 3.0.4
click 7.1.1
colorama 0.4.3
cycler 0.10.0
DateTime 4.3
decorator 4.4.2
defusedxml 0.6.0
dill 0.3.1.1
entrypoints 0.3
filelock 3.0.12
glob3 0.0.1
google 2.0.3
grpcio 1.28.1
idna 2.9
imageio 2.8.0
importlib-metadata 1.6.0
ipykernel 5.2.0
ipython 7.13.0
ipython-genutils 0.2.0
ipywidgets 7.5.1
jedi 0.16.0
Jinja2 2.11.1
joblib 0.14.1
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.2
jupyter-console 6.1.0
jupyter-core 4.6.3
kiwisolver 1.2.0
MarkupSafe 1.1.1
matplotlib 3.2.1
mistune 0.8.4
multidict 4.7.5
multiprocess 0.70.9
natsort 7.0.1
nbconvert 5.6.1
nbformat 5.0.5
networkx 2.4
notebook 6.0.3
numpy 1.18.2
opencv-python 4.2.0.34
pandas 1.0.3
pandocfilters 1.4.2
parso 0.6.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.1.1
pip 20.0.2
prometheus-client 0.7.1
prompt-toolkit 3.0.5
protobuf 3.11.3
psutil 5.7.0
ptyprocess 0.6.0
py-spy 0.3.3
Pygments 2.6.1
pyparsing 2.4.7
pyrsistent 0.16.0
python-dateutil 2.8.1
pytz 2019.3
PyWavelets 1.1.1
PyYAML 5.3.1
pyzmq 19.0.0
qtconsole 4.7.2
QtPy 1.9.0
ray 0.8.4
redis 3.4.1
scikit-image 0.16.2
scikit-learn 0.22.2.post1
scipy 1.4.1
Send2Trash 1.5.0
setuptools 46.0.0
six 1.14.0
sklearn 0.0
sobol 0.9
sobol-seq 0.1.2
soupsieve 2.0
terminado 0.8.3
testpath 0.4.4
tornado 6.0.4
tqdm 4.45.0
traitlets 4.3.3
wcwidth 0.1.9
webencodings 0.5.1
wheel 0.34.2
widgetsnbextension 3.5.1
yarl 1.4.2
zipp 3.1.0
zope.interface 5.0.2
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
import ray
ray.init()
If we cannot run your script, we cannot fix your issue.
Seems like plasma store couldn't receive 30GB of memory assignment, and that broke raylet, which broke everything else after all. I assume ray.init(object_store_memory=15000000000) also works. Am I correct? Also, can you see any logs from plasma_store .out?
Also, https://github.com/ray-project/ray/issues/3670 seems to look relevant to me. It says the problem has to do with how plasma store calculate available memory (it calculates using shared memory). So, I'd assume increasing shared memory in your machine might resolve this issue.
Can you try to increase your shared memory to 35GB following this procedure, and see if it works? http://www.spy-hill.com/help/apple/SharedMemory.html
Thank you for your quick response. You are indeed correct, ray.init(object_store_memory=15000000000) does also function. The plasma_store.out log is blank. I will give increasing the shared memory with a reboot in the next day or so. Making the alteration with sysctl has no immediate effect on the result. It would surprise me though if this was the issue, given that ray was able to allocate the space a couple weeks ago on this same system.
Have confirmed that increasing the share memory according to the prescribed procedure does not fix the issue.
Okay. Seems like this has been an issue, and we had a fix in the past (https://github.com/ray-project/ray/issues/4193).
I assume your problem could be this based on the fact that you said "it worked in the past, but not anymore".
https://github.com/modin-project/modin/issues/468#issuecomment-486313223
Please try this and see if it works.
Fascinating...I can confirm that the /tmp directory is located on a disk with 600 GB+ free space. I also tried setting plasma_directory to another location on the same disk to no avail. However, setting it to another disk with 2 TB+ free, allows it to work just fine...is there a minimum amount of disk space required on a per thread basis? Seems like this kind of error should throw an "Out of Space" notification in the plasma_store.err
Glad the problem has been resolved! It seems like /tmp directory has a RAM space that you can map your shared memory for MacOS (in Linux, it is /dev/shm), and your original filesystems don't have enough RAM space (30GB) on /tmp. It only has 16GB. (https://superuser.com/a/227714).
About the disk space issue, you can probably raise an issue to Apache Arrow as they are primarily maintaining plasma store now. I will keep the issue open so that we can decided if we should work on this from our side.
@pcmoritz Is there a way we can modify plasma code? Or should we create an issue there?
I visited this page many times trying to figure out why I could not get my custom docker stack setup to work. I was getting very similar errors to the OP and I noticed that my socket file was never being created in /tmp/ray. After hours of searching, I learned from studying https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/local/example-full.yaml that I needed to call ulimit -c unlimited in my start script before calling ray start --head.
I understand that there may be a larger issue here that you are trying to address, but in an effort to save other people time I wanted to post my findings here.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Will close this issue as there's no action to be taken I think.
Issue appears to be resolved as of 1.0.0 release.