Ray: [rllib] Invalid return value: likely worker died or was killed while executing the task.

Created on 18 Jul 2018  Â·  10Comments  Â·  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): source
  • Ray version: 0.5.0
  • Python version: 3.5.1
  • Exact command to reproduce:

Describe the problem

I am testing a algorithm which is very similar to the a3c example. I have a virtual machine with 4 sockets, each having 20 cores. So I have 80 cores in the machine. When I run the program with num_workers = 20, the program works perfectly. However, when I increase the num_workers to 64. I have

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable 

and

File "custom_a3c.py", line 14, in <module>
    result = agent.train()
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/tune/trainable.py", line 117, in train
    result = self._train()
  File "/home/richardwlc93/flocking-with-ray/ray for RL/ray-0712/a3c.py", line 102, in _train
    self.optimizer.step()
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/rllib/optimizers/async_gradients_optimizer.py", line 43, in step
    gradient, _ = ray.get(fut)
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/worker.py", line 2643, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(68962c51b79bb422c0705ee3de0f52268c2f693d). It was created by remote function <unknown> which failed with:
Remote function <unknown> failed with:
Invalid return value: likely worker died or was killed while executing the task.

Maybe I need to reserve some resource by ray.init(). Would you please give me some hints to fix the problem.

Maybe it has something to do with the socket. Whenever the num_workers is smaller than the num of cores in a socket, in this case 20, the program works.

Here is the information of the machine

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:              0
CPU MHz:               2199.998
BogoMIPS:              4399.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):     0-19,80-99
NUMA node1 CPU(s):     20-39,100-119
NUMA node2 CPU(s):     40-59,120-139
NUMA node3 CPU(s):     60-79,140-159

Source code / logs

Most helpful comment

I've yet to try this out, but setting export OMP_NUM_THREADS=1 (you have to do this during ray start?) fixed a similar segfault someone else observed. Apparently each process otherwise tries to spawn 144 OMP threads and you run into a operating system limit.

This might only be triggered by tensorflow doing specific things which could explain why Pong-ram-v4 works but the one with a convnet does not.

All 10 comments

It could be you're exceeding some user limit (ulimit -a to list) such as max num open files. You should be able to raise these with ulimit as well.

@ericl I try to increase the max num open files using ulimit -n. But it does not help.

I found another information which may help to determine what could be the reason. The env of the above example is PongDeterministic-v4. If I change the env to Pong-ram-v4. The program can run without any error.

Maybe you're running into memory usage issues? Did you check if the node is low on memory or if there are any crashes in dmesg output?

The last 5 line of dmsg is

[  204.163002] show_signal_msg: 4 callbacks suppressed
[  204.163005] python3[36500]: segfault at 7f86b7f399d0 ip 00007f86db42f8d9 sp 00007ffd00b14448 error 4 in libpthread-2.23.so[7f86db427000+18000]
[  204.166171] python3[36505]: segfault at 7f1dab4fe9d0 ip 00007f1de61eb8d9 sp 00007ffc330e6e28 error 4 in libpthread-2.23.so[7f1de61e3000+18000]
[  204.170551] python3[36496]: segfault at 7ff2983aa9d0 ip 00007ff2a386c8d9 sp 00007ffdecf588f8 error 4 in libpthread-2.23.so[7ff2a3864000+18000]
[  204.185826] python3[36495]: segfault at 7f015d3829d0 ip 00007f01680478d9 sp 00007ffdb51d0598 error 4 in libpthread-2.23.so[7f016803f000+18000]

I think there should be sufficient available memory in the virtual machine. It is a vm from google cloud and its type is n1-ultramem-160 (160 vCPUs, 3,844 GB memory).

Do you have a minimal script to reproduce the segfault?

On Mon, Jul 23, 2018, 11:12 PM luochao1024 notifications@github.com wrote:

The last 5 line of dmsg is

[ 204.163002] show_signal_msg: 4 callbacks suppressed
[ 204.163005] python3[36500]: segfault at 7f86b7f399d0 ip 00007f86db42f8d9 sp 00007ffd00b14448 error 4 in libpthread-2.23.so[7f86db427000+18000]
[ 204.166171] python3[36505]: segfault at 7f1dab4fe9d0 ip 00007f1de61eb8d9 sp 00007ffc330e6e28 error 4 in libpthread-2.23.so[7f1de61e3000+18000]
[ 204.170551] python3[36496]: segfault at 7ff2983aa9d0 ip 00007ff2a386c8d9 sp 00007ffdecf588f8 error 4 in libpthread-2.23.so[7ff2a3864000+18000]
[ 204.185826] python3[36495]: segfault at 7f015d3829d0 ip 00007f01680478d9 sp 00007ffdb51d0598 error 4 in libpthread-2.23.so[7f016803f000+18000]

I think there should be sufficient available memory in the virtual
machine. It is a vm from google cloud and its type is n1-ultramem-160 (160
vCPUs, 3,844 GB memory).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2419#issuecomment-407292818,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SjOUSCX0vQw6CZ3zC_1qw38yanHpks5uJrrZgaJpZM4VVbhB
.

@ericl Yes. You can create a virtual machine in google cloud, n1-standard-96 (96 vCPUs, 360 GB memory) or n1-ultramem-160 (160 vCPUs, 3,844 GB memory). And then run python/ray/rllib/train.py --env=PongDeterministic-v4 --run=A3C --config='{"num_workers": 64}' I install the ray from source on July 12. Here is the tree https://github.com/ray-project/ray/tree/d6af50785e2903d69c7809f019f7ea555f9f2688. After running the script, It will raise

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
The worker with ID 3b940af724283a3b7141244b6c5bd0f2c2277e63 died or was killed while executing the task with ID a3b6a0732a5df6bea1a6a735bc206504d505fe52
terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::system_errorstd::system_error'
'
  what():    what():  Resource temporarily unavailableResource temporarily unavailable

The worker with ID 92b0d60bbeb0760bec00cbec3f8b719a2b45a76c died or was killed while executing the task with ID 5089cc2a17b0511054bf9a98d71a27c7472ff23c
The worker with ID 82235b5c1952ca6bae2dde6dd8b9f72828361a53 died or was killed while executing the task with ID b7e0398ae78b81be3990cf07f446d21e55996565
The worker with ID 961267fd722462630c4d27b59d2e30a304cafbee died or was killed while executing the task with ID 4338446f1902f21c565e2fb07f8ed0fc076551c3
The worker with ID ff37512e084ac37ae1e64e5848139927c3d18cc9 died or was killed while executing the task with ID 7c9379fbb537f467d81e6c6deeb291564490091a
Remote function train failed with:

Traceback (most recent call last):
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/worker.py", line 892, in _process_task
    *arguments)
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/actor.py", line 261, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/tune/trainable.py", line 117, in train
    result = self._train()
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/rllib/agents/a3c/a3c.py", line 101, in _train
    self.optimizer.step()
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/rllib/optimizers/async_gradients_optimizer.py", line 43, in step
    gradient, _ = ray.get(fut)
  File "/home/richardwlc93/.local/lib/python3.5/site-packages/ray/python/ray/worker.py", line 2643, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(fb4646ba7cc05f275b2a505e1f8bdc6a6cac90e2). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task.

If you change the env to Pong-ram-v4, or decrease the num_workers, then everything is fine. I want to compare our algorithm with A3C, and reuse most of the codes from A3C. And my custom code also has the same issue.

@ericl Can you reproduce the error? If not, you can use my account in google cloud platform and run the program

I've yet to try this out, but setting export OMP_NUM_THREADS=1 (you have to do this during ray start?) fixed a similar segfault someone else observed. Apparently each process otherwise tries to spawn 144 OMP threads and you run into a operating system limit.

This might only be triggered by tensorflow doing specific things which could explain why Pong-ram-v4 works but the one with a convnet does not.

@ericl You are right. After setting export OMP_NUM_THREADS=1, it works now. Really appreciated for your help!

Looks like issue is resolved; closing for now and feel free to reopen if other related issues arise.

Was this page helpful?
0 / 5 - 0 ratings