Ray: [tune] A worker dies after 1000 samples

Created on 9 Oct 2020 · 28Comments · Source: ray-project/ray

What is the problem?

Workers crash while testing the Hyperopt example script: https://docs.ray.io/en/latest/tune/examples/hyperopt_example.html on a larger number of samples with fewer steps. The modified version of the script is attached below. The only changes are reducing the number of steps from 100 to 10 (for reaching the issue faster) and increasing the number of samples from 1000 to 10000. The worker crashes happen just beyond the 1000 samples, but at no fixed point.

The following error message appears when a worker crashes:

(pid=136534) F1009 07:02:26.344231 136534 136534 service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was 192.168.1.98:39953
(pid=136534) *** Check failure stack trace: ***
(pid=136534)     @     0x7f56afdb7f5d  google::LogMessage::Fail()
(pid=136534)     @     0x7f56afdb90bc  google::LogMessage::SendToLog()
(pid=136534)     @     0x7f56afdb7c39  google::LogMessage::Flush()
(pid=136534)     @     0x7f56afdb7e51  google::LogMessage::~LogMessage()
(pid=136534)     @     0x7f56afd6eff9  ray::RayLog::~RayLog()
(pid=136534)     @     0x7f56afab2a5a  ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
(pid=136534)     @     0x7f56afab2b5f  ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
(pid=136534)     @     0x7f56afab2d01  ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
(pid=136534)     @     0x7f56afab5071  ray::gcs::ServiceBasedGcsClient::Connect()
(pid=136534)     @     0x7f56afa36746  ray::CoreWorker::CoreWorker()
(pid=136534)     @     0x7f56afa3a484  ray::CoreWorkerProcess::CreateWorker()
(pid=136534)     @     0x7f56afa3b6f2  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=136534)     @     0x7f56afa3c0bb  ray::CoreWorkerProcess::Initialize()
(pid=136534)     @     0x7f56af975dce  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
(pid=136534)     @     0x7f56af9775b5  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=136534)     @     0x556d5bd4e909  _PyObject_FastCallKeywords
(pid=136534)     @     0x556d5bdb439e  _PyEval_EvalFrameDefault
(pid=136534)     @     0x556d5bcf7160  _PyEval_EvalCodeWithName
(pid=136534)     @     0x556d5bd47107  _PyFunction_FastCallKeywords
(pid=136534)     @     0x556d5bdb0585  _PyEval_EvalFrameDefault
(pid=136534)     @     0x556d5bcf6829  _PyEval_EvalCodeWithName
(pid=136534)     @     0x556d5bcf7714  PyEval_EvalCodeEx
(pid=136534)     @     0x556d5bcf773c  PyEval_EvalCode
(pid=136534)     @     0x556d5be0ef14  run_mod
(pid=136534)     @     0x556d5be19331  PyRun_FileExFlags
(pid=136534)     @     0x556d5be19523  PyRun_SimpleFileExFlags
(pid=136534)     @     0x556d5be1a655  pymain_main
(pid=136534)     @     0x556d5be1a77c  _Py_UnixMain
(pid=136534)     @     0x7f56b0fb90b3  __libc_start_main
(pid=136534)     @     0x556d5bdbeff0  (unknown)
A worker died or was killed while executing task ffffffffffffffff92c1835601000000.

The main job does not crash but is stuck when all workers are dead. Killing the main job via keyboard results in:

^CTraceback (most recent call last):
  File "test.py", line 72, in <module>
    **tune_kwargs)
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/tune.py", line 405, in run
    runner.step()
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 375, in step
    self._process_events()  # blocking
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 475, in _process_events
    trial = self.trial_executor.get_next_available_trial()  # blocking
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 463, in get_next_available_trial
    [result_id], _ = ray.wait(shuffled_results)
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/worker.py", line 1558, in wait
    worker.current_task_id,
  File "python/ray/_raylet.pyx", line 939, in ray._raylet.CoreWorker.wait
  File "python/ray/_raylet.pyx", line 144, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/demattia/miniconda3/envs/test_tune/lib/python3.7/site-packages/ray/worker.py", line 784, in shutdown
    time.sleep(0.5)
KeyboardInterrupt

The file /tmp/ray/gcs_server.out is very large, I only provide snippet in case it is helpful. The following snippet is representative of the log while the workers are still alive:

I1009 07:02:24.829182 112317 112317 gcs_worker_manager.cc:31] Reporting worker failure, worker id = e9f747a7936a76818495\
506b3f11c238bda56539, node id = 2041b2a79d53a7b4776d1553e99f625ccd26c15b, address = 192.168.1.98
W1009 07:02:24.829221 112317 112317 gcs_actor_manager.cc:544] Actor 35557ddd01000000 is out of scope,, destroying actor \
child.
W1009 07:02:24.829221 112317 112317 gcs_actor_manager.cc:544] Actor 35557ddd01000000 is out of scope,, destroying actor \
child.
I1009 07:02:24.829226 112317 112317 gcs_actor_manager.cc:558] Destroying actor, actor id = 35557ddd01000000
I1009 07:02:24.829228 112317 112317 gcs_actor_manager.cc:1065] Erasing actor 35557ddd01000000 owned by 01000000fffffffff\
fffffffffffffffffffffff
W1009 07:02:24.829443 112317 112317 gcs_worker_manager.cc:67] Failed to report worker failure, the worker doesn't exist,\
 worker id = e9f747a7936a76818495506b3f11c238bda56539, node id = 2041b2a79d53a7b4776d1553e99f625ccd26c15b, address = 192\
.168.1.98
W1009 07:02:24.829443 112317 112317 gcs_worker_manager.cc:67] Failed to report worker failure, the worker doesn't exist,\
 worker id = e9f747a7936a76818495506b3f11c238bda56539, node id = 2041b2a79d53a7b4776d1553e99f625ccd26c15b, address = 192\
.168.1.98

After the workers crash and the main job is stuck the content becomes:

I1009 07:03:00.041805 112317 112317 gcs_actor_scheduler.cc:387] Retry creating actor 92c1835601000000 on worker 7bb2509f\
320bd6f298c0f4a49bb9debecd2349e2 at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.041824 112317 112317 gcs_actor_scheduler.cc:322] Start creating actor 92c1835601000000 on worker 7bb2509f\
320bd6f298c0f4a49bb9debecd2349e2 at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.111589 112317 112317 gcs_actor_scheduler.cc:387] Retry creating actor 7bb0b57001000000 on worker 3a3c4734\
91f3af24c7eb1796c06b1ed67a13690e at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.111609 112317 112317 gcs_actor_scheduler.cc:322] Start creating actor 7bb0b57001000000 on worker 3a3c4734\
91f3af24c7eb1796c06b1ed67a13690e at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b
I1009 07:03:00.128782 112317 112317 gcs_actor_scheduler.cc:387] Retry creating actor 4d95c66301000000 on worker af65091d\
e52553f02a58a926af9ce296da7b45ac at node 2041b2a79d53a7b4776d1553e99f625ccd26c15b

Ray version and other system information (Python version, TensorFlow version, OS):

OS: Ubuntu 20.04.1 LTS (Focal Fossa)
Ray version = 1.0.0
Python version = 3.7.9
Hyperopt version = 0.2.5

Reproduction (REQUIRED)

Setup the conda environment:

conda create --name test_tune python=3.7
pip install 'ray[tune]' torch torchvision
(taken from the quickstart: https://docs.ray.io/en/latest/tune/index.html)
pip install hyperopt

Full pip freeze:

aiohttp==3.6.2
aiohttp-cors==0.7.0
aioredis==1.3.1
async-timeout==3.0.1
attrs==20.2.0
beautifulsoup4==4.9.3
blessings==1.7
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
cloudpickle==1.6.0
colorama==0.4.3
colorful==0.5.4
decorator==4.4.2
filelock==3.0.12
flake8==3.8.4
future==0.18.2
google==3.0.0
google-api-core==1.22.4
google-auth==1.22.1
googleapis-common-protos==1.52.0
gpustat==0.6.0
grpcio==1.32.0
hiredis==1.1.0
hyperopt==0.2.5
idna==2.10
importlib-metadata==2.0.0
jsonschema==3.2.0
mccabe==0.6.1
msgpack==1.0.0
multidict==4.7.6
networkx==2.5
numpy==1.19.2
nvidia-ml-py3==7.352.0
opencensus==0.7.10
opencensus-context==0.1.1
pandas==1.1.3
Pillow==7.2.0
prometheus-client==0.8.0
protobuf==3.13.0
psutil==5.7.2
py-spy==0.3.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.6.0
pyflakes==2.2.0
pyrsistent==0.17.3
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
ray==1.0.0
redis==3.4.1
requests==2.24.0
rsa==4.6
scipy==1.5.2
six==1.15.0
soupsieve==2.0.1
tabulate==0.8.7
tensorboardX==2.1
torch==1.6.0
torchvision==0.7.0
tqdm==4.50.2
typing-extensions==3.7.4.3
urllib3==1.25.10
yarl==1.6.0
zipp==3.3.0

Script to reproduce the issue:

"""This test checks that HyperOpt is functional.

It also checks that it is usable with a separate scheduler.
"""
import time

import ray
from ray import tune
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest.hyperopt import HyperOptSearch


def evaluation_fn(step, width, height):
    return (0.1 + width * step / 100)**(-1) + height * 0.1


def easy_objective(config):
    # Hyperparameters
    width, height = config["width"], config["height"]

    for step in range(config["steps"]):
        # Iterative training function - can be any arbitrary training procedure
        intermediate_score = evaluation_fn(step, width, height)
        # Feed the score back back to Tune.
        tune.report(iterations=step, mean_loss=intermediate_score)
        time.sleep(0.1)


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args, _ = parser.parse_known_args()
    ray.init(configure_logging=False)

    current_best_params = [
        {
            "width": 1,
            "height": 2,
            "activation": 0  # Activation will be relu
        },
        {
            "width": 4,
            "height": 2,
            "activation": 1  # Activation will be tanh
        }
    ]

    tune_kwargs = {
        "num_samples": 10 if args.smoke_test else 10000,
        "config": {
            "steps": 10,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            # This is an ignored parameter.
            "activation": tune.choice(["relu", "tanh"])
        }
    }
    algo = HyperOptSearch(points_to_evaluate=current_best_params)
    algo = ConcurrencyLimiter(algo, max_concurrent=4)

    scheduler = AsyncHyperBandScheduler()
    tune.run(
        easy_objective,
        search_alg=algo,
        scheduler=scheduler,
        metric="mean_loss",
        mode="min",
        **tune_kwargs)

The script is a small variation of the example at https://docs.ray.io/en/latest/tune/examples/hyperopt_example.html with the only change being the num_samples increase to 10000 and the steps decrease to 10.

Save the script as test.py and run python test.py. Expect the workers to start crashing after about 1000 samples. The point at which they crash is not exactly reproducible. However, they fact that they crash is reproducible. I was never able to complete a full run over 10000 sample and on my setup they tend to crash between 1000 and 2000 samples (trials). For example, in the last attempt the workers started crashing just after I saw the message .. 1091 more trials not shown (1091 TERMINATED) on screen.

[x] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

P0 bug core

Source

demattia

All 28 comments

Wow @demattia thanks a bunch for this super comprehensive repro script!

cc @krfricke who has been looking into this.

richardliaw on 9 Oct 2020

This seems like a P0 since it's a Ray core crash. cc @rkooo567 @stephanie-wang

(assuming it's a new issue in 1.0)

ericl on 16 Oct 2020

👍1

@raulchen, looks like a GCS crash?

stephanie-wang on 16 Oct 2020

@demattia Did you see any logs from gcs_server.err?

rkooo567 on 16 Oct 2020

I just tried running it again and the gcs_server.err file contains only one line:

$ more /tmp/ray/session_2020-10-16_17-13-46_362805_33697/logs/gcs_server.err
E1016 17:14:42.006647171   33721 tcp_server_posix.cc:213]    Failed accept4: Too
 many open files

demattia on 17 Oct 2020

Interesting, maybe the GCS is not cleaning up connections to actors that have died? That would explain why it only happens after a large number of samples (unique actors created).

If this is true, it should be reproducible just by creating a bunch of actors and removing them.

ericl on 17 Oct 2020

@demattia How do you start you cluster? When you start can you run with this command? ulimit -c unlimited; ray start ...

or if you just use ray.init, ulimit -c unlimited; python ...

rkooo567 on 17 Oct 2020

@ericl to test if the lack of connections cleanup is the culprit do you have any suggestions on a small test I could run?

@rkooo567 I just tried with ulimit -c unlimited; python test.py. The job still reaches a locked status, I get the following output:

A worker died or was killed while executing task ffffffffffffffffa088417a01000000.
(pid=raylet) E1016 18:36:37.321611 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.321856 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.321877 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.321892 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.321905 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.321919 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
A worker died or was killed while executing task ffffffffffffffffb92a749101000000.
(pid=raylet) E1016 18:36:37.354800 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.354835 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.354849 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.354863 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.354887 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.354899 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366017 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366053 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366080 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366093 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366106 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366118 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.366132 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.382012 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.382083 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.382256 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.382285 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.382298 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.382318 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
A worker died or was killed while executing task ffffffffffffffffb205b1c001000000.
A worker died or was killed while executing task ffffffffffffffff4bc8365e01000000.
(pid=raylet) E1016 18:36:37.400481 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.400910 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.400934 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.400954 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.400971 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.400998 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.401015 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.415406 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.415614 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.415827 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.416010 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.416209 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
(pid=raylet) E1016 18:36:37.416402 62950 62950 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer

The job memory usage doubles at this point and it remains in that locked status until I kill it. Interestingly, the gcs_server.err file still shows the same:

$ cat /tmp/ray/session_2020-10-16_18-34-01_829650_62922/logs/gcs_server.err
E1016 18:34:57.824522905   62941 tcp_server_posix.cc:213]    Failed accept4: Too many open files

Maybe it has to do with the file descriptors limit? This the ulimit -a command output is as follows:

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         unlimited
-m: resident set size (kbytes)      unlimited
-u: processes                       256847
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  65536
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 256847
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

demattia on 17 Oct 2020

Yeah I think ulimit -c unlimited should allow the unlimited number of open file desc (but based on your help CLI, it is probably -n). Maybe you can instead try ulimit -n 65000; ray start...

rkooo567 on 17 Oct 2020

ah yeah. I can see you have 1024 fd limit! I think the root cause is we probably don't properly clean up connections, but this might help you mitigate issues without fixing the root cause

rkooo567 on 17 Oct 2020

@rkooo567 I was able to get through the initial crash via ulimit -c unlimited -n 65536; python test.py, thank you for the suggestion.

The job slows down tremendously after the first few thousand samples. Is that expected from Hyperopt or could this be a consequence of the root cause?

I noticed this in the output:

Number of trials: 9119 (4 RUNNING, 9115 TERMINATED)
+-------------------------+------------+---------------------+--------------+----------+---------+---------+-----------+--------+------------------+--------------+-----------------+
| Trial name              | status     | loc                 | activation   |   height |   steps |   width |      loss |   iter |   total time (s) |   iterations |   neg_mean_loss |
|-------------------------+------------+---------------------+--------------+----------+---------+---------+-----------+--------+------------------+--------------+-----------------|
| easy_objective_baec28af | RUNNING    | 192.168.1.98:293301 | tanh         | -99.9262 |      10 | 19.9911 | -9.40414  |      9 |     29.8703      |            8 |        9.40414  |
| easy_objective_c6e4fff9 | RUNNING    | 192.168.1.98:293435 | tanh         | -99.97   |      10 | 19.9964 | -8.56821  |      4 |     10.0016      |            3 |        8.56821  |
| easy_objective_cce9da7f | RUNNING    |                     | tanh         | -64.901  |      10 | 19.9986 |           |        |                  |              |                 |
| easy_objective_cce9da8b | RUNNING    |                     | tanh         | -99.9644 |      10 | 18.5679 |           |        |                  |              |                 |
| easy_objective_00070a3a | TERMINATED |                     | tanh         | -96.2074 |      10 | 19.9978 |  0.379256 |      1 |      0.000170946 |            0 |       -0.379256 |
| easy_objective_00070a3e | TERMINATED |                     | relu         | -99.8693 |      10 | 19.9823 | -9.46017  |     10 |     40.2135      |            9 |        9.46017  |
| easy_objective_00070a41 | TERMINATED |                     | tanh         | -91.1667 |      10 | 19.6413 |  0.883334 |      1 |      0.000124931 |            0 |       -0.883334 |
| easy_objective_001768ba | TERMINATED |                     | relu         | -96.2062 |      10 | 17.9835 | -8.05691  |      4 |      0.316018    |            3 |        8.05691  |
| easy_objective_001768be | TERMINATED |                     | relu         | -84.7459 |      10 | 19.4491 |  1.52541  |      1 |      0.000139236 |            0 |       -1.52541  |
| easy_objective_001768c1 | TERMINATED |                     | relu         | -80.1204 |      10 | 18.8326 |  1.98796  |      1 |      0.000136137 |            0 |       -1.98796  |
| easy_objective_001768c7 | TERMINATED |                     | relu         | -87.1098 |      10 | 18.3095 |  1.28902  |      1 |      0.000151157 |            0 |       -1.28902  |
| easy_objective_001768ca | TERMINATED |                     | tanh         | -96.2964 |      10 | 14.0009 | -7.70666  |      4 |      0.470859    |            3 |        7.70666  |
| easy_objective_001768ce | TERMINATED |                     | relu         | -77.2134 |      10 | 19.5709 |  2.27866  |      1 |      0.000116348 |            0 |       -2.27866  |
| easy_objective_001768d2 | TERMINATED |                     | relu         | -90.0362 |      10 | 19.981  |  0.996375 |      1 |      0.000134468 |            0 |       -0.996375 |
| easy_objective_001768d6 | TERMINATED |                     | relu         | -66.5341 |      10 | 17.7863 |  3.34659  |      1 |      0.000136137 |            0 |       -3.34659  |
| easy_objective_001768d9 | TERMINATED |                     | relu         | -27.5854 |      10 | 18.6839 |  7.24146  |      1 |      0.000114441 |            0 |       -7.24146  |
| easy_objective_001768dd | TERMINATED |                     | relu         | -99.9664 |      10 | 19.9842 | -9.46993  |     10 |      0.928737    |            9 |        9.46993  |
| easy_objective_001768e0 | TERMINATED |                     | relu         | -93.1312 |      10 | 19.9989 |  0.686876 |      1 |      0.000126839 |            0 |       -0.686876 |
| easy_objective_001768e3 | TERMINATED |                     | relu         | -82.6744 |      10 | 19.1888 |  1.73256  |      1 |      0.000142097 |            0 |       -1.73256  |
| easy_objective_001768e8 | TERMINATED |                     | relu         | -99.7873 |      10 | 19.9984 | -9.45238  |     10 |      0.929037    |            9 |        9.45238  |
+-------------------------+------------+---------------------+--------------+----------+---------+---------+-----------+--------+------------------+--------------+-----------------+
... 9099 more trials not shown (9099 TERMINATED)

The port listed under loc keeps increasing. Is this normal or is it related to the lack of cleanup? I wonder if this may cause a slowdown in the processing. The first 1000 samples run in about one minute. However, to reach 9000 it took more than two hours.

A sample of the same output for the first 1000 samples is:

Number of trials: 976 (3 RUNNING, 973 TERMINATED)
+-------------------------+------------+-------+--------------+------------+---------+----------+-----------+--------+------------------+--------------+-----------------+
| Trial name              | status     | loc   | activation   |     height |   steps |    width |      loss |   iter |   total time (s) |   iterations |   neg_mean_loss |
|-------------------------+------------+-------+--------------+------------+---------+----------+-----------+--------+------------------+--------------+-----------------|
| easy_objective_fe442ed1 | RUNNING    |       | tanh         | -81.3322   |      10 | 18.3525  |           |        |                  |              |                 |
| easy_objective_fe442ed4 | RUNNING    |       | tanh         | -27.9333   |      10 | 19.9991  |           |        |                  |              |                 |
| easy_objective_fe442ed8 | RUNNING    |       | tanh         | -55.5845   |      10 |  1.15827 |           |        |                  |              |                 |
| easy_objective_b94e4362 | TERMINATED |       | tanh         |   2        |      10 |  4       |  2.37391  |     10 |      0.928054    |            9 |       -2.37391  |
| easy_objective_b94e4363 | TERMINATED |       | relu         |   2        |      10 |  1       |  5.46316  |     10 |      0.923865    |            9 |       -5.46316  |
| easy_objective_b94e4364 | TERMINATED |       | relu         | -98.1968   |      10 |  7.05509 | -8.45906  |     10 |      0.925836    |            9 |        8.45906  |
| easy_objective_b94e4365 | TERMINATED |       | relu         |  83.7363   |      10 | 12.3102  | 18.3736   |      1 |      0.000134706 |            0 |      -18.3736   |
| easy_objective_b94e436e | TERMINATED |       | relu         |  45.2408   |      10 |  4.04708 | 14.5241   |      1 |      0.000127077 |            0 |      -14.5241   |
| easy_objective_b94e4376 | TERMINATED |       | relu         | -69.3192   |      10 | 15.6438  | -6.26876  |     10 |      0.931887    |            9 |        6.26876  |
| easy_objective_b94e4396 | TERMINATED |       | tanh         |  66.1188   |      10 |  3.79745 | 16.6119   |      1 |      0.000133514 |            0 |      -16.6119   |
| easy_objective_b94e4399 | TERMINATED |       | relu         |  72.2568   |      10 |  1.50478 | 17.2257   |      1 |      0.000148296 |            0 |      -17.2257   |
| easy_objective_b94e439c | TERMINATED |       | tanh         | -77.7765   |      10 | 19.7163  | -7.24417  |     10 |      0.943723    |            9 |        7.24417  |
| easy_objective_b94e43a0 | TERMINATED |       | tanh         | -24.4784   |      10 |  5.22422 |  7.55216  |      1 |      0.000123262 |            0 |       -7.55216  |
| easy_objective_b94e43a4 | TERMINATED |       | relu         | -15.9875   |      10 | 16.4877  |  8.40125  |      1 |      0.000123501 |            0 |       -8.40125  |
| easy_objective_b94e43aa | TERMINATED |       | relu         | -81.956    |      10 |  9.55864 | -5.61001  |      4 |      0.309341    |            3 |        5.61001  |
| easy_objective_ba422cac | TERMINATED |       | relu         |  -0.577995 |      10 |  9.12862 |  9.9422   |      1 |      0.00012517  |            0 |       -9.9422   |
| easy_objective_ba422caf | TERMINATED |       | tanh         |  33.4712   |      10 | 17.3596  | 13.3471   |      1 |      9.82285e-05 |            0 |      -13.3471   |
| easy_objective_ba422cb5 | TERMINATED |       | relu         |  81.3015   |      10 | 14.5552  | 18.1301   |      1 |      0.000136852 |            0 |      -18.1301   |
| easy_objective_ba422cb8 | TERMINATED |       | relu         | -78.4471   |      10 |  1.45203 | -0.879026 |      4 |      0.310111    |            3 |        0.879026 |
| easy_objective_ba422cbd | TERMINATED |       | tanh         |   3.93785  |      10 | 14.8098  | 10.3938   |      1 |      0.000105858 |            0 |      -10.3938   |
+-------------------------+------------+-------+--------------+------------+---------+----------+-----------+--------+------------------+--------------+-----------------+
... 956 more trials not shown (956 TERMINATED)

demattia on 17 Oct 2020

In this latest run the gcs_server.err is empty, as expected. The gcs_server.out still shows the following:

I1016 20:48:00.966279 87641 87641 gcs_actor_manager.cc:859] Actor created successfully, actor id = cbefcd0901000000
I1016 20:48:00.966625 87641 87641 gcs_actor_manager.cc:131] Finished creating actor, job id = 01000000, actor id = cbefcd0901000000
W1016 20:48:00.995527 87641 87641 gcs_actor_manager.cc:544] Actor cbefcd0901000000 is out of scope,, destroying actor child.
W1016 20:48:00.995527 87641 87641 gcs_actor_manager.cc:544] Actor cbefcd0901000000 is out of scope,, destroying actor child.
I1016 20:48:00.995549 87641 87641 gcs_actor_manager.cc:558] Destroying actor, actor id = cbefcd0901000000
I1016 20:48:00.995558 87641 87641 gcs_actor_manager.cc:1065] Erasing actor cbefcd0901000000 owned by 01000000ffffffffffffffffffffffffffffffff
I1016 20:48:00.995904 87641 87641 gcs_worker_manager.cc:31] Reporting worker failure, worker id = dd7583c76aa690d413a0a695cd96fa091cc72585, node id = 4467d12e34ea89e26e03126ccbedd527b9d9f056, address = 192.168.1.98
I1016 20:48:00.996824 87641 87641 gcs_actor_manager.cc:680] Worker dd7583c76aa690d413a0a695cd96fa091cc72585 on node 4467d12e34ea89e26e03126ccbedd527b9d9f056 intentional exit.

The line

I1016 20:48:00.995904 87641 87641 gcs_worker_manager.cc:31] Reporting worker failure, worker id = dd7583c76aa690d413a0a695cd96fa091cc72585, node id = 4467d12e34ea89e26e03126ccbedd527b9d9f056, address = 192.168.1.98

talks about a worker failure and the next line about intentional exit. Is this expected?

demattia on 17 Oct 2020

That intentional exit is misleading. I will fix the issue. It should be intentional exit : True / False. I think what's happening is one of your nodes is dead in the middle. Did you see any logs from gcs_server that your node is dead?

rkooo567 on 17 Oct 2020

also cc @richardliaw Have you seen this error. Any possibility the issue is from Tune?

rkooo567 on 17 Oct 2020

This actually seems really easy to reproduce. I haven't seen this issue from Tune before, but going to 10k samples can probably cause some problems.

richardliaw on 17 Oct 2020

@richardliaw Is it possible for the Tune team to take this over and confirm the issue is not related to Tune itself first?

rkooo567 on 19 Oct 2020

I'm also suffering from this "Couldn't reconnect to GCS server. The last attempted GCS server address was ..." error. In my case, I'm using hyperopt as well. In some cases the errors start at just over 500 trials out of a planned 2,500 trials. The program gets to the first 500 trials in a few minutes and then keeps running forever without completing any other trials before I have to kill it. Also I am not using a cluster, everything is being run on a single machine.

ag-tcm on 19 Oct 2020

@rkooo567 since this seems to be confirmed to be a GCS connection leak , is Kai still the right person to work on it?

ericl on 20 Oct 2020

@ag-tcm Can you also show us the log of /tmp/ray/session_latest/logs/gcs_server.err?

rkooo567 on 20 Oct 2020

@ericl So, my understanding / hypothesis here is the initial error was due to a connection leak, but the following errors (https://github.com/ray-project/ray/issues/11309#issuecomment-710726509) are just due to Tune? (Or it could be that gcs server is not stable when there are so many actors, but I'd like to make sure it is not a Tune issue before I dig deeper into it)

rkooo567 on 20 Oct 2020

I'm not sure any errors after the initial one make sense to look into--- once we run out of file descriptors, anything could happen.

ericl on 20 Oct 2020

@rkooo567 I re-ran another job since the error and that gcs_server.err file has been overwritten. I'll post that log when it happens again though.

ag-tcm on 20 Oct 2020

I've managed to reproduce the error with this simplified script: https://gist.github.com/ericl/73dfd011a5c592e65bae66e6fab52052

with ulimit -n 2000

0.8.5 -> works
0.8.6 -> works
0.8.7 -> fails
1.0 -> fails

So it's pretty clearly tied to GCS actor manager on by default. Looking into how to fix next.

ericl on 20 Oct 2020

After more investigation, I don't believe this is really a leak of gRPC clients (I added constructor/destructor logging hooks and the number of gRPC clients remains bounded, and we seem to be cleaning up properly in all cases).

I believe that what's going on is there are a large number of connections stuck closing in TIME_WAIT state: https://github.com/grpc/grpc/issues/3855 This means that after an actor is destroyed, the sockets used for communicating with it may be active for up to 120s: https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux There are some kernel options to recycle sockets, but I didn't have much success trying them.

Since 1.0 is managing actors from the GCS server, this at least doubles the number of connections per actor: you need one from the driver->actor, and another from the GCS->actor. We also seem to be opening more than 1 connection in the GCS, so the overall number of sockets is now 2-3x it was before, bumping into FD limits.

I'm not sure the best solution for this, but probably increasing the ulimit is the way to go in the short term.

ericl on 21 Oct 2020

👍1

@demattia if you are on the latest master release, you can also use tune.run(reuse_actors=True) to avoid this issue.

richardliaw on 21 Oct 2020

Btw, I've also observed Tune getting pretty slow with a large number of finished trials (>2000). I guess this is a separate bug, please open another issue for it to track.

ericl on 21 Oct 2020

yep, the tune slowdown is tracked https://github.com/ray-project/ray/issues/11233

richardliaw on 21 Oct 2020

👍1

Thanks @ericl and everyone for your help. The ulimit increase indeed seems to solve the issue for now. @richardliaw I will try the reuse_actors=True at the next chance, thanks for the suggestion. I will follow the slowdown on #11233 as that is the next bottleneck.

demattia on 21 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Lost reference to actor exception for seemingly valid code.

robertnishihara · 3Comments

Trouble installing in virtual env.

robertnishihara · 3Comments

Bazel build error for boost when building from source on ARM(aarch64)

heavyinfo · 3Comments

[rllib] In multiagent environment, is timesteps_total the total timesteps per agent or over all agents?

coreylowman · 3Comments

In Jenkins tests, test_0.py hangs occasionally.

robertnishihara · 3Comments