Ray: Crash with message node --- has been marked dead because the monitor has missed too many heartbeats from it

Created on 24 Dec 2018 · 36Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Ray installed from (source or binary): from binary
Ray version: 0.6.0
Python version: 2.7.12
Exact command to reproduce: python mnist.py

Describe the problem

I have a 3 node cluster and 3 remote workers -- two of those require GPU and third one only the CPU. Two of the nodes have been started with 1 GPU and 1 CPU and the node with redis contains only the CPU (4 of them). The setup with the attached code works for a few epochs and then crashes with a message that the monitor missed too many heartbeats...
The node with client ID 581656f84659be9bba58da993729644cfb554836 has been marked dead because the monitor has missed too many heartbeats from it.

Traceback (most recent call last):
File "mnist_main.py", line 154, in
train()
File "mnist_main.py", line 138, in train
for actor in train_actors])
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 2358, in get
raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000008623a3fd1946a9132866d6f6113876e). It was created by remote function which failed with:

Remote function failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Source code / logs

from __future__ import print_function
import argparse
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import ray

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

@ray.remote(num_gpus=1)
class MNISTTrainActor(object):
    """Simple actor for MNIST trainer."""
    def __init__(self, id):
        print("Initialize Actor environment gpu id: ", os.environ["CUDA_VISIBLE_DEVICES"])
        self.device = torch.device("cuda")
        self.model = Net().to(self.device)

        kwargs = {'num_workers': 1, 'pin_memory': True}
        self.train_loader = torch.utils.data.DataLoader(
            datasets.MNIST('/data/mnist', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
            batch_size=64, shuffle=True, **kwargs)

        momentum = 0.5
        lr = 0.01
        self.optimizer = optim.SGD(self.model.parameters(), lr=lr, momentum=momentum)

        self.id = id 
        print("ID: ", self.id) 

    def run_train(self, weights):
        self.model.load_state_dict(weights)
        self.model.cuda()
        print("starting run_train for actor.id = ", self.id)
        for batch_idx, (data, target) in enumerate(self.train_loader):
            #send even batches to id == 1 and odd to id == 0
            if ((self.id % 2 == 0 and batch_idx % 2 == 0) or
                (self.id % 2 == 1 and batch_idx % 2 == 1) ) : continue
            data, target = data.to(self.device), target.to(self.device)
            self.optimizer.zero_grad()
            output = self.model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            self.optimizer.step()
            #if batch_idx % 200 == 0 or batch_idx % 201 == 0:
            if False: #batch_idx % 200 == 0 or batch_idx % 201 == 0:
                print('Actor ID: {} batch_idx: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    self.id, batch_idx, batch_idx * len(data), len(self.train_loader.dataset),
                    100. * batch_idx / len(self.train_loader), loss.item()))
        weights = self.model.cpu().state_dict()
        return weights

    def get_weights(self):
        weights = self.model.cpu().state_dict()
        return weights

@ray.remote
class MNISTTestActor(object):
    def __init__(self):
        self.device = torch.device("cpu")
        self.model = Net().to(self.device)
        kwargs = {}
        self.test_loader = torch.utils.data.DataLoader(
            datasets.MNIST('/data/mnist', train=False, transform=transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.1307,), (0.3081,))
                           ])),
            batch_size=64, shuffle=True, **kwargs)

    def accuracy(self, weights, step):
        self.model.load_state_dict(weights)
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for data, target in self.test_loader:
               data, target = data.to(self.device), target.to(self.device)
               output = self.model(data)
               test_loss += F.nll_loss(output, target).item() # sum up batch loss
               pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
               correct += pred.eq(target.view_as(pred)).sum().item()

        test_loss /= len(self.test_loader.dataset)

        print('\nTest set: Step: {}, Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            step, test_loss, correct, len(self.test_loader.dataset),
            100. * correct / len(self.test_loader.dataset)))
def train():
    ray.init(redis_address="IP_address_of_head_removed:6379")
    num_actors = 2
    train_actors = [MNISTTrainActor.remote(i)
                   for i in range(num_actors)]
    test_actor = MNISTTestActor.remote()
    weight_id = train_actors[0].get_weights.remote() 
    step = 0
    acc_id = test_actor.accuracy.remote(weight_id, step)
    print("Starting training loop. Use Ctrl-C to exit.")
    try:
        while True:
            all_weights = ray.get([actor.run_train.remote(weight_id)
                                   for actor in train_actors])
            mean_weights = {k: (sum(weights[k] for weights in all_weights) /
                                num_actors)
                            for k in all_weights[0]}
            weight_id = ray.put(mean_weights)
            step += 10
            if step % 10 == 0:
               acc = ray.get(acc_id)
               acc_id = test_actor.accuracy.remote(weight_id, step)
    except KeyboardInterrupt:
        pass

if __name__ == "__main__":
    train()

Source

bigdata2

Most helpful comment

@robertnishihara My guess is that the monitor (running on the node with redis?) is not getting heartbeat(s) (from one or both nodes) before timeout. Is there a way to make heartbeat timeout longer using Python APIs / configuration file? I can try that to find out if that helps.

bigdata2 on 25 Dec 2018

👍3

All 36 comments

Not sure if this will fix the issue, but can you try ray==0.6.1 which was released yesterday?

robertnishihara on 24 Dec 2018

With ray 0.6.1 I get another error -- trace copied below:

Traceback (most recent call last):
File "/home/anurag/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 107, in
ray.worker.global_worker.main_loop()
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 962, in main_loop
self._wait_for_and_process_task(task)
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 919, in _wait_for_and_process_task
self._process_task(task, execution_info)
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 844, in _process_task
ray.utils.format_error_message(traceback.format_exc()))
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 854, in _handle_process_task_failure
self._store_outputs_in_object_store(return_object_ids, failure_objects)
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 755, in _store_outputs_in_object_store
self.put_object(object_ids[i], outputs[i])
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 356, in put_object
self.store_and_register(object_id, value)
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 291, in store_and_register
self.task_driver_id))
File "/home/anurag/.local/lib/python2.7/site-packages/ray/utils.py", line 418, in _wrapper
return orig_attr(args, *kwargs)
File "pyarrow/_plasma.pyx", line 491, in pyarrow._plasma.PlasmaClient.put
buffer = self.create(target_id, serialized.total_bytes)
File "pyarrow/_plasma.pyx", line 322, in pyarrow._plasma.PlasmaClient.create
check_status(self.client.get().Create(object_id.data, data_size,
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
raise ArrowIOError(message)
ArrowIOError: Broken pipe

This error is unexpected and should not have happened. Somehow a worker
crashed in an unanticipated way causing the main_loop to throw an exception,
which is being caught in "python/ray/workers/default_worker.py".

The node with client ID c6eab66b7b76515181472c5c4cf317265dfe733a has been marked dead because the monitor has missed too many heartbeats from it.
Traceback (most recent call last):
File "mnist_main.py", line 154, in
train()
File "mnist_main.py", line 138, in train
for actor in train_actors])
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 2377, in get
raise value
ray.worker.RayTaskError: Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

bigdata2 on 25 Dec 2018

👍3

It would be great if we can set the value of heartbeat timeout manually.

llan-ml on 25 Dec 2018

The timeout is 30 seconds, which is pretty conservative as is. I'm guessing that this is a red herring and the underlying cause is actually something else.

Does this happen with any other workload you can provide?

ericl on 25 Dec 2018

Yes, 30 seconds is a reasonable timeout. The workload in my example is very simple -- it is a 3 node cluster with 3 actors, one of which averages MNIST parameters. Do you notice anything missing from the code that I provided in this issue? I am able to run my Impala implementation with 5 workers on a single machine without any problem. So I am guessing that this problem manifests only on a multi-node cluster.

bigdata2 on 25 Dec 2018

The workload looks fine, but it's hard to tell since there are a lot of things going on. My question is whether you can reproduce this with a minimal example (preferably not using pytorch), e.g. does multi-node IMPALA work?

ericl on 25 Dec 2018

By the way, you can emulate multi-node on a single node with rllib train --ray-num-local-schedulers=N

ericl on 25 Dec 2018

I did not run multi-node Impala, as the deepmind lab simulator is configured on one node only in my setup and I am not sure how to emulate multi-node using rllib. However, compared to MNIST example that I copied here, Impala is not "minimal", there are a lot many operations and components in Impala including PS, batched CNN, sequential LSTM, importance sampling and value iteration. In contrast, in this example, I am training a 2 layer CNN on two separate nodes using GPUs, putting weights into the object store and retrieving them. I would say this is a "hello world" example. One key difference is that in this example, in every few seconds, thousands of parameters are pushed and pulled from the object store by each worker. In Impala, it takes time to generate the trajectories and there is one way traffic of the model parameters from the PS to workers.

bigdata2 on 25 Dec 2018

It crashes for a single machine configuration as well with the following trace:

Traceback (most recent call last):
File "mnist_main.py", line 150, in
train()
File "mnist_main.py", line 134, in train
for actor in train_actors])
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 2373, in get
values = worker.get_object(object_ids)
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 497, in get_object
int(0.01 * len(unready_ids))
File "/home/anurag/.local/lib/python2.7/site-packages/ray/worker.py", line 395, in retrieve_and_deserialize
self.get_serialization_context(self.task_driver_id))
File "/home/anurag/.local/lib/python2.7/site-packages/ray/utils.py", line 418, in _wrapper
return orig_attr(args, *kwargs)
File "pyarrow/_plasma.pyx", line 523, in pyarrow._plasma.PlasmaClient.get
File "pyarrow/_plasma.pyx", line 383, in pyarrow._plasma.PlasmaClient.get_buffers
File "pyarrow/_plasma.pyx", line 280, in pyarrow._plasma.PlasmaClient._get_object_buffers
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Encountered unexpected EOF

bigdata2 on 8 Jan 2019

I wonder if there's an out-of-memory issue on one of the machines. How much memory do you have on each machine? You might try setting --object-store-memory and --redis-max-memory (e.g., to limit the total amount of memory used by the object store and by Redis. By default, we cap the object store memory at 20GB and Redis at 10GB (in the current master, Redis is uncapped in 0.6.1). Also note that Redis is only used on the head machine.

robertnishihara on 8 Jan 2019

RAM on each machine is 128G. The out-of-memory theory is plausible, I see the aforementioned script always running to a fixed number of epochs (5) before crashing and it does not crash (on a single machine) when I comment out the line "weight_id = ray.put(mean_weights)". However, I also monitored the total memory consumption (on the head machine) while the program was executing and I noticed that there was ample RAM available when it crashed.
What values of object-store-memory and redis-max-memory should I try to set? I am assuming these are set when ray is started?

bigdata2 on 8 Jan 2019

The same issue of PlasmaClient crash was also reported in this issue.

bigdata2 on 8 Jan 2019

The issue you mentioned does not happen recently. Is there any crash log from Plasma Store?

guoyuhong on 8 Jan 2019

@guoyuhong Here is the plasma store log from one of the workers.

108 00:33:26.584295 17788 store.cc:994] Allowing the Plasma store to use up to 20GB of memory.
I0108 00:33:26.584645 17788 store.cc:1024] Starting object store with directory /dev/shm and huge page support disabled
F0108 00:35:09.252362 17788 store.cc:465] Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
* Check failure stack trace: *
@ 0x43be2d google::LogMessage::Fail()
@ 0x43f8dc google::LogMessage::SendToLog()
@ 0x43b953 google::LogMessage::Flush()
@ 0x43bb59 google::LogMessage::~LogMessage()
@ 0x439a18 arrow::util::ArrowLog::~ArrowLog()
@ 0x418c6c plasma::PlasmaStore::ReleaseObject()
@ 0x41a757 plasma::PlasmaStore::ProcessMessage()
@ 0x41b59f _ZNSt17_Function_handlerIFviEZN6plasma11PlasmaStore13ConnectClientEiEUliE_E9_M_invokeERKSt9_Any_datai
@ 0x439357 aeProcessEvents
@ 0x43967b aeMain
@ 0x41f207 plasma::PlasmaStoreRunner::Start()
@ 0x4166d1 plasma::StartServer()
@ 0x413345 main
@ 0x7fd294104830 __libc_start_main
@ 0x414701 (unknown)

bigdata2 on 8 Jan 2019

@bigdata2 Thanks for providing this crash log. I think the root cause is that Plasma Store crashes first and then the Plasma Store Client in both workers and raylet failed. Raylet will not survive when Plasma Server crashes, so raylet crashes and the monitor will mark it dead after 30 seconds.
We must first repro the crash in Plasma Store and fix it.

guoyuhong on 8 Jan 2019

Thanks for the information @guoyuhong . Looking at store.cc here It appears that if an object is not found in the store, the function will return 0. So the check on this line will fail if the object is not deleted/removed (because it was not found in the store). So two questions, first, when will an object be not found in the store and second, if it is not found, should the check still fail?

bigdata2 on 8 Jan 2019

From the code logic in plasma, it looks like the object is released twice, the first time the object is removed from client->object_ids, it should return 1; then the second time the object is not in the map, so it would return 0. The object is still in the store, otherwise the check here would trigger. From the logic in plasma client this shouldn't happen though.

zhijunfu on 8 Jan 2019

It could be related to this issue (plasma thread-safety around object removal): https://github.com/ray-project/ray/pull/3484
I don't think we ever got to the bottom of it.

ericl on 8 Jan 2019

I have a similar issue after training for a few epochs

The node with client ID 8241b3b7dbde95caff47ae7b98ec80150fd3e135 has been marked dead because the monitor has missed too many heartbeats from it.

I tried to figure out more about this issue by looking into the logs -- only to realize that the log directory is empty since the upgrade from ray 0.63 to 0.64, see https://github.com/ray-project/ray/issues/4400

Any other strategies to get to the bottom of this you can recommend?

louiskirsch on 18 Mar 2019

I've recently updated from 0.6.2 to 0.6.6 and I am now seeing this error after a few hours of just running DQN with one of our slower environments. I'm able to reproduce it quite reliably.

Although I'll receive an error along the lines of:

ERROR worker.py:1672 -- The node with client ID f064cdc6134773576a45f71090e1e6253f5caa63 has been marked dead because the monitor has missed too many heartbeats from it.

If I check the /tmp/ray/session_*/logs directory, there will be no logs for a worker with that id and monitor.err contains just a single line:

Monitor: could not find ip for client f064cdc6134773576a45f71090e1e6253f5caa63

It's as if it's trying to monitor a worker that was never started.

In the raylet.err log, there is this error:

F0627 15:20:46.871592  4274 node_manager.cc:395]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.

Is there any further logging I can enable that might help route cause this? Or are there any additional locations in code it would help if I added some additional logging?

elpollouk on 27 Jun 2019

I can now reliably reproduce this with CartPole-v0 and DQN on my desktop. I've wrapped CartPole-v0 in an environment that sleeps to 10 seconds when resetting and 1 second when stepping. This is to simulate the behaviour of a naturally slow environment.

I've attached a test script that can reproduce the issue. When running slowenv.py, it will fail within a couple of hours due to the node being marked dead.

slowenv.zip

elpollouk on 28 Jun 2019

Oh bizarre, I'll take a look into this. Could you also attach.a full dump of the session logs of the crash?

ericl on 29 Jun 2019

I tried this out an it ran fine overnight: - DQN_slow-env_0: RUNNING, [1.0 CPUs, 0 GPUs], [pid=16727], 61584 s, 56 iter, 56000 ts, 156 rew

This is with the latest wheels. I think it's likely a backend issue that was fixed, since there were many stability fixes since 0.6

ericl on 3 Jul 2019

Yeah, that sounds fair. I've tried repeating the same experiment with 0.7.1 and everything seemed fine. Thanks for taking a look though.

elpollouk on 4 Jul 2019

Running with 0.7.2 and this crash is still happening. Sometimes after many 5 million steps (like 6 hours) and sometimes after only 2 hours. I'm using PPO.

drozzy on 29 Jul 2019

What is a command that can be run to reproduce this issue? Also, does it
still happen on the latest wheel?

Eric

On Sun, Jul 28, 2019 at 8:40 PM Andriy Drozdyuk notifications@github.com
wrote:

Running with 0.7.2 and this crash is still happening. Sometimes after many
5 million steps (like 6 hours) and sometimes after only 2 hours. I'm using
PPO.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3628?email_source=notifications&email_token=AAADUSRG7HE25G5UNG6DBNLQBZRD3A5CNFSM4GMCVVM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27P22I#issuecomment-515833193,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSQLSLPBJFYXBHEPXQTQBZRD3ANCNFSM4GMCVVMQ
.

ericl on 29 Jul 2019

Sorry for the stupid question, but how do I get the latest wheel? Is it different from pip install ray?
Regarding command, I'm calling it from python script, with PPO trainer with a custom environment (name removed due to proprietary reasons) and the number of workers set to num of cpus as reported by multiprocessing.cpu_count().

import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env
import multiprocessing

def main():
    n_cpus = multiprocessing.cpu_count()
    ENV_NAME = 'blahblahblah'
    ray.init()
    config = ppo.DEFAULT_CONFIG.copy()
    config["num_gpus"] =  1
    config["num_workers"] = n_cpus

    register_env(ENV_NAME, lambda c: BlahEnv(visualize=False))

    trainer = ppo.PPOTrainer(config=config, env=ENV_NAME)
    CHECKPOINT_NUMBER = 1634
    CHECKPOINT = "./checkpoints/checkpoint_{}/checkpoint-{}".format(CHECKPOINT_NUMBER, CHECKPOINT_NUMBER)

    trainer.restore(CHECKPOINT)
    for i in range(1000):
       # Perform one iteration of training the policy with PPO
       result = trainer.train()
       checkpoint = trainer.save(checkpoint_dir="./checkpoints")
       print("[{}] Checkpoint saved in {}.".format(i, checkpoint))


if __name__ == '__main__':
    main()

I'm trying it with 0.7.1 now.

drozzy on 29 Jul 2019

Here's the link to the latest wheels table:
https://ray.readthedocs.io/en/latest/installation.html#trying-snapshots-from-master

By the way for the repro how many workers did you use?

On Mon, Jul 29, 2019, 11:02 AM Andriy Drozdyuk notifications@github.com
wrote:

Sorry for the stupid question, but how do I get the latest wheel? Is it
different from pip install ray?
Regarding command, I'm calling it from python script, with PPO trainer
with a custom environment and the number of workers set to num of cpus as
reported by multiprocessing.cpu_count().

import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env
import multiprocessing

def main():
n_cpus = multiprocessing.cpu_count()
ENV_NAME = 'blahblahblah'
ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 1
config["num_workers"] = n_cpus
register_env(ENV_NAME, lambda c: BlahEnv(visualize=False))

trainer = ppo.PPOTrainer(config=config, env=ENV_NAME)
CHECKPOINT_NUMBER = 1634
CHECKPOINT = "./checkpoints/checkpoint_{}/checkpoint-{}".format(CHECKPOINT_NUMBER, CHECKPOINT_NUMBER)

trainer.restore(CHECKPOINT)
for i in range(1000):
   # Perform one iteration of training the policy with PPO
   result = trainer.train()
   checkpoint = trainer.save(checkpoint_dir="./checkpoints")
   print("[{}] Checkpoint saved in {}.".format(i, checkpoint))
if __name__ == '__main__':
main()

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3628?email_source=notifications&email_token=AAADUSQ6DK6QAUN5WM22QFTQB4WDJA5CNFSM4GMCVVM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3BQZXA#issuecomment-516099292,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSRAFCSLVKZ5O3WNZ4LQB4WDJANCNFSM4GMCVVMQ
.

ericl on 29 Jul 2019

👍1

I used like 23 workers (my laptop has 24 cores).

By the way, now that I reduced my workers to 1/2 of that - 12, it seems to be running fine for more than 10 hours. NEVERMIND! It just crashed again... after like 12 hours a worker didn't respond...

drozzy on 30 Jul 2019

Ok, I'm trying to reproduce this with 64 workers.

By the way, you can find the raylet logs in /tmp/ray/session... Most
likely, raylet.err or raylet.out contains the actual error that occured.

Eric

On Tue, Jul 30, 2019 at 8:10 AM Andriy Drozdyuk notifications@github.com
wrote:

I used like 23 workers (I think my laptop has 24 cores).

By the way, now that I reduced my workers to 1/2 of that - 12, it seems to
be running fine for more than 10 hours.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3628?email_source=notifications&email_token=AAADUSVE34AFFK6NGOGWX7LQCBKWXA5CNFSM4GMCVVM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3EJQAA#issuecomment-516462592,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSRXZE3XW5XWDTRKKU3QCBKWXANCNFSM4GMCVVMQ
.

ericl on 30 Jul 2019

I've got past 12 million steps on 64 workers on latest. If it's failing
with larger number of workers, maybe it's hitting some resource limit like
memory on the machine? I tested on a machine w/64 cores, 500GB, so memory
wasn't an issue.

Eric

On Tue, Jul 30, 2019 at 12:59 PM Eric Liang ekhliang@gmail.com wrote:

Ok, I'm trying to reproduce this with 64 workers.

By the way, you can find the raylet logs in /tmp/ray/session... Most
likely, raylet.err or raylet.out contains the actual error that occured.

Eric

On Tue, Jul 30, 2019 at 8:10 AM Andriy Drozdyuk notifications@github.com
wrote:

I used like 23 workers (I think my laptop has 24 cores).

By the way, now that I reduced my workers to 1/2 of that - 12, it seems
to be running fine for more than 10 hours.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3628?email_source=notifications&email_token=AAADUSVE34AFFK6NGOGWX7LQCBKWXA5CNFSM4GMCVVM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3EJQAA#issuecomment-516462592,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSRXZE3XW5XWDTRKKU3QCBKWXANCNFSM4GMCVVMQ
.

ericl on 31 Jul 2019

By the way, you can find the raylet logs in /tmp/ray/session... Most likely, raylet.err or raylet.out contains the actual error that occured.

I will look into this.

drozzy on 1 Aug 2019

I am running 0.7.0... This is the only error I see after my 3-node cluster master node crashed:

raylet.err 
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0806 08:58:54.226088  2811 stats.h:50] Failed to create the Prometheus exporter. This doesn't affect anything except stats. Caused by: null context when constructing CivetServer. Possible problem binding to port.

I'll run it again now.

drozzy on 6 Aug 2019

I'm thinking maybe it has something to do with running with max number of cores on the master node and also running tensorboard? Perhaps there is not enough CPU for training itself so timeout occurs.

I'm going to try to reduce num of cpus available on master node, e.g. --num-cpus 40 instead of max 46.

drozzy on 6 Aug 2019

closing because stale, please open a new issue if you run into this issue.