raylet crash "Exiting because this node manager has mistakenly been marked dead"

Created on 3 May 2019 · 16Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2 LTS
Ray installed from (source or binary): source
Ray version: be2cbdf1306182007b904b5d976c0c318e6864e9
Python version: 3.7.2
Exact command to reproduce:

Describe the problem

Running ~8000 tasks on 110 workers or so, we occasionally see raylet processes crash with the below log in raylet.err on the crashing node. The node in question has plenty of free memory at the time of crashing. Do you know what might be going wrong here?

Source code / logs

W0501 14:26:21.914777 4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -92145ms W0501 14:26:21.914784 4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -92044ms W0501 14:26:21.914791 4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -91944ms W0501 14:26:21.914799 4048 node_manager.cc:244] Last heartbeat was sent 130991 ms ago F0501 14:26:21.964082 4048 node_manager.cc:395] Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor. *** Check failure stack trace: *** *** Aborted at 1556720781 (unix time) try "date -d @1556720781" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGABRT (@0x3e800000fd0) received by PID 4048 (TID 0x7f733dcd6740) from PID 4048; stack trace: *** @ 0x7f733d8b0890 (unknown) @ 0x7f733c9a4e97 gsignal @ 0x7f733c9a6801 abort @ 0x55d5dd8e9a59 google::logging_fail() @ 0x55d5dd8ebafa google::LogMessage::Fail() @ 0x55d5dd8ecf4f google::LogMessage::SendToLog() @ 0x55d5dd8eb7ab google::LogMessage::Flush() @ 0x55d5dd8eba01 google::LogMessage::~LogMessage() @ 0x55d5dd8e8b44 ray::RayLog::~RayLog() @ 0x55d5dd85f6ff ray::raylet::NodeManager::ClientRemoved() @ 0x55d5dd8bddde ray::gcs::ClientTable::HandleNotification() @ 0x55d5dd8c9720 _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_RKNS0_8UniqueIDESG_E_clES3_SJ_SG_EUlS3_SJ_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_ @ 0x55d5dd8ce57d _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDE24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaIS9_EEEZNS1_3LogIS4_15ClientTableDataE9SubscribeERKNS0_5JobIDES6_RKSt8functionIFvS3_S6_SD_EERKSL_IFvS3_EEEUlS3_S6_S7_SD_E_E9_M_invokeERKSt9_Any_dataOS3_S6_OS7_SD_ @ 0x55d5dd8cbbcb _ZZN3ray3gcs3LogINS_8ClientIDE15ClientTableDataE9SubscribeERKNS_5JobIDERKS2_RKSt8functionIFvPNS0_14AsyncGcsClientES9_24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaISF_EEEERKSA_IFvSC_EEENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEE_clESZ_ @ 0x55d5dd8d0a50 (anonymous namespace)::ProcessCallback() @ 0x55d5dd8d206d ray::gcs::SubscribeRedisCallback() @ 0x55d5dd8d59fc redisProcessCallbacks @ 0x55d5dd8d482c RedisAsioClient::handle_read() @ 0x55d5dd8d2fda boost::asio::detail::reactive_null_buffers_op<>::do_complete() @ 0x55d5dd81c7e9 boost::asio::detail::epoll_reactor::descriptor_state::do_complete() @ 0x55d5dd81c4c9 boost::asio::detail::scheduler::run() @ 0x55d5dd80c2dc main @ 0x7f733c987b97 __libc_start_main @ 0x55d5dd8141da _start

stale

Source

mattearllongshot

Most helpful comment

@markgoodhead Sorry for the delayed reply.

I'm doing the work about integrating the profiling views into dashboard(webui) these days. After that, we can view the metrics through accessing the webui.

If you want to view them right now, you need to start the prometheus server according to the description of https://github.com/ray-project/ray/pull/4246 .

jovany-wang on 6 May 2019

👍2

All 16 comments

It looks like the monitor process is not receiving heartbeats from one of the raylets for a long enough time (10 seconds I think) and so it things that the raylet has died. This could happen if some event handler in the raylet takes a really long time (e.g., longer than 10 seconds) and so heartbeats don't get sent for a while.

In fact, this seems to be happening because we see the log statement

W0501 14:26:21.914799  4048 node_manager.cc:244] Last heartbeat was sent 130991 ms ago

The line

W0501 14:26:21.914791  4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -91944ms

suggests that the raylet in question is a bit overloaded. I've seen this happen in settings where there are tons of dependencies between tasks (and it may be fixed when we fix https://github.com/ray-project/ray/issues/4359, which will be soon).

Can you share more about the specific workload?

@stephanie-wang do you have other thoughts about what could trigger the task lease issue?

robertnishihara on 4 May 2019

@robertnishihara We've seen it in a number of different workloads, but the most common one we've run (which we've seen it on) is with Ray Tune where we have say ~3-5 concurrent hyperparameter samples being run, where each sample itself spins off 3000 (embarrassingly parallel, there are no dependencies between the 3k tasks) ray remote tasks and then waits for all 3000 to finish. Whilst this means we have many thousands of simultaneous ray remote functions running at one time only the 3-5 Ray Tune trainable function calls have any dependencies on any other tasks (the 3k sub tasks).

markgoodhead on 4 May 2019

👍1

Thanks @markgoodhead, we recently merged some profiling tools into the backend https://github.com/ray-project/ray/pull/4602, so this will be a good test case to make sure we can figure out what is going wrong.. cc @jovany-wang

robertnishihara on 4 May 2019

@robertnishihara Is there a way we can view / activate those profiling tools? If so we'd be happy to enable the profiler during our workload to share some stats with you if that helps.

markgoodhead on 4 May 2019

@markgoodhead Sorry for the delayed reply.

I'm doing the work about integrating the profiling views into dashboard(webui) these days. After that, we can view the metrics through accessing the webui.

If you want to view them right now, you need to start the prometheus server according to the description of https://github.com/ray-project/ray/pull/4246 .

jovany-wang on 6 May 2019

👍2

For what it's worth, we have fairly long running tasks - each ray.remote can take several minutes to complete. Is this sort of usage likely to cause heartbeat timeouts? Are there any possible workarounds that we can try in the mean time?

mattearllongshot on 10 May 2019

You can try changing a configuration parameter that should hopefully lower the overhead per task by passing in the following argument to ray start on all nodes, including the head:

ray start --internal-config='{"initial_reconstruction_timeout_milliseconds": 100000}'

The default value is 10,000; I've increased it to 100,000 here.

Also, can you check what the CPU/memory utilization is like for Redis (e.g., with top or htop) the next time you run this workload? If it is high (>50% CPU utilization), I would try increasing the number of Redis shards. You can do that by passing in a flag to the ray start command on the head node (the default is 1):

ray start --head  --num-redis-shards 4

stephanie-wang on 13 May 2019

Thanks very much for your advice @stephanie-wang - we'll certainly give that timeout config setting a try tomorrow and let you know if it looks to help our issue.

As for the redis sharding generally the workloads we're running on the worker boxes is very much fully saturating them at close to all times (we run as many CPU bound ray workers as cores, so we're definitely relying on the OS scheduler to swap in whatever background processes ray/redis needs). Is that ill advised for us to do / should we try to leave some cores unoccupied so ray core / redis have dedicated CPU time available to run? Do you still think adding more shards would help in such a contended environment?

Another thing we think we spotted a possible correlation with is that we started seeing this a lot more when we started using the request_cores feature we developed in the autoscaler (See https://github.com/ray-project/ray/pull/4754) - by it's design this can cause many worker boxes to be added to the cluster in quick succession and so probably puts extra strain on whatever work ray has to do to get them integrated and up-and-running. Do you think this is a likely candidate for causing performance issues?

markgoodhead on 13 May 2019

As for the redis sharding generally the workloads we're running on the worker boxes is very much fully saturating them at close to all times (we run as many CPU bound ray workers as cores, so we're definitely relying on the OS scheduler to swap in whatever background processes ray/redis needs). Is that ill advised for us to do / should we try to leave some cores unoccupied so ray core / redis have dedicated CPU time available to run? Do you still think adding more shards would help in such a contended environment?

I see, thanks for the info! It's definitely worth a try to give the ray core its own CPU, especially if your workers are running at close to full CPU utilization all of the time. On the worker nodes, Ray has one process to run the scheduler, so running with num CPUs - 1 workers per node should be enough. You can tell Ray to do that by passing in --num-cpus <number of CPUs - 1> into the ray start command (you may have to modify your program if it assumes a fixed number of cores per node). I would actually try this before the configuration change I suggested.

The Redis shards live only on the head node. Adding more shards usually helps, but as you said, it depends on what other work is happening there. If you're finding that there are a lot of compute-heavy tasks being scheduled on the head node, you can try to schedule them onto the worker nodes only, to avoid contention with Redis. It's a bit ugly, but one way to do that is to specify a custom resource requirement like {"worker_node": 1} for all tasks, then give all worker nodes a large number of those resources by passing in --resources='{"worker_node": 100}' to ray start. From the logs you've posted so far, it seems that the problem is more likely that the Ray processes on the worker nodes are getting overloaded, so I would only try this if (1) Redis CPU utilization is very high, and (2) there are many other CPU-heavy Python processes on the head node.

Another thing we think we spotted a possible correlation with is that we started seeing this a lot more when we started using the request_cores feature we developed in the autoscaler (See #4754) - by it's design this can cause many worker boxes to be added to the cluster in quick succession and so probably puts extra strain on whatever work ray has to do to get them integrated and up-and-running. Do you think this is a likely candidate for causing performance issues?

Interesting, that definitely seems possible. I've found that at least on EC2, machines can be pretty slow on startup, so it's possible that machines that just started up are getting overwhelmed. Do you have an idea of how long the raylet process is alive before it crashes with the logs you posted? Also, It may be worth a try to warm up new machines by submitting a small number of tasks and getting the results before the actual work. If this turns out to help, we could consider doing the warmup automatically through the autoscaler.

stephanie-wang on 13 May 2019

Thanks again for the detailed feedback @stephanie-wang - actually our head node is very uncontended CPU wise as we figured that it would need extra CPU in order to manage the overall cluster co-ordination efficiently. For this reason we have manually set our head node to have CPU resource of '1' when it actually has 8-16 cores available (this has given rise to us discovering other interesting autoscaling bugs we've reported https://github.com/ray-project/ray/pull/4653 but seems like a sensible thing to do regardless). Given we haven't been using any redis sharding then it may well have just been getting heavily contended on the single threaded Redis shard anyway (I assume each shard is single threaded?), so perhaps enabling the Redis sharding as you suggest will have a big impact on how much throughput the overall cluster can manage.

In general would you expect these kind of heartbeat / contention issues in the raylet to more likely lie with head node throughput or worker node resource starvation? We'll do as much investigation as we can to deduce this ourselves (and when the metrics are in the webui we can hopefully measure it) but if you have a prior inkling as to which is more likely it'd be good to know.

I've found that at least on EC2, machines can be pretty slow on startup, so it's possible that machines that just started up are getting overwhelmed

We hadn't come across this before we made this feature when autoscaling itself would frequently add 5-10 machines at a time into the cluster (each of which would be sent a full set of CPU bound tasks immediately) which would imply that it's not the workers themselves that are struggling on startup but perhaps some shared dependency which is getting overloaded when 20+ start simultaneously (the head node, presumably?). We'll try and get some more examples to share with you and feedback as to what appears to make the difference.

markgoodhead on 14 May 2019

Thanks again for the detailed feedback @stephanie-wang - actually our head node is very uncontended CPU wise as we figured that it would need extra CPU in order to manage the overall cluster co-ordination efficiently. For this reason we have manually set our head node to have CPU resource of '1' when it actually has 8-16 cores available (this has given rise to us discovering other interesting autoscaling bugs we've reported #4653 but seems like a sensible thing to do regardless). Given we haven't been using any redis sharding then it may well have just been getting heavily contended on the single threaded Redis shard anyway (I assume each shard is single threaded?), so perhaps enabling the Redis sharding as you suggest will have a big impact on how much throughput the overall cluster can manage.

Ah yeah, setting the number of CPUs to 1 on the head node is a good idea (much better than my suggestion :) ). Yes, in that case, I would definitely try sharding Redis. Each shard is indeed single-threaded.

In general would you expect these kind of heartbeat / contention issues in the raylet to more likely lie with head node throughput or worker node resource starvation? We'll do as much investigation as we can to deduce this ourselves (and when the metrics are in the webui we can hopefully measure it) but if you have a prior inkling as to which is more likely it'd be good to know.

Hmm, so the reason that I suspected that the issue lies with the worker node is because of the log messages that you posted. The crash itself can occur if either the worker node or the head node is becoming a bottleneck for too long, but the messages before the crash seem to indicate that the worker node is overloaded. So while you're waiting on the metrics, I would keep an eye out for similar messages in worker logs and CPU utilization of the raylet and redis processes on the head node.

I've found that at least on EC2, machines can be pretty slow on startup, so it's possible that machines that just started up are getting overwhelmed

We hadn't come across this before we made this feature when autoscaling itself would frequently add 5-10 machines at a time into the cluster (each of which would be sent a full set of CPU bound tasks immediately) which would imply that it's not the workers themselves that are struggling on startup but perhaps some shared dependency which is getting overloaded when 20+ start simultaneously (the head node, presumably?). We'll try and get some more examples to share with you and feedback as to what appears to make the difference.

I see, could be that Redis is a bottleneck when new machines are getting added, but I'm not sure why that would be exactly.

stephanie-wang on 14 May 2019

Are there any other possible bottlenecks on the head node we should be
aware of (especially if there's some config we can tweak to mitigate it?)?
Is whatever master process Ray is running on the head node also single
threaded?

On Mon, 13 May 2019, 23:56 Stephanie Wang, notifications@github.com wrote:

Thanks again for the detailed feedback @stephanie-wang
https://github.com/stephanie-wang - actually our head node is very
uncontended CPU wise as we figured that it would need extra CPU in order to
manage the overall cluster co-ordination efficiently. For this reason we
have manually set our head node to have CPU resource of '1' when it
actually has 8-16 cores available (this has given rise to us discovering
other interesting autoscaling bugs we've reported #4653
https://github.com/ray-project/ray/pull/4653 but seems like a sensible
thing to do regardless). Given we haven't been using any redis sharding
then it may well have just been getting heavily contended on the single
threaded Redis shard anyway (I assume each shard is single threaded?), so
perhaps enabling the Redis sharding as you suggest will have a big impact
on how much throughput the overall cluster can manage.

Ah yeah, setting the number of CPUs to 1 on the head node is a good idea
(much better than my suggestion :) ). Yes, in that case, I would definitely
try sharding Redis. Each shard is indeed single-threaded.

In general would you expect these kind of heartbeat / contention issues in
the raylet to more likely lie with head node throughput or worker node
resource starvation? We'll do as much investigation as we can to deduce
this ourselves (and when the metrics are in the webui we can hopefully
measure it) but if you have a prior inkling as to which is more likely it'd
be good to know.

Hmm, so the reason that I suspected that the issue lies with the worker
node is because of the log messages that you posted. The crash itself can
occur if either the worker node or the head node is becoming a bottleneck
for too long, but the messages before the crash seem to indicate that the
worker node is overloaded. So while you're waiting on the metrics, I would
keep an eye out for similar messages in worker logs and CPU utilization of
the raylet and redis processes on the head node.

I've found that at least on EC2, machines can be pretty slow on startup,
so it's possible that machines that just started up are getting overwhelmed

We hadn't come across this before we made this feature when autoscaling
itself would frequently add 5-10 machines at a time into the cluster (each
of which would be sent a full set of CPU bound tasks immediately) which
would imply that it's not the workers themselves that are struggling on
startup but perhaps some shared dependency which is getting overloaded when
20+ start simultaneously (the head node, presumably?). We'll try and get
some more examples to share with you and feedback as to what appears to
make the difference.

I see, could be that Redis is a bottleneck when new machines are getting
added, but I'm not sure why that would be exactly.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/4741?email_source=notifications&email_token=ACOYXYVONGQAT3XMMFTR6FLPVHW3PA5CNFSM4HKVL65KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJZEKA#issuecomment-492016168,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACOYXYRUOKQOZUI3OZDTJM3PVHW3PANCNFSM4HKVL65A
.

markgoodhead on 14 May 2019

Hmm, it's definitely possible, but off the top of my head, I'm not sure why it would be in this case. Ray's scheduler is single-threaded but should be able to sustain thousands of tasks/s with default configuration settings. If the tasks in your application are very long, then the task throughput is probably much lower than the capacity of the head node scheduler.

By the way, do you have some code that we could try running? And/or maybe the autoscaler config you've been using? Thanks!

stephanie-wang on 14 May 2019

It's quite tricky to share the exact workflow we're using as the actual code depends on a lot of our internal codebase - probably running Ray Tune like specified in my post earlier:

the most common one we've run (which we've seen it on) is with Ray Tune where we have say ~3-5 concurrent hyperparameter samples being run, where each sample itself spins off 3000 (embarrassingly parallel, there are no dependencies between the 3k tasks) ray remote tasks and then waits for all 3000 to finish. Whilst this means we have many thousands of simultaneous ray remote functions running at one time only the 3-5 Ray Tune trainable function calls have any dependencies on any other tasks (the 3k sub tasks).

Where the final sub-task is some CPU bound single threaded workload that runs for ~5 mins at a time running on a m5.2xlarge head node with ~50-100 c5.18xlarge worker nodes is the closest to what we're doing that's easier to replicate. You may also want to try requesting say 5k cores using the request_cores branch of the autoscaler as that may be related to starting many worker boxes in a very short period of time.

Either way we've modified our setup to use multiple redis shards and set each box to only have 71 CPU resource when it has 72 cores, so we'll let you know if we still get any crashes and share logs when we do.

markgoodhead on 14 May 2019

Some quick feedback from our investigations since:

The head node seems generally uncontented CPU wise - definitely not redis related issues (largely the raylet/monitor/reporter processes seem to use the most CPU and due to our CPU resource on the head being 1 they have amble spare CPU resource). Definitely the biggest improvement has come from leaving a spare CPU resource on the workers - even with one 'spare' generally the loadavg is > completely saturated with ray core work / other background OS stuff etc. I think the issue then was just ray core having heartbeating issues if it can't access enough CPU resource to keep up with the backlog?

markgoodhead on 17 May 2019

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.