Pulsar: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Created on 14 Jan 2020 · 25Comments · Source: apache/pulsar

Describe the bug
Topics randomly freeze, causing catastrophic topic outages on a weekly (or more frequent) basis. This has been an issue as long as my team has used Pulsar, and it's been communicated to a number of folks on the Pulsar PMC committee.

(I thought an issue was already created for this bug, but I couldn't find it anywhere.)

To Reproduce
We have not figured out how to reproduce the issue. It's random (seems to be non-deterministic) and doesn't seem to have any clues in the broker logs.

Expected behavior
Topics should never just randomly stop working to where the only resolution is restarting the problem broker.

Steps to Diagnose and Temporarily Resolve

Step 2: Check the rate out on the topic. (click on the topic in the dashboard, or do a stats on the topic and look at the "msgRateOut")

If the rate out is 0 this is likely a frozen topic, but to verify do the following:

In the pulsar dashboard, click on the broker that topic is living on. If you see that there are multiple topic that have a rate out of 0, then proceed to the next step, if not it could potentially be another issue. Investigate further.

Step 3: Stop the broker on the server that the topic is living on. pulsar-broker stop .

Step 4: Wait for the backlog to be consumed and all the functions to be rescheduled. (typically wait for about 5-10 mins)

Environment:

Docker on bare metal running: `apachepulsar/pulsar-all:2.4.0`
on CentOS.
Brokers are the function workers.

This has been an issue with previous versions of Pulsar as well.

Additional context

Problem was MUCH worse with Pulsar 2.4.2, so our team needed to roll back to 2.4.0 (which has the problem, but it's less frequent).
This is preventing the team from progressing in the use of Pulsar, and it's causing SLA problems with those who use our service.

triagweek-3 typbug

Source

devinbost

All 25 comments

FWIW, @devinbost I am suspicious of the netty version in the 2.4.X stream. There is a know memory usage issue (https://github.com/netty/netty/issues/8814) in the 4.1.32 version of netty that is used in the stream. When I was testing, I would see issues like you are describing. Then I patched in the latest version of netty, and things seemed better. I have been running with a patched 2.4.2 for a while and have not seen any issues. You may be experiencing something completely different, but it might be worth trying 2.4.2 + latest netty.

cdbartholomew on 14 Jan 2020

👍1

@devinbost have you looked into permits when this issue happens?

sijie on 15 Jan 2020

@sijie We have, but I can't remember exactly what our findings were.
Is there anything in particular that we should look for regarding permits?

devinbost on 21 Jan 2020

@cdbartholomew We actually started experiencing this issue before 2.4.0.

devinbost on 21 Jan 2020

I noticed that each topic lives on a single broker, which creates a single point of failure.
Is there any interest in making topics higher availability? (I suppose it would be a workaround, but it might help prevent other issues.)

devinbost on 21 Jan 2020

I noticed that each topic lives on a single broker, which creates a single point of failure.
Is there any interest in making topics higher availability?

We (StreamNative) have been helping folks from Tencent at developing a feature called ReadOnly brokers. It allows a topic can have multiple owners (one writeable owner and multiple readonly owners). It has been running on production for a while. They will contribute it back soon.

sijie on 21 Jan 2020

👍1

Is there anything in particular that we should look for regarding permits?

Incorrect permits has been one of the main reasons causing the consumer to be stalled. You can use "unload" command to unload a topic or a namespace bundle to trigger a consumer reconnect. It resets the consumer state to mitigate the problem.

@codelipenghui and @jiazhai are working a proposal on improving the permits protocol.

I am not sure if your problem is related to permits. but if the same problem occurs, the first thing you should do is to use topic stats or topic stats-internal to get the stats or internal stats of a topic.

sijie on 21 Jan 2020

I found an example involving a dead topic. (This was on a low-volume topic.)

devinbost on 21 Jan 2020

Here's another case where we saw the topics on a particular broker freeze.

Notice that both topics on broker 09 froze.

devinbost on 21 Jan 2020

Interestingly, the available permits seems to be fine (1000). Are you able to get some more stats that we can help debug？

sijie on 22 Jan 2020

We will need to capture some stats when this happens next. After my team made some changes to improve the stability of the Zookeeper cluster, the frequency of this issue decreased on v2.4.0. So, we will need to update one of the clusters to use 2.4.2 to reproduce this issue again.

devinbost on 25 Jan 2020

I also noticed that some of the bookkeeper nodes are currently running:
apachepulsar/pulsar-all:2.3.0
and some of them are running:
streamlio/pulsar-all:2.4.0-streamlio-24

I don't know if that would have anything to do with the issue.

devinbost on 25 Jan 2020

usually, this kind of problem is not related to bookkeeper. Is the broker running the same version and what is the version?

sijie on 25 Jan 2020

The brokers are all running streamlio/pulsar-all:2.4.0-streamlio-24

devinbost on 28 Jan 2020

The brokers are also running as the function workers. (There aren't dedicated function worker instances.)

devinbost on 28 Jan 2020

Given it is running a special version, it is hard for us to realize if there are any special changes in that version. I think it is better to provide some jstack or heap dump when this problem happens. Otherwise it is hard for the community to help here.

sijie on 28 Jan 2020

@sijieg Thanks for the advice.
I'll talk with the team about moving to a standard Pulsar release.
We did, however, experience this issue on the standard Pulsar 2.4.2 release.

devinbost on 28 Jan 2020

@devinbost There is a fix https://github.com/apache/pulsar/pull/5894 for broker stop dispatch messages and it was released in 2.5.0.

codelipenghui on 29 Jan 2020

👍1

This issue is a duplicate of https://github.com/apache/pulsar/issues/5311

devinbost on 11 Feb 2020

We noticed that after one of these situations, our Zookeeper nodes had gotten out of sync. We ran a diff after scraping the Zookeeper data on each of the ZK instances, and we noticed that only one ZK instance was behind (although we weren't able to check the ledger data since it's constantly changing.) The difference we noticed was that the ZK instance that was behind was missing several nodes in: /admin/policies

devinbost on 10 Mar 2020

@addisonj Have you noticed anything similar?

devinbost on 10 Mar 2020

@sijie @codelipenghui We have confirmed that this is actually still an issue in Pulsar 2.5.2.

devinbost on 14 Jul 2020

If it's a clue, it only happens on topics involving functions.

devinbost on 14 Jul 2020

@devinbost did you happen to have a heap dump?

sijie on 15 Jul 2020

@sijie Unfortunately, I don't. What's the best way for us to get a heap dump when this happens again?

devinbost on 15 Jul 2020

Was this page helpful?