Describe the bug
Topics randomly freeze, causing catastrophic topic outages on a weekly (or more frequent) basis. This has been an issue as long as my team has used Pulsar, and it's been communicated to a number of folks on the Pulsar PMC committee.
(I thought an issue was already created for this bug, but I couldn't find it anywhere.)
To Reproduce
We have not figured out how to reproduce the issue. It's random (seems to be non-deterministic) and doesn't seem to have any clues in the broker logs.
Expected behavior
Topics should never just randomly stop working to where the only resolution is restarting the problem broker.
Steps to Diagnose and Temporarily Resolve

Step 2: Check the rate out on the topic. (click on the topic in the dashboard, or do a stats on the topic and look at the "msgRateOut")
If the rate out is 0 this is likely a frozen topic, but to verify do the following:
In the pulsar dashboard, click on the broker that topic is living on. If you see that there are multiple topic that have a rate out of 0, then proceed to the next step, if not it could potentially be another issue. Investigate further.


Step 3: Stop the broker on the server that the topic is living on. pulsar-broker stop .
Step 4: Wait for the backlog to be consumed and all the functions to be rescheduled. (typically wait for about 5-10 mins)
Environment:
Docker on bare metal running: `apachepulsar/pulsar-all:2.4.0`
on CentOS.
Brokers are the function workers.
This has been an issue with previous versions of Pulsar as well.
Additional context
Problem was MUCH worse with Pulsar 2.4.2, so our team needed to roll back to 2.4.0 (which has the problem, but it's less frequent).
This is preventing the team from progressing in the use of Pulsar, and it's causing SLA problems with those who use our service.
FWIW, @devinbost I am suspicious of the netty version in the 2.4.X stream. There is a know memory usage issue (https://github.com/netty/netty/issues/8814) in the 4.1.32 version of netty that is used in the stream. When I was testing, I would see issues like you are describing. Then I patched in the latest version of netty, and things seemed better. I have been running with a patched 2.4.2 for a while and have not seen any issues. You may be experiencing something completely different, but it might be worth trying 2.4.2 + latest netty.
@devinbost have you looked into permits when this issue happens?
@sijie We have, but I can't remember exactly what our findings were.
Is there anything in particular that we should look for regarding permits?
@cdbartholomew We actually started experiencing this issue before 2.4.0.
I noticed that each topic lives on a single broker, which creates a single point of failure.
Is there any interest in making topics higher availability? (I suppose it would be a workaround, but it might help prevent other issues.)
I noticed that each topic lives on a single broker, which creates a single point of failure.
Is there any interest in making topics higher availability?
We (StreamNative) have been helping folks from Tencent at developing a feature called ReadOnly brokers. It allows a topic can have multiple owners (one writeable owner and multiple readonly owners). It has been running on production for a while. They will contribute it back soon.
Is there anything in particular that we should look for regarding permits?
Incorrect permits has been one of the main reasons causing the consumer to be stalled. You can use "unload" command to unload a topic or a namespace bundle to trigger a consumer reconnect. It resets the consumer state to mitigate the problem.
@codelipenghui and @jiazhai are working a proposal on improving the permits protocol.
I am not sure if your problem is related to permits. but if the same problem occurs, the first thing you should do is to use topic stats or topic stats-internal to get the stats or internal stats of a topic.
I found an example involving a dead topic. (This was on a low-volume topic.)

Here's another case where we saw the topics on a particular broker freeze.

Notice that both topics on broker 09 froze.
Interestingly, the available permits seems to be fine (1000). Are you able to get some more stats that we can help debug?
We will need to capture some stats when this happens next. After my team made some changes to improve the stability of the Zookeeper cluster, the frequency of this issue decreased on v2.4.0. So, we will need to update one of the clusters to use 2.4.2 to reproduce this issue again.
I also noticed that some of the bookkeeper nodes are currently running:
apachepulsar/pulsar-all:2.3.0
and some of them are running:
streamlio/pulsar-all:2.4.0-streamlio-24
I don't know if that would have anything to do with the issue.
usually, this kind of problem is not related to bookkeeper. Is the broker running the same version and what is the version?
The brokers are all running streamlio/pulsar-all:2.4.0-streamlio-24
The brokers are also running as the function workers. (There aren't dedicated function worker instances.)
Given it is running a special version, it is hard for us to realize if there are any special changes in that version. I think it is better to provide some jstack or heap dump when this problem happens. Otherwise it is hard for the community to help here.
@sijieg Thanks for the advice.
I'll talk with the team about moving to a standard Pulsar release.
We did, however, experience this issue on the standard Pulsar 2.4.2 release.
@devinbost There is a fix https://github.com/apache/pulsar/pull/5894 for broker stop dispatch messages and it was released in 2.5.0.
This issue is a duplicate of https://github.com/apache/pulsar/issues/5311
We noticed that after one of these situations, our Zookeeper nodes had gotten out of sync. We ran a diff after scraping the Zookeeper data on each of the ZK instances, and we noticed that only one ZK instance was behind (although we weren't able to check the ledger data since it's constantly changing.) The difference we noticed was that the ZK instance that was behind was missing several nodes in: /admin/policies
@addisonj Have you noticed anything similar?
@sijie @codelipenghui We have confirmed that this is actually still an issue in Pulsar 2.5.2.
If it's a clue, it only happens on topics involving functions.
@devinbost did you happen to have a heap dump?
@sijie Unfortunately, I don't. What's the best way for us to get a heap dump when this happens again?