Thingsboard: Thingsboard crashes every couple of days. java.lang.OutOfMemoryException Java Heap Space

Created on 9 Jan 2018  路  52Comments  路  Source: thingsboard/thingsboard

Hello. I have been working with thingsboard for a while. I am publishing data to thingsboard through a custom application that acts as a gateway (collects data from three devices currently) and pushes data to thingsboard after about every 8 seconds via mqtt. It works fine for a couple of days but then thingsboard crashes. The thingsboard service shows as "Active (running)" but when i try to open the thingsboard UI through the browser it does not load. My application that pushes data to thingsboard also stops publishing data as it is unable to establish a connection through mqtt. I have to restart the thingsboard service everytime this happens. Once the service is restarted, the data publishing resumes from my application. I have tried using thingsboard with the embedded HSQLDB, Cassandra and PostgreSQL but the same OutOfMemoryException : Java Heap Space seems to cause a crash. The only difference is that with Cassandra the crash happens after about three days whereas with PostgreSQL and the embedded HSQLDB, it happens within a day or two.

My thingsboard instance is hosted on a digital ocean droplet with 4GB RAM/ 2 CPUs running Ubuntu . So far i have tried increasing the memory to 1GB in the /etc/thingsboard/thingsboard.conf file by adding this line:
export JAVA_OPTS="$JAVA_OPTS -Dplatform=deb -Xms1024M -Xmx1024M", but the issue persists. I restarted the thingsboard service and monitored the memory usage of thingsboard using top and found that within 24 hours the usage increased from 17.8% to 38.0%.

Here are the ERRORs from the thingsboard.log file that occurred during two different crashes. The errors are always Java heap space errors:

2018-01-08 02:12:21,932 [nioEventLoopGroup-4-1] ERROR i.n.u.c.D.rejectedExecution - Failed to submit a listener notification task. Event loop shut down?
2018-01-08 02:24:18,326 [pool-23-thread-1] ERROR o.t.s.a.p.PluginProcessingContext - **Critical error: Java heap space**

2018-01-09 01:08:00,507 [nioEventLoopGroup-5-9] ERROR o.t.s.t.mqtt.MqttTransportHandler - [mqtt11829] Unexpected Exception
2018-01-09 01:08:02,479 [http-nio-0.0.0.0-8080-exec-1] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler processing failed; nested exception is **java.lang.OutOfMemoryError: Java heap space**] with root cause

Most helpful comment

To add to that, _how_ would you clean up the data file?

All 52 comments

  1. Could you please provide full log file and heap dump file.
    When JVM crashes it generates heap dump and thread dump for troubleshooting. You can use those flags to specify the location of those files:
    -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath={your_dir}/log"

We need heap dump to find memory leak if it is the case.
Also increasing of memory usage from 17% to 38% is not a problem for JVM process because of the way how the garbage collector works.

  1. Another question: is there any other processes run in the same instance with Thingsboard? Maybe another process consumes all available memory and Thingsboard could not allocate required memory for new objects.

  2. Please check for OS OOM killer messages, are there any processes that were killed in the same time window?

Do you have any rules configured? Are you sure that your MQTT client receives acknowledge?

@vparomskiy I am attaching the two log files for the days when the two crashes happened. The funny thing is that even though thingsboard crashed, the thingsboard.jar process was still running when I checked top after the crash occurred, so there were no OOM killer messages (i checked). So I am not sure if the JVM actually crashed. I'll run thingsboard again with the above flags that you mentioned to see if a heap dump file is generated.

There were no other processes running just postgresql and thingsboard and the usual background processes.

@ashvayka I dont have any rules configured currently. I can connect and publish data to thingsboard without any issues via mqtt so that implies that my client's connection request is acknowledged.

thingsboard.2018-01-08.0.log
thingsboard.2018-01-09.0.log

UPDATE!!!
@vparomskiy I can confirm now that the JVM doesn't crash. There is no heap dump file generated (i used the flags you suggested). The thingsboard process and service is still shown as running but the web UI doesnt load and data cant be pushed from the client side until the thingsboard service is restarted. I am attaching two more log file from the last two days. Hope they can help pinpoint the issue.
thingsboard.2018-01-10.0.log
thingsboard-2018-01-11.0.log

Have you figured this out at all? My logs look similar to yours and I've had to revert the VM running thingsboard a couple times now to backup copies to get things working. But this method obviously kills the database so I'd rather not keep doing it.

@cchaz003 Nope, the issue is still there. It is probably a memory leak somewhere in thingsboard. what i have done as a temporary fix is to create a cronjob that restarts the thingsboard service every 12 hours or so. Since then, thingsboard has been running smoothly as the service restart resets the memory usage by thingsboard. Note that in my case, when the out of memeory error occurs, the JVM does not crash even though the error is supposedly an out of memory error. The thingsboard process is still running but it is unresponsive when i try to access it through the browser and my app also stops pushing data to thingsboard. but with the cronjob fix that i added for now, I have not faced the issue since.

Hmm I'm not sure if that fix would work in my situation. When my installation crashes it never gets running again even after I restart the process (or even the whole VM). My current solution is to double the VM memory which seemed to get it working again but it's not exactly ideal since it's leaving me a bit strapped for memory on the host machine. Then again the host machine isn't the most ideal server in the world either so thats on me I guess.

In your case you should prevent it from crashing in the first place. Make a note of how long does it normally take on average for thingsboard to crash on your machine. then setup a cronjob to automatically restart the thingsboard service before it actually crashes. So if it crashes after 5 hours of running, then setup a cronjob to restart the service after ever 4 hours or so. It will mean that thingsboard will be offline for a few seconds every 4 hours but it is better than it crashing and being offline permanently. Hopefully the contributors of thingsboard can find a fix soon.
This is the cronjob i setup for my fix for now. It restarts thingsboard at 9am and 9pm everyday. I am on ubuntu so my service command is located in /usr/sbin:
0 9,21 * * * /usr/sbin/service thingsboard restart > /dev/null 2>&1

Unfortunately when my installation goes off I cant even get it to run at all. It just eats up 100% cpu and spits out log files that have that same "o.t.s.t.mqtt.MqttTransportHandler - [mqtt1029] Processing connect msg for client: N003!" error that you have. I dont have my logs in front of me right now but I also recall complaints about memory.

I've had a similar situation, ThingsBoard use up 100% CPU. The reason failed of service startup seems be data file is too large. If you don't clean up the data file, So service will not be startup successful.

If you don't clean up the data file

@ghost what do you mean by "data" file?

To add to that, _how_ would you clean up the data file?

Seeing something similar here, everything was fine for the first week or so but the UI has gotten progressively slower (despite occasional reboots/restarts of thingsboard). Data overload on too feeble of a server ? (rpi3)

I experience the same too. What can be done? Looks like this will severely jeopardize the viability of this platform.

May I know how do you guys solve this issue?
I just posted a new issue at #861 https://github.com/thingsboard/thingsboard/issues/861
My error is java.lang.OutOfMemoryError: GC overhead limit exceeded
Please assist, thank you.

running it on a 1G ubuntu droplet on digitalOcrean, think it just crash after sending some messages via mqtt. is there a solution to this? or the RAM is just not enough? using postgres but not cassandra.

I wish I could offer any solution to this... I really enjoyed using thingsboard but even when I increased my VM to 8GB RAM, it still ended up crashing - It just took longer.

Since this problem has not been resolved I've had to look elsewhere for IoT monitoring/control. My use was mostly monitoring so I have now switched to using Grafana for data visualization. Not ideal but gives me the overview that I want.

Would love to hear about a solution of some sort to this problem at some point. Thingsboard is a really great piece of software and I'd love to be able to move back to it.

@cchaz003 I am with you there. we just started evaluation on TB. first looks, it meet all our need but this could be a show stopper. I thought the problem is that the instance does not have enough memory to run TB but 8G should be sufficient. out of memory problem is common for java app. not a fan.

would love to hear any success story with TB.

Revisiting this after a long time. FINALLY FOUND A SOLUTION TO THIS! This issue seems to be persistent with earlier versions of thingsboard especially v.1.3.1. I upgraded to the latest version of thingsboard about two months ago (v 2.0.3 at the time) and VOILA! no more crashes or out of memory exceptions. Here is the upgrade guide: https://thingsboard.io/docs/user-guide/install/upgrade-instructions/
I hope it works out well for everyone else.
P.S. My thingsboard instance is still hosted on the same digital ocean droplet with 4GB RAM/ 2 CPUs running Ubuntu.

@mehranq running v2.1 here. could the bug come back? we are testing TB also at digitalOcrean with ubuntu 18.04, 1G RAM and oracle java.

@koo9 I have been running thingsboard v2.0.3 for the past two months without any issues. I think they resolved the memory issues in one of the previous version so i don't think the bug will come back,

@mehranq it just happened today when I tried sending mqtt message to the box, it crashed after a minute or so. how are the devices communicating with your TB server?

i built an app that pushes my device data to thingsboard. the app uses a mqtt library to establish connection to thingsboard and push data to it. My issue with thingsboard was that it would run out of memory after pushing data continuously with my above setup after a few days. I stopped having that issue once i upgraded to the latest version of thingsboard.

@mehranq not sure what to do here. seeing the similar behavior after sending mqtt messages from the client and it's already running on v2.1

@koo9 Please describe load that you are generating.

  1. How many devices submit data concurrently?
  2. What DB are you using?
  3. How much memory allocated for the thingsboard process and database?
  4. If you are sending attributes\telemetry in batches - what is the batch size?
  5. What transport are you using?

Also, when the JVM process is crashed because of OutOfMemory - heap dump file is generated. Could you please send it so we can detect the reason of this Error in your case.

We have a lot of cases when Thingsboard works in production with the average load 500-1500 requests per second and it is stable.

@vparomskiy I have restarted the service. right now there is only one device sending mqtt message every minute to TB. so far haven't seen the crash like it did in the first time. will keep it running for a little while and see how it goes. thx for getting back to this. great work btw.

@vparomskiy turned out to be the IoT is opening too many connection due to an error that cause the server to crash. We have been testing it for a day or so, looks promising so far.

Hi,

We get the same issues this week. (3 times) We are using Thingsboard PE 2.1 ( running on AWS EC2 - m5.large). The instance - vCPU x 2 , Memory : 8 (GiB). We found that thingsboard will dump the log file to EC2's root file system (/var/log/thingsboard). Each log file size is about 100MBytes.

EC2's root files system is consumed by those log files. If the file system size is to "Zero", we cannot connect to Thingsboard PE 2.1.

Does anyone know how to solve this issue ?

in my case is caused by opening to many mqtt connections to the server. please check how many connections are there when it crashed.

Hi Kevin,

We have 100 devices to connect with Thingsboard. In the future, we will have more than 100 devices to connect with Thingsboard. How can we avoid this issue ?

Does Thingsboard have any formula to size the capacity of Server or instance ?

hi Paul,

not saying that's the only cause but might worth looking into.

Hi Kevin,

Thanks for your remind.

hi Paul,

@vparomskiy might be more helpful with the troubleshooting. please look at his messages above and have the log file handy.

@paulwang55 the root cause of your problem is that not enough free space is left on the disk. We need to discover the reason why log files are so big because the 100MB log file is definitely is not OK. If you can provide log file I'll be able to provide more details.
2vCPU and 8GB ram should handle 100+ online devices without problems.

PS: log file retention policy and the directory is configured in /etc/thingsboard/conf/logback.xml

Hello , I encountered such a problem, cpu = 100%, and does not work the Thingsboard, I used 8GB, but after a week everything stopped, maybe someone will suggest how to fix this problem.Thanks

  1. Please write what Thingsboard version you are using.
  2. Share log files

Hi vparomskiy ,
Please use this link to access the log files and logback.xml.
https://www.dropbox.com/sh/nl4jaapf6pqm0p6/AABHPYl9nTXcPGi1fos6ah-Oa?dl=0

We use Thingsboard 2.1.0 running on AWS EC2 (we order this service from AWS marketplace).

The server's cpu will jump to 80%. Then the server log will be created very fast. It seems that it be happen at a fixed time (12:25 PM UTC to 08:33 PM UTC).

We don't have any scheduler job. We don't understand why it happen. Please assist us to sovle this issue. Thanks.

2018-10-09 9 45 30

It is a bug in the Thingsbaord 2.1.0 Fix will be available in the next release. The problem is that in some cases, Actors system is not able to initialize correctly and generates a lot of log messages. Issue already resolved in the master, right now we are testing next thingsboard release. Next release is expected at the end of October.

This issue can be indetified by this log entry

2018-10-09 13:29:21,677 [Akka-akka.actor.default-dispatcher-3] ERROR o.t.server.actors.tenant.TenantActor - Unknown failure
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:172)
at akka.actor.ActorCell.create(ActorCell.scala:606)
at akka.actor.dungeon.FaultHandling$class.finishCreate(FaultHandling.scala:136)
at akka.actor.dungeon.FaultHandling$class.faultCreate(FaultHandling.scala:130)
at akka.actor.ActorCell.faultCreate(ActorCell.scala:374)
at akka.actor.dungeon.FaultHandling$class.faultResume(FaultHandling.scala:102)
at akka.actor.ActorCell.faultResume(ActorCell.scala:374)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:466)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)
at akka.dispatch.Mailbox.run(Mailbox.scala:223)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NullPointerException: null

Hi vparomskiy,

May I ask your a question about the version of ThingsBoard PE ? Thingsboard PE 2.1.3 is offered on AWS's marketplace now.

Is it the latest version for Thingsboard PE ?

Thingsboard PE 2.1.3 from the AWS marketplace is NOT the latest version, it will take some time for approving latest version in the marketplace.

But if you need a latest version - you can manually upgrade your instance using this guide:
https://thingsboard.io/docs/user-guide/install/aws-marketplace-pe-upgrade/#upgrading-to-thingsboard-pe-v213

Hi vparomskiy,

Thanks for your reply. I just want to make sure that Thingsboard PE 2.1.3 can fix my issue or not. According to your previous reply (9 days ago), the next release will fix TB PE 2.1.0 's bug. So, I still need to wait for the next release, right.

You can manually upgrade your instance using provided instruction.
It should resolve your problem.

Hi vparomskiy,

I knew I can upgrade my instance mannually using the instruction. But the instance version is Thingsboard PE 2.1.3. That is why I ask about "version".

I just want to make sure that the issue can be solved by this version ( Thingsboard PE 2.1.3) or not. If the answer is "yes". I will upgrade my instance. Otherwise , I still need to wait for the lastest version mentioned by you.

Please provide your advice. Thanks.

The answer is yes.
Bug with Actor system initialization was fixed in 2.1.3

Hi,

Thanks for your reply. We will upgrade it ASAP.

@paulwang55 any updates, does upgrade to the latest version resolves your issue?

I have the same problem as described above (version 2.1.0) Is there a solution to fix the problem without reinstalling thingsboard?
Thanks you!

Hi vparomskiy,

Our issue was sovled after we upgrade thingsboard server to 2.1.3.

The issue can be closed for me.

Thanks for your assistance.

Hi @vparomskiy

I'm still running into the Java Heap Space error. I have the latest Thingsboard version running.
In the attachment I have an section of the logfile. Do you have any idea?

Logfile.txt

@dobrun

  1. what instance are you using (ram&cpu)?
  2. please provide full log file
  3. What database are you using?

@dobrun is this issue still valid for you?

I run the thingsboard on aws ec2 micro instance with cassandra database. I reinstalled the the service new with PostgreSQL database.

Now it's working fine.

Thank you very much!

Hi @vparomskiy

I'm experiencing these issues on the latest version of ThingsBoard CE 2.4.1 and PostgreSQL.

Using AWS t2.micro instance which crashes every day or two when the cpu load hits 100% just before crashing. I only have 3 devices connected that upload every 2 seconds.

I鈥檓 hoping you can reopen this issue and support me. Thanks

Was this page helpful?
0 / 5 - 0 ratings