Thingsboard: Time out error while posting large amount of telemetries and using REST API

Created on 21 Jun 2018 · 15Comments · Source: thingsboard/thingsboard

We are using thingsboard 2.0.3 with a Cassandra database

How to reproduce the error :

Create a device calling REST API "/api/device" with following JSON parameter : {'name': u"L'Isle-Jourdain", 'type': 'Vigicrues'}
Post telemetries to this device using REST API "/api/v1/my_device_token/telemetry" (we post a large amount of telemetries in one single request, see attached json file to get the json document sent to this service)
telemetries.txt

Result : the service takes a large amount of time to respond, and end with the following time out error : 408 {"error":"Timeout during persistence of the message to the queue!"}

Source

jeansey

👍1

All 15 comments

Could you please share your log file
Did you make any changes in the Rule Chains? Are there any custom nodes for data transformation or enrichment?
What instance is used for the Thingsboard and Cassandra (CPU/RAM)?

vparomskiy on 21 Jun 2018

Here is our log files :
thingsboard.log
thingsboard.out.log

We didn't changed the Rule Chains. We had previous rules before migration from v1.4 to v2.0.3 but they are automatically delete if we understood well the migration process.
Here are screenshots of our rule chains :
rulechains1
rulechains2

Thingsboard and Cassandra are running on the same instance : a Virtual Machine with 8cpu and 16Go RAM
serverressources

You can check the java options we set for those processes :
javaOptions.txt

Just for information, when we were posting the same telemetries with thingsboard 1.4 we didn't had this issue. Our database has around 1500 devices

jeansey on 21 Jun 2018

We are facing a similar problem. In version 1.4.x, we already had a problem of slowness which produced the freeze of the whole web service. We migrated from Postgres to Cassandra, which solved the problem.

We have just migrated to version 2.x to take advantage of the latest advances with rules chain. Unfortunately all insertion are now very slow (on the REST api - more than 10x slower), and ultimately produce a complete freeze of the web application.

Any solutions? Maybe it's just a configuration problem.

geonux on 26 Jun 2018

Same problem here (with 2.1.0 and may be earlier), using MQTT instead of REST.

Similar problem here.
Devices are storing large amount of telemetry messages stored in client's persistent storage while in disconnected state (using PAHO MQTT C client library). When the device client re-connects to the TB MQTT, the first few hundred persistent messages are sent successfully to the server and removed from the client's persistent storage. The remaining messages will be answered by TB with

{"error":"Timeout during persistence of the message to the queue!"}

and will not be removed from the client's persistent storage (which is correct and good!).
During next re-connect, the next couple of hundreds messages can be successfully transferred...
Can be quite a lengthly process to get rid of the messages, but at least no data ist lost.

Any thoughts on how to avoid the timeout on the server side?

BatListener on 29 Aug 2018

Short answer:
Please increase actors.rule.queue.max_size property in the thingsboard.yml file to a higher value. For example, set it to 10000.

Long answer:
When device submit messages to the Thingsbaord (telemetry/attributes/RPC), by default, all messages are saved in the Queue. After messages are saved, they are passed to the Rule Engine for processing and response is generated for the Device. As you see, processing inside Rule Engine is asynchronous and we need to guarantee that if a device receives '200 OK' from the Thingsboard, the message will be processed.
The queue is cleared after message processing is finished.
When a batch of messages is submitted it is saved in the queue as N separate messages.
We limit the concurrent number of messages in the Queue for single Tenant. Default max value is 100.
So when 2 batches are submitted from a single tenant with the size for each batch = 60. Message queue will reject some part of those messages and the device will receive an error.

vparomskiy on 11 Sep 2018

Thanks for the excellent explanation!
Appreciate it!

BatListener on 11 Sep 2018

Thank you for your guide, but it's still a problem!

m0o0 on 9 Oct 2018

More details can help with investigation

vparomskiy on 9 Oct 2018

@vparomskiy Is there any way to observe the Queue's state? Also, can we change the default max value for the concurrent messages per Tenant via configuration or is this not exposed?

chatper on 11 Oct 2018

👍1

In the current version, it is not possible.

vparomskiy on 11 Oct 2018

😕3

contributions are welcome

vparomskiy on 16 Oct 2018

Increasing the actors.rule.queue.max_size did not help. I get the following error...

2018-11-21 05:05:52,380 [pool-23-thread-6] WARN o.h.e.jdbc.spi.SqlExceptionHelper - SQL Error: -104, SQLState: 23505
2018-11-21 05:05:52,381 [pool-23-thread-6] ERROR o.h.e.jdbc.spi.SqlExceptionHelper - integrity constraint violation: unique constraint or index violation; TS_KV_UNQ_KEY table: TS_KV

Attached is the thingsboard.log file.

Any suggestions?

thingsboard.log

JPB22 on 21 Nov 2018

It looks like your system is not able to handle generated load, so I have a couple of questions:

First question - what version are you using (PE or community edition?)
I see that you are using SQL data storage. Is it HSQLDB or Postgres?
What instance is used (OS, CPU, RAM, system load)?

vparomskiy on 21 Nov 2018

This thread is not active anymore. Closing an issue.

vparomskiy on 5 Feb 2019

Short answer:
Please increase actors.rule.queue.max_size property in the thingsboard.yml file to a higher value. For example, set it to 10000.

Long answer:
When device submit messages to the Thingsbaord (telemetry/attributes/RPC), by default, all messages are saved in the Queue. After messages are saved, they are passed to the Rule Engine for processing and response is generated for the Device. As you see, processing inside Rule Engine is asynchronous and we need to guarantee that if a device receives '200 OK' from the Thingsboard, the message will be processed.
The queue is cleared after message processing is finished.
When a batch of messages is submitted it is saved in the queue as N separate messages.
We limit the concurrent number of messages in the Queue for single Tenant. Default max value is 100.
So when 2 batches are submitted from a single tenant with the size for each batch = 60. Message queue will reject some part of those messages and the device will receive an error.

I tried to find actors.rule.queue.max_size in thingsboard.yml, cannot find it. Please advise which parameter this has been replaced with in v3.1.1PE