Tasmota: Connecting to Loadbalanced MQTT broker gives slow Sonoff Tasmota response ("Attempting Connection")

Created on 19 Feb 2018  路  9Comments  路  Source: arendst/Tasmota

Hello all,

Regarding this issue 56 with SonWEB I did another 2 days of investigation to pin point my issue.

A Sonoff device with pre-compiled Tasmota 5.12.0b firmware installed connects to a MQTT Broker (EMQTT or Mosquitto) through a load balancer / proxy (haproxy), which both run in a docker swarm. In this setup the Sonoff connects to the broker and commands can be given using MQTT, without any issues.

However ...

When this request (executed by SonWEB): http:///cm?cmnd=Status%200 is performed, it gives fast response for the first time, but all request after give slow responses. This is probably because for some reason right after the first request, the device looses connection to the broker and tries to reconnect. See logging below.

00:00:00 Project sonoff Sonoff (Topic sonoff-1, Fallback DVES_8BACCC, GroupTopic sonoffs) Version 5.12.0b
00:00:00 WIF: Connecting to AP1 indonesia23 in mode 11N as sonoff-1-3276...
00:00:04 WIF: Connected
00:00:04 DNS: Initialized
00:00:04 HTP: Web server active on sonoff-1-3276.local with IP address 192.168.2.201
00:00:06 MQT: Attempting connection...
00:00:06 MQT: Connected
00:00:06 MQT: tele/sonoff-1/LWT = Online (retained)
00:00:06 MQT: cmnd/sonoff-1/POWER = 
00:00:06 MQT: tele/sonoff-1/INFO1 = {"Module":"Sonoff Basic","Version":"5.12.0b","FallbackTopic":"DVES_8BACCC","GroupTopic":"sonoffs"}
00:00:06 MQT: tele/sonoff-1/INFO2 = {"WebServerMode":"Admin","Hostname":"sonoff-1-3276","IPAddress":"192.168.2.201"}
00:00:06 MQT: tele/sonoff-1/INFO3 = {"RestartReason":"Software/System restart"}
00:00:07 MQT: stat/sonoff-1/RESULT = {"POWER":"OFF"}
00:00:07 MQT: stat/sonoff-1/POWER = OFF
12:18:21 MQT: tele/sonoff-1/STATE = {"Time":"2018-02-19T12:18:21","Uptime":"0T00:00:14","Vcc":3.434,"POWER":"OFF","Wifi":{"AP":1,"SSId":"indonesia23","RSSI":54,"APMac":"30:D3:2D:70:05:4D"}}
12:19:17 MQT: stat/sonoff-1/STATUS = {"Status":{"Module":1,"FriendlyName":"Sonoff","Topic":"sonoff-1","ButtonTopic":"0","Power":0,"PowerOnState":3,"LedState":1,"SaveData":1,"SaveState":1,"ButtonRetain":0,"PowerRetain":0}}
12:19:17 MQT: stat/sonoff-1/STATUS1 = {"StatusPRM":{"Baudrate":115200,"GroupTopic":"sonoffs","OtaUrl":"http://jakarta.indonesia/tasmota/bin/sonoff_latest-minimal.bin","Uptime":"0T00:01:09","Sleep":0,"BootCount":26,"SaveCount":99,"SaveAddress":"F9000"}}
12:19:17 MQT: stat/sonoff-1/STATUS2 = {"StatusFWR":{"Version":"5.12.0b","BuildDateTime":"2018-02-16T23:29:31","Boot":31,"Core":"2_4_0","SDK":"2.1.0(deb1901)"}}
12:19:17 MQT: stat/sonoff-1/STATUS3 = {"StatusLOG":{"SerialLog":2,"WebLog":2,"SysLog":0,"LogHost":"","LogPort":514,"SSId1":"indonesia23","SSId2":"","TelePeriod":300,"SetOption":"00000009"}}
12:19:17 MQT: stat/sonoff-1/STATUS4 = {"StatusMEM":{"ProgramSize":487,"Free":516,"Heap":19,"ProgramFlashSize":1024,"FlashSize":4096,"FlashMode":3}}
12:19:17 MQT: stat/sonoff-1/STATUS5 = {"StatusNET":{"Hostname":"sonoff-1-3276","IPAddress":"192.168.2.201","Gateway":"192.168.2.254","Subnetmask":"255.255.255.0","DNSServer":"192.168.2.10","Mac":"68:C6:3A:8B:AC:CC","Webserver":2,"WifiConfig":0}}
12:19:17 MQT: stat/sonoff-1/STATUS6 = {"StatusMQT":{"MqttHost":"192.168.2.101","MqttPort":1883,"MqttClientMask":"DVES_%06X","MqttClient":"DVES_8BACCC","MqttUser":"DVES_USER","MAX_PACKET_SIZE":1000,"KEEPALIVE":15}}
12:19:17 MQT: stat/sonoff-1/STATUS7 = {"StatusTIM":{"UTC":"Mon Feb 19 11:19:17 2018","Local":"Mon Feb 19 12:19:17 2018","StartDST":"Sun Mar 25 02:00:00 2018","EndDST":"Sun Oct 28 03:00:00 2018","Timezone":1}}
12:19:17 MQT: stat/sonoff-1/STATUS10 = {"StatusSNS":{"Time":"2018-02-19T12:19:17"}}
**12:19:17 MQT: stat/sonoff-1/STATUS11 = {"StatusSTS":{"Time":"2018-02-19T12:19:17","Uptime":"0T00:01:09","Vcc":3.437,"POWER":"OFF","Wifi":**{"AP":1,"SSId":"indonesia23","RSSI":56,"APMac":"30:D3:2D:70:05:4D"}}}
**12:19:39 MQT: Attempting connection...**
12:19:39 MQT: Connected
12:19:39 MQT: tele/sonoff-1/LWT = Online (retained)
12:19:39 MQT: cmnd/sonoff-1/POWER = 

Here it takes about 22 seconds to reconnect, so ever new request fails. The device shouldn't have lost its connection in the first place, right? note : this happens all so when executed from console : STATUS 0.

I tested the broker via HAProxy using JMeter with a load of 250 requests / sec and the broker just seems to work fine.

I also tested without the load balancer in between and just a direct connection between the device and the broker. In this setup the device doesn't loose its connection to the broker and its average response time on STATUS 0 is about 200 ms. Running for hours.

With the same Device -> LB/Proxy -> Broker setup, I've tested the individual STATUS requests ([none], 1-7, 10 and 11), which have the same result as STATUS 0, but do not show any "Attempting Connection" messages and reacts perfectly with a fast average response time.

So why is STATUS 0 using a load balancer / proxy giving "Attempting Connection" messages while the other status requests don't?

Hope someone can help me tackle tis issue :-)

Most helpful comment

After spending lots of time investigating this issue, i finally solved the issue. However I did not found the exact cause of the error it has been fixed now and I am running different Sonoff devices and Wemos devices which all are managed by SonWEB.

My architecture consists of:

  • Docker Cluster with Raspberry PI's 3
  • EMQTT as MQTT broker in a cluster setup (docker containers)
  • Sonweb to manage Tasmota flashed devices (docker container)
  • HAProxy for load balancing and proxy (docker container)
  • Node-red, Portainer, Grafana, InfluxDB & Chronograf all running in docker containers

It seems that after reflashing the devices with the released version of tasmota minimal and right after self-build (pio) image solved the issue, but I am not certain ...

Anyway, long story short .. its working fine now and enough reason to close this issue :)

Thanks for all support @MatzeJoerling @ascillato @reloxx13 馃憤

All 9 comments

Hi.
What is a load balancer/proxy?

@ascillato a load balancer distributes incoming request/connections to one or more back-ends. In my case the idea is to have for example mqtt.mydomain:1883 which points to the load balancer. The load balancer distributes incoming requests to one or more mqtt brokers (nodes) (configured as a cluster). This way more requests can be handle, but that's not main reason.
Using multiple mqtt brokers (nodes) as back-end in a cluster, gives high availability. For instance if a broker (node) goes down, all fuctionality remains working because the LB redirects incoming requests / connections to another broker (node).

mqtt-test-env-load-balancing-678x381

Nice, I didn't know that. Thanks.

@arendst Theo, any idea why status 0 fails?

Hi, sure the reconnect needs 22sec?

12:19:39 MQT: Attempting connection...
12:19:39 MQT: Connected
12:19:39 MQT: tele/sonoff-1/LWT = Online (retained)

looks like instantly connected.

is your connection sticky (on ha-proxy)? why do you count up the ports on mqtt nodes? Docker?
Can you capture the network traffic on ha-proxy? tcpdump, tshark and investigate that? Who drops the connection sonoff-haproxy or haproxy-mqttnode? Doesnt look like a normal tcp timeout. you can post the dump, i will take a look. but - you passwords will be in the dump!

according to Stackoverflow you cant cluster mosquitto! What are you using?

regards M.

Thanks for your response @MatzeJoerling !!!

Well to begin with .. the image in my previous post is just to explain @ascillato what load balancing means and does not represent my exact architecture.

In my architecture i use EMQTT, but tested with Mosquitto as well. In the current test setup I have no cluster .. just one docker container with EMQTT installed. HAProxy is configured to direct incoming connections to the only existing EMQTT broker. So Device -> HAProxy -> EMQTT, but even without HAProxy and only using the docker build-in load balancer the problem exists. So basically its not about clustering, is about the situation that there is a LB between a device and a broker.

You are right the Attempting Connection -> Connected is instantly, but still just between the last STATUS request (STATUS 11) and the Attempting Connection the Device is slow and almost not responsive.

And the strange thing is, requesting the individual status request work fast.

I'll try your suggestion with tcpdump and/or tshark to see if i can get some extra info.

Hi,
so haproxy is just a proxy in your setup, not a load balancer. dont know docker lb, but using haproxy and F5 LTM for long time.
Any probe timeout (haproxy checks backend) or ... ?
the dump would be really helpful to determine where the connection is dropped and why there is a 20sec lag bevor reconnecting. Any sort of timeout, up / down probe.
dooes your jmeter test send real mqtt traffic or just tcp connects?

Hi, @MatzeJoerling first of all thanks for thinking along with me, really appreciate that!

Unfortunately, I didn't succeed to get tshark installed on the HAProxy container, because of rights to execute dumpcap.
tshark: Couldn't run /usr/bin/dumpcap in child process: Operation not permitted
Based upon this error I googled and applied the suggestions, but it still didn't work.

So that decided me to reduce the complexity of my architecture and do some more tests. I removed HAProxy and tested two setups with two tests each setup:

Before going into the details here is a brief summery:

  1. Setup 1 - Test 1 - Docker Network Bridge - STATUS 0 馃憤
  2. Setup 1 - Test 2 - Docker Network Bridge - STATUS [none, 1-7, 10 & 11] 馃憤
  3. Setup 2 - Test 3 - Docker Network Overlay - STATUS 0 馃憥
  4. Setup 2 - Test 4 - Docker Network Overlay - STATUS [none, 1-7, 10 & 11] 馃憤

Why is STATUS 0 giving issues and showing "Attempting to connection ..." when Overlay network is used, while the individual don't?

Setup 1 - Docker Network Bridge

One docker container with EMQTT broker using Network Bridge Driver (explained here )

The bridge driver creates a private network internal to the host so containers on this network can communicate. External access is granted by exposing ports to containers. Docker secures the network by managing rules that block connectivity between different Docker networks.

docker network create -d bridge mybridge
docker run -it -p18083:18083 -p1883:1883 -e EMQ_WAIT_TIME=60 --name emqt-test raymondmm/rpi-emqt:test

This creates the private network (172.19.0.0/16) and deploys the EMQTT container on this network (172.19.0.2). The external network is 192.168.2.x and EMQTT is served at port 1883 and 18083 for its webgui.

Test 1 (passed):
The Sonoff Tasmota connects to the broker using 192.168.2.101:1883 and STATUS 0 responses are fast and stable.

  • http:///cm?cmnd=Status 0

Test 2 (passed):
Same goes for the individual commands STATUS (without parameter), STATUS 1-7 and 10 and 11:

  • http:///cm?cmnd=Status
  • http:///cm?cmnd=Status 1
  • http:///cm?cmnd=Status 2
  • http:///cm?cmnd=Status 3
  • http:///cm?cmnd=Status 4
  • http:///cm?cmnd=Status 5
  • http:///cm?cmnd=Status 6
  • http:///cm?cmnd=Status 7
  • http:///cm?cmnd=Status 10
  • http:///cm?cmnd=Status 11

During these tests the log message "Attempting to connection ..." never shows up.

Setup 2 - Docker Network Overlay

One docker container with EMQTT broker using Network Overlay Driver (explained here )

With the overlay driver, multi-host networks are first-class citizens inside Docker without external provisioning or components. IPAM, service discovery, multi-host connectivity, encryption, and load balancing are built right in. For control, the overlay driver uses the encrypted Swarm control plane to manage large scale clusters at low convergence times.

docker network create -d overlay --opt encrypted my-overlay
docker service create --network my-overlay -p18083:18083 -p1883:1883 -e EMQ_WAIT_TIME=60 --constraint 'node.hostname==docker1' --name emqt-test
 raymondmm/rpi-emqt:test

This creates the overlay network and the deploys the EMQTT container (10.0.0.8/24). It's is accessible using 192.186.2.101:1883.

Test 3 (failed):
The Sonoff Tasmota connects to the broker using 192.168.2.101:1883 and STATUS 0 responses are slow and the log message "Attempting to connection ..." shows up almost every request and the device is non-responsive.

  • http:///cm?cmnd=Status 0

Test 4 (passed):
The individual commands STATUS (without parameter,) STATUS 1-7 and 10 and 11 passed.
During this test the log message "Attempting to connection ..." never shows up.

  • http:///cm?cmnd=Status
  • http:///cm?cmnd=Status 1
  • http:///cm?cmnd=Status 2
  • http:///cm?cmnd=Status 3
  • http:///cm?cmnd=Status 4
  • http:///cm?cmnd=Status 5
  • http:///cm?cmnd=Status 6
  • http:///cm?cmnd=Status 7
  • http:///cm?cmnd=Status 10
  • http:///cm?cmnd=Status 11

ps. apologise for the long post ..

nice debugging, but docker is out of my scope. btw, overlay looks like the most scalable, complex setup of docker networking with overlay/underlay/vxlan stuff, why do you not use bridge network?

the permission problem with tshark / dumpcap is some sort of "you are not root or not in the wireshark group". what kind of os do you use as docker host? container os?

something is dropping the tcp connection to/from mqtt! dont know what!
you could see that in the network dump what perhaps sends a "tcp fin / reset" to close the connection or why it runs in a timeout. but 20sec is an unusual timeout for tcp, could also be some vxlan / spanning tree / layer2 network stuff.

you also could turn om emqtt debugging, like described here : https://github.com/emqtt/emqttd/wiki/Get-Started

hard to debug, sorry.

After spending lots of time investigating this issue, i finally solved the issue. However I did not found the exact cause of the error it has been fixed now and I am running different Sonoff devices and Wemos devices which all are managed by SonWEB.

My architecture consists of:

  • Docker Cluster with Raspberry PI's 3
  • EMQTT as MQTT broker in a cluster setup (docker containers)
  • Sonweb to manage Tasmota flashed devices (docker container)
  • HAProxy for load balancing and proxy (docker container)
  • Node-red, Portainer, Grafana, InfluxDB & Chronograf all running in docker containers

It seems that after reflashing the devices with the released version of tasmota minimal and right after self-build (pio) image solved the issue, but I am not certain ...

Anyway, long story short .. its working fine now and enough reason to close this issue :)

Thanks for all support @MatzeJoerling @ascillato @reloxx13 馃憤

Was this page helpful?
0 / 5 - 0 ratings