Home Assistant release with the issue:
0.85.0
Last working Home Assistant release (if known):
Operating environment (Hass.io/Docker/Windows/etc.):
FreeBSD 11.2, python 3.6.7
Component/platform:
tellduslive
Description of problem:
When there is a network problem between Tellstick and Home Assistant server, the whole home assistant installation becomes very unresponsive. There are messages like this in the log file:
2019-01-20 19:05:55 WARNING (SyncWorker_0) [tellduslive] Failed request: HTTPConnectionPool(host='tellstick.stonepile.fi', port=80): Read timed out. (read timeout=10)
2019-01-20 19:06:05 WARNING (SyncWorker_0) [tellduslive] Failed request: HTTPConnectionPool(host='tellstick.stonepile.fi', port=80): Read timed out. (read timeout=10)
Problem like this shouldn't affect the whole installation, just the component in question.
Problem-relevant configuration.yaml entries and (fill out even if it seems unimportant):
tellduslive:
  host: tellstick.stonepile.fi
  update_interval: 10
Traceback (if applicable):
Additional information:
CC @fredrike
@MartinHjelmare is this due to lack of async? Would it be OK to just patch the component or does the whole lib need to be changed?
Please assign me to this so I can keep track of all the things that I have broken..
We fixed one case of I/O in event loop, in last https://github.com/home-assistant/home-assistant/pull/20023, right? That hasn't been released yet.
My first thought was that we're eating threads somehow. Eg if the request timeout in the library is equal or longer than the scan interval, and we're not limiting concurrent updates. OP has update interval at 10 seconds which is quite short. What's the timeout on the requests in the library?
Please assign me to this so I can keep track of all the things that I have broken..
Only members can be assigned issues, it seems.
The error message contains "Read timed out. (read timeout=10)". So without checking the source, one might guestt that timeout is 10 seconds ?
I have the same symptoms and the same errors in the logs. My config of telldus is:
tellduslive:
  update_interval: 5
A possible workaround until a fix is in place, is to try a longer update interval, eg 15 seconds.
Still not working in 0.86.1. I have tested to change update inervall to >60s but the problem is still there. If I remove telldus from config the unresponsive problem disapers.
Se https://community.home-assistant.io/t/odd-behavior-with-version-0-84-6/88532 for more info and logs.
@MartinHjelmare Is this due to asyncio parts that are still pending?
I don't know about the community thread. It's hard to follow. But if we spawn a new thread in the executor on every update, and a previous update isn't done, we could after a while have consumed all the threads in the pool, if the threads aren't released within short enough time.
Check the logic that does the update. What happens if the request blocks for more than the update interval?
I don't know about the community thread. It's hard to follow. But if we spawn a new thread in the executor on every update, and a previous update isn't done, we could after a while have consumed all the threads in the pool, if the threads aren't released within short enough time.
Check the logic that does the update. What happens if the request blocks for more than the update interval?
The tellduslive is using a RLock from threading. And a timeout for each of the request of 10s..
Should we abort updates if they take too long?
eg.
            with async_timeout.timeout(self._update_interval):
                  await self._hass.async_add_executor_job(self._client.update)
That RLock is the problem.
With the threading RLock, and an update interval of 5 s, we will always block 2 threads if the endpoint doesn't respond within 10 s timeout. Then each time an actuator makes a service call, another thread will block for 10 s.
But the bigger problem is that the same RLock is controlling all access to the stored state cache in the tellduslive library. So now, we're not only blocking some threads in the thread pool, but also we're blocking the event loop, for 10 seconds, whenever we try to access the state cache in tellduslive. That's what's bringing all of home assistant down.
So only adding a timeout to the client update won't solve the problem. We have to remove all access to the tellduslive state cache from inside the event loop.
Since it's hard to keep track of what parts of the library are asyncio safe and what parts are not safe, I suggest first to try and move all access of the tellduslive client and tellduslive devices into the executor. Then also implement timeouts where needed.
Generally we frown upon locks inside our thread pool, cause they can wreak havoc.
For the future it would be good to move to a tellduslive library that supports asyncio.
It's this line where the lock is used while just accessing a dictionary that holds the state cache:
https://github.com/molobrakos/tellduslive/blob/30d7c19b2cf19d77c024124fe366626e8b50500d/tellduslive.py#L285
Accessing a dictionary is considered thread safe, so I'm not sure why that lock is placed there:
https://docs.python.org/3/glossary.html#term-global-interpreter-lock
Thank you @MartinHjelmare for clearing this up!
I'm sorry but I'm not interested in spending much more time on tellduslive. I would be happy to keep it bug free so if we could patch it up to a decent state, I would be happy with that.
I'm moving in the same direction as @molobrakos (the original author of tellduslive) with a mqtt approach for tellduslive (I'm planning to update the hass doc with info to corresponding libraries).
@molobrakos, can you answer why we have a look for the state at L285?
Just a wild guess, but the update method clears the cache before getting updated values here:
https://github.com/molobrakos/tellduslive/blob/30d7c19b2cf19d77c024124fe366626e8b50500d/tellduslive.py#L337
What if we would change update logic so that it collects new state into temporary dictionary and then when all is done it would just assign this temporary dictionary to self._state ? No locking would be needed because it is simple assignment in update and dictionary access in _device(). Or am I missing something ?
@AriZuu, you are correct about this. @fredrike exactly like you did here: https://github.com/molobrakos/tellduslive/pull/19/files/ffabe1bf08609174309597a13e3e7e2ebad95e71..ba1e30211cb85b758ba5759f06e5e529d3fa098e
And then the update scheduling-stuff we talked about.
@AriZuu, It would be great if you could test the changes I've made both on https://github.com/home-assistant/home-assistant/pull/20603 and on the tellduslive lib (you can just replace the lib/python3.7/site-packages/tellduslive.py file with the one from https://github.com/molobrakos/tellduslive/pull/19).
I put the changed files into place. It seems that my version didn't have CONF_UPDATE_INTERVAL in const.py yet, but after adding also that things seem to work ok. Cannot say much about unresponsiveness (yet), but at least the changes didn't break anything.
@fredrike I just had power failure, which helped testing the situation where telldus box is not responding :-)
It seems that now Home Assistant kept working normally, it didn't freeze as before.
Most helpful comment
@fredrike I just had power failure, which helped testing the situation where telldus box is not responding :-)
It seems that now Home Assistant kept working normally, it didn't freeze as before.