It is annoying if one thing is down that the whole platform won't load (mainly sucks with things like TVs).
So I am considering the following official "guide line": Add a way for platforms to "register" themselves to be tried to setup every 5 minutes if their config is not valid right now. It would still report the error but maybe only do that the first time ?
I think that a good way to implement this would be to introduce a new type of exception (PlatformNotReady ?) that platforms can raise. It can then be scheduled to try setup again in 5 minutes inside the entity component. This could easily be added here: https://github.com/home-assistant/home-assistant/blob/dev/homeassistant/helpers/entity_component.py#L140-L157
If a KNX gateway is not reachable (because connections are still used by crashed process for example) it would be nice if this heals itself by retrying without having to restart HASS another time.
So I would prefer to retry at intervals. Something like:
5sec, 10sec, 30sec, 1min, 2min, 5min, 15min, 1/2hour, 1hour, 2hours, 4hours
It would be good to let platforms with a constant connection to a service restart itself if the connection get lost. The pilight component for example seems to stop working as soon as the pilight daemon restarts (which I have to do now and then because the connected arduino seems to lose its sync to the daemon). The only way to fix this at the moment is to restart hass (and lose a lot of state).
Maybe we should retry on any Exception? Instead of logging not available as we do today, we just retry up to a limited amount of times
Backoff sounds good, but lets stick to a limit of say 3 tries. (E.g. 2min, 3min, 5min) and then stop. The platform might really be defective.
It might cause oodles of template errors for some users though
@cyberjunky my view is healing would require require some more work and not in scope of retrying setup. A "reloading" service could be built to restart KNX for example, but we dont have an internal way to monitor/trigger this today
@kellerza I think retrying platform init is more suitable for solving connection problems than for restarting platform after having real config errors (syntax etc)
I don't expect them to be solved in between retries. My KNX example is something I noticed when restarting HASS manually, if I detect this I have to restart HASS again manually to fix it, would be nice if HASS tried to reconnect/re-init platform a few times to solve this though. But I must admit that my knowledge of HASS's internals are limited, I only started to look at it a few weeks ago. The loss of state mentioned by @janLo is also my greatest drwaby in restarting HASS completely, and also triggering automation upon start (have to build all kinds of checks to prevent this from happening)
@janLo restarting a service is not something we can easily do on a generic level, since you have to remove entities and rerun setup_platform and still you might leave behind some old state (take hub type components for example). That is why certain parts (automation, customize, groups) have special restart services.
@cybernunky it should be quite easy to add such a restart service for KNX, since it seems the knx "server/hub" part is misbehaving in some way by not closing connections.
I have a component electric imp that adds new devices whenever they appear on the server - a custom component, available in my repo, but maybe such an approach can work for KNX? (Dont own any KNX devices)
@kellerza I will have a look at your repo and see how to improve KNX connection lost behaviour, thanks.
@kellerza The problem is, that all the devices, that does not "report" a state and where the state is memorized in HA (like pilight switches) the actual state is gone after a restart.
@janLo maybe #4614 will help with such states (that is the idea at least)
I think that at least for now this should be limited to restarting after a failure inside of setup_platform. I think once a platform has successfully set itself up it needs to manage restart of connections on it's own, as @kellerza suggested.
I would propose considering an "opt-out" approach. We would modify setup_platform to return True on success, matching the component setup. (Requires changing all the platforms, but it should be relatively straight-forward.) Then, unless setup_platform throws a new type of exception PlatformConfigInvalid(?), it is scheduled for a retry if it does not return True.
My reasoning is that if the platform didn't setup successfully, there really isn't an harm in retrying, especially if we can squelch subsequent log errors. Also, I'm hard-pressed to come up with scenarios where the platform would really know for certain a retry won't work. Invalid file permissions could be fixed, network connections could be restored, cloud service credentials could be updated on the other end, etc. The biggest reason that a platform could never be setup correctly is an actual config schema violation, which is captured before setup is even attempted.
I agree with @kellerza and @armills about reconnecting. That should be handled by the component and it's platforms.
I like the idea of a back off mechanism and making it opt-out.
So if we get False back from setup platform, retry. If we get None back, it means the platform hasn't been updated, we might want to consider a warning with a link where people can report the platform (in case we missed some).
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment :+1:
Still something we should probably add.
Truly necessary, especially with components like OctoPrint support where the polling of a printer that isn't on 24/7 creates 20-25k log entries in a single day. My only option at the moment is to manually edit my configs and restart each time I want to use HA's notifications with my printer.
Most helpful comment
Still something we should probably add.