Openhab-addons: [knx] Not reconnecting after send failure

Created on 29 Mar 2019  Ā·  47Comments  Ā·  Source: openhab/openhab-addons




Expected Behavior

If a send failure occurs, the KNX binding should reconnect to the IP gateway and repair the connection and continue to work.

Current Behavior

When a send failure occurs, the link is closed and not opened again (or so it seems).




Steps to Reproduce (for Bugs)



Really not sure, it happened quite rarely until a few days ago when I added a new device to the bus that sends quite a bit of traffic (maybe 30 telegrams/minute).
Since then, it happened twice in a week, which I consider a lot.

Your Environment

  • Running openhab 2.4.0 with knx 2.4.0 in Docker using the official docker image.
  • KNX interface is a ABB IPX/S 3.1.1
  • KNX configuration for the knx bridge as follows:
    Bridge knx:ip:bridge [ ipAddress="10.13.37.33", portNumber=3671, type="TUNNEL", autoReconnectPeriod=10 ] { .. }

Logs

See the attachment for logs that shows that the binding closes the link due to a send error and then does not reopen it again.
knx.log

PR pending bug

Most helpful comment

The problem is not the disconnect (if it happens only once or twice a week), the problem is that the binding is not properly re-connecting. I have improved this process, but itā€˜s still waiting for merging.

All 47 comments

@cguedel,

there are more and more problems especially with ABB Interfaces and Tunneling Mode:
https://knx-user-forum.de/forum/supportforen/openhab/1282102-oh-2-3-knx2-ip-gateway-geht-immer-offline#post1282102
https://knx-user-forum.de/forum/supportforen/openhab/1282102-oh-2-3-knx2-ip-gateway-geht-immer-offline?p=1294737#post1294737
https://knx-user-forum.de/forum/supportforen/openhab/1282102-oh-2-3-knx2-ip-gateway-geht-immer-offline?p=1295404#post1295404
https://knx-user-forum.de/forum/supportforen/openhab/1282102-oh-2-3-knx2-ip-gateway-geht-immer-offline?p=1333273#post1333273

I tested two openHAB installations for weeks and could not reproduce the error. Sometimes the RAM seems to be too little for KNX binding. Could also be a memory overflow. But my analysis is very superficial and still inaccurate.

But for now:
Do you have
ABB IPR/S 3.1.1 IP-Router(ROUTER)/Interface(TUNNEL)
or
ABB IPS/S 3.1.1 Interface(TUNNEL)

If you have a ABB IPR/S 3.1.1 you can switch to following ROUTER configuration, and your Problems are gone:

Bridge knx:ip:bridge [ 
    type="ROUTER",
    localSourceAddr="0.0.50", //does not matter
    portNumber=3671,
    readingPause=50, 
    responseTimeout=10, 
    readRetriesLimit=3, 
    autoReconnectPeriod=60
]{...}

@lewie: I have a similar problem as described by cguedel. However I'm not 100% sure if root cause is sam as of cguedel.
However, my issue fits 100% to the issues described in the links of knx-forum.de, you referred to, especially when I look at the logs of the users there (a log of my system is attached for reference, i.e. "log.txt").
One comment to your statements, if I may: As I see it, the users in the knx-forum.de reporting this issue rather seem to have an MDT Interface (i.e. SCN-IP000.01), as I do have too (btw: latest version of the IP interface is SCN-IP000.02, i.e. the SCN-IP000.01 is an older device)

I hope this makes sense...

log.txt

In fact, there are others (like me) which use Weinzierl 730 knx/IP Gateway, and saw postings from users which use knxd (only openHAB knx2 stops working, knxd is still online...)

To my understanding, Weinzierl 730 knx/IP Gateway and SCN-IP000.01 are the same devices (SCN-IP000.01 is a white labeled Weinzierl 730). I.e. mostly users with these devices seem to have this issue

One last comment from my side: I only have this issue since adding a few more devices (i.e. 4 pcs) to the knx bus (similar to the situation of cguedel). The system runs on OH2.4.0 stable (with knx 2.4.0). Before adding the new devices, I had no brake downs at all for month on 2.4.0. Now I have 1-2 brake downs every day. This is really annoying since OH is hardly usable in this condition.

I hope the experts will find a solution soon to this issue.

Dear all

Short update on the above issue: After having

  1. Re-wired some parts of knx-bus (potentially electromagnetic interference?)
  2. Replaced sd-card and set up system freshly (OH 2.4.0)
    my system runs stable for about 3 weeks now, i.e. no permanent disconnects anymore

Maybe this helps!

Hi all,

I'm facing this issue as well. See also my post at https://community.openhab.org/t/knx-version-2-x-binding-losing-its-connection-almost-every-day/75292

I'm using a Weinzierl 771 interface

Also experiencing this problem - the openhab 1.x version was so stable and this is almost unusable.

I have found that the problem seems to occur when I enable the HomeKit add-on which I assume then creates more traffic.

The best solution for me would be if it was robust enough to reconnect after these problems.

this is almost unusable.

I wouldn't say so - many people (including myself) are using it productively without issues.
So ideally this should be analysed and fixed in more depth by someone who experiences this issue...

@kaikreuzer sorry I will re-word - For me and the people with this problem it is almost unusable.

@grouchal ... which I am sorry to hear, but I'm afraid I personally won't be able to help...

I already said that before: I donā€˜t have this problem (with an Enertex router). But if there is someone who is willing to debug this with me I would give it a try. But I donā€˜t want a situation where I provide debug-bundles or fixes and wait for weeks for someone to test it.

I'd have to agree with @grouchal that the knx binding is almost unusable for people experiencing this bug. I've ran the 1.x binding without any issue, but 2.x is terrible. What frightens me even more is that I've seen discussions on the openhab 3.0 development, talking about breaking the 1.x backwards compatibility. If that happens - and there is no solution for this bug - then that means I'm stuck at Openhab 2.x (or I'd have to replace my knx IP router)

@J-N-K: Given the above, I'd be very happy to debug and test run. I'm not an IT developer so don't ask me to review or code, but more than happy to test run new test releases. Maybe @grouchal and @logos37 want to participate as well.

Also note to @grouchal: I'm not using the homekit addon, so that 's not the issue in my case

Iā€˜m quite sure there will be a solution to both, running OH1 addons and also this issue. Iā€˜ll provide a version with more debugging around the reconnects on Wednesday or Thursday.

I understand that this only happens with TUNNEL, is that correct?

Correct. (note that my Weinzierl does not even support ROUTER mode)

@J-N-K great to hear - I am happy to help debug, I currently have a rule that restarts the KNX bundle every hour so that I don't have this problem.

Please tell me when you have a build with more debug. - I assume it will be compatible with OH 2.4.0?

Also tell me which logging namespace I should enable to see the debug please.

Dear all

Please see my comments above (especially last one). My system runs now quite stable, so I’m afraid that I will not be of big help

@J-N-K, @grouchal, @sceppi, @cguedel,

in many versions, configurations and environments I use the KNX binding almost daily.

The KNX 2 binding is not yet perfect and has a few pitfalls, but it can run stable in almost all cases. In any case, it is similarly robust as the KNX 1.x binding was!

The following rules of conduct have led most constellations to success:

1.) In principle, the query of the actuator parameters should be omitted if possible. This feature is in no case necessary in any configuration, it is currently not used at all to tune GAs with openHAB.

After many discussions and hundreds of tests, it turns out that some actuators do not like to be queried regularly. If such a device is in the KNX bus, its unruly behaviour disturbs the whole bus and the openHAB KNX binding can be forced to reconnect or even give up after (usually 3) attempts. So comment out the parameter "address", "fetch" and "pingInterval" in the Thing configuration if possible! It is best to omit this parameter from the beginning.

For orientation, if you absolutely want to arrange your GAs by actuators instead of by functionality or place of use, the value can remain there. (It should be mentioned here, but basically the KNX bus is organized via the GAs and abstracted from the actuator. This is one of the core concepts of the KNX bus and of such a bus in general.)

My opinion is: The query of the actuators is extremely confusing for most users, because they think that the logic of the KNX bus is structured like this, but this is wrong. The order by actuator is detrimental at this point because it is basically only a feature for absolute special applications. Or later used to make autodiscovery of the KNX bus possible in the background.

2.) The connection problem is aggravated with the parameter "autoReconnectPeriod". If this parameter is set too short (<60 sec, depending on the number of addresses to be read out), the binding starts with a new loop and gets entangled, regardless of whether it is still being read out or not. An unfavorable but often observed misconfiguration then leads to the binding breaking off and no longer connecting. (We may have to investigate this and catch it in newer versions)

3.) Do not set "ipAddress" in ROUTER mode and only set "localIp" if two network cards are present and only if problems occur. Both are very special parameters which the normal user does not need.

I'm also writing this for the many users who might get a little too much respect for the KNX binding as a result of the discussion. The following configuration pattern should be sufficient for almost all users:

//PATTERN ROUTER DEFAULT
Bridge knx:ip:bridge [ 
    type="ROUTER",
    autoReconnectPeriod=60 //optional I would set always
] {
    Thing device knx_device "knx_device_name" @ "knx_device_group_in_paperui" [ 
        //readInterval=3600 //optional only used if reading values are present
    ] {
        ...
    }
}

//PATTERN TUNNEL DEFAULT
Bridge knx:ip:bridge [ 
    type="TUNNEL",
    ipAddress="192.168.0.111",
    autoReconnectPeriod=60 //optional I would set always
] {
    Thing device knx_device "knx_device_name" @ "knx_device_group_in_paperui" [ 
        //readInterval=3600 //optional only used if reading values are present
    ] {
        ...
    }
}

That's it, less is more!

Because the topic is close to my heart, I am of course debugging and working out improvements for further versions. If you want you can send me a private message and we will do a remote session.

@lewie I agree with what you said.

What I also found while browsing through the code: The reconnection is done with scheduleWithFixedDelay with the defined autoReconnectPeriod. I have seen ridiculous low values in some configurations (1s or so). This might lead to a situation where the next attempt is already scheduled before the old one is finished. IMO there should be a lower limit like 30s or so.
Another thing: I would schedule the next reconnect attempt only in case of failure of the first one.

WDYT?

@lewie: thank you for your comments, very useful! However, one question related to your comments I have as a not very knowledgeable openhab user: As you say the bus is organized along GAs, to which I clearly agree. But what is then the benefit of the ā€œthingā€ layer introduced with OH2? Wouldnā€˜t it have been more straight forward to stay for knx with architecture of OH1 (with only the items, i.e. GA Layer)

@lewie Thanks very much, this was exactly the feedback I was looking for. I had set fetch=true, pinginterval=600, etc... I've now removed all these parameters and configured my IP interface according to the 'less is more' principle.

I'll do some testing with the new configuration settings and report back on the results in a few days.

@J-N-K, I think you hit the nail on the head!
It's the problem that it makes a difference whether you have a detached house and (10-30s) or 1500 reading group addresses (best 120s) to serve in a concert hall with three event rooms.

I set these values reasonably because I am aware of the contexts. Yes absolutely, these are the nuances we still have to set in order to harden the binding. 30s is in my experience a reasonable minimum value! Everyone has to live with that. We have to die a death.

As you have rightly noticed, catching the exact states is the even cleaner way. If we are able to handle them cleanly.

@logos37, no, the event core (Items) had to be separated from the external communication. The complexity at one end (Things-layer) reduces the complexity at the other end (hundreds of new coupling possibilities) by a multiple.

KNX and openHAB - These are two very different systems and concepts we want to communicate. And there will come some more things like reading ETS projects, autodiscovery and virtual actuators! :-))))

On the one hand there is the KNX bus with its relatively complex types of messages such as GroupValueWrite, GroupValueRead and its counterpart GroupValueResponse (see ETS logging). And the corresponding flags with C,R,T,W,U,I (in german K,L,Ü,S,A,I)

And on the other hand openHAB, which uses a more modern and simple way of communication to describe the event bus and its flexible universal application. OpenHAB brings together hundreds of different communication structures and lets them talk to each other. Essentially, openHAB describes command (something is received by openHAB) and state (an item/thing has a value). From command and state it is also possible to construct different events (which make sense, but at first make everything confusing): state event trigger, state update tigger, state change tigger and command event trigger (and much more for things etc.).

Now you try to connect the two abstractions of the communication via the KNX binding (or the Hue binding and a hundred others). The Thing is an additional abstraction layer to be able to do justice to the variety of devices and standards at all. Items alone could not achieve this, because they are too closely bound to the internal processes of the openHAB event bus - but they must remain absolutely inviolable. Things are absolutely logical, but of course not conducive to understanding at first.

The current step is to hide the complexity of abstraction like things and items e.g. via autodiscovery from the "ordinary" user. What can I say without Things is almost autodiscovery unthinkable. And we want to use autodiscovery to reach more and more "normal", less knowledgeable users, not just specialists and programmers.

Believe me, it took me a long time to learn to love the Things. I loved the simple very original kind of openHAB1, only with simple text files very much - nostalgia alarm ;-) . An unsolvable dilemma, complexity versus flexibility... :-/

@sceppi, I'm looking forward to your feedback, whether you can confirm my experience or not - that would make us a bit better at experience again. :-)

Dear architects like to improve and complement me... ;-)

@lewie Iā€˜ll prepare a PR for that.

@lewie: thank you very much for the detailed clarification, this is highly appreciated!

@lewie : First impression after running the knx2.0 binding for 24hrs is that this seems to have fixed it. Let's see what the next days show, but my feeling is that it's fully functional again now. Thanks a lot!
Note: would suggest to add this 'best practice' to the knx binding manual on https://www.openhab.org/addons/bindings/knx/

**Update after a few days: Confirm that this solution has fully stabilized my knx2 binding and disconnects no longer occur!

@lewie: Unfortunately I had a disconnect again this night: error is as always like that
2019-06-08 02:42:10.194 [ERROR] [net/IP Tunneling 192.168.178.26:3671] - establishing connection failed, null
And this although I configured the things file exactly according your best practice guide. The only thing is that I have still organized Things according to my knx actors, i.e. I have quite some things (around 30 things), which typicall looks like that:

Thing device CO2_EG_WK [
// address="1.0.13", fetch=false, pingInterval=300, readInterval=0
]
{
Type number : EG_Wohnkueche_Heizung_Ist [ ga="<3/2/9" ]
Type number : EG_Wohnkueche_CO2 [ ga="<4/2/1" ]
Type number : EG_Wohnkueche_Feuchte [ ga="<4/2/2" ]
}

and the bridge like that:

Bridge knx:ip:bridge [
type="TUNNEL",
ipAddress="192.168.178.26",
// portNumber=3671,
// localSourceAddr="0.0.0",
// readingPause=50,
// responseTimeout=10,
// readRetriesLimit=3,
autoReconnectPeriod=60

Do you have any additional hints what still generates the disconnects?

The problem is not the disconnect (if it happens only once or twice a week), the problem is that the binding is not properly re-connecting. I have improved this process, but itā€˜s still waiting for merging.

@J-N-K: thanks for the response. As you say - i’m not the expert on this topic.
Is your improvement then available with the next snapshot (i.e. i will have to move away from my stable 2.4.0)?

@lewie: observation from my side: since commenting out most of the config parameters (see my bridge and thing config in my post above), I get more disconnects again (at least that is my impression), just had another this evening, last one was on Saturday. Before commenting out, my system was stable for around three weeks (as I have written, I freshly set up the system with a new sd-card, this seemed to have cured my issue). But now it is back! Very strange and contra intuitive to what you have written. I hope that the improved re-connect process from J-N-K will solve the issue. Any other ideas?

I have seen that the PR from J-N-K is merged now. Can anybody tell me how I can download it onto my system? I’m happy to test it. As i have already written my system is not useable anymore after having implemented the best practice configuration for the bridge / things, I have disconnects without reconnects every 1-2 days, really annoying...

Either use the latest snapshot of the distro or manually install the KNX binding from https://openhab.jfrog.io/openhab/libs-snapshot/org/openhab/addons/bundles/org.openhab.binding.knx/2.5.0-SNAPSHOT/ - the version from today includes the fix.

@kaikreuzer: thanks a lot! I just put org.openhab.binding.knx-2.5.0-20190616.135532-47.jar in the add-on folder, the other files are not needed. correct?

Yes, and make sure to uninstall your old version of the KNX bundle.

@kaikreuzer: thank you!

a brief status update: Since I have installed the fix I had no disconnect anymore so far, this is great (thank you J-N-K for the fix). However I still get sporadically the error
[ERROR] [net/IP Tunneling 192.168.178.26:3671] - establishing connection failed, null

Is that expected behavior, or still an issue?

The connection failed, the reason is unknown. I guess you can ignore that if it recovers by itself.

Should it be possible to use this bundle with a 2.4 openhab as well?
Or will i need a 2.5 installation?

getting the following error with 2.4:


2019-07-10 22:44:26.561 [WARN ] [org.apache.felix.fileinstall        ] - Error while starting bundle: file:/usr/share/openhab2/addons/org.openhab.binding.knx-2.5.0-20190703.181948-60.jar
org.osgi.framework.BundleException: Could not resolve module: org.openhab.binding.knx [246]
  Unresolved requirement: Import-Package: tuwien.auto.calimero; version="[2.4.0,3.0.0)"

        at org.eclipse.osgi.container.Module.start(Module.java:444) ~[?:?]
        at org.eclipse.osgi.internal.framework.EquinoxBundle.start(EquinoxBundle.java:383) ~[?:?]
        at org.apache.felix.fileinstall.internal.DirectoryWatcher.startBundle(DirectoryWatcher.java:1260) [10:org.apache.felix.fileinstall:3.6.4]
        at org.apache.felix.fileinstall.internal.DirectoryWatcher.startBundles(DirectoryWatcher.java:1233) [10:org.apache.felix.fileinstall:3.6.4]
        at org.apache.felix.fileinstall.internal.DirectoryWatcher.startAllBundles(DirectoryWatcher.java:1221) [10:org.apache.felix.fileinstall:3.6.4]
        at org.apache.felix.fileinstall.internal.DirectoryWatcher.doProcess(DirectoryWatcher.java:515) [10:org.apache.felix.fileinstall:3.6.4]
        at org.apache.felix.fileinstall.internal.DirectoryWatcher.process(DirectoryWatcher.java:365) [10:org.apache.felix.fileinstall:3.6.4]
        at org.apache.felix.fileinstall.internal.DirectoryWatcher.run(DirectoryWatcher.java:316) [10:org.apache.felix.fileinstall:3.6.4]

@shakalandy, the KNX Bundele version 2.5 does not work in openHAB version 2.4.

The move of important system parts from org.eclipse.smarthome to org.openhab prevents the use of the newer bundle.

My 2.5 installation does not want to install the JAR either.

Just installing the jar wont work. Install 2.5m2, itā€˜s included there.

Since i have already installed it, that means reinstalling it?

Why would you want to re-install bundles that are already the newest version?

@J-N-K: i installed the 2.5M2 right after it was released. Since then i have this issue (i.e. no initial loading of item status at startup of OH). That’s why i’m asking

Different issue.

Sorry messed up the issues...

Just to report in - all is wonderful with this patch have had a 2.5 SNAPSHOT installed and running for a month without any single connection problems - previously it was happening between every 4 and 12 hours.

Thanks to all who worked on this - openhab is really getting stable (again) for me.

We still have missed KNX message confirmation issues on the Pi3+ after some time which is some other issue response timeout waiting for confirmation that I have not yet figured out and is leading to disconnect after some time.
We have installed the latest 2.5 Snapshot and it effectively reconnects indeed (GIRA IP Interface).
Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Nikos78 picture Nikos78  Ā·  5Comments

gk4 picture gk4  Ā·  3Comments

Alex5719 picture Alex5719  Ā·  6Comments

smyrman picture smyrman  Ā·  4Comments

pfink picture pfink  Ā·  4Comments