Gluon: autoupdater manifest wget running "forever"

Created on 20 Dec 2016  路  11Comments  路  Source: freifunk-gluon/gluon

a few times i have noticed, that the wget process which was initiated by the autoupdater seems to keep running endlessly.
this was observed running the v2016.2.x branch (some commit from november) today, first observed in the beginning of october.
there seems to be no possible bugfix other than switching to an autoupdater in C, according to @NeoRaider

bug

All 11 comments

I found three nodes that have uptimes around 4 weeks and all of them stopped requesting manifests from the update servers since the end of november (+- 7 days)

I don't have direct access to such a node yet, but it sounds likely that we in Darmstadt are also affected.

All affected nodes that I could find are running at least 2016.2.0.

While all Freifunk Bremen testing nodes migrated nicely from 2016.1.6 to 2016.2.1, not all did so with 2016.2.2.
~10 out of 85 didn't not get the update. All of the were running for longer periods of time.
Unfortunately we don't have access to one of the nodes concerned, but I monitored them from the far. Two of them have restarted recently and are now up to date. It seems we've run into the same bug.

What is wget actually doing whole hanging? Recently I observed wget hangs on mtu issues in a different setting.

I actually think that this new bug is a regression and the previous bug #582 was not as bad. Devices not updating at all is way worse than them sometimes crashing. This way we totally loose control over the software of some nodes, even if autoupdate is set to true.

I propose reversing https://github.com/freifunk-gluon/packages/pull/147 until the C rewrite of the autoupdater (https://github.com/freifunk-gluon/packages/tree/c-autoupdater) is finished.

I've pushed an update that should fix the issue in https://github.com/freifunk-gluon/packages/commit/485186ace270564393b464bb4ff07686df31cd96; please test, and I'll backport it to v2016.2.x.

Any ideas on reproducing the issue intentionally?
Or should we just let some devices with the fix run for a longer period of time?

Someone tested this already?

The fix is now also backported to v2016.2.x. As nobody reported problems, I consider this fixed and will release v2016.2.3 soon.

I think the problem is that nobody was able to test if it fixes the problem because of its Heisenbug nature.

Could it be, that this is the old problem of the au running twice on very slow connections?

@rubo77, yes, that is the issue. It is fixed now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oszilloskop picture oszilloskop  路  5Comments

jenell95 picture jenell95  路  3Comments

CodeFetch picture CodeFetch  路  5Comments

Nurtic-Vibe picture Nurtic-Vibe  路  5Comments

lcb01a picture lcb01a  路  3Comments