Gluon: Nondeterministic image breakage [observed on TL-WR841N v5]

Created on 29 Feb 2016  路  15Comments  路  Source: freifunk-gluon/gluon

We've observed that on rare occasions, Gluon would produce images that would almost always fail to boot.

Our analysis has yieded the following results so far:

  • The kernel will always hang after the message "console [ttyS0] disabled"
  • The production of broken images is nondeterministic, the same source code may produce working and broken images when compiled multiple times
  • The only differences between working and broken kernels are timestamps in the uname line and the initcpio (uncompressed; after LZMA compression, the kernel images would be mostly different)
  • We suspect an issue in the LZMA loader, as the kernel image itself seems to be fine when uncompressed on the build machine
  • We are currently verifying if a backport of http://git.openwrt.org/?p=openwrt.git;a=commit;h=4765fe077fef2b281ef8f4607be75793e8372f59 fixes the issue
bug upstream issue

Most helpful comment

@viisauksena, the hang always occurs in the same place. I suspect a race condition in the serial/console driver, but it is very hard to debug, as minor timing differences (like adding debug code) make the issue disappear. I've just ordered a JTAG adapter which might help me debug this...

All 15 comments

That patch did not fix the issue.

If you have a TL-WR841N/ND v5, please test if https://home.universe-factory.net/neoraider/openwrt-ar71xx-generic-tl-wr841nd-v5-squashfs-patched2.bin boots reliably for you. If it does, the observed issue might just be a hardware defect and not a software bug...

This issue has been observed on two different TL-WR841N v5, so it isn't a hardware defect.

It is unknown if any other models are affected, but it seems likely that only a few devices are affected, as we haven't gotten any similar reports for other hardware.

Boot log when the router is stuck: https://paste.linuxlounge.net/#/NzbCcIZULj50bEhtLmi_0C7zYyM!JNNQMzN9Q-ELEMJv4dJ_ep7Wrg5RDh_zWLB2ynzyXu4

The same image might boot or not boot without reflashing after each powercycle

Updating the kernel to 3.18.24 or newer fixes this issue. The master already has 3.18.27, v2016.1.x just got updated. I guess v2016.1.2 is around the corner...

See https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=80e5c4ddd2fcbf99edd31a7c1379b3907cfd4f38 for the upstream fix.

It seems we need to reopen this; we're observing the same issue with kernel 3.18.24 on a 841 v7 now...

3.18.27 is affected as well.

do flash size decrease by time (write cycles) - and if so - does this maybe break it all

@viisauksena, the hang always occurs in the same place. I suspect a race condition in the serial/console driver, but it is very hard to debug, as minor timing differences (like adding debug code) make the issue disappear. I've just ordered a JTAG adapter which might help me debug this...

I've pushed a26f78140478c9c9bdf10874ecfcb13e988250cd, which should fix this issue. Please test; v2016.1.3 will be released soon if the fix is effective.

Make sure to clean your kernel tree (make target/linux/clean or make clean) after updating to get the fix.

I have tested this patch. Its seems to be working for the model 841N/ND v5. 20 of 20 bootups was successful. Maybe we should test this patch on few other routers like on a 841 v7 or some other routers there we know for this problem.

Any Updates for this Issue? Is it solved?

We believe it is solved.

Hei脽t dass, .1.3 folgt zeitnah?

Was this page helpful?
0 / 5 - 0 ratings