Zephyr: Bluetooth: <err> bt_att: ATT Timeout

Created on 4 Jun 2020 · 35Comments · Source: zephyrproject-rtos/zephyr

Describe the bug
Local device performing ATT write cmds, generates <wrn> bt_att: Ignoring unexpected request when peer device send an ATT request. The peer then gets <err> bt_att: ATT timeout

To Reproduce
Use the branch: https://github.com/cvinayak/zephyr/commit/b4939f526a9c1c86a88fd55810ed85e1ecaaf46d

Steps to reproduce the behavior:
Build and flash peripheral:

mkdir -p build/peripheral; cd build/peripheral
cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/peripheral
ninja
ninja flash

Build and flash central_hr:

mkdir -p build/central_hr; cd build/central_hr
cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/central_hr
ninja
ninja flash

Open two terminals, minicom -D /dev/ttyACMx
Observe both peripheral and central_hr fail with respective warning and error messages in the console.

Expected behavior
ATT request from peer device shall not be dropped. ATT bearer shall not timeout.

Impact
showstopper

Screenshots or console output

Environment (please complete the following information):

OS: Linux
- Toolchain: Zephyr SDK 0.11.2
- Commit SHA or Version used: https://github.com/cvinayak/zephyr/commit/b4939f526a9c1c86a88fd55810ed85e1ecaaf46d

Additional context
Add any other context about the problem here.

Regression Bluetooth bug high

Source

cvinayak

Most helpful comment

@mhx yep, that is exactly what I see as well with the latest commit.

carlescufi on 17 Jun 2020

👍2

All 35 comments

I'm seeing the same thing for our application. When using bt_gatt_unsubscribe() we get "bt_att: ATT timeout" on one side and "bt_att: Ignoring unexpected request" on the other. That also means we can not subscribe again, so we need to reset the kits before subscribing. This happened after updating to v2.3.0-rc1 from v2.2.0

alexsven on 4 Jun 2020

I am seeing what I believe is the same issue.
The problem occurs when two transmissions of different types (REQUEST, CONFIRMATION, etc) on the same ATT channel occur within a very short window.
In my case at least, this occurs when the GATT server publishes an Indication, which the host schedules an acknowledgement for. If the GATT client is also performing writes to the server, two writes can go through the host in a very short window.

https://github.com/zephyrproject-rtos/zephyr/blob/e77985a2f0531a8fbefdb9ae2e5ea7c5e3dbd70f/subsys/bluetooth/host/att.c#L152
This line sets the ATT callback, chan_cfm_sent, chan_rsp_sent, etc.
However there is no protection around this setting. If chan_send is called again before bt_att_sent runs, the original callback is lost, and the callback associated with the second transmission is instead run twice.

In my situation, because the callback is lost, chan_cfm_sent is never called and thus the ATT_PENDING_CFM bit is never cleared on the flags. The next indication from the server is therefore ignored at https://github.com/zephyrproject-rtos/zephyr/blob/e77985a2f0531a8fbefdb9ae2e5ea7c5e3dbd70f/subsys/bluetooth/host/att.c#L2309

JordanYates on 6 Jun 2020

We will probably need a list/queue so that we can process the sent event in order.

Vudentz on 8 Jun 2020

@Vudentz This is looking like a regression, no-one has reported it before. Any ideas on why it would need a queue/list now when we didn't before?

joerchan on 8 Jun 2020

@Vudentz This is looking like a regression, no-one has reported it before. Any ideas on why it would need a queue/list now when we didn't before?

This was introduced with EATT support since bt_l2cap_chan_send cannot rely on setting callbacks per buffer like we used to do with bt_l2cap_send_cb, anyway I working on a fix for this.

Vudentz on 9 Jun 2020

👍1

back to v2.2.0 works for me...
Fix 26063 on v2.3.0 works for me also...

Thanks!

Counterfeiter on 9 Jun 2020

With the propopsed fix from 26063 and the steps in this bug-report:

The warning is gone: <wrn> bt_att: Ignoring unexpected request when peer device send an ATT request.

The error is still there: <err> bt_att: ATT timeout

joerchan on 9 Jun 2020

@JordanYates and @Counterfeiter Thanks for reporting issue and status. I have created a new bug-report for the problem that you guys saw. Which appears to be a different issue than the ATT timeout seen here.

Did you guys see an ATT timeout, and has it been fixed? Remember to wait the ATT timeout, which is 30 seconds.

joerchan on 9 Jun 2020

@joerchan I may be wrong, but I believe it is the same issue, or rather, that this issue is caused by the lost callbacks.
When the <wrn> bt_att: Ignoring unexpected request occurs, no response is sent back to peer device. This will cause ATT timeouts.
Either way, the linked pull request will resolve my issue.

JordanYates on 9 Jun 2020

@JordanYates They are definitely two separate issues. Your issue may result in ATT timeout as well.
Can you check if your issue results in ATT timeout and update the new issue I created if that is the case?

joerchan on 9 Jun 2020

@Vudentz

The issue is caused by ATT not sending the responses to Service discovery requests.
In vinayaks setup the peripheral is sending write-requests as fast as possible, while the central is doing device-discovery and sending notifications as fast as possible.
In this case the service discovery requests by the central is what is timing out. This is because the peripheral fails to queue the response packets, and simply drop them.
This occurs here:

static u8_t att_find_type_rsp(struct bt_att_chan *chan, u16_t start_handle,
                  u16_t end_handle, const void *value,
                  u8_t value_len)
{
    /* ... */

    int send_err = bt_att_chan_send(chan, data.buf, chan_rsp_sent);
    if (send_err) {
        BT_WARN("%d send_err %d", __LINE__, send_err);
    }

    return 0;
}

The error code is -11, which is it appears is coming from the ATT_PENDING_SENT now.
But could also be the tx_sem.

joerchan on 9 Jun 2020

Not sure if relevant but I see similar behavior on 2.2.1 when connecting to a peripheral using nrfconnect:

[00:24:57.234,680] <wrn> bt_ctlr_hci: type: 0x14
[00:24:57.332,244] <wrn> bt_att: Unhandled ATT code 0x1d
[00:24:57.430,297] <wrn> bt_att: Ignoring unexpected request

Then after 30 seconds the peripheral disconnects (I suspect the ATT time-out).

mfiumara on 9 Jun 2020

FWIW, I've had a very similar issue after updating my repo yesterday (after about two months). I'm now getting "bt_att: Ignoring unexpected request" almost every time when sequentially enabling notifications for all attributes from within the nRF Connect app. I can't remember seeing that problem before the update, and now I can definitely reproduce it. Disabling CONFIG_BT_ATT_ENFORCE_FLOW seems to help. I've done a bisect given that it reproduces quite well and ended up with commit f4192bda26, which does look somewhat related (i.e. lots of changes to att.c and touches code around the warning).

mhx on 10 Jun 2020

I did apply @Vudentz fix (e9ef65c), and while it does make the warning disappear, it still causes timeouts in nRF Connect for me while enabling notifications. Happy to provide further information if you can let me know what you need.

mhx on 10 Jun 2020

@mhx Thank you for letting us know. Your bisecting matches what others have reported.
Did you try with both fixes applied? If you see in https://github.com/zephyrproject-rtos/zephyr/pull/26063 there are two commits, you should also try https://github.com/zephyrproject-rtos/zephyr/pull/26063/commits/e68cb89680f30c484903a540519a4776aa64a29c

If you still see issues with both commits the best would be have a precise description on how to reproduce the issue that you are seeing.

joerchan on 10 Jun 2020

@joerchan, yes, I do still see it with both fixes applied. I'm still pretty new to the world of Zephyr & BLE, but I'll try my best to describe what I'm doing as accurately as possible.
For context, I've built a very hacky proof-of-concept based on samples/bluetooth/peripheral, pulling in code from other examples. The peripheral exposes the battery service and several custom services. It includes the button and LED services from st_ble_sensor, and another service that represents a light meter. The latter has two attributes representing light intensity for two different spectral ranges. The update interval of these two attributes can be as low as 100ms.
I don't have any central application yet, so I'm simply using "nRF Connect" on MacOS with an nRF52840 dongle. I connect to the peripheral, provide the passkey, then update the connection interval from 7.5ms to 10ms. I assumed that 10ms would be low enough given the light meter attribute update interval. In the past, I've successfully used 50ms.
There are 4 attributes that support notifications: the battery level, the button state, and the two light meter readings. I enable notifications in the nRF Connect UI sequentially, and typically by the time I enable them for the final attribute, attribute updates just stop showing up in the UI. Sometimes it takes two rounds of enabling/disabling notifications, but at some point, update stop and after half a minute there's a GATT timeout error shown in nRF Connect. From the POV of the peripheral, everything seems to be fine, i.e. there are typically no warnings. Also, I can always see via the serial logging on the peripheral that the notifications for the last attribute have actually been enabled. It seems to me that it's the lack of a confirmation towards nRF Connect that's causing the problem.
I can attach logs from both nRF Connect and from the Zephyr peripheral if that helps. If there's any kernel config options that I can turn on that will make the logs more useful, please let me know.

mhx on 10 Jun 2020

Okay, I don't know what I did earlier, but I just tried to reproduce this again with the two fixes applied and using a clean instance of nRF Connect to get some log files. I couldn't reproduce the exact same issue so far. I still eventually managed to get the peripheral into a state where it stopped sending notification updates and from within that state, when disabling notifications, I managed to reproduce the GATT timeout. This was with the connection interval set to 50ms; with intervals of 10ms and 25ms I couldn't get it to fail at all.
I also tried to reproduce this with the repo reset to d148f8648b (just before the EATT commit) and wasn't able to (also with a connection interval of 50ms).
So the fixes definitely make some (the?) issue a lot harder to reproduce.
I've attached the log files for both Zephyr (with CONFIG_BT_DEBUG_ATT=y) and for nRF Connect from the reproduction run described above.

mhx on 10 Jun 2020

I've tried again, rebasing #26063 onto the latest master, just to be sure, and this time immediately got it to reproduce (albeit still with the 50ms connection interval, but I've been using that for the last month without any issues):

nRF Connect: 2020-06-11T07_04_33.079Z-log.txt
Zephyr: 2020-06-11T07_04_33.079Z-zephyr.log

Procedure was as decribed above:

Connect to peripheral
Enter passkey
Update connection interval to 50ms
Sequentially enable notifications for all 4 attributes
Upon enabling the last attribute's notifications (seen in the Zephyr log in line [00:00:45.571,411] <inf> main: Notification enabled), updates from the peripheral stop and after half a minute, there's a GATT timeout in nRF Connect
A disconnect seems to be the only way to interact with the peripheral

mhx on 11 Jun 2020

@mhx The procedure you described, is this with the peripheral sample without modifications? If you have modifications would you be able to share these so that we can reproduce the same scenario as you?

joerchan on 12 Jun 2020

I still facing problems with v2.3.0 and Fix 26063, if I disable the logging.

@joerchan The problem to reproduce this seems to be the timing. I have to disable the whole logging to see connection problems with some iOS Apps.

But I am very new to zephyr, so I haven't the deeper understanding to help you out.

Counterfeiter on 12 Jun 2020

@mhx and @Counterfeiter Yeah, I'm also seeing issues. Can you enable CONFIG_ASSERT=y, see if we all get ASSERTION FAIL [buf] @ WEST_TOPDIR/zephyr/subsys/net/buf.c:574

joerchan on 12 Jun 2020

@mfiumara If I'm reading that stack-trace right it appears the ASSERT came from zephyr_i2c_read and then got a usage-fault in the assert_post_action handler.
The fact that you get a similar ATT timeout on 2.2.0 must mean that it is a different problem, because this one is a regression since 2.2.0.

Can you create a separate bug-report for it?

joerchan on 12 Jun 2020

👍1

@mhx The procedure you described, is this with the peripheral sample without modifications? If you have modifications would you be able to share these so that we can reproduce the same scenario as you?

@joerchan No, the sample has quite a few modifications and it's rather specific to the custom board it's running on (e.g. attribute updates are triggered by the light sensor interrupts). I could try to narrow it down to a smaller, board independent piece of code, but it's unlikely that I'll find the time to do this in the next two weeks unfortunately.

I'll definitely try CONFIG_ASSERT=y, though.

mhx on 14 Jun 2020

No luck with CONFIG_ASSERT=y. I can still reproduce the problem, but there are no assertions in the log.

mhx on 14 Jun 2020

[00:02:03.218,658] [1;31m bt_conn: not connected![0m
[00:02:03.218,688] [0m bt_att.chan_send: code 0x1b[0m

This has nothing to do with the initial issue it seems, it appears to get disconnected in the meantime.

Vudentz on 15 Jun 2020

[00:02:03.218,658] �[1;31m bt_conn: not connected!�[0m
[00:02:03.218,688] �[0m bt_att.chan_send: code 0x1b�[0m

This has nothing to do with the initial issue it seems, it appears to get disconnected in the meantime.

That was a deliberate disconnect I triggered from the nRF Connect app. Before the disconnect, nothing happened for more than one minute:

[00:00:45.571,411] [0m<dbg> bt_att.chan_send: code 0x13[0m
[00:02:03.218,627] [0m<dbg> bt_att.att_sent: conn 0x2000135c chan 0x20003084[0m

mhx on 16 Jun 2020

Okay, I actually have a repro now with just very basic changes to the code in samples/bluetooth/peripheral that at least for me produces the issue 100% of the time. I've also confirmed that the code works fine with zephyr @ d148f8648b.

The change I've made is that, in order to simulate the interrupt-driven updates in my application, I've added a timer that triggers interrupt_worker. interrupt_worker then performs the attribute updates and re-arms the timer to trigger again after 100ms. This is pretty much what's happening in my application, only that instead of a timer trigger, I'm using an external interrupt that triggers after 100ms.

The issue appears to be that the code ends up in a state where interrupt_worker actually gets stuck while performing the attribute updates. I don't actually know where exactly it gets stuck, though, as I can't find a call stack that includes the interrupt_worker (probably gdb doesn't understand zephyr threads). But it can clearly be seen in the console that interrupt_worker is called, but never returns.

Here's the procedure to reproduce the failure:

apply this diff to samples/bluetooth/peripheral and flash to peripheral; I've repro'd this with an adafruit_feather_nrf52840 board
connect to peripheral from nRF Connect
update connection interval to 50ms
enable notifications for Battery Level, Current Time and Heart Rate Measurement
usually at this point, the console logging will stop and after half a minute, nRF Connect will show a Gatt timeout

mhx on 17 Jun 2020

Okay, I actually have a repro now with just very basic changes to the code in samples/bluetooth/peripheral that at least for me produces the issue 100% of the time. I've also confirmed that the code works fine with zephyr @ d148f86.

The change I've made is that, in order to simulate the interrupt-driven updates in my application, I've added a timer that triggers interrupt_worker. interrupt_worker then performs the attribute updates and re-arms the timer to trigger again after 100ms. This is pretty much what's happening in my application, only that instead of a timer trigger, I'm using an external interrupt that triggers after 100ms.

The issue appears to be that the code ends up in a state where interrupt_worker actually gets stuck while performing the attribute updates. I don't actually know where exactly it gets stuck, though, as I can't find a call stack that includes the interrupt_worker (probably gdb doesn't understand zephyr threads). But it can clearly be seen in the console that interrupt_worker is called, but never returns.

Here's the procedure to reproduce the failure:

apply this diff to samples/bluetooth/peripheral and flash to peripheral; I've repro'd this with an adafruit_feather_nrf52840 board

connect to peripheral from nRF Connect

update connection interval to 50ms

enable notifications for Battery Level, Current Time and Heart Rate Measurement

usually at this point, the console logging will stop and after half a minute, nRF Connect will show a Gatt timeout

Does https://github.com/zephyrproject-rtos/zephyr/pull/26063 fix the issue @mhx?

carlescufi on 17 Jun 2020

Does #26063 fix the issue @mhx?

Ah, I didn't see there was an additional change added to #26063 recently. No, this new change actually makes things a lot worse. When connecting to the peripheral from nRF Connect, it never gets to the point where I actually see the attribute tree. Without any interaction, I'm getting a Gatt timeout after half a minute. I'm seeing

[00:00:03.662,231] <wrn> bt_att: Ignoring unexpected request

in the log when attempting to connect.

mhx on 17 Jun 2020

@mhx yep, that is exactly what I see as well with the latest commit.

carlescufi on 17 Jun 2020

👍2

Ive update the set, it is now possible to flood the stack with write without response like before and as a bonus I adds support for doing it using the shell.

Vudentz on 18 Jun 2020

❤1

Ive update the set, it is now possible to flood the stack with write without response like before and as a bonus I adds support for doing it using the shell.

This looks really good! I'm unable to reproduce the problem with the latest changes. I saw a bus fault once, but can't reproduce it:

[00:00:38.866,516] <inf> hrs: HRS notifications enabled
[00:00:38.922,058] <err> bt_conn: Unable to allocate TX context
[00:00:38.924,499] <err> os: ***** BUS FAULT *****
[00:00:38.924,499] <err> os:   Imprecise data bus error
[00:00:38.924,499] <err> os: r0/a1:  0xfffffffc  r1/a2:  0x0000000c  r2/a3:  0x00000008
[00:00:38.924,499] <err> os: r3/a4:  0x20005a60 r12/ip:  0x00000000 r14/lr:  0x0001f11b
[00:00:38.924,530] <err> os:  xpsr:  0x21000000
[00:00:38.924,530] <err> os: Faulting instruction address (r15/pc): 0x0001f120
[00:00:38.924,530] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:38.924,530] <err> os: Current thread: 0x20002794 (unknown)
[00:00:38.994,171] <err> os: Halting system

mhx on 18 Jun 2020

@Vudentz It's looking good now. Throughput sample is working well. Also tested the fixes on top of vinayaks original bug-report branch (https://github.com/joerchan/zephyr/tree/att-timeout) and it is also working without issues now.

joerchan on 18 Jun 2020

This looks really good! I'm unable to reproduce the problem with the latest changes. I saw a bus fault once, but can't reproduce it:

No, that doesn't seem related. We have a couple of deadlocks that need fixing (they have PRs) which might fix the issue you are seeing.

carlescufi on 18 Jun 2020

👍1

This looks really good! I'm unable to reproduce the problem with the latest changes. I saw a bus fault once, but can't reproduce it:

No, that doesn't seem related. We have a couple of deadlocks that need fixing (they have PRs) which might fix the issue you are seeing.

Yep, don't think it's related. The new code has been running for hours now without any issues.

mhx on 18 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings