Zephyr: Bluetooth: host: bt_conn_create_le sometimes fails to stop pre-scan before connecting

Created on 13 Nov 2019  路  26Comments  路  Source: zephyrproject-rtos/zephyr

Describe the bug
The host API function bt_conn_create_le uses the HCI_LE_Create_Connection command to make the controller scan for and connect to an advertiser. However, before doing that, it appears that bt_conn_create_le is doing a pre-scan to find the advertiser, and this pre-scan sometimes appears to be terminated too late, i.e. the host sends the HCI_LE_Create_Connection command _before_ it ends the scan. This causes the controller to reject the HCI_LE_Create_Connection command with a Command Disallowed error. This is seen quite frequently when using an nRF52840 dongle with the Zephyr host on top as a master device.

To Reproduce
Steps to reproduce the behavior:

  1. From the application layer scan for an advertiser and stop scanning when found.
  2. From the application layer call bt_conn_create_le to connect to the advertiser.
    At this point the controller will do the pre-scan (requested by the host) and sometimes report a Command Disallowed error when trying to connect.

Expected behavior
The pre-scan done by bt_conn_create_le should end before the connecting starts, so it doesn't cause the connect to fail.

Impact
We are currently using this BLE master functionality for internal testing only, and we can work around it by adding retries, so the impact is not big yet. But very soon we will start using it in real products where these additional connection latencies may annoy the end user.

Screenshots or console output
None.

Environment (please complete the following information):

  • Zephyr release 2.0 (host running on Linux machine, controller running on nRF52840 dongle), or 2.1 release candidate.

Additional context
None.

Bluetooth bug low

All 26 comments

the pre-scan done by the host sometimes appears to be terminated too late, i.e. the host sends the HCI_LE_Create_Connection command before it ends the scan

That sounds very strange. If you look at the implementation of the check_pending_conn() function in subsys/bluetooth/host/hci_core.c you'll see that it calls set_le_scan_enable(BT_HCI_LE_SCAN_DISABLE) before calling hci_le_create_conn(conn). These functions will queue up HCI commands in the order that they are called, so it should be impossible for the order to be the wrong way around. Do you have any HCI logs to corroborate this? If this is really happening then it is indeed a bug, but it'd be good to split it into a separate issue - the remaining things you describe are as designed currently so that's more of a feature request/enhancement than a bug.

This is seen quite frequently when using an nRF52840 dongle with the Zephyr host on top as a master device.

Can you also specify what your setup looks like? Is this a 2-chip solution or are you running both Host and Controller in the same nRF52840?

This pre-scan doesn't seem necessary, because the application layer will typically already have scanned for available advertisers, and, as mentioned, the controller will subsequently scan again to establish the actual connection.

Not completely unecessary. In cases with host-based privacy this stage is needed.
So making this stage optional in some cases would be an improvement. But it is wrong to call it a bug.

Edit:
Sorry I was a bit too quick here.
If it is a indeed an issue with stopping the scanner and starting the initiator. Then that would be a bug. But should be fixed independently. Can we maybe split this into two issues in that case?

@jhedberg "Zephyr release 2.0 (host running on Linux machine, controller running on nRF52840 dongle), or 2.1 release candidate."
Is this the USB issue where they are on different endpoints? Maybe we don't wait for the command complete for disabling the scanner?

@jhedberg, we'll try to get a log.
@carlescufi, I've added more details about the environment.
@joerchan, so if the extra scan is for host privacy, it should go away if controller privacy is enabled?

Debugging this issue I found that the order of bt_hci_cmd_send_sync... and atomic_set_bit_to...in set_le_scan_enable seems to be relevant. If I switch order, the problem seems to go away (pending more testing).

joerchan, so if the extra scan is for host privacy, it should go away if controller privacy is enabled?

@wdhpawa They way privacy is implemented now we switch to host-based privacy when the controller resolving list is full. So the extra scan is needed if the controller does not support controller privacy, or the controller resolving list is full and we switched to host based privacy.

@joerchan, I see, and I agree that skipping the scan in some cases should be a separate enhancement if the command disallowed problem can be solved independently.

@carlescufi, I've added more details about the environment.

@wdhpawa I still don't get exactly the setup you are using.

What Host are you using? Zephyr's? If so, how are you compiling it? with the native_posix board? With QEMU?

Regarding the controller, which board exactly are you building for, and what transport are you using, UART or USB?

@jhedberg "Zephyr release 2.0 (host running on Linux machine, controller running on nRF52840 dongle), or 2.1 release candidate."
Is this the USB issue where they are on different endpoints? Maybe we don't wait for the command complete for disabling the scanner?

That would be a data vs commands/events race, but not a sequence of HCI commands. I don't think this is the issue.

@carlescufi It is a Zephyr native posix application. We are using two different controller boards. One nRF52840 (PCA10059), and a PTS dongle from Blouetooth.org.

@carlescufi It is a Zephyr native posix application. We are using two different controller boards. One nRF52840 (PCA10059), and a PTS dongle from Blouetooth.org.

Can we please get a full HCI trace using btmon? With this setup it should work out of the box.

@ChristofferSchroeder
Using this combination you should be able to get a full HCI log using btmon. Please attach it to the issue so we can check it out.

I've split the pre-scan removal into a separate issue #20698 as suggested.

@carlescufi Please see this log.
cmd_disallowed.txt

@ChristofferSchroeder Thank you. Will look into it.

Please see this log. cmd_disallowed.txt

Thanks @ChristofferSchroeder . The log really does show HCI_LE_Create_Connection before HCI_LE_Set_Scan_Enable(Disabled), which is really remarkable. Is this really an unmodified Zephyr host stack you're using? Which version? In latest master the reacting to a connectable advertising event that we want to connect to goes through the following calls in hci_core.c before reaching the point where hci_le_create_conn(conn) is called:

le_adv_report() -> check_pending_conn() -> set_le_scan_enable(BT_HCI_LE_SCAN_DISABLE) -> bt_hci_cmd_send_sync()

Note that the _sync() variant of the command sending is used, i.e. the HCI_LE_Create_Connection shouldn't even be queued to be sent before the Command Complete for the HCI_LE_Set_Scan_Enable comes. So something is really strange here.

My theory is that the processing of le_adv_report handler is done before the
thread that called bt_conn_create_le was woken up to process the command_complete event
for the issued LE Set Scan Enable.
This means that we have an inconsistent state in the host where we don't know that
we are currently scanning, but is receiving advertising reports nonetheless.
This happens because we haven't set the BT_DEV_SCANNING bt_dev.flags yet, as
this is done by the thread that called bt_conn_create_le after having processed the Command Complete Event.

The host then runs event handlers for the advertising reports. Since the host
thinks scanning is not enabled it doesn't disable it, and instead issues the
LE Create Connection command directly, which leads to command disallowed.

It might appear from the log that the LE Set Scan Enable to disable the scanner
appears after the LE Create Connection command, but this is because the
error handling of check_pending_conn will disable the scanner.

Relevant code:

Scanning state not set, so we skip disabling the scanner.

bt_conn_create_le -> set_le_scan_enable:
    err = bt_hci_cmd_send_sync(BT_HCI_OP_LE_SET_SCAN_ENABLE, buf, NULL);
    if (err) {
        return err;
    }

    atomic_set_bit_to(bt_dev.flags, BT_DEV_SCANNING,
              enable == BT_HCI_LE_SCAN_ENABLE);

Starting initiator failed, so we disable scanner

le_adv_report (Event handler) -> check_pending_conn:

    if (atomic_test_bit(bt_dev.flags, BT_DEV_SCANNING) &&
        set_le_scan_enable(BT_HCI_LE_SCAN_DISABLE)) {
        goto failed;
    }

    bt_addr_le_copy(&conn->le.resp_addr, addr);
    if (hci_le_create_conn(conn)) {
        goto failed;
    }
...
failed:
    conn->err = BT_HCI_ERR_UNSPECIFIED;
    bt_conn_set_state(conn, BT_CONN_DISCONNECTED);
    bt_conn_unref(conn);
    bt_le_scan_update(false);

Given my theory then this comment makes sense:

Debugging this issue I found that the order of bt_hci_cmd_send_sync... and atomic_set_bit_to...in set_le_scan_enable seems to be relevant. If I switch order, the problem seems to go away (pending more testing).

@jhedberg I am on the master branch commit c68ae690b60dd7df4b34d4ec852c9b52567c8aa1 right now, with no changes. I have seen this happening on pre-2.1 rc1 as well.
@joerchan I am not knowledgeable enough about this code to positively state if your theory is correct or not, but it seems valid. Is my observation about switching the order only a symptom fix?

@ChristofferSchroeder It fixes the case you are seeing but it seems to open up new corner cases, so I will try to make a proper fix. Can I count on you to verify the fix?

@joerchan I will be happy to verify it.

@joerchan thanks for the analysis! It sounds about right to me. One thing regarding hci_cmd_done() that we should be careful with is to not introduce too much extra processing there. This function gets called through bt_recv_prio(), and with some drivers (like h4.c) it's actually in interrupt context. Setting the critical state flags in this context to avoid races does sound like the right thing, but it should probably be limited to just setting the flag and then letting any other logic stay where it is now. We probably want this for other states as well, like advertising, even though we can only see the race with scanning currently.

@ChristofferSchroeder @wdhpawa Can any of you give the proposed fix a try, see if it solves the issue?

Thanks @joerchan. @ChristofferSchroeder will give the fix a try.

@joerchan Sorry for the long wait. Most of my tests failed on Friday which luckily turned out to be other reasons than your change. I have been running tests for some hours today and they are passing with no incidents. So it looks like your change fixes the issue :)

Was this page helpful?
0 / 5 - 0 ratings