Describe the bug
If LE encryption is not supported (like the case for VEGA controllers), and host encryption is enabled, there will still be calls to the controller for obtaining the necessary 8 bytes of random data.
This is done in two places in subsys/bluetooth/host/crypto.c, in prng_reseed() and in prng_init().
The problem is that the sequence of calls is actually a kind of infinite recursion, as per below
prng_reseed() -> BT_HCI_OP_LE_RAND -> le_rand() -> bt_rand() (host) -> prng_reseed()
The call to bt_rand() in host is done because only one implementation of the bt_rand() function (either from controller or from host) is being linked at a time.
Although I have not tested this, I suppose that this could be reproducible on Nordic as well, if setting BT_CTLR_LE_ENC_SUPPORT to n and enabling host crypto. Will test later on, when I can get access to a Nordic board.
To Reproduce
Steps to reproduce the behavior:
[00:00:00.120,000] <err> os: Exception cause Illegal instruction (2)
[00:00:00.120,000] <err> os: Faulting instruction address = 0x0d8fe860
[00:00:00.120,000] <err> os: ra: 0x832f304a gp: 0xec7e3d7e tp: 0xe25f7e36 t0: 0x306eda80
[00:00:00.120,000] <err> os: t1: 0x2c6fde6f t2: 0x30a507e6 t3: 0x6a09e667 t4: 0xbb67ae85
[00:00:00.120,000] <err> os: t5: 0x3c6ef372 t6: 0xa54ff53a a0: 0x510e527f a1: 0x9b05688c
[00:00:00.120,000] <err> os: a2: 0xb78b2e79 a3: 0xb9a2e204 a4: 0x2fb44915 a5: 0xd90c5060
[00:00:00.120,000] <err> os: a6: 0xd4839fe3 a7: 0x579a9afc
This is due to the fact that eventually the stack will run out.
Expected behavior
The controller random function should not be called when BT_CTLR_LE_ENC_SUPPORT is not enabled. This is according the the BLE spec, vol 2, part E, 3.1 LE Controller Requirements: where the HCI_LE_Rand command is noted as C4; per the note "C4: Mandatory if LE Feature (LE Encryption) is supported, otherwise excluded.".
Environment (please complete the following information):
Additional context
One quick-fix would be to re-implement prng_reseed() and prng_init() to call locally the entropy device. Another dirty fix would be to just use the kernel ticks (k_uptime_get) as the 8 bytes of 'entropy'.
The correct way for the host to test LE Encryption support in the controller is:
if (BT_FEAT_LE_ENCR(bt_dev.le.features))) {
...
}
The reasons why the host is avoiding to use a local entropy source are mainly historical: in the early days of Zephyr (3-4 years ago) most boards did not have an implementation for it, which resulted in them easily falling back to using a "dummy" PRNG which was something similar to what you suggested with k_uptime_get(). Since this is extremely insecure and we didn't want to risk anyone ending up with a product with such a security vulnerability we always just used the controller-side entropy source since, until now, all controllers we've ever used have had this available.
I'm fine with falling back to the local entropy support if the controller indicates lack of encryption support, but we have to ensure that it's not a dummy one. In any other scenario the Bluetooth init (bt_enable) should fail. In fact, if there's a reliable and secure local entropy source it's probably more efficient to use that instead of using HCI.
On a combined build CONFIG_BT_CTLR_CRYPTO is selected by default and the corresponding crypto.c file that calls into ecb.c functions.
Switching to CONFIG_BT_HOST_CRYPTO (if there is no crypto hw acceleration in controller) on a combined build results in that deadlock
@George-Stefan The controller shouldn't do calls into to the host. And currently bt_rand is used by the controller. For example to create the access address of the connection (ull_master.c).
I think what we should do is to make sure that the controller never calls bt_rand itself, but provides bt_rand as a function that the host can call.
@cvinayak @jhedberg So I would rename the implementation and all uses of bt_rand in the controlelr to bt_ctlr_rand(), and then introduce bt_rand which calls this function. That would solve the recursion problem.
Also since it appears that the controller already has a dependency on having a bt_rand function available I don't see the problem with implement the LE Rand command even though encryption is disabled.
For the host we should also handle the case where the controller does not provide the LE Rand function. That would mean that the host would possibly do a call to the entropy device. But if there is no entropy device in the host it appears that we have a problem. The host would need these random numbers for the Privacy feature. Which relies on secure random numbers. Using k_uptime would not be good enough for this.
Is there a non-secure rand function implement somewhere? We appear to use bt_rand most places, not all of those need secure random numbers I guess?
I don't see the problem with implement the LE Rand command even though encryption is disabled.
We could kind of do that.. I suppose. The reason I hesitate is that the core spec states LE Encryption as the feature which makes this command available. That said, instead of checking for supported features the host could be checking for supported commands (which is what we do when the core spec doesn't clearly specify what feature enables support for an optional command)
...or then we just pretend ignorance and keep the host code presuming that this command is always available ;)
So the use case we need to consider is a privacy-enabled host build with a controller without encryption, and where the host does not have a access to secure random numbers. Using pseudo-random numbers with privacy would mean you could track it across RPA update simply because the rand part of the RPA would be known.
@jhedberg Yeah I meant that we could include the function and set the function as supported, in which case the host would have to read the supported commands instead of features. Is there any possible drawbacks to this?
Maybe the discussion I started should be taken to it's own issue?
In any case I think what should be done to solve this problem is what I already mentioned:
So I would rename the implementation and all uses of bt_rand in the controlelr to bt_ctlr_rand(), and then introduce bt_rand which calls this function.
in which case the host would have to read the supported commands instead of features
The host is already reading the supported commands and storing them into bt_dev.supported_commands. The only "drawback" would then just be an additional branch in the code.
prng_init checks for supported command, and returns -ENOTSUP in case it is not supported. prng_reseed does not do this, in this case we should get -EIO. The caller of bt_rand will have to handle this error in that case. The host will check this error code and should not start any roles that needed privacy. Same for other users of bt_rand.
In my opinion this is good enough. We just need to fix the recursion problem.
Although I have not tested this, I suppose that this could be reproducible on Nordic as well, if setting BT_CTLR_LE_ENC_SUPPORT to n and enabling host crypto. Will test later on, when I can get access to a Nordic board.
@alexandru-porosanu-nxp Although I don't get the exact same crash as you do, setting BT_CTLR_CRYPTO=n is definitively not working on Nordic either.