Describe the bug
When CONFIG_BT_SETTINGS is enabled and the user calls settings_load(), initially bt/hash is stored as 0x0000. Later, the OS calles bt_hash_gen() and re-stores the hash (this time calculated). In combination with a flash back-end this causes excessive flash wear.
To Reproduce
Steps to reproduce the behavior:
CONFIG_BT_SETTINGS in project, call settings_load in bt_ready callback (from `bt_init)nrfjprog --memrd flash_partition_offset --n 512Expected behavior
bt/hash should not be stored immediately, but calculated once and stored once.
Impact
In combination with a flash backend causes excessive flash wear on every reboot of the device.
Screenshots or console output
memrd output:
0x0007B000: C0FFEEEE 0007FF01 2F746218 68736168 |.........bt/hash|
0x0007B010: 9243D53D 3B7FBAD2 2DD486F0 2AD9F5CC |=.C....;...-...*|
0x0007B020: 6218905A 61682F74 003D6873 00000000 |Z..bt/hash=.....|
0x0007B030: 00000000 00000000 EF000000 2F746218 |.............bt/|
0x0007B040: 68736168 9243D53D 3B7FBAD2 2DD486F0 |hash=.C....;...-|
0x0007B050: 2AD9F5CC 6218905A 61682F74 003D6873 |...*Z..bt/hash=.|
0x0007B060: 00000000 00000000 00000000 EF000000 |................|
0x0007B070: 2F746218 68736168 9243D53D 3B7FBAD2 |.bt/hash=.C....;|
0x0007B080: 2DD486F0 2AD9F5CC 6218905A 61682F74 |...-...*Z..bt/ha|
0x0007B090: 003D6873 00000000 00000000 00000000 |sh=.............|
0x0007B0A0: EF000000 2F746218 68736168 9243D53D |.....bt/hash=.C.|
0x0007B0B0: 3B7FBAD2 2DD486F0 2AD9F5CC 6218905A |...;...-...*Z..b|
0x0007B0C0: 61682F74 003D6873 00000000 00000000 |t/hash=.........|
0x0007B0D0: 00000000 EF000000 2F746218 68736168 |.........bt/hash|
0x0007B0E0: 9243D53D 3B7FBAD2 2DD486F0 2AD9F5CC |=.C....;...-...*|
0x0007B0F0: 6218905A 61682F74 003D6873 00000000 |Z..bt/hash=.....|
0x0007B100: 00000000 00000000 EF000000 2F746218 |.............bt/|
0x0007B110: 68736168 9243D53D 3B7FBAD2 2DD486F0 |hash=.C....;...-|
0x0007B120: 2AD9F5CC 6218905A 61682F74 003D6873 |...*Z..bt/hash=.|
0x0007B130: 00000000 00000000 00000000 EF000000 |................|
0x0007B140: 2F746218 68736168 9243D53D 3B7FBAD2 |.bt/hash=.C....;|
0x0007B150: 2DD486F0 2AD9F5CC 6218905A 61682F74 |...-...*Z..bt/ha|
0x0007B160: 003D6873 00000000 00000000 00000000 |sh=.............|
0x0007B170: EF000000 2F746218 68736168 9243D53D |.....bt/hash=.C.|
0x0007B180: 3B7FBAD2 2DD486F0 2AD9F5CC 6218905A |...;...-...*Z..b|
0x0007B190: 61682F74 003D6873 00000000 00000000 |t/hash=.........|
0x0007B1A0: 00000000 EF000000 2F746218 68736168 |.........bt/hash|
0x0007B1B0: 9243D53D 3B7FBAD2 2DD486F0 2AD9F5CC |=.C....;...-...*|
0x0007B1C0: 6218905A 61682F74 003D6873 00000000 |Z..bt/hash=.....|
0x0007B1D0: 00000000 00000000 EF000000 2F746218 |.............bt/|
0x0007B1E0: 68736168 9243D53D 3B7FBAD2 2DD486F0 |hash=.C....;...-|
0x0007B1F0: 2AD9F5CC 6218905A 61682F74 003D6873 |...*Z..bt/hash=.|
0x0007B200: 00000000 00000000 00000000 EF000000 |................|
0x0007B210: 2F746218 68736168 9243D53D 3B7FBAD2 |.bt/hash=.C....;|
Environment (please complete the following information):
Additional context
Created from issue #20629
I have discussed this one on slack with @mfiumara.
The issue appears to be this check.
static int db_hash_commit(void)
{
/* Stop work and generate the hash */
if (k_delayed_work_remaining_get(&db_hash_work)) {
k_delayed_work_cancel(&db_hash_work);
db_hash_gen(false);
}
The check if (k_delayed_work_remaining) returns the amount of time until the work will execute, that timeout has run out. but the work has not been executed.
Since it is a single queue of work items, and settings_load is called from bt_ready which is a work-item then we get this situation, the bt_ready is executing, bt_hash_work should execute, but has no remaining time remaining.
I'm currently looking into reproducing this and fixing it.
Reproduce:
Configurations:
CONFIG_BT_GATT_DYNAMIC_DB=y
CONFIG_BT_GATT_CACHING=y
CONFIG_BT_DEBUG_GATT=y
Modified (extracted) peripheral sample:
static void bt_ready(int err)
{
k_busy_wait(10 * 1000); // DB hash delayed work is 10 MS
if (IS_ENABLED(CONFIG_SETTINGS)) {
settings_load();
}
}
void main(void)
{
int err;
err = bt_enable(bt_ready);
if (err) {
printk("Bluetooth init failed (err %d)\n", err);
return;
}
/* ... */
}
Then this will happen on every reset:
00:00:00.020,965] <dbg> bt_gatt: Stored Hash:
57 b5 7c 4a f9 e1 e6 08 23 21 0e 89 c1 21 fc 92 |W.|J.... #!...!..
[00:00:00.021,453] <dbg> bt_gatt: New Hash:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |........ ........
[00:00:00.021,453] <dbg> bt_gatt.sc_indicate: start 0x0001 end 0xffff
[00:00:00.021,453] <dbg> bt_gatt.sc_indicate: indicate pending, waiting until complete...
[00:00:00.023,284] <dbg> bt_gatt.db_hash_store: Database Hash stored
....
[00:00:00.259,124] <dbg> bt_gatt: Hash:
57 b5 7c 4a f9 e1 e6 08 23 21 0e 89 c1 21 fc 92 |W.|J.... #!...!..
[00:00:00.260,589] <dbg> bt_gatt.db_hash_store: Database Hash stored
[00:00:00.260,589] <dbg> bt_gatt.sc_process: start 0x0001 end 0xffff
@mfiumara I posted a possible fix for this issue. It solves the issue when I was reproducing it locally. Can you verify that it also works in your setup?
I just tried this. It seems the hash is now not stored anymore as zero, but it is calculated differently and the problem still persists. I tried by erasing all memory, then running my application twice. My storage partition output after this:
nrfjprog --memrd 499712 --n 256
0x0007A000: C0FFEEEE 0000FF01 2F746218 68736168 |.........bt/hash|
0x0007A010: AF4FA13D 51B45CE5 73EF1DAA 45C27DD2 |=.O..\.Q...s.}.E|
0x0007A020: 621856E6 61682F74 D53D6873 BAD29243 |.V.bt/hash=.C...|
0x0007A030: 86F03B7F F5CC2DD4 905A2AD9 2F746218 |.;...-...*Z..bt/|
0x0007A040: 68736168 AF4FA13D 51B45CE5 73EF1DAA |hash=.O..\.Q...s|
0x0007A050: 45C27DD2 621856E6 61682F74 D53D6873 |.}.E.V.bt/hash=.|
0x0007A060: BAD29243 86F03B7F F5CC2DD4 905A2AD9 |C....;...-...*Z.|
0x0007A070: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A080: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A090: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A0A0: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A0B0: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A0C0: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A0D0: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A0E0: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A0F0: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
So it is still swithcing between two hashes on every boot.
@mfiumara Can you check if you have any calls to bt_gatt_service_register, and if they are executed before or after settings_load?
I just checked using gdb. I can confirm bt_gatt_service_register is called _after_ settings_load. The piece of application initialization code I have:
/* Enable Bluetooth. */
rc = bt_enable(bt_ready);
if (rc != 0) {
LOG_ERR("Bluetooth init failed (err %d)", rc);
return rc;
}
/* Register connection callbacks */
bt_conn_cb_register(&conn_callbacks);
/* Register Authentication callbacks */
bt_conn_auth_cb_register(&conn_auth_callbacks);
/* Initialize the Bluetooth mcumgr transport. */
smp_bt_register();
The gdb trace:
(gdb) b bt_gatt_service_register
Breakpoint 1 at 0x22d28: file /zephyr/subsys/bluetooth/host/gatt.c, line 1034.
(gdb) b settings_load
Breakpoint 2 at 0x36d66: file /zephyr/subsys/settings/src/settings_store.c, line 47.
(gdb) c
Continuing.
Breakpoint 2, settings_load () at /zephyr/subsys/settings/src/settings_store.c:47
47 return settings_load_subtree(NULL);
(gdb)
Continuing.
Breakpoint 1, bt_gatt_service_register (svc=svc@entry=0x200095fc <smp_bt_svc>) at /zephyr/subsys/bluetooth/host/gatt.c:1034
1034 bt_gatt_init();
The settings_load is in the bt_ready callback.
Ok. then that explains it I think. Could you move the call to smp_bt_register before settings_load for now?
I documented this current requirement here: https://github.com/zephyrproject-rtos/zephyr/pull/22756/commits/a9e0c6c47016a1a480157eff3af3631d21cdfbc9
Also I asked the question in the PR if this is really the behaviour that we want. So let's hear what the others think about this.
@joerchan I retried by re-ordering my initialisation routine (calling smp_bt_register before settings_load) and this seems to work. I redid the same test (restarting my device twice) and the storage partition now looks as follows:
nrfjprog --memrd 499712 --n 128
0x0007A000: C0FFEEEE 0000FF01 2F746218 68736168 |.........bt/hash|
0x0007A010: 9243D53D 3B7FBAD2 2DD486F0 2AD9F5CC |=.C....;...-...*|
0x0007A020: FFFF905A FFFFFFFF FFFFFFFF FFFFFFFF |Z...............|
0x0007A030: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A040: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A050: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A060: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
0x0007A070: FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF |................|
So can confirm this fix works as intended.
Most helpful comment
I have discussed this one on slack with @mfiumara.
The issue appears to be this check.
The check if (k_delayed_work_remaining) returns the amount of time until the work will execute, that timeout has run out. but the work has not been executed.
Since it is a single queue of work items, and settings_load is called from bt_ready which is a work-item then we get this situation, the bt_ready is executing, bt_hash_work should execute, but has no remaining time remaining.
I'm currently looking into reproducing this and fixing it.