Littlefs: assertion "block < lfs->cfg->block_count" failed

Created on 4 Mar 2018  路  21Comments  路  Source: littlefs-project/littlefs

I'm running LittleFS on a 16mbyte SPI flash on ESP32 and it's been working well, but last night I ran into the above assert. Here's the stack trace:

assertion "block < lfs->cfg->block_count" failed: file "/Users/jason/Projects/x/x-esp32-firmware/components/littlefs/littlefs/lfs.c", line 27, function: lfs_cache_read
abort() was called at PC 0x400e801b on core 0
0x400e801b: __assert_func at /home/jeroen/esp8266/esp32/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:63 (discriminator 8)


Backtrace: 0x40094614:0x3fff2950 0x4009480f:0x3fff2970 0x400e801b:0x3fff2990 0x401703f9:0x3fff29c0 0x40170daa:0x3fff29f0 0x40171472:0x3fff2a40 0x400dbb19:0x3fff2ac0 0x400dabe4:0x3fff2af0 0x401381ad:0x3fff2b70 0x400d99dd:0x3fff2b90 0x400d9a35:0x3fff2bc0
0x40094614: invoke_abort at /Users/jason/esp/esp-idf/components/esp32/./panic.c:648

0x4009480f: abort at /Users/jason/esp/esp-idf/components/esp32/./panic.c:648

0x400e801b: __assert_func at /home/jeroen/esp8266/esp32/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:63 (discriminator 8)

0x401703f9: lfs_cache_read at /Users/jason/Projects/x/x-esp32-firmware/components/littlefs/littlefs/lfs.c:1069

0x40170daa: lfs_ctz_traverse at /Users/jason/Projects/x/x-esp32-firmware/components/littlefs/littlefs/lfs.c:1069

0x40171472: lfs_traverse at /Users/jason/Projects/x/x-esp32-firmware/components/littlefs/littlefs/lfs.c:1069

0x400dbb19: x_fs_get_usage at /Users/jason/Projects/x/x-esp32-firmware/main/./x_fs_littlefs.c:178

Note that I have changed the names of some symbols and files for privacy purposes.

This is happening when I try to use lfs_traverse to get filesystem information. Here's the code I use for that:

static int lfs_traverse_count(void *p, lfs_block_t b) {
    *(lfs_size_t *)p += 1;
    return 0;
}

esp_err_t x_fs_get_usage(
        uint64_t* bytes_total,
        uint64_t* bytes_free,
        uint64_t* bytes_used) {

    lfs_size_t in_use = 0;
    int err = lfs_traverse(&lfs, lfs_traverse_count, &in_use);
    if (err) {
        return err;
    }

    *bytes_total = cfg.block_count * cfg.block_size;
    *bytes_used = cfg.block_size * in_use;
    *bytes_free = *bytes_total - *bytes_used;

    return ESP_OK;
}

Before this happened, I saw two instances of logging of bad blocks:

lfs debug:1229: Bad block at 3252
lfs debug:1229: Bad block at 3253
lfs debug:1229: Bad block at 3254
lfs debug:1229: Bad block at 3255
lfs debug:1229: Bad block at 3256


lfs debug:1229: Bad block at 436
lfs debug:1229: Bad block at 437
lfs debug:1229: Bad block at 438
lfs debug:1229: Bad block at 439
lfs debug:1229: Bad block at 440

I have since written a simple block checker and run it against the flash, and found no bad blocks. Here's that code:

    size_t s_size = 4096;
    uint8_t* buffer_work = malloc(s_size);
    uint8_t* buffer_compare = malloc(s_size);
    for (int i = 0; i < (16 * 1024 * 1024) / s_size; i++) {
        // erase block, read it, check all values are 0xff
        memset(buffer_compare, 0xff, s_size);
        memset(buffer_work, 0x00, s_size);
        x_hal_bulk_storage_erase(i * s_size, s_size);
        x_hal_bulk_storage_read(i * s_size, s_size, buffer_work);
        if (memcmp(buffer_compare, buffer_work, s_size) != 0) {
            printf("%08x: erase failed\n", i);
            continue;
        }

        // write bit pattern, read it, compare
        memset(buffer_work, 0b10101010, s_size);
        x_hal_bulk_storage_write(i * s_size, s_size, buffer_work);
        memset(buffer_work, 0xff, s_size);
        x_hal_bulk_storage_read(i * s_size, s_size, buffer_work);
        memset(buffer_compare, 0b10101010, s_size);
        if (memcmp(buffer_compare, buffer_work, s_size) != 0) {
            printf("%08x: bit pattern 1 failed\n", i);
            continue;
        }

        // write second bit pattern, read it, compare
        memset(buffer_work, 0b01010101, s_size);
        x_hal_bulk_storage_write(i * s_size, s_size, buffer_work);
        memset(buffer_work, 0xff, s_size);
        x_hal_bulk_storage_read(i * s_size, s_size, buffer_work);
        memset(buffer_compare, 0x00, s_size);
        if (memcmp(buffer_compare, buffer_work, s_size) != 0) {
            printf("%08x: bit pattern 2 failed\n", i);
            continue;
        }
    }

Aside from not being able to call lfs_traverse, everything else seemed to be working fine. I was able to read and write files, and list directories, but any call to lfs_traverse caused the above assert, even after a reboot.

Please let me know if there's any other information I can provide that will help with this. Happy to help in any way I can.

Thanks,
Jason

Most helpful comment

Hi @geky, thank you for your response!

Answers:
Hash: b2124a5ae54717737f6d96cb24e6f31e02158d1e
Config:

static struct lfs_config cfg = {
    // block device operations
    .read  = x_read,
    .prog  = x_prog,
    .erase = x_erase,
    .sync  = x_sync,

    // block device configuration
    .lookahead = 256,
    .block_size = 4096,
    .block_count = 4096,
    .prog_size = 256,
    .read_size = 256;
};

Compiler: xtensa-esp32-elf-osx-1.22.0-80-g6c4433a-5.2.0.tar.gz via https://esp-idf.readthedocs.io/en/latest/get-started/macos-setup.html

I am not familiar with qemu, unfortunately. I did find this: https://github.com/Ebiroll/qemu_esp32

In running the block check code I mentioned above I've erased the data, which is no problem. It's all test data at the moment. Next time this happens I will attempt to capture an image of the filesystem.

It does look like maybe it's starting to happen again, on this fresh format. I just saw lfs debug:1385: Bad block at 785 and I appear to have a reboot from earlier with:

assertion "entry.d.type == LFS_TYPE_REG" failed: file "/Users/jason/Projects/x/x-esp32-firmware/components/littlefs/littlefs/lfs.c", line 1526, function: lfs_file_sync

One odd thing is that I've been running LittleFS on this same system for over a week now and these problems have just popped up over the past day.

All 21 comments

Hi Jason,
I am using littlefs for several weeks now and I saw this issue also 2 times, when the Flash is close to be totally full.

Hi @guillaumerems, thank you for your response. In my case the flash was almost empty. About 200kbyte stored out of 16mbyte available.

Hi @vonnieda, thanks for raising an issue.

Quick questions:
What hash of LittleFS are you using?
What configuration are you using? (read_size, prog_size, block_size (4KiB?), lookahead)
What compiler are you using? (It looks like ESP32 is a rather unique architecture, may be worthwhile to run tests on this guy, do you know of a qemu port?)

Actually this sounds like another issue I ran into. LittleFS wasn't tracking blocks correctly and would accidentally allocate a block that was already in use. The block could then be modified by two different filesystem mechanisms, making LittleFS think a write failed and the block had gone bad.

This was the fix: https://github.com/geky/littlefs/commit/e611cf5050fc44acb97057654b6885ec9b060915

If you do a git log and your hash is after that commit, we may have a different problem.

Unfortunately, at this point the filesystem is probably corrupted with pointers going the wrong places, so you'll need to reformat before a traversal will work again. I haven't put together a proper fsck program, so if you need the data on disk you may have to piece together the filesystem by hand. I can give you some pointers on this if you need them.

@guillaumerems, is this related to https://github.com/geky/littlefs/issues/12? The bug fixed by e611cf5 is more likely when there are fewer free blocks, so it sounds like it might be the source of your issue as well. Sorry I hadn't thought about if your issue was related. Were your able to reproduce https://github.com/geky/littlefs/issues/12 easily?

Hi @geky, thank you for your response!

Answers:
Hash: b2124a5ae54717737f6d96cb24e6f31e02158d1e
Config:

static struct lfs_config cfg = {
    // block device operations
    .read  = x_read,
    .prog  = x_prog,
    .erase = x_erase,
    .sync  = x_sync,

    // block device configuration
    .lookahead = 256,
    .block_size = 4096,
    .block_count = 4096,
    .prog_size = 256,
    .read_size = 256;
};

Compiler: xtensa-esp32-elf-osx-1.22.0-80-g6c4433a-5.2.0.tar.gz via https://esp-idf.readthedocs.io/en/latest/get-started/macos-setup.html

I am not familiar with qemu, unfortunately. I did find this: https://github.com/Ebiroll/qemu_esp32

In running the block check code I mentioned above I've erased the data, which is no problem. It's all test data at the moment. Next time this happens I will attempt to capture an image of the filesystem.

It does look like maybe it's starting to happen again, on this fresh format. I just saw lfs debug:1385: Bad block at 785 and I appear to have a reboot from earlier with:

assertion "entry.d.type == LFS_TYPE_REG" failed: file "/Users/jason/Projects/x/x-esp32-firmware/components/littlefs/littlefs/lfs.c", line 1526, function: lfs_file_sync

One odd thing is that I've been running LittleFS on this same system for over a week now and these problems have just popped up over the past day.

A few more notes that might help:

  • The system is basically a data logger.
  • The system writes about 30-40 bytes per minute by appending to files.
  • A do quite a few directory scans of the root directory, but do not create any other directories.
  • I create new files, and append to existing files. I do linear reads and append writes.
  • When it failed, the root directory had about 20 files in it.

Just noticed one more thing, not sure if it's related or not. LFS seems to be doing a lot of erases of just two blocks. I added logging of the erase call and here's an excerpt:

I (11602) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 2)
I (16642) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 10)
I (16682) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 3)
I (21672) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 11)
I (21712) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 2)
I (26722) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 12)
I (26762) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 3)
I (31742) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 13)
I (31802) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 2)
I (36792) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 14)
I (36832) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 3)
I (41822) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 15)
I (41862) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 2)
I (46852) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 16)
I (46912) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 3)
I (51872) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 17)
I (51922) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 2)
I (56922) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 18)
I (56962) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 3)
I (61942) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 19)
I (61992) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 2)
I (66992) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 20)
I (67032) FB_FS_LITT (0000) bs_erase(0x3ffc1498, 3)

Note that block 2 and 3 are being erased over and over. My system appends to a file every 5 seconds, and it appears it's erasing one of those two blocks for every write to this file. So, each block is being erased every 10 seconds. At that rate I'll surpass 100k writes on those blocks in just 12 days.

That's a good thing to notice! But it is intentional. LittleFS's strategy is to recover from blocks when they go bad, as an alternative to wearing the blocks evenly. I put together a little simulator to show how this still extends the life of flash devices. You can also play around with the configuration to see how different write behaviors may impact the filesystem:
http://littlefs.geky.net/demo.html

The bug is interesting. Did LittleFS get remounted during the time or was this a week of continuous operation?

I have seen flash blocks go bad, but only fail to write intermittently. But I'd be surprised if a whole row of incremental blocks failed at once, like in your first log. Also from your setup I would expect block 2 or 3 to be the first to go.

There has been one fix since your hash of LittleFS: https://github.com/geky/littlefs/commit/9ee112a7cbdbf15a6aee931527f87f587ca523e5 (related pr https://github.com/geky/littlefs/issues/33)

But that issue only occurs when needing to extend a directory. With only 20 files in your root you should be fine (unless your filenames are ~192 characters long).

I'll try that qemu fork and see if I can get the test cases running on the ESP32 architecture. Unfortunately it may take me a few days before I can really sit down and debug this. Though thanks for all the info.

It's a long shot, but there have been a couple issues with compiler intrinsics. It'll be hard to know if this is the issue without a repeatable test, but passing -DLFS_NO_INTRINSICS may help reduce what could be the issue.

Thanks @geky.

Lots of remounts, at least 10 per day.

I think I confirmed that there aren't "really" any bad blocks with my block test code up above. For each block it:

  1. erase the block
  2. read the block and confirm all bytes are 0xff
  3. write 0b10101010 to each byte
  4. write 0b01010101 to each byte
  5. read the block and confirm all bytes are 0x00

Let me know if you think this test is not sufficient.

I will merge the fix you mentioned, and will try -DLFS_NO_INTRINSICS.

In the meantime, I'll do some testing of my own and see if I can find a way to easily reproduce this.

Thanks for all your help!

On case I've seen of flash exhaustion was on the internal flash of an MCU after being reprogrammed a large number of times (LittleFS wasn't involved): Roughly 1 out of 10ish erases would fail to push all the bits to 0xff.

You could try running your block test ~100 times to make sure there's no blocks that fail intermittently. Although this will add to the wear on your block device, so choose a number of tries you're comfortable with.

Sidenote: Some quick math using your use case, correct me if I make a mistake:

1 file update = 2 erases (file data and metadata pair)
1 file update every 5 seconds = 1 erase every 2.5 seconds
100K erases cycles per block, for 4K blocks (16MiB / 4KiB) = 400M erases.
400M * 2.5 = 1G seconds = 31.7 years.

So, very roughly, you can expect your flash to last about 30 years with LittleFS and your use case.

Note, this doesn't take into account static files or manufacturing defects, so it's just a maximum estimate more than anything. But it should give you a ballpark of what to expect (at least when everything's working correctly).

Hi @geky, I agree with your math - although one caveat is that the capacity will go down over that time, whereas with universal wear leveling the capacity could be expected to stay the same until everything all failed around the same time. Anyway, I agree, it's not something I need to be concerned with. I didn't really understand the wear leveling scheme you were using and now I do.

I've added code to my write and erase functions to verify the operation afterwards with a read and compare. So, next time I see a "bad block" from LFS I hope this will tell me if I really did have a write failure, or if something else is happening.

Will report back as soon as I have something.

Also, as an aside, I tried destroying a block to see how many erase/writes it could handle, and I got up to 337k erase/writes before I had to stop the test and do something else. No failures so far. I'll continue the test tonight.

@geky I just had a thought that maybe explains this issue: Would a previously formatted, and used LFS volume, which is then reformatted but not fully erased, potentially cause this issue?

Here's the situation:

  1. Flash starts erased from factory.
  2. LittleFS formats, writing just a few blocks.
  3. I use LFS for a while, creating some files, allocating blocks, etc.
  4. I corrupt the superblock while doing some other tests.
  5. I attempt to remount, it fails, so I format. I note this only erases and writes a few blocks, not the whole flash.
  6. LFS starts allocating blocks and runs across a block that "looks" right, but it's actually from a previous format of the FS?

This might explain why I am seeing this error on my development board, which I reformat regularly, but not on other boards that are just in general use.

So, I guess the question comes down to: Do I need to do a chip erase before formatting to ensure the state of the FS?

Oh! The other tests you run, do they erase block 0? What about block 1? The first two erase blocks form the superblock, unfortunately erasing only one can cause problems.

There's a reoccuring issue I've seen where users try to switch between two filesystems on a single piece of storage. If block 0 is erased, LittleFS may pick up block 1, thinking that a power failure occured. Unfortunately, block 1 may contain significantly out of date data, leading to a corrupted filesystem.

So, I guess the question comes down to: Do I need to do a chip erase before formatting to ensure the state of the FS?

No. LittleFS makes no assumption about the state of memory after a format.

But! If your code tries to mount first (which, to be fair, is suggested for handling new flash), and there is a corrupted filesystem. LittleFS may not reformat, mount successfully, but fail later.

So at the moment you do need a full erase (or at least erase the first 2 blocks) after running tests or using another filesystem.

I do have an idea on how to improve this (by checksumming the sum of revision counts on the filesystem). But this would come with a code and runtime cost.

@geky I'm not sure if I've erased the first two blocks in all cases. I was doing quite a bit of stuff, including trying different filesystems, with this particular flash. It's quite possible this came up after I had experimented with just erasing the first block. I've now done a full chip erase on this board and will see if it comes up again.

I'm now experiencing this on multiple devices that have had their flash fully erased before formatting, so I think that rules out my previous guess.

Unfortunately I have not seen my write verify code trigger yet, so I don't think I have any bad blocks. I also think it's unlikely that there are bad blocks as these flash chips are all less than two weeks old and have barely been written to at all. I fear there may be some kind of compiler or platform incompatibility.

@geky Is there any information I can provide, or tests I can run, to help figure this out? I really want to use LittleFS as it's hundreds of times faster than SPIFFS, which I was previously using, but this is starting to become a showstopper :(

I do have a corrupt filesystem now, so if it would help for me to post an image of that, I can.

That was a good idea to add the verify function. Sorry you're running into so many problems.

A disk image could help quite a bit! If you don't want to post it publically you can email it to me.

@geky Thanks for the response. I've emailed the file to the email you mentioned above.

Thanks to a ton of help from @geky I've found this to be a problem in my code. I was calling lfs_traverse without a mutex, while all other lfs_ calls were protected by a mutex. @geky explained that LFS uses a single directory read cache, which was likely getting corrupted by simultaneous reads.

I confirmed the issue by running lfs_traverse in a tight loop on another thread, which corrupted the filesystem in 8 seconds. I then added locking and was not able to reproduce the error.

So, @guillaumerems, make sure you have a mutex around all LFS calls!

I'm going to monitor my device for a few days to make sure this doesn't happen again, and then will close this issue.

Thank you to @geky for all his help, and for such a great filesystem!

I've seen no further instances of corruption on this, so calling it good. Thanks for all the help!

Was this page helpful?
0 / 5 - 0 ratings