I have a system where I have an AES encryption layer between the littlefs code and the actual flash back-end. That means that when flash is erased, it goes to a known consistent value for a given offset (for a given key). I've seen lfs_format(...) fail depending on this key which changes the random data coming back for erased flash. One of the times I hit this call:
returning LFS_ERR_CORRUPT. When this occurred, lfs_format(...) didn't work. I don't know enough about the inner workings of lfs, but it seems like when lfs_format(...) uses this it should compensate for the "corruption" and actually attempt to write out to flash.
I've attached the output of the AES decryption on the erased values of the flash that hit this error which corresponds to the issue I had above. You can see this fail to format with the lfs_fuse project, too:
$ ./lfs /dev/loop0 --block_size=4096 --read_size=16 --prog_size=16 --cache_size=4096 --lookahead_size=128 --block_cycles=100 --format
littlefs/lfs.c:997:error: Corrupted dir pair at {0x1, 0x0}
Thanks,
Tim
I did a little more investigation today, and it looks like:
For a more normal flash erased value of all 0xff's or 0x00's, the initial revision gets set to 0, so the initial case always works.
It seems like the lfs_dir_alloc(...) is attempting to preserve the existing revision counter for a given block, so that lfs_dir_compact(...) later can decide whether or not it needs to move blocks or not. If it does this across the board a filesystem on top of an encrypted layer in there will end up with random counters for evicting a given flash block across all of the blocks. This isn't strictly a problem, but in the format case it seems like it shouldn't do this at all since there are "unknown" values in the superblock location.
Should the format case ignore the current revision counter in the superblock, or preserve it? One easy way to fix this particular case would be to do the following inside lfs_format(...):
// create root dir
lfs_mdir_t root;
err = lfs_dir_alloc(lfs, &root);
if (err) {
goto cleanup;
}
+ // Update revision to 0
+ root.rev = 0;
+
// write one superblock
lfs_superblock_t superblock = {
.version = LFS_DISK_VERSION,
.block_size = lfs->cfg->block_size,
.block_count = lfs->cfg->block_count,
.name_max = lfs->name_max,
.file_max = lfs->file_max,
.attr_max = lfs->attr_max,
};
Alternatively, during a lfs_format(...) only, we'd need to consider a "CORRUPT" case inside lfs_fs_size(...) to be grounds to continue formatting anyways. That'd likely need additional arguments trickled down into it though.
Oh interesting, in theory that should not be happening due to this line in lfs_dir_alloc:
https://github.com/littlefs-project/littlefs/blob/4c9146ea539f72749d6cc3ea076372a81b12cb11/lfs.c#L1378-L1379
I wonder if that logic isn't correct for the case you have found (and investigated well!). What is the revision count and value of cfg->block_cycles you run into?
+ // Update revision to 0 + root.rev = 0;
So, unfortunately that doesn't work since we use the revision count to decide which of two metadata-blocks is the most recent. The reason we read the uninitialized revision count is to set up a revision count in our new metadata-block that indicates the new metadata-block is newer.
If the on-disk revision count was 2, setting the new revision count to 0 would break this on future mounts.
I saw the comment "make sure we don't immediately evict", but I wasn't sure how it was supposed to work. In this case, it's setting dir->rev to the next even number. The check code inside the lfs_dir_compact(...) is:
(dir->rev % ((lfs->cfg->block_cycles+1)|1) == 0
The different touch points of dir->rev and their values:
2742492087 % ((100+1)|1) = 0
2742492087 % ((101)|1) = 0
2742492087 % 101 = 0
Note that 101 is a prime number for this field, so you can find even or odd numbers that will result in the modulus being 0. If you really want to check if it's going to evict inside lfs_dir_alloc(...), shouldn't it use the same modulus comparison that lfs_dir_compact(...) uses before deciding upon a value? Alternatively, the ((lfs->cfg->block_cycles+1)|1 modulus must have the prime factor 2 if you want an odd number to never result in 0 from the modulus. E.g. if it was written as ((lfs->cfg->block_cycles+1)&~1 instead for instance.
However, that might break this comment right above it which sounds like it's purposefully making it an odd modulus:
// If our revision count == n * block_cycles, we should force a relocation,
// this is how littlefs wear-levels at the metadata-pair level. Note that we
// actually use (block_cycles+1)|1, this is to avoid two corner cases:
// 1. block_cycles = 1, which would prevent relocations from terminating
// 2. block_cycles = 2n, which, due to aliasing, would only ever relocate
// one metadata block in the pair, effectively making this useless
Sounds like the modulus _needs_ to be relatively prime to any even number (the 2n count), so the modulus cannot be even here. And since an odd number can hit any input for this check, the lfs_dir_alloc(...) seems like it needs to change how it picks. Maybe something like this inside lfs_dir_alloc(...)?:
```
// make sure we don't immediately evict
dir->rev += dir->rev & 1;
// set defaults
dir->off = sizeof(dir->rev);
```
That would preserve the even/odd dir->rev value if that's important due to the even/odd directory revision blocks.
Regarding setting root.rev to 0, how would setting this to 0 within lfs_format(...) break future mounts unless someone formats the disk again?
Ah, excellent investigation! The check in lfs_dir_compact was made as a part of https://github.com/littlefs-project/littlefs/issues/369, it looks like it didn't consider the effect on lfs_dir_alloc.
You're right that the two comparisons are in conflict. Seeing as the only property we're looking for is that the next compact doesn't evict (even/odd doesn't matter, and I think it currently isn't guaranteed), I think we can simplify it into this:
// make sure we don't immediately evict
+ if (lfs->cfg->block_cycles > 0 &&
+ (dir->rev + 1) % ((lfs->cfg->block_cycles+1)|1) == 0) {
+ dir->rev += 1;
+ }
// set defaults
dir->off = sizeof(dir->rev);
Regarding setting root.rev to 0, how would setting this to 0 within lfs_format(...) break future mounts unless someone formats the disk again?
So the basic metadata structure looks like this:
metadata pair
.--' '--.
v v
.-------. .-------.
| rev 4 | | rev 5 |
| | | |
| | | |
'-------' '-------'
The reason we have two metadata blocks is so we can erase one for new data (lfs_dir_compact) without always needed to update the filesystem tree. The main purpose of the revision count is to know which of the two metadata blocks contains the newest/up to date data.
Another thing to note is that erases are expensive, even if we just erase one block. Blocks on NAND can get up into the 8KiB range. So as an optimization we don't write out both blocks, even when we allocate a new pair. Instead we pretend whatever data was on disk was a valid revision count, and increment it as normal for the newer revision count.
So say this was our disk at format time:
metadata pair
.--' '--.
v v
.-------. .-------.
| rev 4 | | rev ? |
|garbage| |???????|
|garbage| |???????|
'-------' '-------'
If we zeroed the revision count, when we write the newest metadata block it would look like this:
metadata pair
.--' '--.
v v
.-------. .-------.
| rev 4 | | rev 0 |
|garbage| | lfs2 |
|garbage| | here |
'-------' '-------'
Which would lead a future mount to think the block with rev 4 is the most recent, ignoring the rev 0 block.
I went ahead and created a PR here to hopefully fix this: https://github.com/littlefs-project/littlefs/pull/490
Hi @geky -
It definitely works now without an error, however, note that this particular case causes the superblock to expand immediately with this code change now since the rev is bumped after format (but the format succeeds). Are there any long term consequences to it immediately expanding the superblock? E.g. lfs-fuse gives (line numbers are off since I added some extra print outs):
littlefs/lfs.c:1547:debug: Expanding superblock at rev 2742492087
If we wanted to be slightly more definitive in how long a block is going to last, you could use this logic instead of just making sure it won't be evicted:
// Ensure we bump the revision forward enough so we don't evict for the maximum count possible
if (lfs->cfg->block_cycles > 0) {
const int modulus = ((lfs->cfg->block_cycles+1)|1);
int delta = dir->rev % modulus;
// Note: Check against 0 here so won't don't increment by the modulus when we don't
// need to.
if (delta != 0) {
dir->rev += modulus - delta;
}
}
which would make sure that a given block always will be used X times before being evicted upon allocation regardless of what was there before. If a given block was exactly at the modulus interval when evicted previously, then this is a no-op. Otherwise it bumps it forward.
Not sure how this impacts other wear leveling to do this though (and would probably have to be simulated to see if it's worth it), but maybe this is better for situations where the erased value is "random"?
Thanks for the discussion,
Tim
Edit: Removed "I suppose the other option would be to have a way of asking the underlying block layer "is this erased?", and doing this logic only then. That layer _should_ know if a given content is the erased state or not." Just realizing that with a "random" erased state, there is always the chance that your data matches the erased state. Might be okay for just the revision bump, but...
Hmm...
The only consequence of the superblock expanding is that an additional two blocks will be used by the superblock. However, if you write to the root directory often, it's likely the superblock would expand later anyways.
Basically for the COW wear leveling to work, there needs to be height to the filesystem tree. Expanding the superblock-chain adds height artificially to avoid wearing out the superblock-anchor before the rest of the filesystem.
So the superblock expanding is normal, though it could be a problem for extremely small filesystems (<16 blocks), since the superblock expansion is skipped if the filesystem is >1/2 full.
There's also value in the filesystem behaving more-or-less consistently despite the initial contents.
Edit: Removed "I suppose the other option would be to have a way of asking the underlying block layer "is this erased?", and doing this logic only then. That layer should know if a given content is the erased state or not." Just realizing that with a "random" erased state, there is always the chance that your data matches the erased state. Might be okay for just the revision bump, but...
Yeah, there's that possibility. It's also fairly expensive to check the erased state directly since it requires reading every byte in a block. That and assuming we can't test for "erasedness" makes littlefs more flexible + easier to port.
Humorously, I think we can use lfs_alignup for this:
c
// Ensure we bump the revision forward enough so we don't evict for the maximum count possible
if (lfs->cfg->block_cycles > 0) {
dir->rev = lfs_alignup(dir->rev, ((lfs->cfg->block_cycles+1)|1));
}
I went ahead and added this change to https://github.com/littlefs-project/littlefs/pull/490
Thanks for pointing these out.
Hi @geky -
No problem! A few other observations I was thinking about:
(rev >> 1) % block_cycles - e.g. discarding the lowest bit. The advantage of this approach is that it could be optimized to bit-wise operations in the static configuration case above. (Technically the current eviction logic is probably slightly better from a "lots of writes in a row perspective" since you won't hit two erases in a row as easily for the directory pair.) (Of course, you'd need a minimum on the modulus here of 2 or something.) You _could_ maintain the out of phase thing here (and allow it to be optimized to bit-wise operations when static configurations come along) by doing ((rev + ((rev & 1) ? block_cycles : 0)) >> 1) % block_cycles (untested math, but I think it checks out). In this case you'd still do the rev + 1 logic elsewhere. You'd need to avoid incrementing block_cycles by 1 to get the bit-wise optimization with static configs of course, so it might be better to implement a "minimum" block cycles value via a helper function getter or something?Anyhow... I don't know if any action needs to happen with the above thoughts - just random musings. :)
-- Tim
FWIW I was seeing similar issues, where lfs_format() would return LFS_ERR_CORRUPT when there was existing data on the flash storage device. It was not possible to reproduce the error but it seems that the existing data needed to be of a very special form to trigger the issue.
I would have thought that lfs_format() would always succeed, regardless of any existing data on the underlying storage device. Is it recommended to first erase everything before running lfs_format()?
Hi @dpgeorge -
This was fixed in the commit referenced by @geky above - I had just posited some extra thoughts after the fact that probably should be elsewhere. See https://github.com/littlefs-project/littlefs/commit/5783eea0de0438f098424c9f30b865762da26b26 - it's in devel (just not in a release quite yet since it was just fixed).
As for the second half, from what I gathered in this thread erasure isn't intended to be required for lfs_format() and it was intended to work regardless of the current erasure state. There basically was a regression for a while after the eviction logic had been changed; the eviction logic expects there to be a filesystem in the flash, and the allocation for the root folder was supposed to purposefully avoid eviction so that code wasn't hit, and that avoidance was broken for a period of time for some values of flash.
-- Tim
This was fixed in the commit referenced by @geky above
Thanks, I did see that. But what I noticed in the message of #490 is that it says "Should help #489". Does that mean it fixes the issue here fully, or just makes it less likely for the problem to occur? I don't understand the code in that fix well enough to be able to answer that question myself.
As for the second half, from what I gathered in this thread erasure isn't intended to be required for
lfs_format()and it was intended to work regardless of the current erasure state. There basically was a regression for a while after the eviction logic had been changed
Ok, that makes sense, thanks!
Hi @dpgeorge -
This was fixed in the commit referenced by @geky above
Thanks, I did see that. But what I noticed in the message of #490 is that it says "Should help #489". Does that mean it fixes the issue here fully, or just makes it less likely for the problem to occur?
It completely mitigates the issue I observed and captured in the original ticket here. There's a sample case as an attachment with the LFS mount parameters to replicate the issue I saw. The commit in #490 fixes the issue with that attachment, and I'm pretty sure it'll fix it for all cases that have the same issue (e.g. eviction, which expects a valid filesystem, during lfs_format() where there isn't a valid filesystem yet). The eviction logic is just a counter and it runs the eviction code if it modulo X is 0. So if the existing contents caused that to occur, the eviction logic said "corrupt filesystem" and this trickled out into the lfs_format() call with the LFS_ERR_CORRUPT error code.
I should probably just mark this as closed as the fix is now in the "devel" branch. It's just not in any formal release yet.
Should help #489
I'll word that better in the future, if I put "should fix/resolve" it triggers GitHub's automatic issue closing, which wasn't what I wanted.
Anyhow... I don't know if any action needs to happen with the above thoughts - just random musings. :)
I enjoy the random musings, thanks for those. Having more brains looking at things is always good, it means we're more likely to find the best solution :)
Technically when it "wraps" the revision counter (e. g. crossing 0xffff ffff to 0x0000 0000), we'll end up with a partial period of the modulus since the modulus currently cannot directly divide into 2^32. I don't think this is important to fix, but was something that popped into my head over the past couple of days. (This wouldn't occur if the eviction modulus was a power of 2, but that's not possible with the current eviction logic.)
Yeah, I noticed this corner case too. The fact that the number of block in a pair is 2 rules out multiple of 2 as modulus as far as I can tell. Another option would be to artificially overflow the revision count on a large multiple of our modulus (rev %= lfs_aligndown(UINT32_MAX, modulus)), but I don't know if it's worth the extra complexity/code-cost.
Hmm, maybe it would be for consistent eviction timing?
When you get #491 in place the compiler could decide to optimize the various moduli in the system with bit-wise AND masks depending on the static values picked by the user of the filesystem (of course it can only do that if static values are used instead of dynamic ones). The current modulus for the revision counter could never be optimized into bitwise operations by the compiler. Depending on the microcontroller and other code usages by the user of lfs, allowing the compiler to optimize this may remove extra code from the compiled binary and reduce the overall size. (The modulus operation isn't always free from a code space perspective, or from a computation perspective.)
Yeah, it's unfortunate especially since you can still find chips without a hardware divider.
At the very least there are a few other types of constants compilers can optimize into bitshifts/multiplies:
https://zneak.github.io/fcd/2017/02/19/divisions.html
The change you referenced in fe957de causes the pair of revisions to evict out of phase from each other. E.g. when allocating the root directory at blocks 0 and 1, block 1 will be evicted at roughly half of the revision count whereas block 0 will evict at roughly the full revision count. (It'd even out long term, but just that first eviction for a given directory would end up doing that.) I'm not sure if this was intentional or not; I think another approach would have been to do (rev >> 1) % block_cycles - e.g. discarding the lowest bit. The advantage of this approach is that it could be optimized to bit-wise operations in the static configuration case above. (Technically the current eviction logic is probably slightly better from a "lots of writes in a row perspective" since you won't hit two erases in a row as easily for the directory pair.) (Of course, you'd need a minimum on the modulus here of 2 or something.) You could maintain the out of phase thing here (and allow it to be optimized to bit-wise operations when static configurations come along) by doing ((rev + ((rev & 1) ? block_cycles : 0)) >> 1) % block_cycles (untested math, but I think it checks out). In this case you'd still do the rev + 1 logic elsewhere. You'd need to avoid incrementing block_cycles by 1 to get the bit-wise optimization with static configs of course, so it might be better to implement a "minimum" block cycles value via a helper function getter or something?
You're right about the original intention, the goal is to avoid two evictions in a row.
Hmmm, dividing the block count by 2 is a good idea. There may also be a trick such as adding 2 when we evict to keep staggered evictions.
I'm going to go ahead and reopen this if that's fine with you, since there are good ideas here. Otherwise it would probably get lost in the issue pile.
I'll word that better in the future, if I put "should fix/resolve" it triggers GitHub's automatic issue closing, which wasn't what I wanted.
Ah, I see, thanks for the clarification!
Humorously is also means sentences like "this doesn't fix #489" will cause GitHub to close those issues as fixed. Unfortunately it looks like the only way to avoid this is to change the sentence itself.
Hi @geky -
I enjoy the random musings, thanks for those. Having more brains looking at things is always good, it means we're more likely to find the best solution :)
:)
Technically when it "wraps" the revision counter (e. g. crossing 0xffff ffff to 0x0000 0000), we'll end up with a partial period of the modulus since the modulus currently cannot directly divide into 2^32. I don't think this is important to fix, but was something that popped into my head over the past couple of days. (This wouldn't occur if the eviction modulus was a power of 2, but that's not possible with the current eviction logic.)
Yeah, I noticed this corner case too. The fact that the number of block in a pair is 2 rules out multiple of 2 as modulus as far as I can tell.
With the divide by 2 example presented here, we can use a power of 2 modulus and avoid this corner case.
Yeah, it's unfortunate especially since you can still find chips without a hardware divider.
Note that the Cortex-M* processors does not have a modulo operation as an instruction and has to use a divide and multiply instruction each. And the Cortex-M0/M0+/M1 processors don't have a hardware divider at all.
At the very least there are a few other types of constants compilers can optimize into bitshifts/multiplies:
https://zneak.github.io/fcd/2017/02/19/divisions.html
True. Powers of 2 are always easier though (and typically faster from the processor side - although, maybe it doesn't truly matter since it's probably going to be dwarfed by the I/O cost to the storage backend).
You're right about the original intention, the goal is to avoid two evictions in a row.
Hmmm, dividing the block count by 2 is a good idea. There may also be a trick such as adding 2 when we evict to keep staggered evictions.
The presented formula does the staggering since it adds in half of a block_count for odd values; no need to add "2" to keep the staggered evictions with this formula. ((rev + ((rev & 1) ? block_cycles : 0)) >> 1) % block_cycles could have been re-written as rev / 2 + ((rev & 1) ? block_cycles / 2 : 0) % block_cycles where it's a little easier to see what's going on. However, we only need to do the divide by 2 once which is why I presented it in the other form. Anyhow, you can see that the base revision going into the modulus is always rev / 2, so neighboring pairs get the same value out, e.g. {[0,1], [2,3], [4,5]} -> {0, 1, 2}. Adding in half of the block cycles for odd revisions means you get the original staggered evictions, but supportive of power of 2 math - e.g. if your block_cycles was 4:
Even revs (not adding in the half block_cycles before the modulus):
rev=x: (x / 2 + 0) % 4
rev=0: (0 / 2 + 0) % 4 = 0 % 4 = 0; evict
rev=8: (8 / 2 + 0) % 4 = 8 % 4 = 0; evict
Odd revs (adding in the half block_cycles before the modulus - additive of 2 in this example):
rev=x: (x / 2 + 2) % 4
rev=5: (5 / 2 + 2) % 4 = 4 % 4 = 0; evict
rev=13: (13 / 2 + 2) % 4 = 8 % 4 = 0; evict
I'm going to go ahead and reopen this if that's fine with you
No worries! I just wasn't sure if it should've been moved to a different issue #.
Thanks,
Tim
((rev + ((rev & 1) ? block_cycles : 0)) >> 1) % block_cyclescould have been re-written asrev / 2 + ((rev & 1) ? block_cycles / 2 : 0) % block_cycles
Ah! I see, so you're effectively using the lowest bit to offset one block by block_cycles/2. I initially didn't understand what was going on.
You could even get rid of the conditional:
// mul/div should optimize to bit-shifts if block_cycles is power-of-2
((rev + ((rev & 1)*block_cycles)) / 2) % block_cycles
I was initially thinking +2 at compact time if it made the staggering formula simpler, but then again there may be future benefit for the revision count to be strictly monotonic (debugability?)
Ah! I see, so you're effectively using the lowest bit to offset one block by
block_cycles/2. I initially didn't understand what was going on.
Sorry, I probably should have explained it better. :) But yes, since that lowest bit dictates even/odd, it could be used to decide offsetting the eviction modulus result, and then that bit gets thrown away in the overall divide by 2 before the modulus is checked. Same overall result as making the modulus not contain the prime factor 2, but allows for better optimizations by the compiler. :)
You could even get rid of the conditional:
// mul/div should optimize to bit-shifts if block_cycles is power-of-2 ((rev + ((rev & 1)*block_cycles)) / 2) % block_cycles
Hopefully the compiler would recognize you're either multiplying by 0 or 1 here and just reduce it to a conditional add instruction in the underlying architecture. I suppose for code readability you might actually have it a separate line:
int tmp_rev = dir->rev;
// Check to see if it's odd and add in block_cycles if we're odd in order
// to stagger the eviction of the odd directory revisions
if(tmp_rev & 1) {
tmp_rev += block_cycles;
}
if((tmp_rev / 2) % block_cycles == 0) {
...
}
Sorry, I probably should have explained it better. :)
I also read through perhaps a bit too quickly
Hopefully the compiler would recognize you're either multiplying by 0 or 1 here and just reduce it to a conditional add instruction
Ah, I was more hoping for it to optimize into a bit-shift:
((rev + ((rev & 1)*block_cycles)) / 2) % block_cycles
Becomes (assuming block_cycles is power-of-two):
.--- constant ---. .----- constant -----.
((rev + ((rev & 1) << log2(block_cycles))) >> 1) & (log2(block_cycles)-1)
Actually, now that I think about it, you don't even need the first bitmask, since we remove all higher bits anyways. At the risk of maybe going too far:
((rev ^ (rev*block_cycles)) / 2) % block_cycles
Becomes (assuming block_cycles is power-of-two):
.--- constant ---. .----- constant -----.
((rev ^ (rev << log2(block_cycles))) >> 1) & (log2(block_cycles)-1)
Though I wonder if the original would be better to take advantage of bit-level instructions...
Becomes (assuming block_cycles is power-of-two):
.--- constant ---. .----- constant -----. ((rev ^ (rev << log2(block_cycles))) >> 1) & (log2(block_cycles)-1)
You are probably overthinking this a bit (and accidentally changed the + to a ^). The compiler will recognize that you're using % with a power of 2 (when a compile-time constant) and automatically convert it to an AND-mask. The only place you get some benefit here is if the configuration parameter is the log2(block_cycles) parameter since you can avoid using the integer division (or the divide + multiple on a cortex-m4) for a dynamic usage case.
Also, it should've been:
((rev + ((rev & 1) ? block_cycles : 0)) >> 1) & ((1 << log2(block_cycles))-1)
Here are the four method's disassembly on ARM when compiling for -Os: https://godbolt.org/z/9sKP9r
You are probably overthinking this a bit
This is probably try, but it is a bit fun
(and accidentally changed the + to a ^)
Ah you're right, I initially thought ^/+ were interchangeable here, but I don't know if that's the case when block_cycles is not a power of two
((rev + (rev*block_cycles)) / 2) % block_cycles
This works because with modular arithmetic we don't care about any multiples > block_cycles, and (rev*block_cycles)/2 is always a multiple of block_cycles/2.
(Humorously, ((rev*(block_cycles+1)) / 2) % block_cycles also works, but it's there's no real benefit)