Littlefs: Questions about small block sizes and devices

Created on 17 Jul 2018 · 9Comments · Source: littlefs-project/littlefs

Not an issue as such, but some questions and possible improvements.
I'm using littleFs with a relatively large FRAM, with read, write, erase block sizes currently all set to 128 bytes. (This was the smallest size that worked). Works well, except that repetitive open-read-close operations get rather slow. I'm reading 52-byte records from a fixed size file, so many reads will span two blocks. About to analyse the timings further, but wondered if you have any comments on the current settings and use, and whether there are obvious tweaks to improve performance.

Also, FRAM is a use case quite possibly not considered; would it be easy to support such media (which basically are byte-writable, without erase) in a more efficient way?

And a small issue; in many of the LFS_DEBUG() Statements, uint32_t types are printed with the '%d' formatting string, and GCC 6.3 throws a warning. I changed all these to '%lu'.

Source

e107steved

All 9 comments

And a small issue; in many of the LFS_DEBUG() Statements, uint32_t types are printed with the '%d' formatting string, and GCC 6.3 throws a warning. I changed all these to '%lu'.

Already fixed in master - see https://github.com/ARMmbed/littlefs/commit/7e67f9324e1fb0a910244baac60121e945566639

Interesting question, as I was planning to try littlefs on a EEPROM, which would be very similar to your case. Generally this is also similar to SD/MMC cards, which (usually) don't really need to be erased with a separate step.

FreddieChopin on 17 Jul 2018

👍1

One possible issue is that littleFs places some reliance on unused (blank) memory being at 0xff. I actually implemented a FRAM erase routine on this basis; not sure how necessary it was, since I had other problems at the time.
My main thought was that 'copy on write' code could maybe just overwrite if the data position and size was the same, saving the overhead of updating block references etc. Particularly relevant for my current use case.

e107steved on 17 Jul 2018

Ah! Interesting!

Also, FRAM is a use case quite possibly not considered; would it be easy to support such media (which basically are byte-writable, without erase) in a more efficient way?

Hmmm. At it's core littlefs is a block-based filesystem. So the erase blocks are very much a part of the logical structure in the filesystem. At the very least, littlefs should be able to take advantage of byte-level read/writes.

I'm not going to lie to you though, littlefs is block-based, so there may be more efficient approaches to putting a filesystem on FRAM.

I'm using littleFs with a relatively large FRAM, with read, write, erase block sizes currently all set to 128 bytes. (This was the smallest size that worked).

Ah! 128 bytes is the smallest erase block (anything smaller and internal structures start to overflow the erase block). But since you have byte-level read/writes you can set the "read_size" and "prog_size" as low as 1 byte.

Though you may want to play around with these values. Unfortunately, right now, several of the operations aren't very smart about small values and will painfully read 1 byte at a time over your bus (https://github.com/ARMmbed/littlefs/issues/41). I'm currently working on fixing this with an additional "cache_size" parameter, but it will have to wait for a major version change.

With a smaller read_size you should see better performance reading records across block boundaries.

One possible issue is that littleFs places some reliance on unused (blank) memory being at 0xff.

Nope! Your erase function can just be a noop. For littlefs, the idea of an "erase" is just an optional step to prepare a write.

My main thought was that 'copy on write' code could maybe just overwrite if the data position and size was the same, saving the overhead of updating block references etc. Particularly relevant for my current use case.

This still poses a problem with power-loss, not of an erase, but because power-loss could catch you with half-written-out data.

geky on 18 Jul 2018

Thanks for the explanations; a great help in avoiding going up blind alleys (and eliminating some code). I'll have a play; currently I've got a few layers between the application code and littleFs, so got to pin down the time consuming areas better. And I don't see any significant disadvantage to littleFS being block-based in my application. Most of the time the file system is doing very little; there's just one situation where the time taken is an issue. Important features for me are the resilence, and the low overhead of directory structures and the like.

A little detail that you may or may not be able to accommodate in your enhanced cache handler. I'm actually using two FRAMs, which together appear as a contiguous block of memory provided a read or write operation doesn't cross the chip boundary. If you cross the boundary, the internal address pointer of the chip addressed at the start of the operation just wraps round to the start of the same chip. So there could be a problem filling the cache, assuming it can start on any read/write block boundary. A possible solution might be to constrain the cache size to be an exact sub-multiple of the chip size, and have the cache contents always start on a multiple of the cache size. This might fall out of the way you're doing things anyway. Certainly not a show-stopper if it would adversely affect performance; probably one chip will have enough capacity for me anyway.

e107steved on 19 Jul 2018

A little detail that you may or may not be able to accommodate in your enhanced cache handler. I'm actually using two FRAMs, which together appear as a contiguous block of memory provided a read or write operation doesn't cross the chip boundary. If you cross the boundary, the internal address pointer of the chip addressed at the start of the operation just wraps round to the start of the same chip. So there could be a problem filling the cache, assuming it can start on any read/write block boundary. A possible solution might be to constrain the cache size to be an exact sub-multiple of the chip size, and have the cache contents always start on a multiple of the cache size. This might fall out of the way you're doing things anyway. Certainly not a show-stopper if it would adversely affect performance; probably one chip will have enough capacity for me anyway.

Isn't it something you should handle in your abstraction? mbed has code like this, I guess it properly divides all reads/writes/erases when they do cross the boundary. I did not check it, but I would be surprised if it did not handle it.

https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/bd/ChainingBlockDevice.h

FreddieChopin on 19 Jul 2018

👍1

Current plans are for cache_size to be a multiple of prog_size, which is a multiple of read_size. I hadn't thought about caches larger than a block_size, the logic to keep caches in sync may end up making it not worthwhile.

Or, in more mathy terms:
block_size = A*cache_size = B*prog_size = C*read_size
for integers A, B, C

Also as @FreddieChopin notes, this is a good case where you can hide the separation between the two chips by using the higher bits of the address to chose the exact chip (the ChainingBlockDevice class does this). The block device API is surprisingly flexible in this way.

geky on 19 Jul 2018

Current plans are for cache_size to be a multiple of prog_size, which is a multiple of read_size. I hadn't thought about caches larger than a block_size, the logic to keep caches in sync may end up making it not worthwhile.

Or, in more mathy terms:
block_size = Acache_size = Bprog_size = C*read_size
for integers A, B, C

Also as @FreddieChopin notes, this is a good case where you can hide the separation between the two chips by using the higher bits of the address to chose the exact chip (the ChainingBlockDevice class does this). The block device API is surprisingly flexible in this way.

I've already handled this for the obvious case of block sizes being a sub-multiple of chip capacity (it basically just falls out naturally, with values as powers of two). My concern was if cache_size could be a multiple of block size, in which case it would be possible to cross the chip boundary on a single read.

e107steved on 23 Jul 2018

You can implement a greedy function like in the chaining block device:
https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/bd/ChainingBlockDevice.cpp#L144-L165

Which handles larger-than-block-size cache sizes, by effectively breaking them up into blocks.

But there's no plans at the moment to cache large than a block size, so I don't think you have to worry about it.

geky on 23 Jul 2018

Just to feed back now we've done some tests to establish the source of the delays.
It seems we have a similar problem to #41, aggravated by slow hardware.
To recap, we're reading a file comprising 100 52-byte records, with 128 byte block size, one block at a time. (Changing to 256 byte block size didn't make any significant difference to overall speed). Pretty vanilla configuration of littleFs, with littleFs doing buffer management.
Part of our speed problem is that we're accessing the FRAM over I2C at 400kHz, and can't go any faster.
The other issue is the number of chip reads to find a block. These happen for every read, even if the file is being read sequentially. Something like four or five chip reads to find the place, followed by one or two chip reads to get the data.
Most of the time the speed isn't an issue for us, and we've worked round the one situation where it matters.
We'll most likely revisit this if we can hook the FRAM up via SPI, or if a solution to #41 hits the repository.

From the comments in #41, some form of directory information caching could make a big difference in this type of situation. May be reiterating existing thoughts, but the following might be a solution:

When the file is opened (or maybe on first access), save 'useful information' in a cache buffer. This could be, for example:
1. the locations of a number of consecutive blocks, starting at some value - this would have to be refreshed when the file pointer moved out of range
2. Pointers to elements of the CTZ skip list. If the number of pointers cached is 2^n-1 (I'm assuming the pointer to the last record is already available in RAM), n chip reads would be saved on each access.

I rather like the second option since:
a) It is more deterministic than the first - no "random" pauses while the next batch of location information is read in
b) It is readily configurable with a single number from 0 (no directory caching) upwards, giving a simple speed vs RAM tradeoff.

e107steved on 27 Jul 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings