Nodemcu-firmware: File writes slow and fail well before SPIFFS filesystem space is exhausted

Created on 9 Apr 2019 · 24Comments · Source: nodemcu/nodemcu-firmware

Expected behavior

Periodically appending chunks of data to a file on the SPIFFS filesystem is possible until the free space is exhausted, or nearly exhausted.

Actual behavior

There are increasing problems with writing chunks to a file as the filesystem fills up. It does not seem to be a problem with file size: having multiple smaller files still leads to the same problem.

Write speed slows considerably as the filesystem is filled to > 75% capacity.
File writes eventually fail completely with ~8% of the space left on the filesystem (230k+ for 4MB module)
The last few successful writes take 6-77 seconds instead of the normal ~15-75 milliseconds.
The points, in terms of space remaining, at which file writes slow or fail are repeatable between runs.

This problem has been present since at least v2.2.0, a year or more. Once file writes fail, it is usually not possible to successfully download any file from SPIFFS using ESPlorer or nodemcu-uploader until large files are deleted.

In actual use cases, the same problem happens with an ESP8266 that turns on periodically, collects some data, and writes to csv. The problems seem to happen at roughly similar filesizes across different individual ESP8266 modules.

Abbreviated function output, with write time for 1024 byte chunks in milliseconds:

For 4MB module:
Used: 1506 | Remain: 3445728 | Time: 180.279
Used: 2510 | Remain: 3444724 | Time: 12.285
Used: 3514 | Remain: 3443720 | Time: 12.366
Used: 4518 | Remain: 3442716 | Time: 12.362
...
Used: 291411 | Remain: 3155823 | Time: 71.817
Used: 292415 | Remain: 3154819 | Time: 71.83
Used: 293419 | Remain: 3153815 | Time: 71.957
...
Used: 2741924 | Remain: 705310 | Time: 77.094
Used: 2742928 | Remain: 704306 | Time: 77.077
Used: 2743932 | Remain: 703302 | Time: 77.078
Used: 2744936 | Remain: 702298 | Time: 262.307
Used: 2745940 | Remain: 701294 | Time: 320.053
Used: 2747195 | Remain: 700039 | Time: 321.103
...
Used: 3210541 | Remain: 236693 | Time: 326.398
Used: 3211545 | Remain: 235689 | Time: 327.534
Used: 3212549 | Remain: 234685 | Time: 318.57
Used: 3213804 | Remain: 233430 | Time: 1690.133
Used: 3214808 | Remain: 232426 | Time: 8580.37
Used: 3215812 | Remain: 231422 | Time: 6809.032
Used: 3216816 | Remain: 230418 | Time: 6137.508
Used: 3217820 | Remain: 229414 | Time: 7142.574
Writing file failed. f:write() output:  nil

~~~~~~~~~~~~~~~~~~~~~~~~~

For 1MB module: 
Used: 1506 | Remain: 457824 | Time: 43.308
Used: 2510 | Remain: 456820 | Time: 16.082
Used: 3514 | Remain: 455816 | Time: 16.178
...
Used: 353408 | Remain: 105922 | Time: 26.187
Used: 354412 | Remain: 104918 | Time: 26.173
Used: 355416 | Remain: 103914 | Time: 26.377
Used: 356420 | Remain: 102910 | Time: 301.693
Used: 357424 | Remain: 101906 | Time: 302.709
Used: 358428 | Remain: 100902 | Time: 301.955
...
Used: 419421 | Remain: 39909 | Time: 333.119
Used: 420425 | Remain: 38905 | Time: 333.52
Used: 421429 | Remain: 37901 | Time: 331.577
Used: 422433 | Remain: 36897 | Time: 3329.875
Used: 423437 | Remain: 35893 | Time: 76869.023
Used: 424441 | Remain: 34889 | Time: 77039.751
Writing file failed. f:write() output:  nil

Test code

function test_filesystem(mult)

    chunk = string.rep("0123456789ABCDEF", 2 ^ mult)
    t = tmr.create()

    remain, _, _ = file.fsinfo()

    while remain > 500 do

        start = tmr.now()

        f = file.open( "test.txt", "a")

        if f then

            success = f:write(chunk)

            if success then

                f:close()
                time = (tmr.now() - start) / 1000
                remain, used, _ = file.fsinfo()                
                print("Used: " .. used .. " | Remain: " .. remain .. " | Time: " .. time)

            else

                f:close()
                print("Writing file failed. f:write() output: ", success)
                return

            end

        else
            print("Opening file failed")
            return
        end

    end

end

-- 0 = 16 bit chunks, 1 = 32, 4 = 256, 5 = 512, 6 = 1024, etc.
test_filesystem(6)

NodeMCU version

2.2.1 master, commit: 4905381c004bdeaf744a60bb940c1906c2e052d4

Hardware

WeMos D1 mini (4MB), and WeMos D1 mini Lite (1MB)

wontfix

Source

mickley

All 24 comments

Sorry but what you describe is expected behaviour for SPIFFS. See SPIFFS Performance and Optimizing. SPIFFS is not a high performance FS; it is designed to balance safety, wear levelling and performance for bare flash devices. It does not maintained any form of linked structures so attempting to open a non-existent file will end up doing a full scan of the FS. When it needs a new logical page, it does a scan of the FS until it finds one free, and so these scan times rise considerably as the used % of the FS approaches 100%. Ditto fseek() times giving an O(N²) characteristic.

If you want good performance from SPIFFS:

Keep the % utilisation low
Keep the partition size low as well.

If you want to do a lot of local logging then the best thing to do is to use FatFS on an SDCard. Use a decent branded SD Card as these embed a proper wear-levelling controller on the device so FatFS give a balance of performance and life.

TerryE on 9 Apr 2019

@TerryE That's reasonable, at least in terms of write times increasing I think.

Having file writes fail completely at an arbitrary point (and leave the filesystem unstable) seems to be bad form to me though. At a minimum, this should be documented, so that it can be mitigated.

What's the point of having a filesystem with x amount of space if you cannot use it all, and don't even know how much you can use?

SD cards are not always an option either. They've got a high power draw, and add significant cost. At the scales that I'm using nodemcu, it would amount to $1000+

mickley on 9 Apr 2019

See the link and the related repository. We import SPIFFS and are not the maintainers. There may be a GC issue giving a false high. What SPIFFS is all about is giving a usable basic FS on the < 5 cent flash device that is on most ESP modules. If you don't want the additional cost / power draw then you need to work within the H/W limitations. There's no free lunch. Sorry.

TerryE on 9 Apr 2019

👍1

Your comments and attitude are unbelievably "helpful", for what is a legitimate bug report. If you have nothing good to say, don't say it. And if you don't want bug reports, don't solicit them.

There are all kinds of ways to mitigate this, some (e.g. documentation) would take 5 minutes.

mickley on 9 Apr 2019

This a community project. You are using the Lua firmware because of the 100s possibly by now 1000+ hours that I and others contribute for nothing. If you want to suggest a clarification to the documentation, then please read Contributing for how you can do that 5 minutes work yourself.

TerryE on 9 Apr 2019

Your work is appreciated. Your rudeness is not. I'd be happy to contribute that bit if that's the way to do it. I think there are probably better programmatic ways that actually fix the bug though.

What is the point of simply ignoring bugs? This is an undocumented and unpredictable failure of a fairly basic function, and "don't use the filesystem" is not at all a good way to migitate that.

mickley on 9 Apr 2019

Submit the PR then. If you think that there is an issue with SPIFFS, I've given you the project link. Raise the issue yourself with the maintainers.

TerryE on 9 Apr 2019

Actually, I'm raising an issue with file.write(). Is the file module not part of nodemcu?

mickley on 10 Apr 2019

file.write() calls SPIFFS to do the work. It is essentially a wrapper. Your issue isn't with what the file module does; it is with the underlying SPIFFS functionality.

TerryE on 10 Apr 2019

👍1

Thanks, that's useful to know.

mickley on 10 Apr 2019

See SPIFFS FAQ

TerryE on 10 Apr 2019

🎉1

@marcelstoer @devsaurus @pjsg @jmattsson

I don't think this should be simply closed. It's still clearly a bug and NOT the way things are intended to work, as @TerryE is saying. There's even a bug already posted on the spiffs repository.

At the very least, if some space is simply unusable, the nodemcu file.write() wrapper could prevent writes when a certain amount of space is remaining on spiffs. This would at least prevent the entire filesystem from becoming unstable.

More broadly, I'm disappointed in your approach to bug handling. I've spent some time building a bug report. I didn't ask for it to be fixed, I simply submitted the bug. I might even be willing to work on fixes myself. Unilaterally putting that down, closing the bug report, and complaining that there is no free lunch is not community-oriented, and rather ham-fisted, just juvenile behavior. I would think this is not at all in the interest of collaboration, and I would hope that you all reconsider how you are interacting.

mickley on 10 Apr 2019

👎1

Hi @mickley, thanks for the detailed bug report. Apologies for the rough ride here, and that you've felt your effort go unappreciated.

While Terry is right in that we don't pretend to grok the SPIFFS internals, and it's up to its maintainer to fix the underlying issue, I do agree it's not a good thing for us to knowingly ship broken things! Your suggestion of making the wrapper itself report an error early has merit. It's been quite some time since I touched the SPIFFS side of things though. @pjsg what are your thoughts? Do you think that would work, or would the free-size check overly pessimise all writes?
Between this bug report and the one on the SPIFFS repo we do have a reasonable idea on where we'd have to aim for the cut-off I think. Oh, and didn't I see something about Espressif taking on SPIFFS too? Or was I imagining that?

jmattsson on 11 Apr 2019

To clarify, my comment about "rough ride" was in relation to the SPIFFS behaviour and should not be taken as me denigrating @TerryE in any way here. Apologies for that potential misreading!

jmattsson on 11 Apr 2019

Thank you @jmattsson, much appreciated.

From my fiddling, it seems that the last ~25% of the filesystem space is problematic to some extent, but more testing is probably wise, it may be somewhat situation-dependent. I haven't extensively explored the effects of writing different sized chunks, or creating rather than appending to a file. I've shown there's a difference in where the writes fail on a 1MB vs 4MB module, but I haven't tested 16MB. I think I've noticed similar problems with a 512kb module before though.

In our actual use case, writes failed with ~8 files, of which the large ones were 400k and 2MB, leaving ~800k of unusable space. Usually, when the filesystem becomes unstable after writes fail, there's a single small block in one or both of those large files that becomes corrupted and unreadable, and not at the end of the file.

mickley on 11 Apr 2019

@mickley It's been years since I last looked into SPIFFS, but as Terry said, it does come with a lot of trade-offs and you end up with a certain amount of wastage (dependent on the usage pattern IIRC). My big concern is that horrendous time increase that you've reported a while before everything fails.

It would be very interesting to see if this behaviour can be replicated using the SPIFFS test suite. If you can, have a play with it (grab it from the main SPIFFS repo), and configure the filesystem for:

256 bytes logical page size
8k logical block size
set SPIFFS_ALIGNED_OBJECT_INDEX_TABLES to 1
set SPIFFS_MAX_GC_RUNS 256
maybe disable SPIFFS_CACHE_STATS and SPIFFS_GC_STATS (don't know if this impacts anything important other than adding diagnostics overheads)

to match the NodeMCU settings.

Also, I'm now wondering whether we've overtuned the GC runs and that's why we're getting those horrendous times towards the end. If you're building NodeMCU locally, maybe try twiddling the SPIFFS_MAX_GC_RUNS setting in app/spiffs/nodemcu_spiffs.h and see how/if that impacts things?

jmattsson on 11 Apr 2019

@mickley, I have referred you now twice to the Wiki that the SPIFFS maintainers have written. Have you read it yet? Have you reviewed the SPIFFS source code?

If you want to benchmark write performance then don't open and close the file each write chunk; don't do an fstat each write chunk. Also note their observation on the Checking filesystem that the fstat() function only gives an _indication_ of the remaining space as this also includes metadata content.

The per write close causes the FS to flush part used logical pages to Flash which in turn causes a high GC load.

Also please avoid using ad hominem criticism of me or the other maintainers. This is quite unnecessary.

TerryE on 11 Apr 2019

@TerryE I read the wiki the first time you linked to it. Comments such as "Have you read it yet" are rude and unhelpful. I will stand by my contention that behavior such as this should not belong here. I'm disappointed that you continue to display that. As a project collaborator especially, you're setting a bad example.

You seem entirely hung up on the write performance aspect of this. I'm not. I'm much more concerned with writes failing completely and potentially leaving the filesystem unstable. You've yet to even acknowledge this aspect of the problem. I have no interest in benchmarking write performance, I'm only including that aspect because it seems to be an interacting factor.

mickley on 11 Apr 2019

The referenced SPIFFS material already answers some of the points that you are raising as issues, yet you don't seem to be aware of this; this seems to me be be a valid reason for asking whether you had read this material. Why does doing this constitute rudeness on my part?

The SPIFFS documentation advises that the SPIFFS_info() is inaccurate because of metadata overhead, so why not just set your soft limit for capacity at 80% of the nominal FS size, say, and then you will have a good balance of performance and stability.
Don't close the file after each write as this defeats the write-though caching system which both hits performance and causes bad logical page fragmentation, reducing the life of the flash memory chip.
If you want to measure bare write performance then don't embed instrumentation within the write loop that materially impacts on this performance.

These issues are known to the SPIFFS authors. If you want to lobby to request architectural or functional changes on SPIFFS issues and we will pick them up in due course once they are incorporated into a new SPIFFS Release. We are not planning any such changes in the interim.

Lastly on a point of list governance: you are seeking help here gratis, I am here trying to provide this help, also gratis. I am not here to be accused of having an unbelievable attitude; having nothing good to say; being rude; being ham-fisted and juvenile -- all _your_ ad hominems directed at me. I don't understand why you use language that is going to drop your change of getting help to zero. I provide my help and my contributions for free, so that developers like yourself can make free use of them. Whatever you feel about my conduct, I am trying to engage constructively, and I make my points and ask my questions for what I view as valid reasons.

This is the _first time_ on GitHub that I have found it necessary to give the following advisory. _Please engage constructively and refrain from ad hominem comments. Any further breaches will result in your account being blocked from submitting to this list and your content here will be deleted._

TerryE on 11 Apr 2019

Thanks @TerryE

I am currently already using the workaround you suggest of setting a soft limit on the filesystem capacity; that's a sensible and easy way of mitigating the problem. If the consensus is that this is unfixable, then I'd be willing to write a note in the documentation to that end, or hack in a similar limit to file.write().

As for closing the file after each write, in my use case that's a requirement as power is cut between each write, there's no way to keep a file open. I'm not surprised that would reduce the life of the flash chip. As I mentioned before though, write performance is secondary to the problems with writes themselves failing and filesystem instability.

To clarify, I'm not asking for your help, I'm just filing a bug report. A proper fix would be nice, but I don't assume any such entitlement.

@jmattsson I'll try to run some more tests when I have some time.

mickley on 11 Apr 2019

As for closing the file after each write, in my use case that's a requirement as power is cut between each write, there's no way to keep a file open. I'm not surprised that would reduce the life of the flash chip.

Do you mean deep sleep or true power cut? (See ESP8266 Low-Power Solutions). In most cases the deep sleep option is preferable since the uses an average 20 μA substrate current and is therefore suitable for battery operation. This current is enough to keep the RTC, specified wake-up GPIO and RTC R/W memory. This last is particularly useful for NRAM state info (at least unless the battery fails) and avoids needing to write such volatile content to SPIFFS.

Have a look at app/spiffs/spiffs_hydrogen.h for an overview as to how the FS is actually laid out. We currently use 32 × 256 byte logical pages per 8Kb block. The first page of any block is always used for object lookup metadata. Of the 31 data pages, these are a mix of object index and object data pages. Each file requires at least one object index page and closing the file causes a flush which will generate a churn of object index pages -- hence the metadata percentage is not determinate.

TerryE on 11 Apr 2019

@TerryE

We're cutting the power completely, using a RTC alarm to drive a P-MOSFET to turn on. Even low deep sleep currents become substantial over longer periods, especially when there are peripherals, and a voltage regulator involved. I believe I got down to ~60 μA using a custom-designed ESP8266 module and regulator with very efficient quiescent current. The WeMos D1 Minis are ~100 μA in deep sleep. For our uses, that's roughly equal to the drain from when the chip is on, halving battery life.

If you're interested in the use case, you can read about it here, though it's rather tangential to this bug report.
https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13128
https://github.com/mickley/EMU

Will poke around more in SPIFFS myself when I have some more time. Appreciate the suggestions.

mickley on 11 Apr 2019

@TerryE Long before the Internet was a thing, I wrote in its entirety a project
which ended up as 45000 lines of code, and 8000 lines of docs. It was the work
of 18 months, when I was unemployed - on the other hand it did help me get a job.

If you are going to lead an open-source project of that size, users are
sometimes irritating; but if you don't want users (in fact, I did), why publish
the project?

Saying that you are working for nothing doesn't cut it; if you publish code
you have a responsibility to your users. Reject that responsibilty, and
threaten users, and the project soon becomes toxic.

C W Rose

cwrseck on 13 Apr 2019

@cwrseck it looks to me like you are fundamentally misunderstanding a couple of things about open-source(ing) projects. However, I don't want to discuss this any further here as it would be OT. It would only further distract from the actual issue reported.