Mbed-os: Filesystem bugs, including volume corruption

Created on 3 Jan 2018  Â·  62Comments  Â·  Source: ARMmbed/mbed-os

Description

  • Type: Bugs
  • Priority: Major?

Bug

Target
LPC4088

Toolchain:
GCC_ARM - yes
ARM|IAR - untested

Toolchain version:

mbed-cli version:
1.2.2

mbed-os sha:
2b4ff78ab0a52ef1dc3f2998908453c595e2b2c0

Expected behavior

  1. readdir() returns NULL to terminate a directory listing.
  2. FATFileSystem writes files to SD Card at a relatively consistent speed.
  3. FATFileSystem never corrupts the file system on the SD Card.

Actual behavior

  1. When operating on a directory on a FAT32 volume containing the maximum number of files per directory (65534), readdir() never returns NULL.
  2. Increasing the number of files in a directory slows down FATFileSystem writes.
  3. Under certain cirumstances, FATFileSystem (or SDBlockDevice?) corrupts the file system on the SD Card. (i.e. It can no longer be read by FATFileSystem, nor by a Windows PC.)

Steps to reproduce

I forked ARMmbed/mbed-os-example-filesystem to create demo code for items 1 and 2. There are tags (linked below) for each of these. I don't have code to share for 3 yet, but I give details below.

All testing was done with an Embedded Artists' LPC4088 QuickStart Board connected to an LPC4088 Experiment Base Board (EBB). The EBB is used for its SD Card slot. Jumpers are configured to use the SPI interface to the SD Card (per Figure 7 in the EBB user guide (PDF)).

I made a small patch to mbed OS so that GCC_ARM will malloc() from the LPC4088's external SDRAM like the mbed online compiler does. Demo programs 1 & 2 use megabytes of memory for a HeapBlockDevice. You should be able to replace this with an SDBlockDevice to avoid the large memory use and test on another target device, but of course this is slower to run.

Of course, add your choice of Serial device to output printf() text.

  1. bug01-mbed-os-5.7.1-65534-files-dirent-ne-null (1fa0b0b)
  2. bug02a-mbed-os.5.7.1-write-slowdown-empty-files (6ab378b)
    bug02b-mbed-os.5.7.1-write-slowdown-small-files (d891ad3)
  3. Using an 8 GB SD Card, write many files to a directory. After each file write, append text via fprintf() to a metadata file in the directory, and a log file in the parent directory.

    • Circumstance 1: Files are exactly 512 KiB each, written via fwrite(). Filesystem is corrupted after writing to the 8160th file. I observed this particular failure mode on at least two units.

    • Circumstance 2: Files are approximately 7.5 MB each, written via fprintf(). Filesystem is corrupted after filling up the volume. Observed on one unit so far.

closed_in_jira storage mirrored

Most helpful comment

@yossi2le, thanks for your input. I hadn't realized that ChaN had taken any action on my bug report, since there was no mention in my thread on their forum.

You may either [option 1] upgrade the ChaN FatFs implementation in mbed-os and then deploy the patch or in my opinion (however not tested) [option 2] use the current FatFs R0.13a and deploy the patch at line 1742 instead of 1728.

On your [option 1], it doesn't look like the patch has been included in any FatFs release yet.

I will try your [option 2] when I get a chance. Then hopefully I can close this with a PR. 😃

All 62 comments

@deepikabhavnani @geky Please review

Note, #3 takes about two days to run one scenario on the dev board. I haven't yet tried pre-filling the SD Card, though - probably should. 😄

3 may have something to do with me having a logfile open, the volume fills up from another file, then I try to write to the logfile again. Further investigation required.

@bmcdonnell-ionx - Thanks for reporting the issues.
I have looked into 1. and it seems logic at https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/fat/FATFileSystem.cpp#L638 is incorrect

ChanFS reports FR_OK for end of directory, hence the check should be just if condition and not else if

-- else if (finfo.fname[0] == 0) {
++ if (finfo.fname[0] == 0) {
            break;
        } 

I do not have target device with that big HEAP memory, and testing with SD card will take long. Will confirm once I have verified.

  1. Increasing the number of files in a directory slows down FATFileSystem writes.

Yes it will, as per your example files are created in same directory, and while open call all files are scanned in the directory before creation. See https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/fat/ChaN/ff.cpp#L3046

You can split files in multiple directories to get better performance.

We would appreciate a PR to improve performance here, else we might pick it up in future based on priorities.

OK, so @deepikabhavnani is working on #1, and has explained #2.

We would appreciate a PR to improve performance here [#2], else we might pick it up in future based on priorities.

I don't anticipate doing so myself, but for reference, how could one improve the performance here?

@bmcdonnell-ionx - @geky might help you with that, else you can add query in ChanFs community (http://elm-chan.org/fsw/ff/bd/) and see if you can get some help.

I've published test cases for the following. mbed team (@deepikabhavnani, @geky): Is this enough information for you to troubleshoot the issue?

  1. Under certain cirumstances, FATFileSystem (or SDBlockDevice?) corrupts the file system on the SD Card. (i.e. It can no longer be read by FATFileSystem, nor by a Windows PC.)
  1. Using an 8 GB SD Card, write many files to a directory. After each file write, append text via fprintf() to a metadata file in the directory, and a log file in the parent directory.

    • Circumstance 1: Files are exactly 512 KiB each, written via fwrite(). Filesystem is corrupted after writing to the 8160th file. I observed this particular failure mode on at least two units.

bug03a-mbed-os-5.7.2-filesystem-corruption (650c265)
bug03b-mbed-os-5.7.2-filesystem-corruption (167f1e1)

Notes:

  • I'm using malloc() to allocate a 512 KiB block of memory. With my mbed-os patch, malloc() goes to external memory with GCC_ARM.
  • This is imitating an unpublished program of mine where the failure occurred, which is why version a includes faux log and checksum files.

    • Results of some test runs are summarized in the table below. I used identical board pairs with 8 GB SD cards. (Each board pair is denoted as a numbered "unit" in the table.)

  • As before, add your Serial device for console output.

| Run# | program | Date | Unit# | mbed OS ver# | Result |
| --- | --- | --- | --- | --- | --- |
| 1 | (unpublished) | 12/22 | 2 | 5.7.0 (maybe 5.7.1) | Failed to re-open the log after writing the file 0x00001fe0.bin (8160 decimal). Writing that file may have failed too. There was also a checksum error when reading back the file two loop iterations before the failure. Filesystem corrupted. |
| 2 | (unpublished) | 12/22 | 3 | 5.7.0 (maybe 5.7.1) | Same as run 1. |
| 3 | (unpublished) | 1/6 | 1 | 5.7.2 | Same as run 1, except it failed one iteration earlier (00001fdf.bin). |
| 4 | (unpublished) | 1/6 | 2 | 5.7.2 | Same as run 1, except it failed much earlier (0000191f.bin). |
| 5 | bug03a (650c265) | 1/6 | 2 | 5.7.2 | Same as run 1. |
| 6 | bug03b (167f1e1) | 1/7 | 2 | 5.7.2 | Failed to create 00001fe2.bin. Filesystem corrupted. |
| 7 | bug03b (167f1e1) | 1/7 | 3 | 5.7.2 | Same as run 6. |

I have another test run executing now without the faux log and checksum files; I'll report results when I have them. If I have time I may try smaller fwrite()s to reduce RAM usage, and see if the results are the same.

Test runs of my unpublished program did not fail or corrupt the filesystem on 2 GB or 4 GB SD Cards, which can't fit the number of files it took to get to the failure demonstrated above.

I edited my comment above, including the table. I added the version (03b) without the faux log and checksum files.

I'm experimenting with pre-filling the test directory. I'll update later.

OK. Finally I published:

mbed team (@deepikabhavnani, @geky) - is this test enough good enough for you to use to troubleshoot?

This version doesn't require the external memory. And it runs much more quickly - around 15 minutes on my board. But you must pre-populate the files.

To pre-populate the SD Card, with your PC, create a folder fs-test. Inside that folder, create 8128 files of 512 KiB each. To follow my naming convention, they are named with hex numbers, 00000000.bin through 00001fbf.bin. The data doesn't matter; I just used /dev/zero as the source.

A bash script like this should do the trick:

#!/usr/bin/bash

mkdir fs-test
cd fs-test

# create the first file
dd if=/dev/zero of=00000000.bin bs=512K count=1

# make copies
for f in `printf '%08x.bin\n' $( seq 1 8127 )`
do
   echo $f
   cp 00000000.bin $f
done

I recommend you create the folder on your hard drive, so you can just copy it repeatedly for multiple test runs. If using Windows, I recommend using Windows Explorer to copy/paste the folder to the SD Card. (I had an issue where the mbed device isn't seeing all the files under certain circumstances, such as when things were moved from one location to another within the SD Card, or when files were copied to the SD Card using Cygwin. If I can nail it down, I'll report here or in another issue. But Ctrl+C/Ctrl+V on the fs-test folder in Windows Explorer to copy it from the hard drive to the SD Card is working reliably for me on Windows 10.)

I did three test runs, on three different units, and on all three it failed to create file 00001fe2.bin, and then corrupted the filesystem. (i.e. You then put it in your PC, and it tells you to format it.)

(Sorry it took so long. I had tried changing several things at once, and I couldn't reproduce the errors, so I made one change at a time. With test runs taking 1+ days, it took a while...)

When operating on a directory on a FAT32 volume containing the maximum number of files per directory (65534), readdir() never returns NULL.

@bmcdonnell-ionx Were you able to create 65534 files successfully on HEAP partition/SD card? I was trying with 8GB SD card formatted using FatFilesystem::format() API and using host machine for file creation, below is my observation.

  1. Linux system created 32765 files in any directory and then failed with error (No space left on device). Creation of files in another directory was allowed. Read was successful with application on mbed board.
  2. Windows created 37,767 files and then failed with error. readdir does not return NULL. Using the same SD card on Linux machine, ls -l hangs and does not list the filenames.

Oops - somehow the call to sdram_init() remained in my bug03d release. Not sure how that happened. Skip that one; use this one:

@bmcdonnell-ionx - I already did those changes and tried once. Below is the observation, I will be trying again with format using FatFilesystem::format()

Steps:

  1. Format on windows
  2. 8,128 - 512KB each created on linux
  3. Insert in windows and check - Windows complains after linux dump of files as well. Did scan and repair - no error found. (Windows pop error might be because of *.bin file name. TODO: Rename all files to .txt and try later)
  4. Edited test to create files from 0x1fc0 (8128) - https://gist.github.com/deepikabhavnani/cdfc66003b5f213b31551bd02338b378

Output:
Open logfile /fs/fs-test/log.txt.
Create test directory /fs/fs-test/00000000.
Create checksums file /fs/fs-test/00000000/checks.txt.
/fs/fs-test/00001fc0.bin 13053 ms - Because overwrites
/fs/fs-test/00001fc1.bin 12666 ms
|
|
/fs/fs-test/00002200.bin 12666 ms and still going on (Killed)

Next Step: Try format with FatFilesystem::format() and repeat steps

@deepikabhavnani, you bring up a good point - I did not track which methods I used to format the SD Cards as it related to the tests. I think I mostly used Windows 10 to format them (with "default allocation size"), but occasionally used the FATFileSystem::format() function on my mbed device.

I already did those changes and tried once.

Which changes?

Looks like your gist was based on my Rev c. I think you'll find my latest Rev e easier to work with.

Windows complains after linux dump of files as well.

I don't have a Linux machine to test with, so I can't duplicate that test. I'm using Cygwin.

Windows pop error might be because of *.bin file name.

Why would that matter? I don't recall it giving me any trouble (Windows 10).

@deepikabhavnani,

Were you able to create 65534 files successfully on HEAP partition/SD card?

Yes, this demo code does that with empty files on the HeapBlockDevice.

  1. bug01-mbed-os-5.7.1-65534-files-dirent-ne-null (1fa0b0b)

@bmcdonnell-ionx I was able to reproduce issue 3 with FatFilesystem::format(), more on that once we find out actual root cause of corruption. Thanks.

@bmcdonnell-ionx - Can you please verify the fix in PR#5829 for issue 3?

@deepikabhavnani - Does it matter if the card was formatted by the mbed device or the PC?

No it should not matter. Block division is done differently by different format utilities, with this case with fat filesystem format we easily got the case where write to 0x800000 block was done. With other methods, it might take more time or some extra file writes.

I'm running some other testing on my devices now. I'll probably try test 3e tomorrow (abbreviated with prepopulation), and maybe run an extended test (without prepopulation) over the weekend.

Thanks.

@bmcdonnell-ionx - Thanks for helping us find this issue

@deepikabhavnani - You're welcome. Thanks for being responsive and fixing things. :)

@deepikabhavnani - Until now, I've been testing with mbed-os-5.7.2. Should I merge or cherry-pick your PR to test it?

The merge brings in a lot of other stuff, as shown below.

$ git merge fat_issue_5780_3
Auto-merging targets/TARGET_STM/rtc_api.c
Auto-merging features/FEATURE_BLE/ble/BLE.h
Merge made by the 'recursive' strategy.
 TESTS/host_tests/rtc_calc_auto.py                  |  138 ++++++++++++++
 TESTS/mbed_hal/rtc_time/main.cpp                   |  329 ++++++++++++++------------------
 TESTS/mbed_hal/rtc_time_conv/main.cpp              |  214 +++++++++++++++++++++
 TESTS/mbedmicro-rtos-mbed/CircularBuffer/main.cpp  |  469 +++++++++++++++++++++++++++++++++++++++++++++
 features/FEATURE_BLE/ble/BLE.h                     |    9 +
 features/FEATURE_BLE/ble/generic/GenericGap.h      |  297 +++++++++++++++++++++++++++++
 features/FEATURE_BLE/source/BLE.cpp                |   24 +++
 features/FEATURE_BLE/source/generic/GenericGap.cpp | 1091 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 features/FEATURE_LWIP/lwip-interface/lwip_stack.c  |   43 +++++
 features/filesystem/fat/FATFileSystem.cpp          |   14 +-
 platform/CircularBuffer.h                          |   30 ++-
 platform/CriticalSectionLock.h                     |   49 ++++-
 platform/mbed_mktime.c                             |  117 +++++++-----
 platform/mbed_mktime.h                             |   54 ++++--
 targets/TARGET_Atmel/TARGET_SAM_CortexM4/rtc_api.c |    9 +-
 targets/TARGET_NUVOTON/TARGET_M451/rtc_api.c       |    9 +-
 targets/TARGET_NUVOTON/TARGET_M480/rtc_api.c       |    7 +-
 targets/TARGET_NUVOTON/TARGET_NANO100/rtc_api.c    |    9 +-
 targets/TARGET_NUVOTON/TARGET_NUC472/rtc_api.c     |    9 +-
 targets/TARGET_NXP/TARGET_LPC176X/rtc_api.c        |    9 +-
 targets/TARGET_NXP/TARGET_LPC408X/rtc_api.c        |    9 +-
 targets/TARGET_NXP/TARGET_LPC43XX/rtc_api.c        |    9 +-
 targets/TARGET_RENESAS/TARGET_RZ_A1H/rtc_api.c     |    9 +-
 targets/TARGET_RENESAS/TARGET_VK_RZ_A1H/rtc_api.c  |    9 +-
 targets/TARGET_STM/rtc_api.c                       |    7 +-
 tools/config/__init__.py                           |   26 ++-
 tools/test.py                                      |    2 +-
 tools/test_configs/__init__.py                     |    7 +-
 28 files changed, 2703 insertions(+), 305 deletions(-)
 create mode 100644 TESTS/host_tests/rtc_calc_auto.py
 create mode 100644 TESTS/mbed_hal/rtc_time_conv/main.cpp
 create mode 100644 TESTS/mbedmicro-rtos-mbed/CircularBuffer/main.cpp
 create mode 100644 features/FEATURE_BLE/ble/generic/GenericGap.h
 create mode 100644 features/FEATURE_BLE/source/generic/GenericGap.cpp

Fix did not work on re-running test, let me see more why I am still getting addr 0x0 for block 0x800000. May be some compiler optimization.
@bmcdonnell-ionx - You can cherry-pick. But please wait till I confirm

@bmcdonnell-ionx - Fix in place, you can verify as per your convenience. Thanks

@deepikabhavnani

Fix in place

Where?

where?

I updated the PR with fix https://github.com/ARMmbed/mbed-os/pull/5829

The PR looks promising. I ran a short test on two units using 3e. They are up to

/fs/fs-test/0000207b.bin  23119 ms

and

/fs/fs-test/000020ba.bin  16940 ms

so far. (Previously it would fail at 00001fe2.bin.)

Before prepopulating the files, I formatted one SD Card using the Windows PC (default allocation size), and the other on the device using FATFileSystem::reformat().

I have to interrupt them now to take them home for the weekend. I'll start over and let it run for longer there.

Status update, by bug#:

  1. @deepikabhavnani, what is the status?
  2. Issue inherent in implementation.
  3. I'm testing the PR #5829.

The good news

The two units completed the test (bug03e) on 8 GB micro SD cards without error. And I see that you've merged it into master, so that's good. I tentatively think we can consider bug03 resolved - unless you think the following is somehow related.

The questionable news

I cherry-picked PR #5829 into my unpublished test program, ran that on an mbed target device with a 2 GB SD card, and got some weird (bad) results. I don't think the following problematic results have anything to do with the PR. The questions here are:

  • Is this just a bad SD card, or is there potentially another filesystem (or block device) software issue?
  • If the latter, can/should code be added to mitigate data loss on bad SD cards?

Program description

My unpublished test program creates or reuses the main test directory sd-test/, and creates the next available numbered subdir (e.g. 00000000/) under that. It creates 512-KiB binary files containing test patterns in the directory, appends their MD5 sums to a text file, reads back each test file and verifies its MD5 sum. Anytime a write operation fails, the program creates the next numbered subdir (e.g. 00000001/), and continues, until the volume is nearly full. Results are output to the console, and captured in sd-test/log.txt.

Results observed

Summary

Some failures to write, some failures to read, some verification failures after read-back, and some files/dirs appear to have been destroyed by later writes. Obviously that last part is the most concerning.

Details

The test program reported on the console:

  • sd-test/00000000/

    • Successfully created and verified files 00000000.bin through 0000069b.bin.

    • Failed to open file 0000069c.bin for read-back.

    • Failed to create 0000069d.bin (so try next subdir).

  • sd-test/00000001/ through 00000005

    • Created dir, could not create the first file 00000000.bin.

  • sd-test/0000006/

    • Created file 00000000.bin; read it back - MD5 sum mismatch.

    • Successfully created & verified files 00000001.bin through 000007c0.bin.

    • Could not create file 000007c1.bin.

After-test SD card examination in Windows PC:

  • Dirs sd-test/00000000/ through 00000005/ don't exist. (Recall that sd-test/00000000/ contained a bunch of files.)
  • sd-test/log.txt only contains results from files in sd-test/00000006/.
  • There are several unreadable files in the root of the SD card, as shown below.

2018-01-14--2gb-run01-windows-listing-01-sd-test

me@pc /cygdrive/e
$ ls -alF -R
.:
total 0
drwxr-xr-x 1 me  Domain Users 0 Jan  1  1980  ./
dr-xr-xr-x 1 me  Domain Users 0 Jan 15 11:24  ../
drwxr-xr-x 1 me  Domain Users 0 Jan  2  2098  sd-test/
drwxr-xr-x 1 me  Domain Users 0 Jan 14 12:12 'System Volume Information'/

./sd-test:
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/¬': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/µ': No such file or directory
ls: cannot access './sd-test/â•›': No such file or directory
ls: cannot access './sd-test/â•›': No such file or directory
total 96
drwxr-xr-x 1 me  Domain Users     0 Jan  2  2098 ./
drwxr-xr-x 1 me  Domain Users     0 Jan  1  1980 ../
-????????? ? ?   ?                ?            ? â•›
-????????? ? ?   ?                ?            ? â•›
-????????? ? ?   ?                ?            ? ¬
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
-????????? ? ?   ?                ?            ? µ
drwxr-xr-x 1 me  Domain Users     0 Jan  2  2098 00000006/
-rw-r--r-- 1 me  Domain Users 77555 Jan  3  2098 log.txt

./sd-test/00000006:
total 1015424
drwxr-xr-x 1 me  Domain Users      0 Jan  2  2098 ./
drwxr-xr-x 1 me  Domain Users      0 Jan  2  2098 ../
-rw-r--r-- 1 me  Domain Users 524288 Jan  2  2098 00000000.bin
-rw-r--r-- 1 me  Domain Users 524288 Jan  2  2098 00000001.bin
-rw-r--r-- 1 me  Domain Users 524288 Jan  2  2098 00000002.bin

[snip]

-rw-r--r-- 1 me  Domain Users 524288 Jan  3  2098 000007c0.bin
-rw-r--r-- 1 me  Domain Users 458752 Jan  3  2098 000007c1.bin
-rw-r--r-- 1 me  Domain Users  93295 Jan  3  2098 md5sums.txt

'./System Volume Information':
total 32
drwxr-xr-x 1 me  Domain Users  0 Jan 14 12:12 ./
drwxr-xr-x 1 me  Domain Users  0 Jan  1  1980 ../
-rw-r--r-- 1 me  Domain Users 76 Jan 14 12:12 IndexerVolumeGuid
-rw-r--r-- 1 me  Domain Users  0 Jan 15 10:52 WPSettings.dat
  1. Did you try this unpublished program on 8GB card with the fix?
  2. From log it seems the Boot block and fat table are intact, corruption is now in root directory.

I ran the unpublished program on another 2 GB card - passed, no problems.

  1. Did you try this unpublished program on 8GB card with the fix?

I just started it this morning. It'll be a couple days.

  1. From log it seems the Boot block and fat table are intact, corruption is now in root directory.

Might one argue that the FAT table is not "intact" on account of those untouchable, unreadable files?

Regardless, is there any implication or anything that follows from your comment here?

I just ran a test using F3 of the suspect SD card on my PC in Cygwin. It passed, as shown below.

I will re-run tests with the suspect card on an mbed device when I get a chance.

````
$ f3write /cygdrive/e/
F3 write 7.0
Copyright (C) 2010 Digirati Internet LTDA.
This is free software; see the source for copying conditions.

Free space: 1.82 GB
Creating file 1.h2w ... OK!
Creating file 2.h2w ... OK!
Free space: 1.69 MB
Average writing speed: 9.44 MB/s
````

````
$ f3read /cygdrive/e/
F3 read 7.0
Copyright (C) 2010 Digirati Internet LTDA.
This is free software; see the source for copying conditions.

SECTORS ok/corrupted/changed/overwritten
Validating file 1.h2w ... 2097152/ 0/ 0/ 0
Validating file 2.h2w ... 1715990/ 0/ 0/ 0

Data OK: 1.82 GB (3813142 sectors)
Data LOST: 0.00 Byte (0 sectors)
Corrupted: 0.00 Byte (0 sectors)
Slightly changed: 0.00 Byte (0 sectors)
Overwritten: 0.00 Byte (0 sectors)
Average reading speed: 21.91 MB/s
````

$ ls -alF /cygdrive/e total 1906592 drwxr-xr-x 1 me Domain Users 0 Jan 1 1980 ./ dr-xr-xr-x 1 me Domain Users 0 Jan 16 13:30 ../ -rw-r--r-- 1 me Domain Users 1073741824 Jan 16 13:25 1.h2w -rw-r--r-- 1 me Domain Users 878586880 Jan 16 13:27 2.h2w drwxr-xr-x 1 me Domain Users 0 Jan 15 12:38 'System Volume Information'/

I will re-run tests with the suspect card on an mbed device when I get a chance.

I re-ran the unpublished program on the suspect 2 GB card. It passed, no problems. (!)

I have a hypothesis: the suspect SD card had some blocks that were _just_ failing. Those failures induced data loss as they occurred, and the SD card (itself) detected these failures, marked those blocks unusable internally, and remapped some other blocks to those logical addresses internally.

Can anyone comment on the plausibility?

During the first run (failure), do you remember the initial state of card? Was it formatted or you had ran previous example executed? (say published -1/2/3). Could that be because of some combination testing?
First run shows write was successful for ~800MB, failed retried (may be something got deleted here?) and then write again till ~800MB. Does that mean card was half filled?

Assuming second run was after format and was successful. During format, yes SD cards do detect bad blocks and mark them unusable. But same happens during write as well, if write fails on SD card, card will write the data to other block or send failure status response in next command.

During the first run (failure), do you remember the initial state of card? Was it formatted or you had ran previous example executed? (say published -1/2/3).

I may have used it as-formatted by the factory. I may have done some other test runs with it beforehand (probably the same unpublished program if so), but deleted all the files and folders before starting the test run that failed.

Could that be because of some combination testing?

I don't see how.

First run shows write was successful for ~800MB, failed retried (may be something got deleted here?)

I think that's when all the metadata for the previously-written stuff got corrupted - or when the corruption was observed.

and then write again till ~800MB. Does that mean card was half filled?

I don't recall the reported free space. I think it was saying it was full or close to it.

Could that be because of some combination testing?

This could be the case, only if previous data was present on card.

I'm pretty sure I had removed it all.

Maybe I should just keep testing, and wait and see if it happens again. (Based on my hypothesis.)

@deepikabhavnani, what do you think of my suggested course of action above (wait & see on #3)?

Also, what's the status of issue #1? Did you merge a fix?

For ease of reference, here's our conversation so far on that.

I said:

  1. When operating on a directory on a FAT32 volume containing the maximum number of files per directory (65534), readdir() never returns NULL.

You said:

I have looked into 1. and it seems logic at https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/fat/FATFileSystem.cpp#L638 is incorrect

ChanFS reports FR_OK for end of directory, hence the check should be just if condition and not else if

-- else if (finfo.fname[0] == 0) {
++ if (finfo.fname[0] == 0) {
            break;
   }

I do not have target device with that big HEAP memory, and testing with SD card will take long. Will confirm once I have verified.

You said:

Were you able to create 65534 files successfully on HEAP partition/SD card? I was trying with 8GB SD card formatted using FatFilesystem::format() API and using host machine for file creation, below is my observation.

  1. Linux system created 32765 files in any directory and then failed with error (No space left on device). Creation of files in another directory was allowed. Read was successful with application on mbed board.
  2. Windows created 37,767 files and then failed with error. readdir does not return NULL. Using the same SD card on Linux machine, ls -l hangs and does not list the filenames.

I said:

Yes, this demo code does that with empty files on the HeapBlockDevice.

  1. bug01-mbed-os-5.7.1-65534-files-dirent-ne-null (1fa0b0b)

I am pretty sure file creation in SD card failed beyond 32765 files on Linux system, hence was unable to reproduce that issue. Also, I don’t have any system with huge heap to try this. Will let you know if I am working back on that issue.

For issue #3, I would say leave it if you are not able to reproduce it.

Well, you showed the code here for your proposed fix. I can test it.

I am pretty sure file creation in SD card failed beyond 32765 files on Linux system, hence was unable to reproduce that issue.

Was the card formatted by an mbed device, without your PR?

Yes

Were you creating the files directly on the SD card? Could the buggy mbed formatting be the reason why you couldn't create all the files?

Bug was not in format, still I will give it a try sometime next week.

"Buggy mbed formatting" was poor phrasing on my part. I meant that f_mkfs() calls the functions you fixed (disk_write() and disk_read()), so I figured maybe the bug would manifest in formatting too - not necessarily that it would be noticed during the format, but maybe it would be a problem when you try to write to certain areas on a large enough volume.

Anyway, if you want to decouple questions, you could try creating the empty files on your HDD, and copy them to the SD card later.

@deepikabhavnani, re #1, could you either make a commit in a branch somewhere and leave a link here, or answer the following?

I see two lines in FATFileSystem.cpp containing else if (finfo.fname[0] == 0). Should they both be changed to if (finfo.fname[0] == 0)? Is that all that needs to be changed?

@bmcdonnell-ionx - I tried creation after new format and same result.

Should they both be changed

Yes, but that change might not help, I missed the return statement for failure case, so checking fname[0] only in case of success makes sense.

@deepikabhavnani, if you want me to test, can you just put the commit in a branch somewhere, and point me to it?

@deepikabhavnani, can you give a status update? Do you think you know what needs to be fixed for bug01? If so, can you please commit in a branch or tag somewhere, and point me to it so I can test?

I'm not clear on the required changes from what you've said here. I'd love to wrap this up and get the changes into Mbed OS 5.7.4.

@bmcdonnell-ionx As mentioned earlier, the proposed change might not work as I missed the return statement in failure case during analysis.
Also, sorry but I am not working on filesystem issues at present, so wont be able to help much.

Re bug01, I think it is a problem is with FatFs. I made a post on their forum about it.

f_readdir() bug - bad termination when dir full

ARM Internal Ref: MBOTRIAGE-296

@bmcdonnell-ionx IMHO issues 2 and 3 (slow performance on big directories and corruption of FATFS) are well known issues with FAT file system.
Regarding issue #1, looks like a bug in CHAN.
Any concerns for closing this issue?

@dannybenor,

Apparently [2] is a known inherent issue. [3] was fixed.

Agreed, [1] appears to be an issue w/ ChaN FatFs.

Any concerns for closing this issue?

Can you document [1] in Mbed OS as a known issue, until ChaN fixes it?

I reported it to them (link above), but there's been no response, other than one other user confirming the issue. Do you know if that's the right place, or any other way to encourage them to take action on it?

@bmcdonnell-ionx
I found out that problem has been addressed by ChaN on May 23 2018 for FatFs R0.13b and a workaround have been supplied.
http://elm-chan.org/fsw/ff/patches.html
You may either upgrade the ChaN FatFs implementation in mbed-os and then deploy the patch or in my opinion (however not tested) use the current FatFs R0.13a and deploy the patch at line 1742 instead of 1728.

Do you still has any concerns regarding closing this bug?

@yossi2le, thanks for your input. I hadn't realized that ChaN had taken any action on my bug report, since there was no mention in my thread on their forum.

You may either [option 1] upgrade the ChaN FatFs implementation in mbed-os and then deploy the patch or in my opinion (however not tested) [option 2] use the current FatFs R0.13a and deploy the patch at line 1742 instead of 1728.

On your [option 1], it doesn't look like the patch has been included in any FatFs release yet.

I will try your [option 2] when I get a chance. Then hopefully I can close this with a PR. 😃

Then hopefully I can close this with a PR.

PR is merged, hence closing this. Please reopen if anything is missed out.

(Oops, I forgot to close it. Thanks!)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  Â·  4Comments

hasnainvirk picture hasnainvirk  Â·  3Comments

DuyTrandeLion picture DuyTrandeLion  Â·  3Comments

ashok-rao picture ashok-rao  Â·  4Comments

davidantaki picture davidantaki  Â·  3Comments