Zfs: File Size < on disk size - currently unexplained, size on disk is 2-3 x file size

Created on 17 Jan 2020 · 4Comments · Source: openzfs/zfs

zdb -b filesystem022-OST21.txt
Problem description:
File Size < on disk size - currently unexplained, size on disk is 2-3 x file size.

Observations:

A potential ZFS filesystem022 corruption across RaidInc Storage in London?
zdb check for leaks, it walks the entire block tree constructing the space maps in memory and then compares them to the ones stored on disk. If they differ it reports the leak.
a. Presuming from the below investigation that the “space leaks” mean the pool is corrupted somehow. zdb (ZFS debug) has detected tons of corruptions.
zdb did not report space leaks on ZFS Houston SI’s.
Does zdb leaked space means trouble with the pool and could explain the file size < disk size discrepancy?
Is it possible that errors got injected due to failover or hardware errors?
It seems to be at least inconsistent which is supposed to never happen with ZFS. Is this indicative of a larger problem? Numerous lockups, etc.?

Investigation:
For the troubleshooting, the following file located in WEY, was selected. There are no snapshots/reservations/quotas involved here.

server01]$ du -h --apparent-size /lus/filesystem022/project/file_name/*
33K /lus/filesystem022/project/file_name/aux_data
19K /lus/filesystem022/project/file_name/descriptor.yaml
104G /lus/filesystem022/project/file_name/trace_data.bin
14G /lus/filesystem022/project/file_name/trace_header.bin

[server01]$ du -h /lus/filesystem022/project/file_name/*
33K /lus/filesystem022/project/file_name/aux_data
56K /lus/filesystem022/project/file_name/descriptor.yaml
237G /lus/filesystem022/project/file_name/trace_data.bin
31G /lus/filesystem022/project/file_name/trace_header.bin

Copy of the dataset onto the same storage.
o Disk size is different.
o Checksum matches.

[server01]$ cp -rp /lus/filesystem022/project/file_name /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC

[server01]$ md5sum /lus/filesystem022/project/file_name/*
md5sum: /lus/filesystem022/project/file_name/aux_data: Is a directory
f861b60d2b1b844e5ae252345aa20497 /lus/filesystem022/project/file_name/descriptor.yaml
e8ac57c241e52b38b60907e4e767b451 /lus/filesystem022/project/file_name/trace_data.bin
0826bc74e525697d769248aabcb195cd /lus/filesystem022/project/file_name/trace_header.bin

[server01]$ md5sum /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/*
md5sum: /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/aux_data: Is a directory
f861b60d2b1b844e5ae252345aa20497 /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
e8ac57c241e52b38b60907e4e767b451 /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_data.bin
0826bc74e525697d769248aabcb195cd /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_header.bin

[server01]$ du -h /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/*
33K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/aux_data
56K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
99G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_data.bin
13G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_header.bin

[server01]$ du -h --apparent-size /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/*
33K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/aux_data
19K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
104G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_data.bin
14G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_header.bin

Printing the OST name hosting the given file.
[server01]$ ./lustre-find-ost-for-file /lus/filesystem022/project/file_name/trace_data.bin
15
/lus/filesystem022/project/file_name/trace_data.bin: ['filesystem022-OST000f'] (filesystem022-oss6)
Run zdb to check for leaks
[root@filesystem022-oss6 ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
filesystem022-OST17 48.3T 18.5T 219K none
filesystem022-OST17/filesystem022-OST0005 48.3T 18.5T 48.3T none
filesystem022-OST19 49.5T 17.3T 219K none
filesystem022-OST19/filesystem022-OST0009 49.5T 17.3T 49.5T none
filesystem022-OST21 47.3T 19.5T 219K none
filesystem022-OST21/filesystem022-OST000f 47.3T 19.5T 47.3T none
filesystem022-OST23 51.1T 15.7T 219K none
filesystem022-OST23/filesystem022-OST0013 51.1T 15.7T 51.1T none

[root@filesystem022-oss6 ~]# zdb -b filesystem022-OST21
Traversing all blocks to verify nothing leaked ...

loading space map for vdev 0 of 1, metaslab 180 of 181 ...
62.0T completed (12801MB/s) estimated time remaining: 0hr 00min 07sec
leaked space: vdev 0, offset 0x1d80003de000, size 1081344
[…]
See attachment.

Please would someone be able to advise.

Thanks
Nick

Question

Source

JestersTear

All 4 comments

Would you mind not destroying the preformated issue form?
Please restore it and included the requested info.

Ornias1993 on 17 Jan 2020

Apologies. Please see reformatted information below:

System information

Type | Version/Name
--- | ---
Distribution Name | Centos
Distribution Version | 7.4.1708
Linux Kernel | 3.10.0-693.11.6.el7_lustre.x86_64
Architecture | intel x86
ZFS Version | 0.7.5-1
SPL Version | 0.7.5-1

Describe the problem you're observing

File Size < on disk size - currently unexplained, size on disk is 2-3 x file size

Observations:

- A potential ZFS corruption across Lustre filesystems?

- zdb check for leaks, it walks the entire block tree constructing the space maps in memory and then compares them to the ones stored on disk. If they differ it reports the leak.

- Presuming from the below investigation that the “space leaks” mean the pool is corrupted somehow. zdb (ZFS debug) has detected tons of corruptions.

- Does zdb leaked space means trouble with the pool and could explain the file size < disk size discrepancy?

- Is it possible that errors got injected due to failover or hardware errors?

- It seems to be at least inconsistent which is supposed to never happen with ZFS. Is this indicative of a larger problem? Numerous lockups, etc.?

Investigation:
For the troubleshooting, the following file located in WEY, was selected. There are no snapshots/reservations/quotas involved here.

[server01]</users/user001>$ du -h --apparent-size /lus/filesystem022/project/file_name/*
33K /lus/filesystem022/project/file_name/aux_data
19K /lus/filesystem022/project/file_name/descriptor.yaml
104G /lus/filesystem022/project/file_name/trace_data.bin
14G /lus/filesystem022/project/file_name/trace_header.bin

[server01]</users/user001>$ du -h /lus/filesystem022/project/file_name/*
33K /lus/filesystem022/project/file_name/aux_data
56K /lus/filesystem022/project/file_name/descriptor.yaml
237G /lus/filesystem022/project/file_name/trace_data.bin
31G /lus/filesystem022/project/file_name/trace_header.bin

Copy of the dataset onto the same storage.
- Disk size is different.
- Checksum matches.

[server01]</users/user001>$ cp -rp /lus/filesystem022/project/file_name /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC

[server01]</users/user001>$ md5sum /lus/filesystem022/project/file_name/*
md5sum: /lus/filesystem022/project/file_name/aux_data: Is a directory
f861b60d2b1b844e5ae252345aa20497 /lus/filesystem022/project/file_name/descriptor.yaml
e8ac57c241e52b38b60907e4e767b451 /lus/filesystem022/project/file_name/trace_data.bin
0826bc74e525697d769248aabcb195cd /lus/filesystem022/project/file_name/trace_header.bin

[server01]</users/user001>$ md5sum /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/*
md5sum: /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/aux_data: Is a directory
f861b60d2b1b844e5ae252345aa20497 /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
e8ac57c241e52b38b60907e4e767b451 /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_data.bin
0826bc74e525697d769248aabcb195cd /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_header.bin

[server01]</users/user001>$ du -h /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/*
33K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/aux_data
56K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
99G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_data.bin
13G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_header.bin

[server01]</users/user001>$ du -h --apparent-size /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/*
33K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/aux_data
19K /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
104G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_data.bin
14G /lus/filesystem022/project/p005j02_2010_SRME_1238A018_JC/trace_header.bin

Printing the OST name hosting the given file.

[server01]</users/user001>$ ./lustre-find-ost-for-file /lus/filesystem022/project/file_name/trace_data.bin
15
/lus/filesystem022/project/file_name/trace_data.bin: ['filesystem022-OST000f'] (filesystem022-oss6)

Run zdb to check for leaks

[root@filesystem022-oss6 ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
filesystem022-OST17 48.3T 18.5T 219K none
filesystem022-OST17/filesystem022-OST0005 48.3T 18.5T 48.3T none
filesystem022-OST19 49.5T 17.3T 219K none
filesystem022-OST19/filesystem022-OST0009 49.5T 17.3T 49.5T none
filesystem022-OST21 47.3T 19.5T 219K none
filesystem022-OST21/filesystem022-OST000f 47.3T 19.5T 47.3T none
filesystem022-OST23 51.1T 15.7T 219K none
filesystem022-OST23/filesystem022-OST0013 51.1T 15.7T 51.1T none

[root@filesystem022-oss6 ~]# zdb -b filesystem022-OST21
Traversing all blocks to verify nothing leaked ...

loading space map for vdev 0 of 1, metaslab 180 of 181 ...
62.0T completed (12801MB/s) estimated time remaining: 0hr 00min 07sec
leaked space: vdev 0, offset 0x1d80003de000, size 1081344
[…]

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

JestersTear on 20 Jan 2020

👀1 👍1

zdb check for leaks, it walks the entire block tree constructing the space maps in memory and then compares them to the ones stored on disk. If they differ it reports the leak.

The main thing to be aware of when using zdb to check for leaks is the pool must not be online when running zdb. If the pool is imported and active zdb will incorrectly report leaked space. This is because zdb effectively imports the pool read-only in user space, so changes made to the imported pool by the kernel module will not be correctly accounted for by zdb.

This means the warnings you've seen are probably not indicative of a problem if you ran zdb against the imported pool. Unfortunately, to be absolutely certain you would need to stop the OSS then use zdb -e verify the space maps.

File Size < on disk size

As for the original issue is this something you're seeing on every file? Since you're using Lustre I assume the du output posted is what's returned via the Lustre mount point? Is it possible that you've set the zfs copies property to a value other than 1?

behlendorf on 21 Jan 2020

Hi Brian,

Thanks for the info. I believe the zdb check was probably run with the zpools imported. Will look at possible opportunity to run offline.

File size < on disk size : We do not see this on every file. The data appears to be ok. The output from the du command is returned from the lustre mount point.

Zfs copies are set to 1

zfs get copies

NAME PROPERTY VALUE SOURCE
lsi0xx-OST17 copies 1 default
lsi0xx-OST17/lsi0xx-OST0005 copies 1 default
lsi0xx-OST19 copies 1 default
lsi0xx-OST19/lsi0xx-OST0009 copies 1 default
lsi0xx-OST21 copies 1 default
lsi0xx-OST21/lsi0xx-OST000f copies 1 default
lsi0xx-OST23 copies 1 default
lsi0xx-OST23/lsi0xx-OST0013 copies 1 default

We are currently running a zfs scrub on one pool that has exhibited the issue on some files, with 0B repaired so far:

zpool status lsi0xx-OST21

pool: lsi0xx-OST21
state: ONLINE
scan: scrub in progress since Thu Jan 16 13:42:33 2020
55.8T scanned out of 67.0T at 97.2M/s, 33h37m to go
0B repaired, 83.27% done
config:

    NAME                  STATE     READ WRITE CKSUM
    lsi0xx-OST21          ONLINE       0     0     0
      raidz2-0            ONLINE       0     0     0
        JBOD_5_6_SLOT_24  ONLINE       0     0     0
        JBOD_5_6_SLOT_25  ONLINE       0     0     0
        JBOD_5_6_SLOT_26  ONLINE       0     0     0
        JBOD_5_6_SLOT_27  ONLINE       0     0     0
        JBOD_5_6_SLOT_28  ONLINE       0     0     0
        JBOD_5_6_SLOT_36  ONLINE       0     0     0
        JBOD_5_6_SLOT_37  ONLINE       0     0     0
        JBOD_5_6_SLOT_38  ONLINE       0     0     0
        JBOD_5_6_SLOT_39  ONLINE       0     0     0
        JBOD_5_6_SLOT_40  ONLINE       0     0     0

errors: No known data errors

We have several lustre storage systems built with a ZFS filesystem and to a degree they all show some files with the same symptoms

Any further suggestions would be appreciated

Regards
Nick

From: Brian Behlendorf notifications@github.com
Sent: 21 January 2020 22:21
To: zfsonlinux/zfs zfs@noreply.github.com
Cc: Nick Skingle Nick.Skingle@pgs.com; Author author@noreply.github.com
Subject: [External] Re: [zfsonlinux/zfs] File Size < on disk size - currently unexplained, size on disk is 2-3 x file size (#9855)

zdb check for leaks, it walks the entire block tree constructing the space maps in memory and then compares them to the ones stored on disk. If they differ it reports the leak.

The main thing to be aware of when using zdb to check for leaks is the pool must not be online when running zdb. If the pool is imported and active zdb will incorrectly report leaked space. This is because zdb effectively imports the pool read-only in user space, so changes made to the imported pool by the kernel module will not be correctly accounted for by zdb.

This means the warnings you've seen are probably not indicative of a problem if you ran zdb against the imported pool. Unfortunately, to be absolutely certain you would need to stop the OSS then use zdb -e verify the space maps.

File Size < on disk size

As for the original issue is this something you're seeing on every file? Since you're using Lustre I assume the du output posted is what's returned via the Lustre mount point? Is it possible that you've set the zfs copies property to a value other than 1?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_zfsonlinux_zfs_issues_9855-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAKINXQUABTLJURAZ7VARPNLQ65YK5A5CNFSM4KIEWI7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJRPWYY-23issuecomment-2D576912227&d=DwMCaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=2_4LnTcS-T76QITHqbAo2LU1OQ-6bfJjf5CMuaMjSVQ&m=uGzm6HqiQHqtKy-YUC0t0L4aY4O5s_UQAdXzBUsHt9o&s=SGmLKZ8V-g7cKmLUSyKNriN98NXg9pQ6kVnZtHPB0og&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKINXQUX2JS3U6XC5EQF4DTQ65YK5ANCNFSM4KIEWI7A&d=DwMCaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=2_4LnTcS-T76QITHqbAo2LU1OQ-6bfJjf5CMuaMjSVQ&m=uGzm6HqiQHqtKy-YUC0t0L4aY4O5s_UQAdXzBUsHt9o&s=jc7fYRdkOsqGncpZp8HKT2eMp0a8DgLtwi2HZFb1Nv8&e=.

JestersTear on 23 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings