Zfs: Buffer I/O error - lost async page write when writing to zvol

Created on 16 Sep 2016 · 14Comments · Source: openzfs/zfs

Hi, I have two different systems with different disk controllers and disks. system A is running latest 0.6.5.7 and system B is running latest clone from repository.
Kernel we are using is based on 4.1.2.

Running simple generated load (with dd) to zvol devices on both of the systems (directly to /dev/zvol..) causes identical errors logged on both systems.

[491412.318544] Buffer I/O error on dev zd16, logical block 26213724462, lost async page write

Where as zd-device is representing zvol where writes are supposed to happen.

I have not yet validated whether I'm actually loosing writes here or what is happening, as I was unable to find any more information regarding what is the reason for this behaviour.

Inactive Understood

Source

mkovero

Most helpful comment

Your pool is full, from a logical perspective (zfs list says there is 0 bytes available). Pools themselves have a small (often 4%) reserve for ZIL and other housekeeping. This space is not available for datasets, which is why zfs list says 0 bytes are available.

The reason refreservations exist is to avoid the situation you've achieved. By purposefully removing the refreservations, you open up the possibility of running out of space in the volumes. As you've discovered, this is not a best practice.

Therefore the system is working as designed.

richardelling on 19 Sep 2016

👍2 👎1

All 14 comments

After more close investigation, it seems that pools are 96% full as a result of generating data.
Higher level application, dd in this case, has not received any errors (eg. ENOSPACE). I think this is a bug and should be sorted out. if this is the case, I believe application is just having its writes discarded silently, which can lead to very dangerous situations.

mkovero on 16 Sep 2016

Check what 'sync' is set to and test with sync=always

madpsy on 16 Sep 2016

sync=disabled on both.

mkovero on 16 Sep 2016

setting sync=always did not remedy the issue, after restarting dd sessions, two out of six instances received out of space errors while others continued.

mkovero on 16 Sep 2016

volume consumers know nothing about space, the expected errors are EIO or ENXIO.

The "space" for volumes is reserved, so you'll need to check that the consumed space is less than the reservation. The easiest way to do this is

zfs list -o space

Note: the reservation is estimated when the volume is created. Actual allocation size is often different and expected to be always less than the reservation. However, the estimation is not omniscient. The biggest area for discrepancy is if you are using raidz on drives with > 512byte physical block size.

richardelling on 16 Sep 2016

I have created the volumes with -o refreservation=none

zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
dg1 0 7.32T 0 279K 0 7.32T
dg1/foo 0 186K 0 186K 0 0
dg1/foo1 0 1.14T 0 1.14T 0 0
dg1/foo2 0 1.29T 0 1.29T 0 0
dg1/foo3 0 1.34T 0 1.34T 0 0
dg1/foo4 0 1.13T 0 1.13T 0 0
dg1/foo5 0 1.30T 0 1.30T 0 0
dg1/foo6 0 1.12T 0 1.12T 0 0

zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
dg1 10.4T 10.1T 333G - 54% 96% 1.00x ONLINE -

pool has been on 96% for 9 days now, writer processes are still active, not receiving error.
only the buffer i/o error in dmesg.

mkovero on 19 Sep 2016

Therefore the system is working as designed.

richardelling on 19 Sep 2016

👍2 👎1

afaik behaviour differs from solaris implementation and I'd say the inconsistency (2 out of 6 writers actually got any errors, on identical setup) is an issue, you should be able to expect how it behaves in situations like this. I'd say not having refreservation is not valid reason for this issue, as zfs list says there is 0 bytes available, which is the available space for writing data - the fact the refreservation will not and cannot change. If there is no space where to write, application should not be allowed to do so.

hence, I think system does not work as designed. :)

mkovero on 19 Sep 2016

Quite simply, you are out of space and you didn't reserve enough for the volumes. Solaris behaves the same way. No bug here.

richardelling on 19 Sep 2016

So why the inconsistency?

mkovero on 19 Sep 2016

and more importantly, what happens to writes? Where are they going?

 89093 root      20   0 10080 1456 1324 R   106  0.0   2257:52 dd                                                                                                                                                                             
123986 root      20   0     0    0    0 R   100  0.0  40:46.05 kworker/u289:3                                                                                                                                                                 
 89088 root      20   0 10080 1508 1376 R    94  0.0   2262:11 dd                                                                                                                                                                             
 89085 root      20   0 10080 1492 1360 R    94  0.0   2248:16 dd                                                                                                                                                                             
 89091 root      20   0 10080 1524 1396 R    93  0.0   2247:16 dd                                                                                                                                                                             
 89095 root      20   0 10080 1628 1496 R    92  0.0   2247:58 dd                       

zd0               0.00     0.00 50259.00 105207.00 201036.00 420824.00     8.00     1.35    0.01    0.01    0.01   0.01  91.50
zd16              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
zd32              0.00     0.00 46895.00 100189.00 187580.00 400756.00     8.00     1.39    0.01    0.01    0.01   0.01  94.20
zd48              0.00     0.00 51397.00 104302.00 205588.00 417208.00     8.00     1.33    0.01    0.01    0.01   0.01  90.90
zd64              0.00     0.00 51885.00 105779.00 207544.00 423116.00     8.00     1.38    0.01    0.01    0.01   0.01  93.40
zd80              0.00     0.00 52975.00    0.00 211896.00     0.00     8.00     0.47    0.01    0.01    0.00   0.01  47.40

zfs create -V 1g -o refreservation=none dg1/bar
cannot create 'dg1/bar': out of space

dd if=/dev/zero of=/dev/zvol/dg1/foo2 count=100
100+0 records in
100+0 records out
51200 bytes (51 kB) copied, 0.000127589 s, 401 MB/s

dd if=/dev/zero of=/dev/zvol/dg1/foo2 count=100000
dd: writing to `/dev/zvol/dg1/foo2': No space left on device
20481+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 0.0231473 s, 453 MB/s

top/iostat output is from live system that has been writing to nothing for several days now.
compression is not enabled.
zfs create and dds in example are tried out in sequence.
I have never witnessed this in solaris world. I might be wrong but this seems broken for me.

mkovero on 19 Sep 2016

Overwriting existing data does not increase the allocation, unless snapshots exist.

Setting refreservation=none on volumes is a bad idea.

richardelling on 20 Sep 2016

Overwriting not increasing allocation only explains why 4/6 writers continued like "they should", yet again the inconsistency.
Setting refreservation=none is actually common thing to do when relying on external monitoring and dynamically allocating space from pools based on actual usage. (Eg. 'thinprovisioning')
Thanks anyway, atleast I tried. :)

mkovero on 20 Sep 2016

Hi there,

Got the same issue on a volume that has refreservation=10,1G, on 20tb partition. I just made a ext4 partition on it as large as it could fit, after it filling up the partition (backup system that ran amoc), we got perpetual errors on fsck, I'd wish to salvage some of the information from there. Is it a good solution to try to resize the partition after deleting files? How much margins should be kept ? What ratio of refreservation to volume size is safe?