Zfs: Cannot trim L2ARC on SSD

Created on 11 Dec 2019 · 24Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Debian (Proxmox VE 6.1)
Distribution Version | 10
Linux Kernel | 5.3.10-1-pve
Architecture | x86_64
ZFS Version | 0.8.2-pve2
SPL Version | 0.8.2-pve2

Describe the problem you're observing

I believe that trim does currently not work on L2ARC devices.

Describe how to reproduce the problem

L2ARC on SSDs
trim correct pool (zpool trim rpool)
check trim status with zpool status -t

root@server:~# zpool status -t
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 1 days 10:42:22 with 0 errors on Mon Dec  9 11:06:24 2019
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            wwn-0x5000cca22bdaa077-part1  ONLINE       0     0     0  (trim unsupported)
            wwn-0x5000cca22bd9d2e5-part1  ONLINE       0     0     0  (trim unsupported)
        logs
          mirror-1                        ONLINE       0     0     0
            wwn-0x500253850016023e-part2  ONLINE       0     0     0  (100% trimmed, completed at Sat 07 Dec 2019 04:48:15 PM CET)
            wwn-0x5002538c403f69a6-part2  ONLINE       0     0     0  (100% trimmed, completed at Sat 07 Dec 2019 04:48:15 PM CET)
        cache
          sda1                            ONLINE       0     0     0  (untrimmed)
          sdb1                            ONLINE       0     0     0  (untrimmed)

errors: No known data errors

As you can see in the output the L2ARC is discovered as untrimmed. The HDDs are correctly showing trim unsupported.

Trying to force a trim on the L2ARC device itself directly, does not work either:

zpool trim rpool sda1
cannot trim 'sda1': device is in use as a cache

Taking the device offline beforehand and issuing the trim command does not change the output.

Include any warning/errors/backtraces from the system logs

There are no warnings/errors/backtraces in the logs. I can only find trim_start & trim_finish for the corresponding devices that are getting trimmed. The cache devices (sda1/sdb1) are not mentioned in the logs.

Feature Performance

Source

lukeb2e

👍5

Most helpful comment

@lukeb2e today the l2arc device is always overwritten, it does not get trimmed. This optimization was left as follow up work to the initial trim feature, but it is something we'd like to eventually implement.

behlendorf on 12 Dec 2019

👍6

All 24 comments

Why would it?

scineram on 11 Dec 2019

If you have a long running server with regular reboot intervals and loose your L2ARC you encounter performance issues over time.
This is caused by the missing trim for the cache devices.

From wikipedia:

SSD write performance is significantly impacted by the availability of free, 
programmable blocks. Previously written data blocks no longer in use can 
be reclaimed by TRIM; however, even with TRIM, fewer free blocks cause 
slower performance.

Therefore in this case a trim of the L2ARC does increase L2ARC performance which is a good thing. :)

lukeb2e on 11 Dec 2019

Persistent l2arc would only minimize the performance hit over time. Data in the l2arc would still be overwritten from time to time, which would result in the same issue as with the reboot. Only slower though.

lukeb2e on 11 Dec 2019

Lack of trim is of course drive dependent. While not Datacenter caliber, Samsung 840, 850, etc are very popular and all have issues.

Simply filling a disk with dd and removing the file also works in cases where trim isn't available. It's something I've had to do with every Samsung I've ever encountered. Anyone who routinely runs scrub may not notice it.

h1z1 on 11 Dec 2019

The server has a routine scrub configured, I still encountered an issue with slow L2ARC anyways. After removing the SSDs from the pool, manually reformatting and trimming the disk the performance went back to normal.

In our case the performance was so bad, that it was actually better with the L2ARC removed compared to our default setup.

So I still believe that the issue we encountered can be traced back to the missing trim support for L2ARC.

lukeb2e on 12 Dec 2019

Sorry, not clear to me how lack of trim could inhibit read performance.

scineram on 12 Dec 2019

Sadly we dont have a choice in the used SSDs. :( We also do not have autotrim activated as it can affect performance negatively as @kpande mentioned.

It did not corrupt any data, just the general performance was reduced so much that the whole system became nearly unusable. The issue was fixed when the L2ARC was removed. After trimming the disks manually and readding the SSDs as L2ARC our performance did not degrade anymore.

The initial reason I thought of trim was that I knew of an issue with Macbooks which did not support trim in older versions. This resulted in SSDs which slowed down over time.

In the end it comes down to the question: does zfs ever perform a "delete" operation (which would definitely require a trim) or is it always "just" overwriting data? As well as do some (maybe cheap) SSDs require trim even if you "overwrite" data? From what I read so far about trim I don't believe that this can be answered easily as it seems to me that trim behaves different for any configuration of controller, manufacturer, firmware, ...

lukeb2e on 12 Dec 2019

behlendorf on 12 Dec 2019

👍6

I really wonder why Trim on L2ARC SSDs is required. During normal operation i would expect the SSD to be fully (or close to fully) utilized. As a result there is nothing to trim as the data is always overwritten and never deleted.

After a reboot (and without persistent l2arc) trimming the l2arc SSD would be helpful to warm up l2arc faster as you can write faster. If i remember correctly l2arc is feed with 30 MB/s by default. This should be doable even without prior trim.

ronnyegner on 14 Dec 2019

👍2

True, but performance of SSDs can degrade quite a bit. The worst case I found right now is someone mentioning degredation of performance to 8MB/s. Sadly we worked around the issue for now, so I can't tell you what our performance was before our workaround.

lukeb2e on 17 Dec 2019

Default value for l2arc_write_max is, coincidentally, 8MB/sec.
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#l2arc_write_max

richardelling on 17 Dec 2019

I don't think it is as easy as that.

After doing some more digging on Read/Write Performance for SSDs without trim I found this:
https://www.bit-tech.net/reviews/tech/storage/windows-7-ssd-performance-and-trim/13/

It’s random write speeds where there’s real cause for concern though, with the P128 in its heavily used condition recording just 1.11MB/s random write speed alongside write latencies that peaked at a whopping 1410ms.

One can now argue wether the P128 should be used in "server applications/hardware". But since this is just an example, I think this should support the general issue about trim on L2ARC.

lukeb2e on 17 Dec 2019

Can't speak for everyone of course but it is a reality I see. Had a drive today do exactly what has been reported.. These will degrade to sub 1M/s if not conditioned. While I absolutely agree I wouldn't put critical vms on them, they are still useful once you understand the problem and Samsung has nfi how to make firmware.

# dd if=/dev/zero of=test.delme bs=4096 &                                       
[1] 2994                                                                        
# while kill -USR1 $! ; do sleep 1 ; done                                       
323716+0 records in                                                             
323716+0 records out                                                            
1325940736 bytes (1.3 GB) copied, 10.1787 s, 130 MB/s                           
408706+0 records in                                                             
408706+0 records out                                                            
1674059776 bytes (1.7 GB) copied, 13.7129 s, 122 MB/s                           
525787+0 records in                                                             
525787+0 records out                                                            
2153623552 bytes (2.2 GB) copied, 18.2957 s, 118 MB/s                           
606019+0 records in                                                             
606019+0 records out                                                            
2482253824 bytes (2.5 GB) copied, 18.5913 s, 134 MB/s                           
^C                                                                              
# fg                                                   
dd if=/dev/zero of=test.delme bs=4096                                           
^C^C                                                                            
645825+0 records in                                                             
645825+0 records out                                                            
2645299200 bytes (2.6 GB) copied, 22.1985 s, 119 MB/s                           
# zpool trim testvol2                                  
# zpool status testvol2 -t                             
  pool: testvol2                                                                
 state: ONLINE                                                                  
status: Some supported features are not enabled on the pool. The pool can       
  still be used, but some features are unavailable.                             
action: Enable all features using 'zpool upgrade'. Once this is done,           
  the pool may no longer be accessible by software that does not support        
  the features. See zpool-features(5) for details.                              
  scan: scrub repaired 0B in 0 days 00:04:11 with 0 errors on Sat Nov 16 04:30:01 2019
config:                                                                         

  NAME                      STATE     READ WRITE CKSUM                          
  testvol2                  ONLINE       0     0     0                          
    wwn-0x50025385a00XXXXX  ONLINE       0     0     0  (100% trimmed, completed at Wed 18 Dec 2019 02:42:21 AM EST)

errors: No known data errors                                                    
# zpool status testvol2 -t                             
  pool: testvol2                                                                
 state: ONLINE                                                                  
status: Some supported features are not enabled on the pool. The pool can       
  still be used, but some features are unavailable.                             
action: Enable all features using 'zpool upgrade'. Once this is done,       
#
# dd if=/dev/zero of=test.delme bs=4096 &              
[1] 11303                                                                       
# while kill -USR1 $! ; do sleep 1 ; done              
244673+0 records in                                                             
244673+0 records out                                                            
1002180608 bytes (1.0 GB) copied, 3.31499 s, 302 MB/s                           
368957+0 records in                                                             
368957+0 records out                                                            
1511247872 bytes (1.5 GB) copied, 5.02608 s, 301 MB/s                           
411965+0 records in                                                             
411964+0 records out                                                            
1687404544 bytes (1.7 GB) copied, 5.18055 s, 326 MB/s                           
494398+0 records in                                                             
494398+0 records out                                                            
2025054208 bytes (2.0 GB) copied, 6.73668 s, 301 MB/s                           
594850+0 records in                                                             
594850+0 records out                                                            
2436505600 bytes (2.4 GB) copied, 8.07323 s, 302 MB/s                           
625569+0 records in                                                             
625569+0 records out                                                            
2562330624 bytes (2.6 GB) copied, 8.18257 s, 313 MB/s                           
695330+0 records in                                                             
695330+0 records out                                                            
2848071680 bytes (2.8 GB) copied, 9.40134 s, 303 MB/s                           
795810+0 records in                                                             
795810+0 records out                                                            
3259637760 bytes (3.3 GB) copied, 10.7216 s, 304 MB/s                           
920545+0 records in                                                             
920545+0 records out                                                            
3770552320 bytes (3.8 GB) copied, 11.1921 s, 337 MB/s                           
921867+0 records in                                                             
921867+0 records out                                                            
3775967232 bytes (3.8 GB) copied, 12.4199 s, 304 MB/s                           
1006722+0 records in                                                            
1006722+0 records out                                                           
4123533312 bytes (4.1 GB) copied, 13.5586 s, 304 MB/s                           
1091554+0 records in                                                            
1091554+0 records out                                                           
4471005184 bytes (4.5 GB) copied, 14.6854 s, 304 MB/s

etc. Drive itself is fine

# smartctl -a /dev/sdb
smartctl 6.2 2017-02-27 r4394 [x86_64-linux-4.20.0+] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 PRO Series
Serial Number:    S1ANNSXXXXXXXXX
LU WWN Device Id: 5 002538 5a00XXXXX
Firmware Version: DXM05B0Q
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec 18 02:59:55 2019 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  15) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       48575
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       147
177 Wear_Leveling_Count     0x0013   090   090   000    Pre-fail  Always       -       199
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   099   099   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   073   051   000    Old_age   Always       -       23
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       98053916085

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Eventually the firmware will complete whatever cleanup it needs to and we're back at "normal"

 while kill -USR1 $! ; do sleep 1 ; done
414301+0 records in
414300+0 records out
1696972800 bytes (1.7 GB) copied, 3.53948 s, 479 MB/s
533437+0 records in
533436+0 records out
2184953856 bytes (2.2 GB) copied, 4.53862 s, 481 MB/s
650397+0 records in
650396+0 records out
2664022016 bytes (2.7 GB) copied, 5.53923 s, 481 MB/s
766301+0 records in
766300+0 records out
3138764800 bytes (3.1 GB) copied, 6.53996 s, 480 MB/s
882257+0 records in
882257+0 records out
3613724672 bytes (3.6 GB) copied, 7.54068 s, 479 MB/s
996605+0 records in
996604+0 records out
4082089984 bytes (4.1 GB) copied, 8.54161 s, 478 MB/s
1111412+0 records in
1111411+0 records out
4552339456 bytes (4.6 GB) copied, 9.54211 s, 477 MB/s
1226173+0 records in
1226172+0 records out
5022400512 bytes (5.0 GB) copied, 10.5429 s, 476 MB/s
1339965+0 records in
1339964+0 records out
5488492544 bytes (5.5 GB) copied, 11.5437 s, 475 MB/s
1452893+0 records in
1452892+0 records out
5951045632 bytes (6.0 GB) copied, 12.5442 s, 474 MB/s
1567453+0 records in
1567452+0 records out
6420283392 bytes (6.4 GB) copied, 13.5451 s, 474 MB/s
^C
# fg
dd if=/dev/sdb of=/dev/null bs=4096
^C1824029+0 records in
1824028+0 records out
7471218688 bytes (7.5 GB) copied, 15.7939 s, 473 MB/s

h1z1 on 18 Dec 2019

What does that have to do with the L2ARC? Which will be constantly full anyway (assuming the headers fit in memory). If the fill rate is still too slow, then make a partition for some overprovisioning.

scineram on 18 Dec 2019

The entire drive slows to a crawl.

h1z1 on 18 Dec 2019

For anyone interested in working on this, there may be some relatively low hanging fruit to be had. The l2arc_evict() function is responsible for evicting headers which reference the next N bytes of the l2arc device to be overwritten. If this function were updated to additionally TRIM that vdev space before it's overwritten that _may_ help performance.

As @richardelling mentioned currently it's only overwritten in l2arc_write_max (8M) chunks. This default value hasn't changed since at least 2008, since today's SSDs are so much more capable than a decade ago I'd be surprised if increasing the default wasn't beneficial. It really should be at least as large as the maximum block size (16M). 64M would retain the original scaling factor of 64 times the average block size now that 1M blocks are very common. It would be interesting to do a scaling study with l2arc_evict() trimming ahead.

behlendorf on 18 Dec 2019

Thought everyone tweaked those already? :) There are a few defaults that could use revisiting. They are sane but not necessarily ideal.

h1z1 on 18 Dec 2019

I can confirm this issue can take place even on underprovisioned (i.e. not the full capacity of SSD dedicated to the L2ARC, some left unformatted as an empty partition), but low-end SSDs which tends to write data slower and slower while capacity utilization increases.

Most obvious this issue became if we're using some of the L2ARC-SSDs free capacity for the SLOG and L2ARC partition on the same SSD devices once became fully occupied: if afterward for some reason utilization of L2ARC device decreases, write performance of the sync writes to SLOG is not restored to the level it was before the partition dedicated for the L2ARC became fully utilized.

I did notice this in one of my ZFS boxes with two 120GB SSDs, which space (~70%) used for the L2ARC and 1GB of each used for the mirror SLOG device but I'm able to reproduce it in almost all of my ZFS setups where I'm using same SSDs for the L2ARC and SLOG mirror with the following steps:

Running openssl enc -aes-128-ecb -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | dd iflag=fullblock of=/dev/sdb bs=2M oflag=direct where /dev/sdb is one of my SLOG backed ZVOLs to put some highly-random sync write workload on it.
I'm using openssl instead of reading from the /dev/urandom to make sure that random will outperform the write speed of the SLOG.
Running iostat to monitor write latencies for some time
Starting some IO-intensive read workload on ZFS/ZVOL devices backed with L2ARC on the same SSDs used for SLOG at step 1
Watching utilization of L2ARC grows and write latencies of dd from step 1 also increases
Waiting until the utilization of L2ARC became nearly 100% and noticing the write latencies which will be much worse than at step 2
From this moment it doesn't matter whatever I'm going to do or not as the write latencies can only be restored if I'll destroy my L2ARC and then use blkdiscard on the SSD partitions dedicated for the L2ARC

So it would be best to have write latencies of SLOG restored automatically whenever utilization of L2ARC goes down (BTW this isn't such rare case in my workloads) without destroying L2ARC and discarding all of the data.
In the light of #9582 this issue becomes more crucial as previously with any reboot we're going to lose L2ARC anyway so I do have a 'destroy L2ARC, discard partition data, create new L2ARC'-procedure executed each reboot but after persistent L2ARC will go to the live, it won't be any longer acceptable for me to trade write performance for the read using this procedure.

Vlad1mir-D on 19 Dec 2019

👍4

Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though.

h1z1 on 19 Dec 2019

why are you sharing a device with L2ARC and SLOG? that runs contradictory to the purpose of a SLOG device, low latency I/O.

Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though.

This ^

Vlad1mir-D on 19 Dec 2019

Well, most probably I will wait for the persistent L2ARC to become live and then provide PR for this.

Vlad1mir-D on 20 Dec 2019

well, no, latency introduced by sharing the device is not workload dependent, it is hardware dependent. optane might (might) not exhibit the problem, but all drives do.

I don't understand the analogy you're trying to make. Does it really matter if the hardware is to blame or the workload when the end result is the same?

you're using a setup that is explicitly mentioned as a bad idea in documentation everywhere and you want others to put in effort to make sure you can continue using bad hardware in subpar configurations. please, can't you just fix your setup?

There are different cases above. Regardless of what some think about sharing services between SLOG and ARC, it is a valid deployment. The slowness comes from lack of trim on some drives, and yes that too happens on all hardware. As the example I gave above, it's a fairly serious degradation with some drives.

One cases where trim does not make sense is security but you're not arguing that?

h1z1 on 22 Dec 2019

For those interested I created #9789.
@Vlad1mir-D you already have the testing setup. Would you mind giving this a try?