Type | Version/Name
--- | ---
Distribution Name | Debian (Proxmox VE 6.1)
Distribution Version | 10
Linux Kernel | 5.3.10-1-pve
Architecture | x86_64
ZFS Version | 0.8.2-pve2
SPL Version | 0.8.2-pve2
I believe that trim does currently not work on L2ARC devices.
zpool trim rpool)zpool status -troot@server:~# zpool status -t
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 1 days 10:42:22 with 0 errors on Mon Dec 9 11:06:24 2019
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x5000cca22bdaa077-part1 ONLINE 0 0 0 (trim unsupported)
wwn-0x5000cca22bd9d2e5-part1 ONLINE 0 0 0 (trim unsupported)
logs
mirror-1 ONLINE 0 0 0
wwn-0x500253850016023e-part2 ONLINE 0 0 0 (100% trimmed, completed at Sat 07 Dec 2019 04:48:15 PM CET)
wwn-0x5002538c403f69a6-part2 ONLINE 0 0 0 (100% trimmed, completed at Sat 07 Dec 2019 04:48:15 PM CET)
cache
sda1 ONLINE 0 0 0 (untrimmed)
sdb1 ONLINE 0 0 0 (untrimmed)
errors: No known data errors
As you can see in the output the L2ARC is discovered as untrimmed. The HDDs are correctly showing trim unsupported.
Trying to force a trim on the L2ARC device itself directly, does not work either:
zpool trim rpool sda1
cannot trim 'sda1': device is in use as a cache
Taking the device offline beforehand and issuing the trim command does not change the output.
There are no warnings/errors/backtraces in the logs. I can only find trim_start & trim_finish for the corresponding devices that are getting trimmed. The cache devices (sda1/sdb1) are not mentioned in the logs.
Why would it?
If you have a long running server with regular reboot intervals and loose your L2ARC you encounter performance issues over time.
This is caused by the missing trim for the cache devices.
From wikipedia:
SSD write performance is significantly impacted by the availability of free,
programmable blocks. Previously written data blocks no longer in use can
be reclaimed by TRIM; however, even with TRIM, fewer free blocks cause
slower performance.
Therefore in this case a trim of the L2ARC does increase L2ARC performance which is a good thing. :)
Persistent l2arc would only minimize the performance hit over time. Data in the l2arc would still be overwritten from time to time, which would result in the same issue as with the reboot. Only slower though.
Lack of trim is of course drive dependent. While not Datacenter caliber, Samsung 840, 850, etc are very popular and all have issues.
Simply filling a disk with dd and removing the file also works in cases where trim isn't available. It's something I've had to do with every Samsung I've ever encountered. Anyone who routinely runs scrub may not notice it.
The server has a routine scrub configured, I still encountered an issue with slow L2ARC anyways. After removing the SSDs from the pool, manually reformatting and trimming the disk the performance went back to normal.
In our case the performance was so bad, that it was actually better with the L2ARC removed compared to our default setup.
So I still believe that the issue we encountered can be traced back to the missing trim support for L2ARC.
Sorry, not clear to me how lack of trim could inhibit read performance.
Sadly we dont have a choice in the used SSDs. :( We also do not have autotrim activated as it can affect performance negatively as @kpande mentioned.
It did not corrupt any data, just the general performance was reduced so much that the whole system became nearly unusable. The issue was fixed when the L2ARC was removed. After trimming the disks manually and readding the SSDs as L2ARC our performance did not degrade anymore.
The initial reason I thought of trim was that I knew of an issue with Macbooks which did not support trim in older versions. This resulted in SSDs which slowed down over time.
In the end it comes down to the question: does zfs ever perform a "delete" operation (which would definitely require a trim) or is it always "just" overwriting data? As well as do some (maybe cheap) SSDs require trim even if you "overwrite" data? From what I read so far about trim I don't believe that this can be answered easily as it seems to me that trim behaves different for any configuration of controller, manufacturer, firmware, ...
@lukeb2e today the l2arc device is always overwritten, it does not get trimmed. This optimization was left as follow up work to the initial trim feature, but it is something we'd like to eventually implement.
I really wonder why Trim on L2ARC SSDs is required. During normal operation i would expect the SSD to be fully (or close to fully) utilized. As a result there is nothing to trim as the data is always overwritten and never deleted.
After a reboot (and without persistent l2arc) trimming the l2arc SSD would be helpful to warm up l2arc faster as you can write faster. If i remember correctly l2arc is feed with 30 MB/s by default. This should be doable even without prior trim.
True, but performance of SSDs can degrade quite a bit. The worst case I found right now is someone mentioning degredation of performance to 8MB/s. Sadly we worked around the issue for now, so I can't tell you what our performance was before our workaround.
Default value for l2arc_write_max is, coincidentally, 8MB/sec.
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#l2arc_write_max
I don't think it is as easy as that.
After doing some more digging on Read/Write Performance for SSDs without trim I found this:
https://www.bit-tech.net/reviews/tech/storage/windows-7-ssd-performance-and-trim/13/
It鈥檚 random write speeds where there鈥檚 real cause for concern though, with the P128 in its heavily used condition recording just 1.11MB/s random write speed alongside write latencies that peaked at a whopping 1410ms.
One can now argue wether the P128 should be used in "server applications/hardware". But since this is just an example, I think this should support the general issue about trim on L2ARC.
Can't speak for everyone of course but it is a reality I see. Had a drive today do exactly what has been reported.. These will degrade to sub 1M/s if not conditioned. While I absolutely agree I wouldn't put critical vms on them, they are still useful once you understand the problem and Samsung has nfi how to make firmware.
# dd if=/dev/zero of=test.delme bs=4096 &
[1] 2994
# while kill -USR1 $! ; do sleep 1 ; done
323716+0 records in
323716+0 records out
1325940736 bytes (1.3 GB) copied, 10.1787 s, 130 MB/s
408706+0 records in
408706+0 records out
1674059776 bytes (1.7 GB) copied, 13.7129 s, 122 MB/s
525787+0 records in
525787+0 records out
2153623552 bytes (2.2 GB) copied, 18.2957 s, 118 MB/s
606019+0 records in
606019+0 records out
2482253824 bytes (2.5 GB) copied, 18.5913 s, 134 MB/s
^C
# fg
dd if=/dev/zero of=test.delme bs=4096
^C^C
645825+0 records in
645825+0 records out
2645299200 bytes (2.6 GB) copied, 22.1985 s, 119 MB/s
# zpool trim testvol2
# zpool status testvol2 -t
pool: testvol2
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 0 days 00:04:11 with 0 errors on Sat Nov 16 04:30:01 2019
config:
NAME STATE READ WRITE CKSUM
testvol2 ONLINE 0 0 0
wwn-0x50025385a00XXXXX ONLINE 0 0 0 (100% trimmed, completed at Wed 18 Dec 2019 02:42:21 AM EST)
errors: No known data errors
# zpool status testvol2 -t
pool: testvol2
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
#
# dd if=/dev/zero of=test.delme bs=4096 &
[1] 11303
# while kill -USR1 $! ; do sleep 1 ; done
244673+0 records in
244673+0 records out
1002180608 bytes (1.0 GB) copied, 3.31499 s, 302 MB/s
368957+0 records in
368957+0 records out
1511247872 bytes (1.5 GB) copied, 5.02608 s, 301 MB/s
411965+0 records in
411964+0 records out
1687404544 bytes (1.7 GB) copied, 5.18055 s, 326 MB/s
494398+0 records in
494398+0 records out
2025054208 bytes (2.0 GB) copied, 6.73668 s, 301 MB/s
594850+0 records in
594850+0 records out
2436505600 bytes (2.4 GB) copied, 8.07323 s, 302 MB/s
625569+0 records in
625569+0 records out
2562330624 bytes (2.6 GB) copied, 8.18257 s, 313 MB/s
695330+0 records in
695330+0 records out
2848071680 bytes (2.8 GB) copied, 9.40134 s, 303 MB/s
795810+0 records in
795810+0 records out
3259637760 bytes (3.3 GB) copied, 10.7216 s, 304 MB/s
920545+0 records in
920545+0 records out
3770552320 bytes (3.8 GB) copied, 11.1921 s, 337 MB/s
921867+0 records in
921867+0 records out
3775967232 bytes (3.8 GB) copied, 12.4199 s, 304 MB/s
1006722+0 records in
1006722+0 records out
4123533312 bytes (4.1 GB) copied, 13.5586 s, 304 MB/s
1091554+0 records in
1091554+0 records out
4471005184 bytes (4.5 GB) copied, 14.6854 s, 304 MB/s
etc. Drive itself is fine
# smartctl -a /dev/sdb
smartctl 6.2 2017-02-27 r4394 [x86_64-linux-4.20.0+] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 840 PRO Series
Serial Number: S1ANNSXXXXXXXXX
LU WWN Device Id: 5 002538 5a00XXXXX
Firmware Version: DXM05B0Q
User Capacity: 128,035,676,160 bytes [128 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Dec 18 02:59:55 2019 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 15) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 48575
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 147
177 Wear_Leveling_Count 0x0013 090 090 000 Pre-fail Always - 199
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 099 099 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 073 051 000 Old_age Always - 23
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 98053916085
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
255 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Eventually the firmware will complete whatever cleanup it needs to and we're back at "normal"
while kill -USR1 $! ; do sleep 1 ; done
414301+0 records in
414300+0 records out
1696972800 bytes (1.7 GB) copied, 3.53948 s, 479 MB/s
533437+0 records in
533436+0 records out
2184953856 bytes (2.2 GB) copied, 4.53862 s, 481 MB/s
650397+0 records in
650396+0 records out
2664022016 bytes (2.7 GB) copied, 5.53923 s, 481 MB/s
766301+0 records in
766300+0 records out
3138764800 bytes (3.1 GB) copied, 6.53996 s, 480 MB/s
882257+0 records in
882257+0 records out
3613724672 bytes (3.6 GB) copied, 7.54068 s, 479 MB/s
996605+0 records in
996604+0 records out
4082089984 bytes (4.1 GB) copied, 8.54161 s, 478 MB/s
1111412+0 records in
1111411+0 records out
4552339456 bytes (4.6 GB) copied, 9.54211 s, 477 MB/s
1226173+0 records in
1226172+0 records out
5022400512 bytes (5.0 GB) copied, 10.5429 s, 476 MB/s
1339965+0 records in
1339964+0 records out
5488492544 bytes (5.5 GB) copied, 11.5437 s, 475 MB/s
1452893+0 records in
1452892+0 records out
5951045632 bytes (6.0 GB) copied, 12.5442 s, 474 MB/s
1567453+0 records in
1567452+0 records out
6420283392 bytes (6.4 GB) copied, 13.5451 s, 474 MB/s
^C
# fg
dd if=/dev/sdb of=/dev/null bs=4096
^C1824029+0 records in
1824028+0 records out
7471218688 bytes (7.5 GB) copied, 15.7939 s, 473 MB/s
What does that have to do with the L2ARC? Which will be constantly full anyway (assuming the headers fit in memory). If the fill rate is still too slow, then make a partition for some overprovisioning.
The entire drive slows to a crawl.
For anyone interested in working on this, there may be some relatively low hanging fruit to be had. The l2arc_evict() function is responsible for evicting headers which reference the next N bytes of the l2arc device to be overwritten. If this function were updated to additionally TRIM that vdev space before it's overwritten that _may_ help performance.
As @richardelling mentioned currently it's only overwritten in l2arc_write_max (8M) chunks. This default value hasn't changed since at least 2008, since today's SSDs are so much more capable than a decade ago I'd be surprised if increasing the default wasn't beneficial. It really should be at least as large as the maximum block size (16M). 64M would retain the original scaling factor of 64 times the average block size now that 1M blocks are very common. It would be interesting to do a scaling study with l2arc_evict() trimming ahead.
Thought everyone tweaked those already? :) There are a few defaults that could use revisiting. They are sane but not necessarily ideal.
I can confirm this issue can take place even on underprovisioned (i.e. not the full capacity of SSD dedicated to the L2ARC, some left unformatted as an empty partition), but low-end SSDs which tends to write data slower and slower while capacity utilization increases.
Most obvious this issue became if we're using some of the L2ARC-SSDs free capacity for the SLOG and L2ARC partition on the same SSD devices once became fully occupied: if afterward for some reason utilization of L2ARC device decreases, write performance of the sync writes to SLOG is not restored to the level it was before the partition dedicated for the L2ARC became fully utilized.
I did notice this in one of my ZFS boxes with two 120GB SSDs, which space (~70%) used for the L2ARC and 1GB of each used for the mirror SLOG device but I'm able to reproduce it in almost all of my ZFS setups where I'm using same SSDs for the L2ARC and SLOG mirror with the following steps:
openssl enc -aes-128-ecb -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | dd iflag=fullblock of=/dev/sdb bs=2M oflag=direct where /dev/sdb is one of my SLOG backed ZVOLs to put some highly-random sync write workload on it.openssl instead of reading from the /dev/urandom to make sure that random will outperform the write speed of the SLOG.iostat to monitor write latencies for some timedd from step 1 also increasesblkdiscard on the SSD partitions dedicated for the L2ARCSo it would be best to have write latencies of SLOG restored automatically whenever utilization of L2ARC goes down (BTW this isn't such rare case in my workloads) without destroying L2ARC and discarding all of the data.
In the light of #9582 this issue becomes more crucial as previously with any reboot we're going to lose L2ARC anyway so I do have a 'destroy L2ARC, discard partition data, create new L2ARC'-procedure executed each reboot but after persistent L2ARC will go to the live, it won't be any longer acceptable for me to trade write performance for the read using this procedure.
Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though.
why are you sharing a device with L2ARC and SLOG? that runs contradictory to the purpose of a SLOG device, low latency I/O.
Depends on the workload, most SSD's are still orders of magnitude faster then a platter drive though.
This ^
Well, most probably I will wait for the persistent L2ARC to become live and then provide PR for this.
well, no, latency introduced by sharing the device is not workload dependent, it is hardware dependent. optane might (might) not exhibit the problem, but all drives do.
I don't understand the analogy you're trying to make. Does it really matter if the hardware is to blame or the workload when the end result is the same?
you're using a setup that is explicitly mentioned as a bad idea in documentation everywhere and you want others to put in effort to make sure you can continue using bad hardware in subpar configurations. please, can't you just fix your setup?
There are different cases above. Regardless of what some think about sharing services between SLOG and ARC, it is a valid deployment. The slowness comes from lack of trim on some drives, and yes that too happens on all hardware. As the example I gave above, it's a fairly serious degradation with some drives.
One cases where trim does not make sense is security but you're not arguing that?
For those interested I created #9789.
@Vlad1mir-D you already have the testing setup. Would you mind giving this a try?
For those interested I created #9789.
@Vlad1mir-D you already have the testing setup. Would you mind giving this a try?
Sure I will, thank you again for all your hard work making ZFS more suitable for the cheap tiered storage setups!
Most helpful comment
@lukeb2e today the l2arc device is always overwritten, it does not get trimmed. This optimization was left as follow up work to the initial trim feature, but it is something we'd like to eventually implement.