I'm running into many memory-related issues since switching to ZFS, including instances where the oom killer triggered despite plenty free memory being available, and instances where programs fail due to out of memory conditions.
For an example of a failure, when trying to recreate my initramfs:
* Gentoo Linux Genkernel; Version 64
* Running with options: --install initramfs
* Using genkernel.conf from /etc/genkernel.conf
* Sourcing arch-specific config.sh from /usr/share/genkernel/arch/x86_64/config.sh ..
* Sourcing arch-specific modules_load from /usr/share/genkernel/arch/x86_64/modules_load ..
* Linux Kernel 4.7.2-hardened-gnu for x86_64...
* .. with config file /usr/share/genkernel/arch/x86_64/kernel-config
* busybox: >> Using cache
* initramfs: >> Initializing...
* >> Appending base_layout cpio data...
* >> Appending udev cpio data...
cp: cannot stat '/etc/modprobe.d/blacklist.conf': No such file or directory
* cannot copy /etc/modprobe.d/blacklist.conf from udev
cp: cannot stat '/lib/systemd/network/99-default.link': No such file or directory
* cannot copy /lib/systemd/network/99-default.link from udev
* >> Appending auxilary cpio data...
* >> Copying keymaps
* >> Appending busybox cpio data...
* >> Appending modules cpio data...
* >> Appending zfs cpio data...
* >> Including zpool.cache
* >> Appending blkid cpio data...
* >> Appending ld_so_conf cpio data...
* ldconfig: adding /sbin/ldconfig...
* ld.so.conf: adding /etc/ld.so.conf{.d/*,}...
cpio: lib64 not created: newer or same age version exists
cpio: lib64 not created: newer or same age version exists
cpio: lib64/ld-linux-x86-64.so.2 not created: newer or same age version exists
cpio: lib64/librt.so.1 not created: newer or same age version exists
cpio: lib64/libpthread.so.0 not created: newer or same age version exists
cpio: lib64/libuuid.so.1 not created: newer or same age version exists
cpio: lib64/libz.so.1 not created: newer or same age version exists
cpio: lib64/libblkid.so.1 not created: newer or same age version exists
cpio: lib64/libc.so.6 not created: newer or same age version exists
cpio: usr/lib64 not created: newer or same age version exists
cpio: lib64 not created: newer or same age version exists
cpio: lib64/ld-linux-x86-64.so.2 not created: newer or same age version exists
cpio: lib64/libuuid.so.1 not created: newer or same age version exists
cpio: lib64/libc.so.6 not created: newer or same age version exists
cpio: lib64/libblkid.so.1 not created: newer or same age version exists
* >> Finalizing cpio...
* >> Compressing cpio data (.xz)...
/usr/bin/xz: /var/tmp/genkernel/initramfs-4.7.2-hardened-gnu: Cannot allocate memory
* ERROR: Compression (/usr/bin/xz -e --check=none -z -f -9) failed
I've experienced this failure multiple times already. In every case, reducing the ARC size (i.e. temporarily reducing zfs_arc_max) solves it, as does e.g. echo 2 > /proc/sys/vm/drop_caches.
At the time of this failure, my system reported about 80% memory being in use, and after the echo 2 command described it went down to about 60%.
The weird thing is that I can't explain this high memory usage. This is my current output of arc_summary.py: https://0x0.st/MFr.txt
As you can see, ARC claims to be using about 11 GiB. Tallying together the top memory-hungry processes in top gives me about 3 GiB at most. There are no significant amounts of data in tmpfs either.
Together, this means that my system should be consuming 11+3 = 14 GiB of memory, meaning my usage should be 14/32 ≈ 43%, rather than 60%. Why does free -m report almost 19 GiB used? Where are the missing 5 GiB being accounted for? I've never had this weird issue before ZFS, nor have I ever “run out” of memory before ZFS.
I'm considering drastically reducing the ARC size as a temporary measure at least until this issue can be tracked down and fixed, wherever it comes from.
referencing:
https://github.com/zfsonlinux/zfs/issues/2298 Document or distribute memory fragmentation workarounds in 0.6.3
https://github.com/zfsonlinux/zfs/issues/4953 External Fragmentation leading to Out of memory Condition
https://github.com/zfsonlinux/zfs/issues/466 Memory usage keeps going up
https://github.com/zfsonlinux/zfs/pull/3441 ABD: linear/scatter dual typed buffer for ARC (ver 2)
@haasn xz -9 allocates a lot of memory, even by today's standards. Memory fragmentation can cause it to fail, even if there are free pages arround. You can have a sense of just how much your memory is fragmented by looking into: /proc/buddyinfo (you want large numbers on the right)
A small update: After setting the ARC size limit to 8 GiB max, I've done a reboot. I'm now running a clean system with virtually no memory consumption from programs. (about 700 MiB from the browser I'm typing this in and basically nothing else)
I've read a bunch of data from disk (tar cf /dev/null) to fill up the ARC with stuff
After doing this, it reports the current usage as 6 GiB (75% of the 8 GiB max), yet free -m considers my total used memory to be 10 GiB (32% used). Again I have about a 4 GiB deficit between what zfs claims to consume and what it actually consumes.
Note: As an experiment, I tried removing my L2ARC devices, because I read that zfs needs to store tables of some sort to support them. However, this did not affect memory usage at all. (Unless I need to reboot for the change to take effect?)
You can find my current arc_summary.py output here: https://0x0.st/MFJ.txt
This is my current: /proc/buddyinfo:
Node 0, zone DMA 1 1 0 1 1 1 1 0 1 1 3
Node 0, zone DMA32 25784 4161 216 389 146 78 32 15 1 1 257
Node 0, zone Normal 175248 31387 1710 2261 780 395 236 158 28 8 1959
Node 1, zone Normal 97675 18760 574 323 304 539 404 266 29 20 2618
Also worth noting is that I have two pools imported, although the second pool has primarycache=metadata. Exporting the second pool did not affect memory usage.
After running echo 2 > /proc/sys/vm/drop_caches, my memory usage went down by about 2 GiB, down to 8 GiB (26%) - I now have about 8 GiB of memory used, and arc_summary.py reports 4.5 GiB for the arc size. My total usage is still consistently about 4 GiB higher than what zfs claims.
(To work around this temporarily, and since the number seems to be fairly constant, I'm going to subtract 4 GiB from my normal zfs_arc_max setting, giving it 16-4 = 12 GiB total.)
P.s. I forgot to mention, I am on kernel 4.7.2 and spl/zfs git master.
I decided to re-investigate this after fixing #5036 to eliminate that as the cause. Additionally, I am now testing on a stock kernel (not hardened) to eliminate more potential issues.
Long story short: Problem persists, the difference between the actual and observed RAM is again almost exactly 4 GiB.
(1.7 GiB is the total sum of all resident+shared memory currently in use, 6.91 GiB is what ARC reports → totals to 8.61 GiB, but free reports 12.6 GiB in use)
It _seems_ like this memory usage is slowly growing over time, while the node 0 memory fragmentation also grew (according to /proc/buddyinfo). I would give you more details, but when I tried reducing the ARC size followed by echo 2 > /proc/sys/vm/drop_caches and echo 1 > /proc/sys/vm/compact_memory, my system hard-froze shortly thereafter. (Completely unresponsive to input and networking, didn't even respond to magic sysrq)
I'm slightly suspecting that there may be some sort of fragmentation-inducing memory leak in some SPL/ZFS component on my machine, since I haven't had these problems while running btrfs on the same hardware. same kernel version and doing the same things.
Further update: I had a look through /proc/spl/kmem/slab and noticed the allocation of many zio_buf/zio_data_buf slabs, about 3GB in total (and currently shrinking). (Full output here: https://0x0.st/uX8.txt)
With this extra data accounted for, I'm only “missing” about 2G currently, which probably has some other similar explanation. It seems I was under the misguided assumption that ZFS would only use about as much RAM as I had ARC configured. Is it normal for ZFS to have several extra GB of slabs allocated for other purposes?
Perhaps worth noting is that I am using SLAB instead of SLUB as my SLAB allocator, because I have previously observed it performing better for me under certain workloads, but it might be worth re-evaluating that assumption for ZFS.
Edit: Unfortunately, it seems like the /proc/spl/kmem/slab output also includes objects associated with the ARC, so I was doing a bit of double-counting. (And I'm also not entirely sure which size fields to go by, the output of this file seems a bit cryptic), so my above reasoning is probably invalid.
referencing:
https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSonLinuxMemoryWhere
Where your memory can be going with ZFS on Linux
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/tXHQPBE6uHg Where ZFS on Linux's memory usage goes and gets reported (I think)
Interesting, that first article in particular pretty much completely answers all of the doubts and questions I had, and also helps me understand why ZFS is causing me so many out-of-memory style conditions when I had other programs running at the same time (i.e. I can't entirely dedicated my RAM to ARC and spl SLAB objects like the defaults seem to be tuned for).
I'm considering this issue resolved unless I run into more troubles. Thanks for the pointers.
For reference, PR #5009 which is actively being worked in a big step towards addressing these issues.
@haasn take a look at: http://list.zfsonlinux.org/pipermail/zfs-discuss/2013-November/012336.html
disabling transparent hugepages also might lessen memory fragmentation and consumption
@kernelOfTruth I gave it a try. (I also got around to setting up monitoring/graphs so I could observe this over time)

I tried out your suggestion by using echo never > /sys/kernel/mm/transparent_hugepage/enabled when I read it, which was shortly after 20:40 local time. As you can see, there is a dramatic drop-off in the number of free pages (i.e. increase in fragmentation) corresponding almost exactly with the change.
Just now (at around 4:00 local time) I saw this graph and decided to re-enable transparent_hugepage/enabled=always, as well as transparent_hugepage/defrag=always for good measure. (Since I wasn't sure whether just enabling transparent_hugepage would have changed anything)
As you can see, memory fragmentation immediately went down rather dramatically, at least for smaller fragments, despite practically no change in the amount of consumed memory (nor the distribution of memory). The number of free pages for large chunks still seems to be rather low compared to a fresh boot, but it's still higher than it was before
Update: Seems to have been a fluke caused by switching the setting more than anything. Not an hour after enabling hugepages again, memory fragmentation has gotten even worse than before (free page count dropped to basically nothing).
If I had to guess, I think what I'm seeing would be explained by changes to this variable only taking being taken into effect for new allocations, rather than existing ones - and enabling defragmentation caused a spike in the available free pages due to defragmenting all of the existing ones.
Edit: That being said, I disabled it again and not much longer my free page count has skyrocketed again, so now I'm not really sure. These values are probably pretty unreliable at the moment either way since I'm rsyncing some data off old disks. I'll comment again when I can provide more concrete data.
Edit 2: Confirmed that it was the rsync causing heavy memory pressure which increased my fragmentation, which I've cross-confirmed by looking at the overall memory usage and noticing that nearly all available memory was being used for internal caches. Seems that “memory fragmentation” graph really only considers free memory, rather than available memory. (Which is somewhat odd IMO, but oh well). Everything's fine again now.
It seems this problem won't leave me alone. On my machine right now, 75% of my RAM is being used. (It was at 90% before I reduced my ARC size)
Current ARC size is 8 GiB (25%). About 3 GiB of that is applications, and another 3 GiB was in tmpfs. This memory (which I can directly account for) adds up to 14 GiB (43%), leaving behind 10 GiB of memory in use by zfs's various slabs.
Any tips on how I can track down _why_ exactly they are being used, and ideally, limit them? I wonder what would have happened if I had less available RAM to begin with. Would zfs have exploded, or would it have self-limited? If the latter, can I do this manually?
@haasn is it possible for you to test out current master ?
I've pre-set 10GB for ARC but the following currently is used while transferring 2.5 TB of data from one ZFS pool to another (via rsync):
p 4 5368709120
c 4 10737418240
c_min 4 1073741824
c_max 4 10737418240
size 4 3173034432
meaning, it hovered between 2.5 to 3.7 GB,
when adding the other memory consumption it is probably still less than 10 GB,
so it means either the compressed ARC makes usage really efficient and/or it is now also able to stick way more exactly to the preset limits
Since you mentioned rsync:
linking
https://github.com/Feh/nocache
and
http://insights.oetiker.ch/linux/fadvise/
here
@kernelOfTruth I can upgrade. Right now I'm on commit 9907cc1cc8c16fa2c7c03aa33264153ca2bdad6c (and zfsonlinux/spl@aeb9baa618beea1458ab3ab22cbc0f39213da6cf), is there a commit in particular that you think will help?
One thing I noticed while inspecting /proc/spl/kmem/slab is that the zio_data_buf_131072 usage is extremely high: https://0x0.st/SOb.txt (6 GiB in total). Also, everything about this is 0. At first I thought that was because of spl_kmem_cache_slab_limit, but it seems like that is set to 16384 - not 163840 (off by a factor of 10).
Edit: I just realized, spl_kmem_cache_slab_limit is a _lower_ limit, not an upper limit.
@haasn the changes since September 7th, specifically:
https://github.com/zfsonlinux/zfs/pull/5078 Compressed ARC, Compressed Send/Recv, ARC refactoring (changes from yesterday, September 13th)
spl basically tag 0.7.0-rc1 (September 7th).
Make sure to have recent backups of your data (just in case, which is actually always a good idea & practice)
As an example:
current arc stats (after several hundreds of GB of data transferred)
p 4 5368709120
c 4 10737418240
c_min 4 1073741824
c_max 4 10737418240
size 4 3332885640
compressed_size 4 1000249856
uncompressed_size 4 4317090816
overhead_size 4 794254848
hdr_size 4 120223344
data_size 4 113368576
metadata_size 4 1681136128
dbuf_size 4 304738112
dnode_size 4 807605400
bonus_size 4 305814080
anon_size 4 2877952
plus
1147872 1033816 90% 0.50K 35871 32 573936K zio_buf_512
1125514 994428 88% 0.30K 43289 26 346312K dmu_buf_impl_t
1082484 960373 88% 0.82K 27756 39 888192K dnode_t
939872 937707 99% 0.99K 29371 32 939872K zfs_znode_cache
939455 937174 99% 0.27K 32395 29 259160K sa_cache
774976 246379 31% 0.06K 12109 64 48436K kmalloc-64
707994 689363 97% 0.19K 33714 21 134856K dentry
370944 345673 93% 0.32K 15456 24 123648K arc_buf_hdr_t_full
219648 87379 39% 0.01K 429 512 1716K kmalloc-8
188118 142108 75% 0.09K 4479 42 17916K kmalloc-96
164240 156315 95% 4.00K 20530 8 656960K zio_buf_4096
146304 127594 87% 0.03K 1143 128 4572K kmalloc-32
78387 42943 54% 0.08K 1537 51 6148K arc_buf_t
49344 41727 84% 0.06K 771 64 3084K range_seg_cache
40256 40256 100% 0.12K 1184 34 4736K kernfs_node_cache
39936 39139 98% 1.00K 1248 32 39936K zio_buf_1024
38690 38538 99% 16.00K 19345 2 619040K zio_buf_16384
35272 35241 99% 8.00K 8818 4 282176K kmalloc-8192
21632 20889 96% 0.03K 169 128 676K nvidia_pte_cache
20820 20317 97% 2.50K 1735 12 55520K zio_buf_2560
which adds around 2-3 GB to the existing accounted 3 GB in arcstats, which is close to 6 GB still significantly lower than 10 GB (the set limit)
@haasn it seems that the conservative mem usage after the compressed ARC patches was rather a regression than fundamental change in behavior:
https://github.com/zfsonlinux/zfs/issues/5128 Poor cache performance
https://github.com/zfsonlinux/zfs/pull/5129 Fix arc_adjust_meta_balanced()
So the only solution right now is to e.g. set ARC size at approx. 40% when you want it to occupy for example 50% of your RAM
The netdata graphs paint a new light on the situation:

According to this graph, which does not seem to be displaying my ARC size (16 GB at the time of writing), I have 9-10 GB of memory spent on unreclaimable slab objects. What are these 10 GB currently doing? Is there any way I can introspect this figure further?
Hi @haasn we encounterd the same problem with several servers, did you managed to find solution?
@meteozond here is my solution to the problem: http://brockmann-consult.de/peter2/bc-zfsonlinux-memorymanagement2.tgz
it will loop forever and keep tuning and dropping caches if used RAM gets too high.
Both of my large 36 disk box ZoL machines hang if I don't run this. I wrote this in bash years ago and recently redid it in python3 to fix the float handling and exceptions.
@meteozond @haasn any update on this problem? I use nmon to get info about current slab usage.
When I run: echo 3 > /proc/sys/vm/drop_caches on system where slab is around 6 GB it reduce it to 1,6 GB almost instationous, after a 15 minutes it even dropped further to ~400 MB.
@petermaloney is Your script doing anything fancier than that?
PS: I'm on Ubuntu 16.04 + zfsutils-linux 0.6.5.6-0ubuntu17
@lechup The source is there for you to read what it does (not sure if you need to know python to understand it). The script is fancier than that, yes. What it does is constantly manages the zfs_arc_meta_limit and zfs_arc_max module parameters to keep the total system used RAM within a specified range, like 89-93% used which is the default I set (works well for me; but is configurable).
I found that setting those module parameters one time doesn't work well... a low value might still end up using all your RAM still, or a not so high value might still not use enough RAM for best performance. Or a value that works well sometimes might not work well other times. So this script keeps your free RAM around 10%. The machine I originally wrote it for was very slow if I didn't use enough RAM, so it was very important to use lots when available, so this strategy was very effective.
The script will also use drop_caches (and zfs set primarycache=metadata) if it panics because setting the lowest value wasn't enough to keep RAM low. And then it sets it back to primarycache=all later. Even if it runs fine for weeks, there's still a chance that zfsonlinux eats way more RAM than usual for a short time, causing this to happen.
@petermaloney thanks for sharing Your code and explainng how it works - I'll give it a shot!
@petermaloney your link is dead? do you have it on github somewhere?
@gdevenyi the path changed slightly https://www.brockmann-consult.de/peter2/zfs/bc-zfsonlinux-memorymanagement2.tgz
and BTW there's a hang bug in the zfs version which is the one used on Ubuntu 16.04 where I think it might not get triggered if you lower the meta limit (my script set it very generously)... (see https://github.com/zfsonlinux/zfs/commit/25458cbef9e59ef9ee6a7e729ab2522ed308f88f ) so to maybe prevent that (still testing...) patch the script like:
-meta_limit = int((limit_gb-2)102410241024)
+meta_limit = int((limit_gb2/3)10241024*1024)
Thanks @petermaloney I'm still on 0.6.5.11. I just started having these exploding memory usages, OOM on a fileserver that had been running fine for years. Right now I'm experiencing huge zfs slabs for unknown reasons. The only thing thats saved me is greatly relaxing arc_min so that it can shrink. Interestingly your script does absolutely nothing for me. I guess I'll wait another month to see of the 0.7.x series is finally stable since new features are only being added to 0.8.x now.
Most helpful comment
@lechup The source is there for you to read what it does (not sure if you need to know python to understand it). The script is fancier than that, yes. What it does is constantly manages the
zfs_arc_meta_limitandzfs_arc_maxmodule parameters to keep the total system used RAM within a specified range, like 89-93% used which is the default I set (works well for me; but is configurable).I found that setting those module parameters one time doesn't work well... a low value might still end up using all your RAM still, or a not so high value might still not use enough RAM for best performance. Or a value that works well sometimes might not work well other times. So this script keeps your free RAM around 10%. The machine I originally wrote it for was very slow if I didn't use enough RAM, so it was very important to use lots when available, so this strategy was very effective.
The script will also use
drop_caches(andzfs set primarycache=metadata) if it panics because setting the lowest value wasn't enough to keep RAM low. And then it sets it back toprimarycache=alllater. Even if it runs fine for weeks, there's still a chance that zfsonlinux eats way more RAM than usual for a short time, causing this to happen.