Type | Version/Name
Distribution Name | Proxmox VE
Distribution Version | 5.1-41
Linux Kernel | 4.13.13-6-pve
Architecture | amd64
ZFS Version | 0.7.6-1
SPL Version | 0.7.6-1
"qemu-img convert" will use all available memory as buffered pages if the source image is bigger than total memory and is in ZFS ZVOL image format. This causes high swapping activity and overall system slow down.
This issue only happens on "qemu-img convert" with ZFS ZVOL as the source image format. Is this some kind of memory leak or just a bug from qemu-img or bug ZFS?
On host with 32GB RAM and ZFS zpool mirror on 2 spinning drives, the problem is reproducible every time:
Didn't notice any error listed in syslog when the overall system gets slow down and high swap usage occurred.

Could you try to reproduce it with https://github.com/zfsonlinux/zfs/commit/e9a77290081b578144e9a911ad734f67274df82f ?
I wonder if 03b60eee78b0bf1125878dbad0fcffd717def61f would also help
edit: my theory is that qemu is using linux buffer/cache and not returning them when finished, this would normally be okay for non-zfs systems, but on zfs this forces an arc eviction, and the cache sits around forever doing nothing.
@chrone: if you need a test PVE kernel with either (or both) of these
commits cherry-picked, feel free to ping me.
@gmelikov @bunder2015 Thanks for the tips, unfortunately, I'm no good at compiling ZFS from scratch especially on PVE kernel.
@Fabian-Gruenbichler Whoa, this is great. Sure, I could do a test on the PVE kernel. Are you Fabian from Proxmox staff? If you are then I could leave a message on Proxmox forum.
@chrone81 I believe you should have 03b60ee already as part of 0.7.x, however its disabled by default. Setting zfs_arc_pc_percent to something like 100-500 may help block a mass arc eviction due to linux buffer/cache. (that is, unless I'm misinterpreting how that option works)
edit:
echo 3 > /proc/sys/vm/drop_caches
echo 100 > /sys/module/zfs/parameters/zfs_arc_pc_percent
((re-prime arc))
((test again to see if buffers cause arc eviction and/or massive buffer leftovers))
Hi @bunder2015 thanks for pointing that out. The memory buffered pages are removed when the 'qemu-img convert' is finished or cancelled though.
I just tested with 100 or 500 and it didn't help limit the buffered pages used by qemu-img convert. The memory buffered pages keeps growing and spilled to swap over time.
My apologies, it must only work on cache then (rather than both cache and buffers). :frowning_face:
PVE 5 kernel with #7170 backported:
http://download.proxmox.com/temp/pve-kernel-4.13.13-6-pve_4.13.13-42~test1_amd64.deb
md5/sha256sum:
e6b0f499110093121a7d9a84922010b0 pve-kernel-4.13.13-6-pve_4.13.13-42~test1_amd64.deb
984786973c94b4583252c40f73cc1d35ae5f9e482bc10e117703498e16169838 pve-kernel-4.13.13-6-pve_4.13.13-42~test1_amd64.deb
I think this is only tangentially ZFS related, as I can still reproduce it on a test system with #7170 included.
@chrone81: do you see an improvement if you manually re-try the qemu-img command (you might have to create the target zvol first if it does not exist already), but add "-t none -T none" before the zvol paths? I think qemu-img just has a very bad choice of default caching mode, which should be fixable on the calling side...
try again with memcg / memory controller enabled. maybe like this
cgcreate -g memory:/sandbox
cgset -r memory.limit_in_bytes=100M sandbox
cgexec -g memory:/sandbox (your actual work)
@Fabian-Gruenbichler
Just tested without the patched PVE kernel and with the qemu-img convert -t none and -T none options. The RAM usage is normal as in attached screenshot.

I'll test with the patched kernel and report back later.
@fcicq Unfortunately, I didn't test this with Linux containers yet.
@Fabian-Gruenbichler using the ZFS patched PVE kernel didn't help. Only with the qemu-img convert -t none and -T none options did fix this issue.
Most helpful comment
I think this is only tangentially ZFS related, as I can still reproduce it on a test system with #7170 included.
@chrone81: do you see an improvement if you manually re-try the qemu-img command (you might have to create the target zvol first if it does not exist already), but add "-t none -T none" before the zvol paths? I think qemu-img just has a very bad choice of default caching mode, which should be fixable on the calling side...