Zfs: Possible memory leak on ZFS ZVOL as source image format and qemu-img convert.

Created on 26 Feb 2018  路  12Comments  路  Source: openzfs/zfs

System information

Type | Version/Name
Distribution Name | Proxmox VE
Distribution Version | 5.1-41
Linux Kernel | 4.13.13-6-pve
Architecture | amd64
ZFS Version | 0.7.6-1
SPL Version | 0.7.6-1

Describe the problem you're observing

"qemu-img convert" will use all available memory as buffered pages if the source image is bigger than total memory and is in ZFS ZVOL image format. This causes high swapping activity and overall system slow down.

This issue only happens on "qemu-img convert" with ZFS ZVOL as the source image format. Is this some kind of memory leak or just a bug from qemu-img or bug ZFS?

Describe how to reproduce the problem

On host with 32GB RAM and ZFS zpool mirror on 2 spinning drives, the problem is reproducible every time:

  • Full clone 100GB VM with source image ZFS ZVOL to target image ZFS ZVOL uses all available memory as buffered pages.
  • Full clone 100GB VM with source image ZFS ZVOL to target image qcow2 on ZFS uses all available memory as buffered pages.
  • Full clone 100GB VM with source image qcow2 on ZFS to target image ZFS ZVOL uses more less 20% buffered pages.
  • Full clone 100GB VM with source image qcow2 on ZFS to target image qcow2 does not use buffered pages or does not increase memory usage at all.

Include any warning/errors/backtraces from the system logs

Didn't notice any error listed in syslog when the overall system gets slow down and high swap usage occurred.

zfs zvol qemu-img convert issue

Most helpful comment

I think this is only tangentially ZFS related, as I can still reproduce it on a test system with #7170 included.

@chrone81: do you see an improvement if you manually re-try the qemu-img command (you might have to create the target zvol first if it does not exist already), but add "-t none -T none" before the zvol paths? I think qemu-img just has a very bad choice of default caching mode, which should be fixable on the calling side...

All 12 comments

I wonder if 03b60eee78b0bf1125878dbad0fcffd717def61f would also help

edit: my theory is that qemu is using linux buffer/cache and not returning them when finished, this would normally be okay for non-zfs systems, but on zfs this forces an arc eviction, and the cache sits around forever doing nothing.

@chrone: if you need a test PVE kernel with either (or both) of these
commits cherry-picked, feel free to ping me.

@gmelikov @bunder2015 Thanks for the tips, unfortunately, I'm no good at compiling ZFS from scratch especially on PVE kernel.

@Fabian-Gruenbichler Whoa, this is great. Sure, I could do a test on the PVE kernel. Are you Fabian from Proxmox staff? If you are then I could leave a message on Proxmox forum.

@chrone81 I believe you should have 03b60ee already as part of 0.7.x, however its disabled by default. Setting zfs_arc_pc_percent to something like 100-500 may help block a mass arc eviction due to linux buffer/cache. (that is, unless I'm misinterpreting how that option works)

edit:

echo 3 > /proc/sys/vm/drop_caches
echo 100 > /sys/module/zfs/parameters/zfs_arc_pc_percent
((re-prime arc))
((test again to see if buffers cause arc eviction and/or massive buffer leftovers))

Hi @bunder2015 thanks for pointing that out. The memory buffered pages are removed when the 'qemu-img convert' is finished or cancelled though.

I just tested with 100 or 500 and it didn't help limit the buffered pages used by qemu-img convert. The memory buffered pages keeps growing and spilled to swap over time.

My apologies, it must only work on cache then (rather than both cache and buffers). :frowning_face:

PVE 5 kernel with #7170 backported:
http://download.proxmox.com/temp/pve-kernel-4.13.13-6-pve_4.13.13-42~test1_amd64.deb

md5/sha256sum:

e6b0f499110093121a7d9a84922010b0  pve-kernel-4.13.13-6-pve_4.13.13-42~test1_amd64.deb
984786973c94b4583252c40f73cc1d35ae5f9e482bc10e117703498e16169838  pve-kernel-4.13.13-6-pve_4.13.13-42~test1_amd64.deb

I think this is only tangentially ZFS related, as I can still reproduce it on a test system with #7170 included.

@chrone81: do you see an improvement if you manually re-try the qemu-img command (you might have to create the target zvol first if it does not exist already), but add "-t none -T none" before the zvol paths? I think qemu-img just has a very bad choice of default caching mode, which should be fixable on the calling side...

try again with memcg / memory controller enabled. maybe like this
cgcreate -g memory:/sandbox
cgset -r memory.limit_in_bytes=100M sandbox
cgexec -g memory:/sandbox (your actual work)

@Fabian-Gruenbichler

Just tested without the patched PVE kernel and with the qemu-img convert -t none and -T none options. The RAM usage is normal as in attached screenshot.

zfs zvol qemu-img convert issue fix

I'll test with the patched kernel and report back later.

@fcicq Unfortunately, I didn't test this with Linux containers yet.

@Fabian-Gruenbichler using the ZFS patched PVE kernel didn't help. Only with the qemu-img convert -t none and -T none options did fix this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

geek-at picture geek-at  路  3Comments

Baughn picture Baughn  路  4Comments

jakeogh picture jakeogh  路  3Comments

seonwoolee picture seonwoolee  路  3Comments

kernelOfTruth picture kernelOfTruth  路  4Comments