Type | Version/Name
--- | ---
Distribution Name | Ubuntu
Distribution Version | 18.04
Linux Kernel | 4.15.0-33-generic
Architecture | x86_64
ZFS Version | https://github.com/delphix/zfs/commit/0ef8832
SPL Version | https://github.com/delphix/zfs/commit/0ef8832
When running an application that heavily uses zfs send/recv, we'll sometimes see lzc_receive_resumable return EBUSY. Using trace-bpfcc we've determined that EBUSY is coming from the dsl_dataset_handoff_check function (I think, via dsl_dataset_clone_swap_check_impl). Additionally, we've only seen this failure occur when the application is receiving a ZVOL.
Before I continue, I want to make it clear, I'm not sure if the root cause of the EBUSY is due to ZFS or something else (e.g. an application bug), and unfortunately I haven't yet reproduced the problem without using this proprietary application to trigger the EBUSY condition. Regardless, I wanted to (perhaps prematurely) open a ZFS bug since the evidence that I have so far leads me to think this might end up being a problem in ZFS. If it's determined this isn't due to a ZFS defect, feel free to close this out.
Here's what my reproducer looks like:
$ cat ~/reproducer.sh
#!/bin/bash -x
while true; do
sudo trace-bpfcc -t -K \
'r::dsl_dataset_handoff_check (retval == 16)' \
'p::dsl_dataset_handoff_check "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
'p::dsl_dataset_long_hold "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
'p::dsl_dataset_long_rele "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
> ~/LX-1302.out &
# Let BPF have a few seconds to start up before starting the test
sleep 10
# Run proprietary application, which may fail due to EBUSY
sudo ./gradlew --daemon -x jFR :server:replication:dxosTest --tests '*sourceLaunchesNewJobOnStartup'
rc=$?
sudo pkill trace-bpfcc
[[ $rc -eq 0 ]] || break
done
After a few iterations of the loop, it fails, and looking at the output from trace-bpfcc I see the following:
... snip ...
203.5538 12849 12849 txg_sync dsl_dataset_handoff_check ds = 0xffff88ffe1c8c800, owner = 0x0
dsl_dataset_handoff_check+0x1 [kernel]
dmu_recv_end_check+0x19e [kernel]
dsl_sync_task_sync+0x7a [kernel]
dsl_pool_sync+0x2af [kernel]
spa_sync+0x4ef [kernel]
txg_sync_thread+0x223 [kernel]
thread_generic_wrapper+0x74 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
... snip ...
203.5948 12834 12834 z_zvol dsl_dataset_long_hold ds = 0xffff88ffe1c8c800, owner = 0xffff89004de44400
dsl_dataset_long_hold+0x1 [kernel]
dsl_dataset_own+0x3b [kernel]
dmu_objset_own+0x92 [kernel]
zvol_first_open+0x98 [kernel]
zvol_open+0x187 [kernel]
__blkdev_get+0xda [kernel]
blkdev_get+0x129 [kernel]
device_add_disk+0x3f9 [kernel]
add_disk+0xe [kernel]
zvol_create_minor_impl+0x45b [kernel]
zvol_create_minors_impl+0x12d [kernel]
zvol_task_cb+0x6d [kernel]
taskq_thread+0x210 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
... snip ...
204.2233 12834 12834 z_zvol dsl_dataset_long_rele ds = 0xffff88ffe1c8c800, owner = 0xffff89004de44400
dsl_dataset_long_rele+0x1 [kernel]
dmu_objset_disown+0x30 [kernel]
zvol_last_close+0x26 [kernel]
zvol_release+0xeb [kernel]
__blkdev_put+0x1dc [kernel]
blkdev_put+0x4c [kernel]
device_add_disk+0x40e [kernel]
add_disk+0xe [kernel]
zvol_create_minor_impl+0x45b [kernel]
zvol_create_minors_impl+0x12d [kernel]
zvol_task_cb+0x6d [kernel]
taskq_thread+0x210 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
... snip ...
206.2021 12834 12834 z_zvol dsl_dataset_long_hold ds = 0xffff88ffe1c8c800, owner = 0xffffffffc052e7b0
dsl_dataset_long_hold+0x1 [kernel]
dsl_dataset_own+0x3b [kernel]
dmu_objset_own+0x92 [kernel]
zvol_create_minor_impl+0x139 [kernel]
zvol_create_minors_impl+0x12d [kernel]
zvol_task_cb+0x6d [kernel]
taskq_thread+0x210 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
... snip ...
206.2097 12849 12849 txg_sync dsl_dataset_handoff_check ds = 0xffff88ffe1c8c800, owner = 0x0
dsl_dataset_handoff_check+0x1 [kernel]
dmu_recv_end_check+0x19e [kernel]
dsl_sync_task_sync+0x7a [kernel]
dsl_pool_sync+0x2af [kernel]
spa_sync+0x4ef [kernel]
txg_sync_thread+0x223 [kernel]
thread_generic_wrapper+0x74 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
206.2097 12849 12849 txg_sync dsl_dataset_handoff_check
kretprobe_trampoline+0x0 [kernel]
dmu_recv_end_check+0x19e [kernel]
dsl_sync_task_sync+0x7a [kernel]
dsl_pool_sync+0x2af [kernel]
spa_sync+0x4ef [kernel]
txg_sync_thread+0x223 [kernel]
thread_generic_wrapper+0x74 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
... snip ...
206.2230 12834 12834 z_zvol dsl_dataset_long_rele ds = 0xffff88ffe1c8c800, owner = 0xffffffffc052e7b0
dsl_dataset_long_rele+0x1 [kernel]
dmu_objset_disown+0x30 [kernel]
zvol_create_minor_impl+0x411 [kernel]
zvol_create_minors_impl+0x12d [kernel]
zvol_task_cb+0x6d [kernel]
taskq_thread+0x210 [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
... snip ...
From this, we can see the call to dsl_dataset_handoff_check by txg_sync at time 206.2097 failed with EBUSY. Looking just above that call, we can see z_zvol call dsl_dataset_long_hold for this same dataset, at time 206.2021. And then z_zvol doesn't drop this hold until time 206.2230, when it calls dsl_dataset_long_rele.
Since I don't fully understand how/when the ZVOL minors are created I could be mistaken, but I think it's this hold by z_zvol that's causing the EBUSY condition, and resultant failure in txg_sync (and application failure).
With help from @jgallag88, we've managed to reproduce what we think to be this same issue, but this time we're not relying on our closed source application to trigger it...
First, we have to make the following change to the ZFS kernel module:
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -733,6 +733,9 @@ dsl_dataset_long_hold(dsl_dataset_t *ds, void *tag)
void
dsl_dataset_long_rele(dsl_dataset_t *ds, void *tag)
{
+#ifdef _KERNEL
+ mdelay(200);
+#endif
(void) refcount_remove(&ds->ds_longholds, tag);
}
With this change in place, the following script will quickly cause a failure:
#!/bin/bash -ex
set -o pipefail
function cleanup
{
if pgrep trace-bpfcc &>/dev/null; then
sudo pkill trace-bpfcc
fi
if zfs list rpool/zvol-recv &>/dev/null; then
while ! sudo zfs destroy -r rpool/zvol-recv; do
sleep 1
done
fi
if zfs list rpool/zvol-send &>/dev/null; then
sudo zfs destroy -r rpool/zvol-send
fi
if [[ -d "$DIR" ]]; then
rm -rf "$DIR"
fi
}
trap cleanup EXIT
DIR=$(mktemp -d -p /tmp zfs-recv-test.XXXXXXX)
sudo stdbuf -oL -eL trace-bpfcc -t -K \
'r::dsl_dataset_handoff_check (retval == 16)' \
'p::dsl_dataset_handoff_check "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
'p::dsl_dataset_long_hold "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
'p::dsl_dataset_long_rele "ds = 0x%llx, owner = 0x%llx", arg1, arg2' \
> ~/LX-1302.out &
sleep 5
sudo zfs create -V 1G rpool/zvol-send
sudo zfs snapshot rpool/zvol-send@snap1
sudo zfs snapshot rpool/zvol-send@snap2
sudo zfs send rpool/zvol-send@snap1 \
>"$DIR/zvol-send.snap1"
sudo zfs send -i rpool/zvol-send@snap1 rpool/zvol-send@snap2 \
>"$DIR/zvol-send.snap2"
while true; do
sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
sleep 0.1
sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"
while [[ ! -e /dev/zvol/rpool/zvol-recv ]]; do
sleep .1
done
sudo udevadm settle
sudo zfs destroy -r rpool/zvol-recv
done
In case it's not obvious, the actual reproducer is the following lines from that script:
while true; do
sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
sleep 0.1
sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"
done
The rest is just setup/teardown required to run this loop.
When running this, the second zfs recv will fail with EBUSY; e.g.
+ sudo zfs recv -F rpool/zvol-recv@snap1
+ sleep 0.1
+ sudo zfs recv -F rpool/zvol-recv@snap2
cannot receive incremental stream: dataset is busy
Looking at the trace-bpfcc output, we can see the following:
8.858035 362 362 z_zvol dsl_dataset_long_hold ds = 0xffff95c6f03ae000, owner = 0xffff95c6aa2a5400
dsl_dataset_long_hold+0x1 [kernel]
dsl_dataset_own+0x3b [kernel]
dmu_objset_own+0x92 [kernel]
zvol_first_open+0xc5 [kernel]
zvol_open+0x3fd [kernel]
__blkdev_get+0xda [kernel]
blkdev_get+0x129 [kernel]
device_add_disk+0x3f9 [kernel]
add_disk+0xe [kernel]
zvol_create_minor_impl+0x4ce [kernel]
zvol_create_minors_impl+0x131 [kernel]
zvol_task_cb+0x6d [kernel]
taskq_thread+0x22d [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
8.860954 407 407 txg_sync dsl_dataset_handoff_check ds = 0xffff95c6f03ae000, owner = 0x0
dsl_dataset_handoff_check+0x1 [kernel]
dmu_recv_end_check+0x1d0 [kernel]
dsl_sync_task_sync+0x88 [kernel]
dsl_pool_sync+0x3da [kernel]
spa_sync+0x62f [kernel]
txg_sync_thread+0x28c [kernel]
thread_generic_wrapper+0x7d [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
8.860994 407 407 txg_sync dsl_dataset_handoff_check
kretprobe_trampoline+0x0 [kernel]
dmu_recv_end_check+0x1d0 [kernel]
dsl_sync_task_sync+0x88 [kernel]
dsl_pool_sync+0x3da [kernel]
spa_sync+0x62f [kernel]
txg_sync_thread+0x28c [kernel]
thread_generic_wrapper+0x7d [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
8.983752 362 362 z_zvol dsl_dataset_long_rele ds = 0xffff95c6f03ae000, owner = 0xffff95c6aa2a5400
dsl_dataset_long_rele+0x1 [kernel]
dmu_objset_disown+0x30 [kernel]
zvol_last_close+0x4e [kernel]
zvol_release+0x16a [kernel]
__blkdev_put+0x1dc [kernel]
blkdev_put+0x4c [kernel]
device_add_disk+0x40e [kernel]
add_disk+0xe [kernel]
zvol_create_minor_impl+0x4ce [kernel]
zvol_create_minors_impl+0x131 [kernel]
zvol_task_cb+0x6d [kernel]
taskq_thread+0x22d [kernel]
kthread+0x121 [kernel]
ret_from_fork+0x35 [kernel]
Just like when using our closed source application to trigger this, it looks like the hold taken by the z_zvol thread is causing the dsl_dataset_handoff_check function to fail, resulting in the zfs recv operation failing.
Also worth noting, if I modify the reproducer to sleep for 10 seconds in between zfs recv calls, instead of 0.1 seconds, the failure no longer occurs (at least, I haven't seen it occur yet); i.e.
sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
sleep 10
sudo zfs recv -F rpool/zvol-recv@snap2 < "$DIR/zvol-send.snap2"
@behlendorf @bprotopopov if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the z_zvol thread, here.
I tried to test this theory by making the ZVOL minor creation synchronous with this change:
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -4217,7 +4217,7 @@ dmu_recv_end_sync(void *arg, dmu_tx_t *tx)
drc->drc_newsnapobj =
dsl_dataset_phys(drc->drc_ds)->ds_prev_snap_obj;
}
- zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_TRUE);
+ zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_FALSE);
/*
* Release the hold from dmu_recv_begin. This must be done before
but now I think I've reintroduced the deadlock that the taskq was intending to fix.
@prakashsurya, are the snapshots being automatically exposed as read-only devices?
Typos courtesy of my iPhone
On Sep 5, 2018, at 11:05 AM, Prakash Surya <[email protected]notifications@github.com> wrote:
@behlendorfhttps://github.com/behlendorf @bprotopopovhttps://github.com/bprotopopov if either of you have a chance to look at this, I'd appreciate any insight you might have. so far, I think this might be related to the change that introduced the z_zvol thread, herehttps://github.com/zfsonlinux/zfs/commit/a0bd735adb1b1eb81fef10b4db102ee051c4d4ff.
I tried to test this theory by making the ZVOL minor creation synchronous with this change:
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -4217,7 +4217,7 @@ dmu_recv_end_sync(void *arg, dmu_tx_t *tx)
drc->drc_newsnapobj =
dsl_dataset_phys(drc->drc_ds)->ds_prev_snap_obj;
}
- zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_TRUE);
+ zvol_create_minors(dp->dp_spa, drc->drc_tofs, B_FALSE);
/*
* Release the hold from dmu_recv_begin. This must be done before
but now I think I've reintroduced the deadlock that the taskq was intending to fix.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/7863#issuecomment-418765094, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACX4ucodesOm1G3uRXKBaTEFo-OIuVdBks5uX-g4gaJpZM4WZUuX.
@bprotopopov what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior.
I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the rpool/zvol-recv ZVOL, not for either of the snapshots of that filesystem.
Specifically, when we do the first zfs recv, I believe it will create the rpool/zvol-recv dataset and start the ZVOL minor creation. Then, when the second zfs recv calls into dsl_dataset_handoff_check (via dsl_dataset_clone_swap_check_impl) at the end of the zfs recv ioctl, it detects the hold from the z_zvol thread (that's still attempting to create the minor from the first zfs recv) and returns EBUSY.
I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this:
sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs recv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call dsl_dataset_clone_swap_check_impl on that same dataset. The dsl_dataset_clone_swap_check_impl call may race with the ZVOL minor creation (since that creation code path will take a hold on the dataset), sometimes resulting in the EBUSY error.
I also wouldn't be surprised if this could fail:
sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs destroy rpool/zvol-recv
where the zfs destroy fails with EBUSY because of the hold from the z_zvol thread.
@prakashsurya, there is a zvol property that controls exposing snapshots as devices. It is not set by default but i think it could be inherited.
its been a while since i looked at this, but i think your analysis sounds reasonable. The code in question is designed to allocate and deallocate device minor numbers and to create and destroy device nodes for zvols. The latter happens via udev, which works asynchronously in Linux. So for instance, a script that creates a zvol and then writes to it via the device node will fail intermittently, because the device might or might not be created in time for the write.
I would expect that retrying the operation that fails with EBUSY should succeed; the retries could be automated with back-off up to a configurable max time (in the application code or zfs code).
Making minor creation synchronous is risky because of possible deadlocks. At the time i looked at this, i thought that because of the complexity of dealing with the deadlocks, the asynchronous solution was a good alternative.
Typos courtesy of my iPhone
On Sep 5, 2018, at 11:54 PM, Prakash Surya <[email protected]notifications@github.com> wrote:
@bprotopopovhttps://github.com/bprotopopov what is the default behavior? I don't think they're being exposed read-only, but I'd have to go back and double check to be sure. I'm not doing anything specific w.r.t. that, so it should just be doing whatever is the default behavior.
I'm not 100% sure of this, since I'm still learning how this part of the code works.. but I think the minor that's being created and causing the EBUSY, is for the rpool/zvol-recv ZVOL, not for either of snapshots of that filesystem.
Specifically, when we do the first zfs recv, I believe it will create the rpool/zvol-recv dataset and start the ZVOL minor creation. Then, when the second zfs recv calls into dsl_dataset_handoff_check (via dsl_dataset_clone_swap_check_impl) at the end of the zfs recv ioctl, it detects the hold from the z_zvol thread (that's still attempting to create the minor from the first zfs recv) and returns EBUSY.
I'm speculating at this point, since I haven't actually tested this, but if my analysis so far is correct, I think this same issue can be triggered if we were to change the loop to something like this:
sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs create -V 1grecv -F rpool/zvol-recv@snap1 < "$DIR/zvol-send.snap1"
i.e. we just need to do an operation that'll start the asynchronous ZVOL minor creation, and then do another operation that'll call dsl_dataset_clone_swap_check_impl on that same dataset. The dsl_dataset_clone_swap_check_impl call may race with the ZVOL minor creation (since that creation code path will take a hold on the dataset), sometimes resulting in the EBUSY error.
I also wouldn't be surprised if this could fail:
sudo zfs create -V 1G rpool/zvol-recv
sleep .1
sudo zfs destroy rpool/zvol-recv
where the zfs destroy fails with EBUSY because of the hold from the z_zvol thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/7863#issuecomment-418956301, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACX4uexYOZSqs6Hk0GtNzlhejLW0qAocks5uYJxjgaJpZM4WZUuX.
The code in question is designed to allocate and deallocate device minor numbers and to create and destroy device nodes for zvols. The latter happens via udev, which works asynchronously in Linux. So for instance, a script that creates a zvol and then writes to it via the device node will fail intermittently, because the device might or might not be created in time for the write.
Yea. This is consistent with the behavior we see. udev is another problem we see that also causes the same EBUSY issues.
I would expect that retrying the operation that fails with EBUSY should succeed; the retries could be automated with back-off up to a configurable max time (in the application code or zfs code).
Right. We've worked around this limitation so far by adding logic that will do the following each time we create a ZVOL:
udevadm settleThis way, by the time these two operations finish, we know we can safely use the ZVOL without hitting an EBUSY error due to holds from z_zvol or udev.
IMO, the fact that one can't use a ZVOL immediately after it's created (e.g. can't even immediately destroy it) without risking an EBUSY error is broken behavior, even if this has been the existing behavior for awhile.
Further, even if we made the creation of the minor node synchronous, I think we still have the problem of udev taking a hold asynchronously, and that hold causing EBUSY errors.
Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL, until the ZVOL can be reliably used/destroyed/etc. E.g. libzfs could create the ZVOL, then wait for its device link to appear, and then call udevadm settle; and only return after all of those steps complete. This way, after calling zfs create, one could immediately call zfs destroy without risking the EBUSY issue.
Anyways, @bprotopopov thanks for chiming in, I really appreciate it! In the short term, we're working around this problem by retrying/waiting as you suggest. In the long term, it'd be great not to have to do that, but I'll have to think more about this to try and come up with a better solution.
Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.
Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL.
This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the libzfs library already links against libudev so it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it.
I concur - this would definitely be a usability improvement.
Typos courtesy of my iPhone
On Sep 6, 2018, at 7:16 PM, Brian Behlendorf <[email protected]notifications@github.com> wrote:
Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.
Part of me thinks that the ZFS userspace command/libraries should not return after creating a ZVOL.
This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the libzfs library already links against libudev so it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/7863#issuecomment-419271585, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACX4uZ8Drdp7BHoLw26cCjYnGkSAWliuks5uYazcgaJpZM4WZUuX.
Most helpful comment
Thanks for digging in to this, sorry I'm a little late to the discussion. Your analysis looks entirely reasonable to me as the root cause. And we have definitely noticed exactly these kinds of issues even the ZTS due to the asynchronous nature of udev.
This. We should do this. The good news is that these days virtually all the required code to handle this is already in place. Previously doing this would have been awkward but now the
libzfslibrary already links againstlibudevso it should be straight forward to block waiting on the event, or retry in the EBUSY case. Making this the default behavior would be a nice improvement, though I would like to see an option to allow it to be async for those systems which can take advantage of it.