Sorry for not following the template.
This _potential_ issue was found through a code review.
The code in question:
```.c
if (scan_ds_queue_contains(scn, ds1->ds_object, &mintxg)) {
scan_ds_queue_remove(scn, ds1->ds_object);
scan_ds_queue_insert(scn, ds2->ds_object, mintxg);
}
if (scan_ds_queue_contains(scn, ds2->ds_object, &mintxg)) {
scan_ds_queue_remove(scn, ds2->ds_object);
scan_ds_queue_insert(scn, ds1->ds_object, mintxg);
}
````
I see two potential problems.
If the first condition is true then the second condition will always be true as well.
That's because the first block replaces ds1->ds_object with ds2->ds_object in the queue.
So, the second block will do a reverse replacement.
I feel that it is unlikely that that is the intended behavior.
Also, I think that it is possible that both ds1->ds_object and ds2->ds_object may already be on the queue. If that's the case, then the code will attempt to do a duplicate insertion.
Maybe the code should first check if object IDs are on the queue and remember the results.
Then do necessary removals (if either ID is present, then both should be removed?).
And then do necessary insertions (insert the other ID for each originally present ID).
CC: @tcaputi @ahrens
I believe that FreeBSD bug 239566 is the same issue. That crash happens when both ds1 and ds2 are already queue when dsl_scan_ds_clone_swapped() is called.
I believe you are correct. We should probably just have separate if / else statements for each case.
Hi all - just a follow up. This is not a theoretical issue. It happens without the fix. So thanks for taking care of it!
VERIFY3(avl_find(&scn->scn_queue, sds, &where) == ((void *)0)) failed (00000000185a2639 == (null))
2020-05-26T11:59:00.867230+00:00 kern.emerg usw2-bfyii-307 kernel: [6444694.715839] PANIC at dsl_scan.c:1101:scan_ds_queue_insert()
1 7ddba5456e65ef9403f3e1c6d4ae2eb9 /proc/23127/stack
[<0>] spl_panic+0xfa/0x110 [spl]
[<0>] scan_ds_queue_insert+0x8e/0xc0 [zfs]
[<0>] dsl_scan_ds_clone_swapped+0x2dc/0x3c0 [zfs]
[<0>] dsl_dataset_clone_swap_sync_impl+0x9a2/0xa50 [zfs]
[<0>] dmu_recv_end_sync+0xd7/0x480 [zfs]
[<0>] dsl_sync_task_sync+0x11c/0x120 [zfs]
[<0>] dsl_pool_sync+0x295/0x340 [zfs]
[<0>] spa_sync+0x3d6/0x8c0 [zfs]
[<0>] txg_sync_thread+0x29b/0x340 [zfs]
[<0>] thread_generic_wrapper+0x74/0x90 [spl]
[<0>] kthread+0x105/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Most helpful comment
Hi all - just a follow up. This is not a theoretical issue. It happens without the fix. So thanks for taking care of it!