I'm really not sure where the problem lies. There are no symlinks in the entire path here. And the error does not always occur. The error goes away if I "split up" the chdir as demonstrated.
Linux dc 3.4.1-vs2.3.3.4 #2 SMP Sat Jun 23 16:39:09 MST 2012 x86_64 Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz GenuineIntel GNU/Linux
zfs/spl 0.6.0_rc9
-[root@dc]-[5.92/10.60/10.65]-66%-0d19h15m-2012-07-09T14:30:03-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world
bash: cd: /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world: Too many levels of symbolic links
-[root@dc]-[5.81/7.28/9.21]-64%-0d19h23m-2012-07-09T14:37:35-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/
-[root@dc]-[24.16/11.15/10.45]-64%-0d19h23m-2012-07-09T14:37:45-
-[/backup/1/minecraft501/.zfs/snapshot:#]- cd 20120703-1202/home/craft/bukkit/world
-[root@dc]-[22.95/11.12/10.44]-64%-0d19h23m-2012-07-09T14:37:51-
-[/backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world:#]-
Thanks for filing the bug, I've seen this once before and we didn't manage to run it down then.
Not sure how helpful this is, but "me too".
I have a pool which I have been using under CentOS 6.3 x86_64 (2.6.32), and there I have issues with a system hang when running find inside the .zfs subdirectory (with a load of snapshots present). I just thought I'd try the same pool under Ubuntu 12.04 x86_64 (3.2.0-23), and although I see no system hang, instead I get intermittent errors like this:
find: ‘/tank1/data1/.zfs/snapshot/2012.0608.0113.Fri.snapadm.weekly’: Too many levels of symbolic links Command exited with non-zero status 1
There are no symbolic links though. Even without this error, I don't think the find command is finding everything it should.
ZFS on Linux 0.6.0 rc9.
Andy
Hi, same problem here.
When trying to cd to a snapshot, I get intermittent errors "Too many levels of symbolic links", I try 2 minutes later and I can.
Using Ubuntu 12.04 64bits ZOL rc10
Hi, in my case I found several symlink files in the folder (but the same folder under ext4 doesn't cause any problem).
To find all symlink files you can use: sudo find /zpool/dataset -type l -exec ls -l {} \;
for me, I get this randomly accessing files (via python) over NFS mounted zvol. OSError(40, 'Too many levels of symbolic links')
If your at all able to reproduce this it would be very helpful to get an strace of failing command to see if this error is coming back from the kernel. And if so what system call is responsible.
I'll attempt to get an strace
Running in zsh:
evil:~/backup/vol/.zfs/snapshot ls
20100401/ 20100513/ 20100825/ 20110302/ 20110907/ 20120330/
20100414/ 20100608/ 20101007/ 20110329/ 20111020/ 20120722/
20100417/ 20100609/ 20101122/ 20110501/ 20111021/ 20120801/
20100501/ 20100714/ 20101213/ 20110613/ 20111218/ 20120808/
evil:~/backup/vol/.zfs/snapshot cd 20110329
cd: too many levels of symbolic links: 20110329
zsh: exit 1
evil:~/backup/vol/.zfs/snapshot
gives strace output for the failing step:
7118 09:53:55.903028 stat(".", {st_mode=S_IFDIR|0555, st_size=2, ...}) = 0
7118 09:53:55.903150 chdir("/home/matt/backup/vol/.zfs/snapshot/20110329") = -1 ELOOP (Too many levels of symbolic links)
7118 09:53:55.918222 stat(".", {st_mode=S_IFDIR|0555, st_size=3, ...}) = 0
7118 09:53:55.918346 chdir("20110329") = -1 ELOOP (Too many levels of symbolic links)
Running the "cd" again works OK. This is on Ubuntu 12.04, 64 bit, 3.2.0-27-generic, ZFS v0.6.0.65-rc9.
It seems directly related to the number of files in a folder. If there are 300 files in a folder, no errors ever. If there are 7K files, then i get the error quite often.
I am seeing the same symptom; getting the same backtrace as mkj, and noticing the same pattern as msmitherdc in it affecting file systems containing lots of files. Likewise it is only happening the first time I try entering a snapshot subdir, within a certain time window. The second attempt right afterwards always seem to succeed.
Did notice running a difference in the reported ownership of the snapshot subdir. Before a failing attempt it is listed as belonging to root:root, with some generic permissions. Before the second attempt, the succeeding one, the ownerships as well as the permissions actually match the existing ones on the top level of the file system in question. Some cached metadata from the first attempt, making all the difference the second time around?
root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
dr-xr-xr-x 1 root root 0 Sep 9 14:01 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
-bash: cd: H21/: Too many levels of symbolic links
root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
drwxr-x--x 31 andreas andreas 49 Sep 8 22:29 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
root@halleck:/home/andreas/.zfs/snapshot/H21#
Seeing this running a 64-bit Ubuntu 12.04, on the 3.2.0-30-generic kernel, with zfs 0.6.0.71.
I wonder if this is simply due to the snapshot being slow to mount. The subsequent attempt would work because the snapshot was then successfully mounted. It would continue to work until the snapshot gets automatically unmounted due to inactivity.
The way the .zfs/snapshot directory was implemented is by mounting the required snapshot on demand. Basically, the traversal in to the snapshot triggers the mount and will block in the system call until it completes. This makes the process transparent to the user and greatly simplifies the kernel code since each snapshot can be treated as an individual mount point. However, perhaps there are some races which remain.
Incidentally, the permission issue you reported is just how the mount point is permissioned before the snapshot gets mounted on top. So that's to be expected.
The above strace output is valuable, but the ideal bit of debugging to have would be a call trace using ftrace or systemtap. We'd be able to see exactly where that ELOOP was returned in the kernel to chdir().
I agree, I have this same problem and it's absolutely consistent: the first access gives the error (it doesn't require "cd", for example, "ls" gives the same message). The second and subsequent accesses are fine. In my case, it's not related to the number of files in the directory. Once it works, it works for "a while" (what is the inactivity timeout?) and then after some period, the error occurs again.
This is on Ubuntu 12.04, kernel 3.2.0-31, ZFS v0.6.0.80-rc11.
That "awhile" would be 5 minutes. By default that's the timeout to expire idle snapshots which were automounted. If you want to mitigate the issue for now you could crank this use by increasing the zfs_expire_snapshot module option.
$ modinfo module/zfs/zfs.ko | grep expire parm: zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int)
I'm being affected by this problem too. Is there anything I can do to help debug? Ubuntu 12.10; kernel 3.5.0-18-generic; ZOL 0.6.0-rc12.
I've been digging in to this problem. The process loops in follow_managed(), calling follow_automount() each time until it hits the 40 level limit, as shown by the following output from a custom systemtap script.
1355336265 ls(63225) kernel.function("follow_managed@/build/buildd/linux-3.2.0/fs/namei.c:797") zfs-auto-snap_daily-2012-12-08-0747 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
...
The follow_automount probe shows the dentry->d_flags and the path structure.
Notice that the dentry and mnt pointers never change. I think that in order to exit the while loop (shown below) the path->dentry pointer needs to point to the dentry for the root of the newly-mounted filesystem after the call to follow_automount(). This is taken care of in follow_automount() for the non-mount-collision case.
I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.
804 /* Given that we're not holding a lock here, we retain the value in a
805 * local variable for each dentry as we look at it so that we don't see
806 * the components of that value change under us */
807 while (managed = ACCESS_ONCE(path->dentry->d_flags),
808 managed &= DCACHE_MANAGED_DENTRY,
809 unlikely(managed != 0)) {
810 /* Allow the filesystem to manage the transit without i_mutex
811 * being held. */
812 if (managed & DCACHE_MANAGE_TRANSIT) {
813 BUG_ON(!path->dentry->d_op);
814 BUG_ON(!path->dentry->d_op->d_manage);
815 ret = path->dentry->d_op->d_manage(path->dentry, false);
816 if (ret < 0)
817 break;
818 }
819
820 /* Transit to a mounted filesystem. */
821 if (managed & DCACHE_MOUNTED) {
822 struct vfsmount *mounted = lookup_mnt(path);
823 if (mounted) {
824 dput(path->dentry);
825 if (need_mntput)
826 mntput(path->mnt);
827 path->mnt = mounted;
828 path->dentry = dget(mounted->mnt_root);
829 need_mntput = true;
830 continue;
831 }
832
833 /* Something is mounted on this dentry in another
834 * namespace and/or whatever was mounted there in this
835 * namespace got unmounted before we managed to get the
836 * vfsmount_lock */
837 }
838
839 /* Handle an automount point */
840 if (managed & DCACHE_NEED_AUTOMOUNT) {
841 ret = follow_automount(path, flags, &need_mntput);
842 if (ret < 0)
843 break;
844 continue;
845 }
846
847 /* We didn't change the current path point */
848 break;
849 }
It's still not clear to me why this only sometimes fails.
I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.
You should be able to do this with follow_down_one.
It's still not clear to me why this only sometimes fails.
Me neither. If the problem is as I described it seems like it should always fail. Unless the path pointer is shared and I just always "win" the race on my desktop. For me it always fails on my workstation, but I haven't reproduced it in a VM running the same kernel and ZFS versions.
You should be able to do this with follow_down_one.
Cool, I'll give that a try. Thanks
Adding follow_up(path) to zpl_snapdir_automount() fixes it for me.
diff --git a/module/zfs/zpl_ctldir.c b/module/zfs/zpl_ctldir.c
index 7dfaf6e..09585c4 100644
--- a/module/zfs/zpl_ctldir.c
+++ b/module/zfs/zpl_ctldir.c
@@ -356,6 +356,8 @@ zpl_snapdir_automount(struct path *path)
if (error)
return ERR_PTR(error);
+ follow_up(path);
+
/*
* Rather than returning the new vfsmount for the snapshot we must
* return NULL to indicate a mount collision. This is done because
@cronnelly It would be great if anyone else having this issue could test the above patch before I submit a pull request. Thanks
Up... down... I always get those confused. Based on your analysis it does look like this should resolve the issue. It would be great is some of the folks watching this issue could verify the proposed 1 line fix resolves the probably for them as well.
Seems to do the trick.
I have a server with a set of snapshots on which I pretty much all the time managed to trigger the "Too many levels of symbolic links" response. Now with this patch I haven't been able to reproduce the bug.
Thanks!
Works great for me. I was hitting this issue 100% of the time. Snapshot mounts are immediate now with no initial error.
Likewise, the problem was completely repeatable and reproducible, and now it works perfectly. The only thing is, when I run "ls -l /xxxx/.zfs/snapshot/*" we end up with a very large number of "mount" commands running for a few minutes. Not that this is a normal operation, mind you, but it's not exactly scalable to huge numbers of snapshots.
Thanks for the fix!!
Thank you everyone, this fix was merged.
Reopening this issue since the fix introduced a regression which wasn't initially caught.
The real root cause for the racy behavior was identified and fixed. Thanks Ned.
761394b call_usermodehelper() should wait for process
For some reason this is an issue for me.. cd and other commands fail on snapshots.. funnily enough it's to do with Minecraft too:
# cp -R /home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft ~/minecraft-backup-snapshot
cp: cannot stat ‘/home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft’: Too many levels of symbolic links
Not sure if this issue has regressed or if it's a new issue..
I was just trying to do an strace and got a kernel panic.. See the attached image.
Here is the kernel panic.. running zfs 0.6.4.2_r0_g44b5ec8_4.1.4_1-1
@behlendorf Sorry man just bumping this in case you didn't see it.
That kernel panic is a duplicate. I saw that months ago and reported it.
@drescherjm perhaps you can link to the other issue?
Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04).
After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots.
This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug.
This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :)
I thought this was fixed long ago.
Me too. I realize this is as old issue, but though it could be nice to post my solution here too.
Its the same issue as this: https://github.com/zfsonlinux/zfs/issues/4514
Just some additional info:
The ubuntu server sending snapshots (14.04) has the ubuntu-zfs package installed.
[ 1.570547] ZFS: Loaded module v0.6.5.7-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
The ubuntu server receiving snapshots (16.04.1) has zfs native
[ 17.440504] ZFS: Loaded module v0.6.5.6-0ubuntu10, ZFS pool version 5000, ZFS filesystem version 5
I am experiencing this issue, unmount/mount workaround did it for me.
this issue is getting long in the tooth, but still exists for a newly installed fully updated ubuntu 16.04 with incremental received snapshots. normal snapshots work fine. the unmount/mount workaround does work, so it's certainly a cache issue.
I'm sending my snaps using http://www.bolthole.com/solaris/zrep/ is that matters. it's easy to make a test case to reproduce it using this configuration method http://www.bolthole.com/solaris/zrep/zrep.documentation.html#backupserver
@chrwei are you able to reproduce this with Ubuntu 18.04? It's likely this was resolved in a newer version or ZFS, can you check exactly which version your running, cat /sys/module/zfs/version. If you're still able to reproduce it with 0.7.x or newer it would be helpful if you could put together a small script with reproduces the issue.
I am on 0.6.5.6-0ubuntu20.
I don't have any 18.04 and don't plan on it for some time.
To add to this mystery, I have also found issues with this error mounting my ZFS pool via sshfs, with an unmount and remount fixing it as well. It only seems to affect zfs pools on my system, even with the same data. ZFS is running on Proxmox latest fully updated, and sshfs client is a fully updated Manjaro client.
EDIT: ZFS Version 0.7.12-1