Zfs: "Too many levels of symbolic links" when "cd"ing to snapshot subdir

Created on 10 Jul 2012 · 40Comments · Source: openzfs/zfs

I'm really not sure where the problem lies. There are no symlinks in the entire path here. And the error does not always occur. The error goes away if I "split up" the chdir as demonstrated.

Linux dc 3.4.1-vs2.3.3.4 #2 SMP Sat Jun 23 16:39:09 MST 2012 x86_64 Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz GenuineIntel GNU/Linux

zfs/spl 0.6.0_rc9

-[root@dc]-[5.92/10.60/10.65]-66%-0d19h15m-2012-07-09T14:30:03-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world
bash: cd: /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world: Too many levels of symbolic links

-[root@dc]-[5.81/7.28/9.21]-64%-0d19h23m-2012-07-09T14:37:35-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/
-[root@dc]-[24.16/11.15/10.45]-64%-0d19h23m-2012-07-09T14:37:45-
-[/backup/1/minecraft501/.zfs/snapshot:#]- cd 20120703-1202/home/craft/bukkit/world

-[root@dc]-[22.95/11.12/10.44]-64%-0d19h23m-2012-07-09T14:37:51-
-[/backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world:#]-

Source

mooinglemur

All 40 comments

Thanks for filing the bug, I've seen this once before and we didn't manage to run it down then.

behlendorf on 10 Jul 2012

Not sure how helpful this is, but "me too".

I have a pool which I have been using under CentOS 6.3 x86_64 (2.6.32), and there I have issues with a system hang when running find inside the .zfs subdirectory (with a load of snapshots present). I just thought I'd try the same pool under Ubuntu 12.04 x86_64 (3.2.0-23), and although I see no system hang, instead I get intermittent errors like this:

find: ‘/tank1/data1/.zfs/snapshot/2012.0608.0113.Fri.snapadm.weekly’: Too many levels of symbolic links Command exited with non-zero status 1

There are no symbolic links though. Even without this error, I don't think the find command is finding everything it should.

ZFS on Linux 0.6.0 rc9.

Andy

ahmgithubahm on 18 Jul 2012

Hi, same problem here.
When trying to cd to a snapshot, I get intermittent errors "Too many levels of symbolic links", I try 2 minutes later and I can.
Using Ubuntu 12.04 64bits ZOL rc10

kattunga on 18 Aug 2012

Hi, in my case I found several symlink files in the folder (but the same folder under ext4 doesn't cause any problem).

To find all symlink files you can use: sudo find /zpool/dataset -type l -exec ls -l {} \;

kattunga on 18 Aug 2012

for me, I get this randomly accessing files (via python) over NFS mounted zvol. OSError(40, 'Too many levels of symbolic links')

msmitherdc on 18 Aug 2012

If your at all able to reproduce this it would be very helpful to get an strace of failing command to see if this error is coming back from the kernel. And if so what system call is responsible.

behlendorf on 18 Aug 2012

I'll attempt to get an strace

msmitherdc on 18 Aug 2012

Running in zsh:

evil:~/backup/vol/.zfs/snapshot ls
20100401/  20100513/  20100825/  20110302/  20110907/  20120330/
20100414/  20100608/  20101007/  20110329/  20111020/  20120722/
20100417/  20100609/  20101122/  20110501/  20111021/  20120801/
20100501/  20100714/  20101213/  20110613/  20111218/  20120808/
evil:~/backup/vol/.zfs/snapshot cd 20110329
cd: too many levels of symbolic links: 20110329
zsh: exit 1
evil:~/backup/vol/.zfs/snapshot

gives strace output for the failing step:

7118  09:53:55.903028 stat(".", {st_mode=S_IFDIR|0555, st_size=2, ...}) = 0
7118  09:53:55.903150 chdir("/home/matt/backup/vol/.zfs/snapshot/20110329") = -1 ELOOP (Too many levels of symbolic links)
7118  09:53:55.918222 stat(".", {st_mode=S_IFDIR|0555, st_size=3, ...}) = 0
7118  09:53:55.918346 chdir("20110329") = -1 ELOOP (Too many levels of symbolic links)

Running the "cd" again works OK. This is on Ubuntu 12.04, 64 bit, 3.2.0-27-generic, ZFS v0.6.0.65-rc9.

mkj on 19 Aug 2012

It seems directly related to the number of files in a folder. If there are 300 files in a folder, no errors ever. If there are 7K files, then i get the error quite often.

msmitherdc on 25 Aug 2012

👍1

I am seeing the same symptom; getting the same backtrace as mkj, and noticing the same pattern as msmitherdc in it affecting file systems containing lots of files. Likewise it is only happening the first time I try entering a snapshot subdir, within a certain time window. The second attempt right afterwards always seem to succeed.

Did notice running a difference in the reported ownership of the snapshot subdir. Before a failing attempt it is listed as belonging to root:root, with some generic permissions. Before the second attempt, the succeeding one, the ownerships as well as the permissions actually match the existing ones on the top level of the file system in question. Some cached metadata from the first attempt, making all the difference the second time around?

root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
dr-xr-xr-x 1 root root 0 Sep  9 14:01 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
-bash: cd: H21/: Too many levels of symbolic links
root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
drwxr-x--x 31 andreas andreas 49 Sep  8 22:29 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
root@halleck:/home/andreas/.zfs/snapshot/H21#

Seeing this running a 64-bit Ubuntu 12.04, on the 3.2.0-30-generic kernel, with zfs 0.6.0.71.

andreaso on 9 Sep 2012

I wonder if this is simply due to the snapshot being slow to mount. The subsequent attempt would work because the snapshot was then successfully mounted. It would continue to work until the snapshot gets automatically unmounted due to inactivity.

The way the .zfs/snapshot directory was implemented is by mounting the required snapshot on demand. Basically, the traversal in to the snapshot triggers the mount and will block in the system call until it completes. This makes the process transparent to the user and greatly simplifies the kernel code since each snapshot can be treated as an individual mount point. However, perhaps there are some races which remain.

Incidentally, the permission issue you reported is just how the mount point is permissioned before the snapshot gets mounted on top. So that's to be expected.

The above strace output is valuable, but the ideal bit of debugging to have would be a call trace using ftrace or systemtap. We'd be able to see exactly where that ELOOP was returned in the kernel to chdir().

behlendorf on 10 Sep 2012

I agree, I have this same problem and it's absolutely consistent: the first access gives the error (it doesn't require "cd", for example, "ls" gives the same message). The second and subsequent accesses are fine. In my case, it's not related to the number of files in the directory. Once it works, it works for "a while" (what is the inactivity timeout?) and then after some period, the error occurs again.

This is on Ubuntu 12.04, kernel 3.2.0-31, ZFS v0.6.0.80-rc11.

ldonzis on 9 Oct 2012

That "awhile" would be 5 minutes. By default that's the timeout to expire idle snapshots which were automounted. If you want to mitigate the issue for now you could crank this use by increasing the zfs_expire_snapshot module option.

$ modinfo module/zfs/zfs.ko | grep expire
parm:           zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int)

behlendorf on 9 Oct 2012

I'm being affected by this problem too. Is there anything I can do to help debug? Ubuntu 12.10; kernel 3.5.0-18-generic; ZOL 0.6.0-rc12.

cronnelly on 25 Nov 2012

I've been digging in to this problem. The process loops in follow_managed(), calling follow_automount() each time until it hits the 40 level limit, as shown by the following output from a custom systemtap script.

1355336265 ls(63225) kernel.function("follow_managed@/build/buildd/linux-3.2.0/fs/namei.c:797") zfs-auto-snap_daily-2012-12-08-0747 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
...

The follow_automount probe shows the dentry->d_flags and the path structure.

Notice that the dentry and mnt pointers never change. I think that in order to exit the while loop (shown below) the path->dentry pointer needs to point to the dentry for the root of the newly-mounted filesystem after the call to follow_automount(). This is taken care of in follow_automount() for the non-mount-collision case.

I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.

 804         /* Given that we're not holding a lock here, we retain the value in a   
 805          * local variable for each dentry as we look at it so that we don't see 
 806          * the components of that value change under us */                      
 807         while (managed = ACCESS_ONCE(path->dentry->d_flags),                    
 808                managed &= DCACHE_MANAGED_DENTRY,                                
 809                unlikely(managed != 0)) {                                        
 810                 /* Allow the filesystem to manage the transit without i_mutex   
 811                  * being held. */                                               
 812                 if (managed & DCACHE_MANAGE_TRANSIT) {                          
 813                         BUG_ON(!path->dentry->d_op);                            
 814                         BUG_ON(!path->dentry->d_op->d_manage);                  
 815                         ret = path->dentry->d_op->d_manage(path->dentry, false);
 816                         if (ret < 0)                                            
 817                                 break;                                          
 818                 }                                                               
 819                                                                                 
 820                 /* Transit to a mounted filesystem. */                          
 821                 if (managed & DCACHE_MOUNTED) {                                 
 822                         struct vfsmount *mounted = lookup_mnt(path);            
 823                         if (mounted) {                                          
 824                                 dput(path->dentry);                             
 825                                 if (need_mntput)                                
 826                                         mntput(path->mnt);                      
 827                                 path->mnt = mounted;                            
 828                                 path->dentry = dget(mounted->mnt_root);         
 829                                 need_mntput = true;                             
 830                                 continue;                                       
 831                         }                                                       
 832                                                                                 
 833                         /* Something is mounted on this dentry in another       
 834                          * namespace and/or whatever was mounted there in this  
 835                          * namespace got unmounted before we managed to get the 
 836                          * vfsmount_lock */                                     
 837                 }                                                               
 838                                                                                 
 839                 /* Handle an automount point */                                 
 840                 if (managed & DCACHE_NEED_AUTOMOUNT) {                          
 841                         ret = follow_automount(path, flags, &need_mntput);      
 842                         if (ret < 0)                                            
 843                                 break;                                          
 844                         continue;                                               
 845                 }                                                               
 846                                                                                 
 847                 /* We didn't change the current path point */                   
 848                 break;                                                          
 849         }

nedbass on 12 Dec 2012

It's still not clear to me why this only sometimes fails.

I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.

You should be able to do this with follow_down_one.

behlendorf on 12 Dec 2012

It's still not clear to me why this only sometimes fails.

Me neither. If the problem is as I described it seems like it should always fail. Unless the path pointer is shared and I just always "win" the race on my desktop. For me it always fails on my workstation, but I haven't reproduced it in a VM running the same kernel and ZFS versions.

You should be able to do this with follow_down_one.

Cool, I'll give that a try. Thanks

nedbass on 12 Dec 2012

Adding follow_up(path) to zpl_snapdir_automount() fixes it for me.

diff --git a/module/zfs/zpl_ctldir.c b/module/zfs/zpl_ctldir.c
index 7dfaf6e..09585c4 100644
--- a/module/zfs/zpl_ctldir.c
+++ b/module/zfs/zpl_ctldir.c
@@ -356,6 +356,8 @@ zpl_snapdir_automount(struct path *path)
        if (error)
                return ERR_PTR(error);

+       follow_up(path);
+
        /*
         * Rather than returning the new vfsmount for the snapshot we must
         * return NULL to indicate a mount collision.  This is done because

nedbass on 13 Dec 2012

@cronnelly It would be great if anyone else having this issue could test the above patch before I submit a pull request. Thanks

nedbass on 13 Dec 2012

Up... down... I always get those confused. Based on your analysis it does look like this should resolve the issue. It would be great is some of the folks watching this issue could verify the proposed 1 line fix resolves the probably for them as well.

behlendorf on 13 Dec 2012

Seems to do the trick.

I have a server with a set of snapshots on which I pretty much all the time managed to trigger the "Too many levels of symbolic links" response. Now with this patch I haven't been able to reproduce the bug.

Thanks!

andreaso on 13 Dec 2012

Works great for me. I was hitting this issue 100% of the time. Snapshot mounts are immediate now with no initial error.

mgmartin on 13 Dec 2012

Likewise, the problem was completely repeatable and reproducible, and now it works perfectly. The only thing is, when I run "ls -l /xxxx/.zfs/snapshot/*" we end up with a very large number of "mount" commands running for a few minutes. Not that this is a normal operation, mind you, but it's not exactly scalable to huge numbers of snapshots.

Thanks for the fix!!

ldonzis on 13 Dec 2012

Thank you everyone, this fix was merged.

behlendorf on 13 Dec 2012

Reopening this issue since the fix introduced a regression which wasn't initially caught.

behlendorf on 9 Jan 2013

The real root cause for the racy behavior was identified and fixed. Thanks Ned.

761394b call_usermodehelper() should wait for process

behlendorf on 10 Jan 2013

For some reason this is an issue for me.. cd and other commands fail on snapshots.. funnily enough it's to do with Minecraft too:

# cp -R /home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft ~/minecraft-backup-snapshot
cp: cannot stat ‘/home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft’: Too many levels of symbolic links

Not sure if this issue has regressed or if it's a new issue..

I was just trying to do an strace and got a kernel panic.. See the attached image.

ioquatix on 16 Aug 2015

Here is the kernel panic.. running zfs 0.6.4.2_r0_g44b5ec8_4.1.4_1-1

ioquatix on 16 Aug 2015

@behlendorf Sorry man just bumping this in case you didn't see it.

ioquatix on 18 Aug 2015

That kernel panic is a duplicate. I saw that months ago and reported it.

drescherjm on 18 Aug 2015

@drescherjm perhaps you can link to the other issue?

ioquatix on 18 Aug 2015

3257

drescherjm on 18 Aug 2015

Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04).

After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots.

This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug.

This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :)

pivot69 on 8 Dec 2016

👍1

I thought this was fixed long ago.

drescherjm on 8 Dec 2016

Me too. I realize this is as old issue, but though it could be nice to post my solution here too.
Its the same issue as this: https://github.com/zfsonlinux/zfs/issues/4514

Just some additional info:
The ubuntu server sending snapshots (14.04) has the ubuntu-zfs package installed.
[ 1.570547] ZFS: Loaded module v0.6.5.7-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
The ubuntu server receiving snapshots (16.04.1) has zfs native
[ 17.440504] ZFS: Loaded module v0.6.5.6-0ubuntu10, ZFS pool version 5000, ZFS filesystem version 5

pivot69 on 8 Dec 2016

I am experiencing this issue, unmount/mount workaround did it for me.

eladik on 10 Mar 2017

this issue is getting long in the tooth, but still exists for a newly installed fully updated ubuntu 16.04 with incremental received snapshots. normal snapshots work fine. the unmount/mount workaround does work, so it's certainly a cache issue.

I'm sending my snaps using http://www.bolthole.com/solaris/zrep/ is that matters. it's easy to make a test case to reproduce it using this configuration method http://www.bolthole.com/solaris/zrep/zrep.documentation.html#backupserver

chrwei on 13 Jun 2018

@chrwei are you able to reproduce this with Ubuntu 18.04? It's likely this was resolved in a newer version or ZFS, can you check exactly which version your running, cat /sys/module/zfs/version. If you're still able to reproduce it with 0.7.x or newer it would be helpful if you could put together a small script with reproduces the issue.

behlendorf on 13 Jun 2018

I am on 0.6.5.6-0ubuntu20.

I don't have any 18.04 and don't plan on it for some time.

chrwei on 13 Jun 2018

To add to this mystery, I have also found issues with this error mounting my ZFS pool via sshfs, with an unmount and remount fixing it as well. It only seems to affect zfs pools on my system, even with the same data. ZFS is running on Proxmox latest fully updated, and sshfs client is a fully updated Manjaro client.
EDIT: ZFS Version 0.7.12-1