Singularity: Failed to pull the image from docker registry if the cache dir is on NFS

Created on 18 May 2019  Â·  11Comments  Â·  Source: hpcng/singularity

Version of Singularity:

$ singularity version
3.2.0

Singularity was installed by https://github.com/abims-sbr/ansible-singularity.

Expected behavior

I want to pull an image from DockerHub or my private registry and create *.sif at my home directory. Home directory is exported by NFSv4 (QNAP NAS). It is mounted on my compute node (singularity installed) by AutoFS.

$ singularity pull ubuntu.sif docker://ubuntu:18.04
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
INFO:    Build complete: ubuntu.sif

It should success to pull.

Actual behavior

It always fails on last layer of the specified image. I tried to pull ubuntu:18.04, alpine:latest, nvidia/cuda:9.2-cudnn7-devel-ubuntu18.04 ...etc, but all failed.

$ singularity --debug pull ubuntu.sif docker://ubuntu:18.04
DEBUG   [U=15119,P=1752]   NewBundle()                   Created temporary directory for bundle /tmp/sbuild-339142860
INFO    [U=15119,P=1752]   Full()                        Starting build...
DEBUG   [U=15119,P=1752]   Get()                         Reference: ubuntu:18.04
DEBUG   [U=15119,P=1752]   updateCacheSubdir()           Caching directory set to /home/share/myname/.singularity/cache/oci
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
DEBUG   [U=15119,P=1752]   cleanUp()                     Build bundle(s) cleaned: [/tmp/sbuild-339142860]
FATAL   [U=15119,P=1752]   PullOciImage()                Unable to pull docker://ubuntu:18.04: conveyor failed to get: Error reading config blob sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031: open /home/myname/.singularity/cache/oci/blobs/sha256/68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031: permission denied

And the error message says Permission denied interestingly. I have a permission to read and write it absolutely.

ls -l /home/share/myname/.singularity/cache/oci/blobs/sha256/ | grep 68eb5e932
-rw-r--r-- 1 myname    2420 May 18 14:01 68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031

Next, I thought that NFS (or AutoFS maybe) was suspicious and tried to change the cache directory of singularity.

$ SINGULARITY_CACHEDIR=/tmp/singularity singularity --debug pull ubuntu.sif doc
ker://ubuntu:18.04
DEBUG   [U=15119,P=1826]   NewBundle()                   Created temporary directory for bundle /tmp/sbuild-446641628
INFO    [U=15119,P=1826]   Full()                        Starting build...
DEBUG   [U=15119,P=1826]   Get()                         Reference: ubuntu:18.04
DEBUG   [U=15119,P=1826]   updateCacheSubdir()           Caching directory set to /tmp/singularity/oci
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 0s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination

Storing signatures
DEBUG   [U=15119,P=1826]   Full()                        Inserting Metadata
DEBUG   [U=15119,P=1826]   Full()                        Calling assembler
INFO    [U=15119,P=1826]   Assemble()                    Creating SIF file...
INFO    [U=15119,P=1826]   Full()                        Build complete: ubuntu.sif
DEBUG   [U=15119,P=1826]   cleanUp()                     Build bundle(s) cleaned: [/tmp/sbuild-446641628]

I pulled the image successfully.

The shared home directory is mounted with these parameters. A symbolic link is created from /am/home to /home/share. And my home directory is /home/share/myname.

$ nfsstat -m
/am/home from example.com:/homes
 Flags: rw,sync,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.xx.yy,local_lock=none,addr=192.168.xx.zz

Question

  1. Is this a bug of Singularity?
  2. Are there any workarounds?
  3. Can I share the caches with other users if I specify a cache dir as /tmp/singularity?
Question Release 3.2.0

All 11 comments

I'm running into almost the same error message.
However I'm only able to reproduce it while running two pulls in parallel. For that I open two SSH sessions and run the pull in parallel in both sessions.

In my case it doesn't matter if the CACHE_DIR is local or on a NFS mount.

Writing manifest to image destination
Storing signatures
DEBUG [U=2094,P=121126] cleanUp() Build bundle cleanup: /tmp/sbuild-857664395
FATAL [U=2094,P=121126] PullOciImage() Unable to pull docker://ubuntu:18.04: conveyor failed to get: no descriptor found for reference "f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5"

singularity version 3.1.1-1.el7

@jscook2345 Should I open a different issue for the parallel execution?

@pddg ...
Okay, you have NFS v4.2 ... It's been a while for me, but I know NFS v4 used to store the files as user nobody until fully flushed, then the idmapping would kick in and either set the UID/GID correctly ... or keep it at nobody.

I don't know how to test this... but it's _possible_ we're trying to access the file on NFS before it has been fully written out to NFS and all the bits are set. I say it's possible, as we just attempt this sequentially here... and you mentioned it's _always_ the last layer file of an image that this errors on.

@ikaneshiro ... Would it be possible to add a sleep in here somewhere after the download to wait a few seconds, before starting the next step?

@pddg Could you check if it still fails after SINGULARITY_DISABLE_CACHE=yes is set ?

In my case this "fixes" or hides the error.

I also tried using SINGULARITY_CACHEDIR=/tmp/singularity ?
DIR was cleaned up before running again.
In my case this does not fix the issue, so still the error message is:

FATAL:   Unable to pull docker://NNNNNNNNNN: conveyor failed to get: no descriptor found for reference "50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4"

test1.strace.365457.txt

strace output of the process ending with FATAL.....

test2.strace.365430.parallel.running.success.txt

This is the same strace of a process running in parallel but finishing successfully.

@pddg I hope I don't hijack your issue.
@jscook2345 @jmstover What do you think - should it be one issue or should I create a dedicated one? IMHO this is rather a bug than a question.

@tbugfinder In my case, SINGULARITY_DISABLE_CACHE=yes do not affect.

$ SINGULARITY_DISABLE_CACHE=yes singularity --debug pull ubuntu.sif docker://ubuntu:18.04
DEBUG   [U=15119,P=1122]   NewBundle()                   Created temporary directory for bundle /tmp/sbuild-412835060
INFO    [U=15119,P=1122]   Full()                        Starting build...
DEBUG   [U=15119,P=1122]   Get()                         Reference: ubuntu:18.04
DEBUG   [U=15119,P=1122]   initCacheDir()                Creating cache directory: /home/share/myname/.singularity/cache/oci
DEBUG   [U=15119,P=1122]   updateCacheSubdir()           Caching directory set to /home/share/myname/.singularity/cache/oci
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 2s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
DEBUG   [U=15119,P=1122]   cleanUp()                     Build bundle(s) cleaned: [/tmp/sbuild-412835060]
FATAL   [U=15119,P=1122]   PullOciImage()                Unable to pull docker://ubuntu:18.04: conveyor failed to get: Error reading config blob sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031: open /home/share/myname/.singularity/cache/oci/blobs/sha256/68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031: permission denied

Why did debug messages say Creating cache directory ~ and Caching directory set to ~?. Is SINGULARITY_DISABLE_CACHE a valid environment variable? I cannot found it in the document of Singularity 3.2 (https://www.sylabs.io/guides/3.2/user-guide/build_env.html)

I've run through again using SINGULARITY_DISABLE_CACHE=yes and indeed it's neither in the code and it isn't a solution.
Sorry about that confusion.

Sounds like #3634 is a match, too.

In my environment, this issue was fixed after upgrading to Singularity v3.3.0. However, if the implementation of Singularity has not changed, it is unclear why it has been fixed.

3.3.0 did include caching changes fixing some issues with network cache locations. We still don't guarantee no race conditions / shared cache usage, so #3634 is still open.

Was this page helpful?
0 / 5 - 0 ratings