dvc init
got stuck in MatrixDS environment:
One of guesses - it is related to NFS:
That's all I know for today :(
@FavioVazquez might have more details.
Thanks @dmpetrov. Tagging @isaacfab here too
That's a good guess about issues related to the NFS. We have had to tweak the user privileges for many of the containers.
Hi @isaacfab ! For how long did you leave it running?
It ran for a long time, but nothing happened. I couldn't even use Ctrl+C @efiop
@FavioVazquez Right, if you were not able to CTRL+C, then the program was still in kernel space, meaning that it is probably an NFS issue. Is it still running? If not, mind running it again and then:
1) find it's PID with something like ps xaf | grep dvc
;
2) show us cat /proc/$PID/stack
;
Thanks,
Ruslan
@efiop :
matrix@tool-5ca6d6e08f48fd83c6cc6069-mcn6f:/home/test$ cat /proc/248/stack
cat: /proc/248/stack: Operation not permitted
trace is disabled in docker by default.
@isaacfab is there a way to run one of your docker images with NFS with SYS_PTRACE enabled? We need to see the stack trace at least to understand what's happening. Otherwise it's extremely hard to debug this.
I also've tried to run NFS on my machine (server and client), dvc worked fine.
@shcheklein Is that docker image running on linux host?
@efiop most likely, yes. @isaacfab do you have more details?
The container is running on GCP GKE so yes. All of our containers (with docker files) can be found here. https://github.com/matrixds/tools
You can use custom containers on the site (see the repo for instructions). Feel free to enable whatever variables you think would be useful. Have you checked the logs on the environments from the webui?
@isaacfab I've checked them, don't see anything interesting.
Re custom tool - I'm not sure I understand which field should I be using to pass docker run
arguments. Is there any documentation how those options translate to CLI and how could I adjust for the docker run
?
@isaacfab to be more specific, we need something like docker run --cap-add SYS_PTRACE ...
in the UI for the custom tool, you can add a specific docker run command. here is an example.
you could also build the container from scratch and change the run command in the entrypoint script. If you put a test container on docker hub you can build it from the custom UI. (see the example above I've just referenced our current jupyterlab image.
We had a similar issue with our shiny tool. The system runs the applications as a different user (shiny) so we had to specifically add privileges to that user or NFS would block any updates. maybe something similar is going on here? You can see in the shiny docker file how we handled it. lines ~30-40
https://github.com/matrixds/tools/blob/master/shiny/Dockerfile
@isaacfab as far as I understand entrypoint script does not change docker run
arguments, it changes a script the container runs first when you start it. We need to run the container in a little bit different mode, it means that we need to modify settings of the GKE most likely. I'm not sure if we can do it for a single container though.
Okay, I see. Maybe we should jump on a call for this.
@isaacfab hi! We were able to pinpoint the source of this problem - file locks over NFS. We are using file locking in two places - one is to prevent user from running two commands simultaneously, second indirectly used in SQLLite. For example, if try to run the following snippet, you can reproduce the problem very easily:
import sqlite3
db = sqlite3.connect('db')
cursor = db.cursor()
cmd = "CREATE TABLE IF NOT EXISTS 'state' (count INTEGER)"
cursor.execute(cmd)
It looks like a pretty general problem and I would expect some other tools that use SQLLite to hang.
The first thing to try that comes to my mind (before we consider possible workarounds on the DVC side) - check and adjust the NFS settings or even NFS version. It looks like you are using NFS 3 which is pretty old. Was there a specific reason for that? Could you provide more information - server/client versions, settings, commands you run to mount it, etc? That would help us to brainstorm this.
@shcheklein the other tools that use sqllite have it installed in another volume (or on the container itself) so we don't have issues like this. The problem here is that the matrixds project directory is special. It is a public collection of files that any user on the project can use by creating a new tool. This directory is intended to shared and only contain project artifacts (data, code and dependencies). Because of this we have some specific (permissive) rules set on that volume to ensure there are not any problems with sharing. If we allow some files to be locked here it causes some issues. This might need some deeper thinking on our end.
I'm not sure about the NFS 3 version...
One (hacky) workaround is to put the dvc project somewhere other than the home directory for now (like /home/dvc_project or something like that).
I'm not sure I completely understand the setup you have. Does it mean that I can share the same directory (tool-<id>
) on NFS between two different docker containers (two different tools)? I thought that this directory was unique to the container (thus it has a unique id in its path). Could you give me an example how do I share it?
Is this directory intended to run git in it, btw?
Saying that, I have also another question. You mentioned that "If we allow some files to be locked here it causes some issues". First, what kind of issues do you mean? And my thinking was that locking files is the mechanism specifically made to prevent data (data in general, not in data science sense) corruption in a shared environment. For example, git is also using locking (through symlinks, probably because they had a lot of troubles with NFS :)) when you run a command. Specifically to prevent multiple users from executing something simultaneously.
We can definitely force users to put DVC project outside NFS. It's hard for me to "feel" how hacky is it for your environment.
@isaacfab hey, any updates on your end? ) could you clarify those things I asked above ^^
@shcheklein got busy :). The workaround referenced above worked so we are good right now (we have a number of users doing this). We are working on a more permanent NFS solution that will take a bit more time to implement. I'll ping you once we have it done.
@isaacfab glad to hear that, Isaac! I'm going to close the issue for now since we have a workaround and looks like no action items from the DVC end for now. Please, feel free to reopen it when you get to the NFS redesign and if you need our help. Thanks.
Most helpful comment
@isaacfab glad to hear that, Isaac! I'm going to close the issue for now since we have a workaround and looks like no action items from the DVC end for now. Please, feel free to reopen it when you get to the NFS redesign and if you need our help. Thanks.