dvc: .dvc/lock and sqlite lock on NFS and CIFS

Created on 23 Apr 2019 · 28Comments · Source: iterative/dvc

As it turns out, regular locks can't be relied on when dvc is running on NFS or CIFS. See https://github.com/iterative/dvc/issues/1823 and https://discordapp.com/channels/485586884165107732/485596304961962003/570270243964846081 .

There are workarounds such as adding proper mnt options or moving dvc project outside with an external cache directory at NFS/CIFS mount, but it would be great to mitigate such issues in general. Here are a few ways to go about it:

One way to go about it is to use git-like locks (which, if I recall correctly, are using symlinks as an atomic way lock/unlock a file). In that case, we would have to use an unlocked sqlite db relying on our .dvc/lock (or maybe introduce a separate special lock specifically for a db).

c5-half-a-day enhancement p1-important research

Source

efiop

Most helpful comment

The questions need to be answered:

can we use symlink locks like git does? Are there any ready to use python implementation? Do NFS/CIFS/LustreFS support atomic symlinks?
are there some key-value file backed db which we can switch to from sqlite which works on NFS/CIFS/...?
where do we place sqlite db, if we don't place it within .dvc?

Suor on 27 Aug 2019

👍2

All 28 comments

Running

import sqlite3
db  = sqlite3.connect('db')
cursor = db.cursor()
cmd = "CREATE TABLE IF NOT EXISTS 'state' (count INTEGER)"
cursor.execute(cmd)

on CIFS, results in

Traceback (most recent call last):
  File "azureml-setup/context_manager_injector.py", line 161, in <module>
    execute_with_context(cm_objects, options.invocation)
  File "azureml-setup/context_manager_injector.py", line 90, in execute_with_context
    runpy.run_path(sys.argv[0], globals(), run_name="__main__")
  File "/azureml-envs/azureml_f46203ca27ee37bd5932e64f3549ae1c/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/azureml-envs/azureml_f46203ca27ee37bd5932e64f3549ae1c/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/azureml-envs/azureml_f46203ca27ee37bd5932e64f3549ae1c/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "code/training_scripts/train_azure_test.py", line 108, in <module>
    cursor.execute(cmd)
sqlite3.OperationalError: database is locked

efiop on 23 Apr 2019

For the record: another user ran into nfs problems https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101

efiop on 9 May 2019

Another user might be getting the same issue but now with Lustre FS https://discordapp.com/channels/485586884165107732/485596304961962003/582672973660684291

efiop on 28 May 2019

Hello, this is also an issue for me working on CIFS file system. Would be great to get a workaround. Thanks!

brbarkley on 19 Aug 2019

@brbarkley , as @efiop wrote in the description:

There are workarounds such as adding proper mnt options or moving dvc project outside with an external cache directory at NFS/CIFS mount...

Although, I'm not sure about the proper mnt options. Would you mind explaining it, @efiop ? Couldn't find anything on https://linux.die.net/man/8/mount.cifs

ghost on 19 Aug 2019

Maybe you need to enable Unix Extensions for CIFS :thinking:

https://askubuntu.com/questions/982266/how-to-mount-cifs-with-unix-extensions#995142

ghost on 19 Aug 2019

Thanks @mroutis.

Unfortunately, I don't have sudo rights to change the mount options as described in the link you provided. But I can keep this in mind to discuss with my admins to see if it would solve the problem.

I think one of the workarounds @efiop refers to is described in the discord conversation here, which proposes pulling into a local drive from a dvc remote located on NFS or CIFS location and using an external cache to optimize space in the local drive. While this helps to some extent, the storage limits in my local drive of the linux server I'm working on is so strict (~5GB) that @efiop's workaround only works for very small projects in my case.

Currently, I don't see a workaround described that would allow a user to dvc pull into a git repo located on an NFS or CIFS location.

brbarkley on 19 Aug 2019

@brbarkley

While this helps to some extent, the storage limits in my local drive of the linux server I'm working on is so strict (~5GB) that @efiop's workaround only works for very small projects in my case.

I suppose you are not using symlinks for it, right? I mean dvc config cache.type symlink. Could you describe your scenario a little more? Is it a deployment or something? If it is, then using ^ would solve the problem, because it would not use any space at all on your local drive. If, however, you run the risk that someone or something would try to modify those files in place in the workspace, then you might also consider using dvc config cache.protected true, so dvc is making those files read-only, preventing accidental edits which might result in cache corruption.

efiop on 19 Aug 2019

@efiop

I suppose I am currently either using one of the default options, reflink or copy, as I have not modified the cache.type. I can view .dvc/config but it does not seem to show all default values; is there a way to view current config values?

My workflow in general is estimating a model, e.g., python -m src.predict_model, which takes dvc-tracked deps, e.g., data/df_processed_data.pkl, and outputs dvc-tracked outs, e.g., output/model_output.pkl. A workspace state thus contains a specific version of the model specification (tracked by git) and its associated data input(s) and model output(s) (both tracked by dvc). When I change the model specification, I then execute dvc repro based on updated pipeline and it produces an updated output/model_output.pkl file given the new model specification.

Based on the description in the dvc docs, I'm not sure if my use case is compatible with using symlinks or hardlinks as my pipeline entails more than just "adding files to a directory input dataset" or "deleting files from a directory dataset" (in the above example, it seems I'm overwriting output/model_output.pkl with each new specification):

dvc unprotect can be an expensive operation (involves copying data). Check first whether your task matches one of the cases that are considered safe, even when cache protected mode is enabled:

Adding more files to a directory input dataset (say, images or videos)

Deleting files from a directory dataset

Please advise if I'm misreading the guidance on dvc unprotect. I'm open to using symlinks if it would be safe/appropriate for my use case.

Thanks!

(Aside: To date, I have been developing and running pipeline locally on windows PC with dvc remote in CIFS location. However, I would like to set up a deployment configuration on a remote linux machine by which I would develop locally but run pipeline remotely along the lines of this and/or this. As mentioned above, though, the storage capacity in my local user space on the remote linux machine is severely limited.)

brbarkley on 19 Aug 2019

I suppose I am currently either using one of the default options, reflink or copy, as I have not modified the cache.type. I can view .dvc/config but it does not seem to show all default values; is there a way to view current config values?

Yes, by default it is set to reflink, copy(if reflink is not available, it will use copy).

My workflow in general is estimating a model, e.g., python -m src.predict_model, which takes dvc-tracked deps, e.g., data/df_processed_data.pkl, and outputs dvc-tracked outs, e.g., output/model_output.pkl. A workspace state thus contains a specific version of the model specification (tracked by git) and its associated data input(s) and model output(s) (both tracked by dvc). When I change the model specification, I then execute dvc repro based on updated pipeline and it produces an updated output/model_output.pkl file given the new model specification.

Is output big too? Will it fit on your drive? If it won't, then, in that case, the workaround proposed by me won't work, because initially, those files will lie as is in the workspace and then dvc will move them to cache and will create symlinks, freeing space in the workspace.

Please advise if I'm misreading the guidance on dvc unprotect. I'm open to using symlinks if it would be safe/appropriate for my use case.

In your case, I suppose you are not modifying the dependency files, right? So you don't need to unprotect those. Outputs are newly created, so you also don't need to unprotect those if you don't plan on modifying them in-place. So unless your output won't temporarily fit onto your drive as explained above, using symlinks with protected mode(or even without it if you are sure that you won't try to edit those files in-place) should be suitable for you.

If that workaround doesn't really work for you, another option would be to modify dvc to support moving temporary files out of the repository. E.g. dvc config tmp_files_dir /path/to/dir/not/on/nfs, that would make dvc put its locks and stuff there, while allowing you to comfortably work on NFS.

efiop on 19 Aug 2019

In your case, I suppose you are not modifying the dependency files, right? So you don't need to unprotect those.

I actually do refresh the underlying data on a periodic basis

Outputs are newly created, so you also don't need to unprotect those if you don't plan on modifying them in-place.

Outputs are newly created but they're written to the same file name (i.e., model_output.pkl) each time. Model output for current project isn't too large. But I would like my workflow to scale to larger projects.

If that workaround doesn't really work for you, another option would be to modify dvc to support moving temporary files out of the repository. E.g. dvc config tmp_files_dir /path/to/dir/not/on/nfs, that would make dvc put its locks and stuff there, while allowing you to comfortably work on NFS.

So, in this case would the model_output.pkl be temporarily saved to tmp_files_dir then moved to cache and symlinks created in workspace?

brbarkley on 19 Aug 2019

Outputs are newly created but they're written to the same file name (i.e., model_output.pkl) each time. Model output for current project isn't too large. But I would like my workflow to scale to larger projects.

Same file name is fine, as long as you don't edit the existing file in-place. Think of it this way, symlink is directly pointing to the cache file, so if you try to edit it, it would result in a cache file being edited and so that cache file would be corrupted. As long as you remove the existing file before creating a new one, you should be fine.

So, in this case would the model_output.pkl be temporarily saved to tmp_files_dir then moved to cache and symlinks created in workspace?

No, it would stay on your NFS mount. That tmp dir is only for temporary untracked files like .dvc/lock, .dvc/state, .dvc/updater.lock. So it won't require additional space on your drive, no matter how big your data is.

efiop on 19 Aug 2019

@brbarkley , if I remember correctly, dvc run / dvc repro unlinks the previous output file (model_output.pkl) before executing the cmd, the only problem with symlinks are modifications by the user (like opening the file with an editor and modifying the content). That's why @efiop mentioned "As long as you remove the existing file before creating a new one, you should be fine".

ghost on 19 Aug 2019

👍1

Thanks @efiop and @mroutis

if I remember correctly, dvc run / dvc repro unlinks the previous output file (model_output.pkl) before executing the cmd, the only problem with symlinks are modifications by the user (like opening the file with an editor and modifying the content).

Got it. While I do use dvc repro to update my pipeline and generally do not open data and output to edit files manually in place, I do use a debugger to debug my python files. If the debug runs successfully, outputs like model_output.pkl would be saved to disk outside of dvc repro execution and without unlinking my files. Any solutions here? Not sure if I could setup my debugger to execute the python file via dvc?

In your case, I suppose you are not modifying the dependency files, right? So you don't need to unprotect those.

Reading your comment more closely, I do modify git-tracked dependencies in place like code files—e.g., predict_model.py. But they are not stored in the dvc cache, correct, so that would not risk cache corruption? Or would it?

brbarkley on 20 Aug 2019

👍1

Got it. While I do use dvc repro to update my pipeline and generally do not open data and output to edit files manually in place, I do use a debugger to debug my python files. If the debug runs successfully, outputs like model_output.pkl would be saved to disk outside of dvc repro execution and without unlinking my files. Any solutions here? Not sure if I could setup my debugger to execute the python file via dvc?

Not quite sure how it looks like using debugger(pdb?) with pkls. Are they modified in-place? Is that a part of a regular pipeline? Like create pkg and then another stage to debug i and modify it? In that case I think you could indeed run dvc run ... pdb .... But I can't tell for certain, you would need to give it a try and see for yourself.

Reading your comment more closely, I do modify git-tracked dependencies in place like code files—e.g., predict_model.py. But they are not stored in the dvc cache, correct, so that would not risk cache corruption? Or would it?

It wouldn't. Cache corruption risk is only about modifying files that are cached with dvc, correct.

efiop on 20 Aug 2019

Not quite sure how it looks like using debugger(pdb?) with pkls. Are they modified in-place? Is that a part of a regular pipeline? Like create pkg and then another stage to debug i and modify it? In that case I think you could indeed run dvc run ... pdb .... But I can't tell for certain, you would need to give it a try and see for yourself.

Well, I'm not really debugging the pkl files. I'm debugging the python files that create the pkl files to test that they run properly (or at all), for example, using the interactive debugger features in PyCharm.

The debug process is not really incorporated/captured by any of the dvc stages. I'll have to think about it some more. If there was a way to configure the debugger to not actually write to any files, that could work. Writing to files is a trivial task and I don't really care about that in the debug process.

Actually, I suppose if I had dvc lock symlink files, the debugger would fail every time the python script tries to save to a file, which would be good for protecting my dvc cache but not so much for the debug process...

brbarkley on 20 Aug 2019

👍1

ohh that's an interesting case, @brbarkley. As a work around, if your scripts are writing instead of appending to the pkl file, you can os.remove the file before writing into it, this way, even if you run the script without DVC (in this case, for debugging purposes) it won't affect the cache.

Actually, I suppose if I had dvc lock symlink files, the debugger would fail every time the python script tries to save to a file, which would be good for protecting my dvc cache but not so much for the debug process...

:thinking: DVC locks are more like an option for stage files to prevent repro from running the command.

def lock(self, target, unlock=False):
    from dvc.stage import Stage

    stage = Stage.load(self, target)
    stage.locked = False if unlock else True
    stage.dump()

    return stage

I'm not sure if it is a good idea to chmod -w the cache, this will prevent any accidental write on symlinks pointing to the cache. what do you think, @efiop ?

ghost on 20 Aug 2019

As a work around, if your scripts are writing instead of appending to the pkl file, you can os.remove the file before writing into it, this way, even if you run the script without DVC (in this case, for debugging purposes) it won't affect the cache.

Would this work if dvc config cache.protected true?

And if I os.remove, for example, model_output.pkl and save an updated model_output.pkl in my workspace during debugging, will that save the actual file in my workspace or just a symlink to the file?

Sounds like it would save the actual file in my workspace, then when I actually run dvc repro, the file saved during debugging would get replaced by a symlink to the newly updated file in cache. Just want to make sure I'm understanding correctly.

DVC locks are more like an option for stage files to prevent repro from running the command.

I meant to refer to the dvc config cache.protected true option. That's what I meant by "locking" the files; I should have been more specific. So, for example, if dvc config cache.protected true, I wouldn't be able to edit them or write over them, correct?

brbarkley on 20 Aug 2019

Would this work if dvc config cache.protected true?

@brbarkley , it will! Forgot about that configuration option :sweat_smile:

if dvc config cache.protected true, I wouldn't be able to edit them or write over them, correct?

Correct :smiley: . You'd need to unprotect your file before editing it.

And if I os.remove, for example, model_output.pkl and save an updated model_output.pkl in my workspace during debugging, will that save the actual file in my workspace or just a symlink to the file?

It will save the actual file in the workspace (as you stated later).

Just want to make sure I'm understanding correctly.

You do, @brbarkley :)

So, you can use dvc config cache.protected true :smile:

ghost on 20 Aug 2019

Ok, thank you @mroutis

I will experiment.

brbarkley on 20 Aug 2019

The questions need to be answered:

can we use symlink locks like git does? Are there any ready to use python implementation? Do NFS/CIFS/LustreFS support atomic symlinks?
are there some key-value file backed db which we can switch to from sqlite which works on NFS/CIFS/...?
where do we place sqlite db, if we don't place it within .dvc?

Suor on 27 Aug 2019

👍2

@Suor You might find this one useful tto look at https://gitlab.com/warsaw/flufl.lock/blob/master/flufl/lock/_lockfile.py (no py2 though :slightly_frowning_face: ).

efiop on 28 Aug 2019

As far as I can see this is for locks within single process.

ср, 28 авг. 2019 г., 22:37 Ruslan Kuprieiev notifications@github.com:

@Suor https://github.com/Suor You might find this one useful
https://gitlab.com/warsaw/flufl.lock/blob/master/flufl/lock/_lockfile.py
(no py2 though 🙁 ).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc/issues/1918?email_source=notifications&email_token=AACFLR3F4RDQPBPSCAUFCM3QG2LS5A5CNFSM4HH5IBRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5LRMHI#issuecomment-525800989,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACFLRZXJVBCKSIJ2D63CKTQG2LS5ANCNFSM4HH5IBRA
.

Suor on 28 Aug 2019

are there some key-value file backed db which we can switch to from sqlite which works on NFS/CIFS/...?

Not sure if it has the same lock mechanisms

UPDATE:

UnQLite share some SQLite low-level components such as the VFS (Virtual File System), the pager layer and the locking mechanism.

https://www.unqlite.org/faq.html

:disappointed:

ghost on 28 Aug 2019

@Suor , also another question would be if around the idea of disabling the lock mechanisms in sqlite

ghost on 28 Aug 2019

@Suor Yeah, I'm just saying that the implementation is hardlink-based, which might be useful. Just throwing that one out there :slightly_smiling_face: Ran into it accidentally.

efiop on 28 Aug 2019

@mroutis there is a nolock option, but using it will lead to possible corruption.

Suor on 29 Aug 2019

Some takeouts from discussion with @efiop:

if we have good universal locks then this is rather straightforward: use sqlite nolock option, use those locks around sqlite and in place of any current locks.
git uses open(O_EXCL | O_CREAT | ...) locks, which don't work on pre v3 NFS.
accesibility of symlinks rely on mount options.
flufl locks linked by @efiop above.