Background:
We used to do aws s3 sync
for pulling reference data set. Now we are trying to use dvc
and looking for similar dvc sync
or equivalent behaviour.
Issue:
We run our jobs on AWS Batch deployment environment and, there were situation two job or two processes might happen to do dvc pull
in the same checkout DVC project location, hence hitting Unable to acquire lock and job script fail/exit. We have tried dvc get
but it misses the cache download and reference dataset is generally big in size. As a workaround, we have to check .dvc/tmp/lock
to peak PID and, ps -p $PID
to block another process while, the other process is dvc pull
in progress, refer PR https://github.com/umccr/infrastructure/pull/90 -- however, workaround this way is bit cumbersome and no fool proof.
Use Case:
We kind of like the way dvc
does and, we can do switch to dvc pull
or dvc get
for pulling specific tag or version dataset, etc... At the moment, our use case is mainly on reference blob dataset versioning and, pull them into deployment environment to run the report/analysis job.
May be, I explain a bit more about our use case details.
dvc
and, started explore/using it in-placed of previous methoddvc pull
there!dvc pull
step while the other process still downloading dataset in progress.Desirable:
What I might be referring to is similar to aws s3 sync
behaviour described in this article; whereas, no matter how many times, say, you do dvc sync
it will do diff check, cache check like dvc pull
, etc.. and sync up dataset nicely in idempotent way, for example. The whole use case shall be implicitly read-only though. For the start, dvc sync
feature could be very similar to dvc pull
perhaps without locking or, for dataset sync up or downloading/staging purpose.
Hi @victorskl !
Do I understand correctly that the only problem for you with dvc get
is that it doesn't preserve cache between runs so it has to download stuff each time it runs even if cache for that data was downloaded before? But otherwise dvc get
suits you best in your scenario?
Hi @victorskl !
Do I understand correctly that the only problem for you with
dvc get
is that it doesn't preserve cache between runs so it has to download stuff each time it runs even if cache for that data was downloaded before? But otherwisedvc get
suits you best in your scenario?
Kind of yes. However, AFAIK, dvc get
does not have two options, compare to dvc pull
:
-j
--jobs
)Missing --jobs
option may be ok such that user shall invoke multiple dvc get
processes to, say, if dvc get
can handle cache check, diff check and/or; what file to sync or skip if file checksum match, etc. Hence, proposing dvc sync
.
Please correct me if I understood it wrong. Thanks.
@victorskl Thanks for clarifying! A few more questions to make sure I understand your use case:
no multi-threads / parallel download option (i.e. -j --jobs)
You mean that the defaults it has don't suit you, right? By default, we use 4*NCPU jobs for s3 remotes. Is that too little?
Btw, do you want to download all the data used in the project (e.g. how dvc pull
works without targets) or particular datasets?
no multi-threads / parallel download option (i.e. -j --jobs)
You mean that the defaults it has don't suit you, right? By default, we use 4*NCPU jobs for s3 remotes. Is that too little?
Sorry for confusion. For this, I was referring to dvc get
command that has no --jobs
option. And, I am not sure whether dvc get
download data in multi-threads. So far, I observe it is in single thread, yes?
Btw, do you want to download all the data used in the project (e.g. how
dvc pull
works without targets) or particular datasets?
We normally pull all reference data in, so yes, we download all the data at that point in time or a particular commit version. Our use case is more pronounce on checkout to a revision or tag; then get all as fast as possible -- no lock or any blocking, multi-threads, multi-processes, etc -- suffice to say at this point we are pulling data for the run. Not for modification.
@victorskl dvc get
does use multiple threads when pull
ing the data internally. The number of those threads is the same as default --jobs in dvc pull
.
We normally pull all reference data in, so yes, we download all the data at that point in time or a particular commit version. Our use case is more pronounce on checkout to a revision or tag; then get all as fast as possible -- no lock or any blocking, multi-threads, multi-processes, etc -- suffice to say at this point we are pulling data for the run. Not for modification.
Makes sense. I suppose you are also using some git files too (e.g. scripts/code) from that same git revision? If so, even more reason to stick with dvc pull
. A simple workaround might be to make your workers create separate instances of git repos, but make them all use one shared cache (see dvc cache dir
command). That would avoid the repo lock conflicts between workers and would preserve the cache for any worker to use (checkout
s will be really quick, see https://dvc.org/doc/user-guide/large-dataset-optimization if you don't use reflinks/hardlinks/symlinks already). Would that work for you?
@efiop thanks for the suggestion.
I can see how that would solve the lock conflicts on the repo level. However, would that not just push them to the cache level?
I imagine two processes started at the same time. They start a dvc pull
in their respective repos, but since the cache is (initially) empty they would proceed to fill it, potentially running into the same contention pulling and writing the same files?
Or is that handled gracefully at this point? I.e. can the cache be filled by two independent parallel processes?
Sorry, quite new to DVC, so this may be a stupid question.
@reisingerf Very good question! The cache is designed to be filled by multiple processes at the same time. The way we do it is when downloading a file, we first download it to a temporary file right beside the original location (e.g. .dvc/cache/12/3456.tmp) and then atomically rename
it into place. That being said, there is a potential for one process to discover that that particular file was already downloaded and not be able to rename
it and fail the dvc pull
, but the chances are pretty low in regular use cases. Though it would be good for us to maybe handle those rare cases more gracefully in terms of just recognizing that the file already exists and we are safe to skip it.
@efiop thanks a lot for the detailed explanation!
We are working with some fairly large files which may take a little while to download. So I guess there is a chance that two parallel processes, especially when they start at pretty much the same time, try to fetch the same file at the same time.
In any case, I think we'll try your suggestion and see if we run into issues.
Thanks again for your help!
@reisingerf Makes sense! Well, for now, the worst-case scenario is that you might have to run dvc pull
once again for it to discover that file that was downloaded by another process after it fails to move
, so even data won't be lost and you won't have to download it again. You could even put a dvc pull
loop in your bootstrapping code that will retry running it N times more if the command fails. We will definitely look into handling this more gracefully on our side.
Just to update, we are now using dvc common cache directory as suggested workaround for avoiding dvc pull lock issue. It is working so far.
@victorskl Glad to hear it works! I guess the only thing in this ticket left is to check for that potential recoverable bug that I've described before? Or is there anything else you would like to add here?
I guess the only thing in this ticket left is to check for that potential recoverable bug that I've described before? Or is there anything else you would like to add here?
Not much to add, yes. We are okay if it works this way for now. We still concern that we might encounter the potential recoverable edge case @reisingerf and you described, yes. It works well so far in DEV environment. Happy to report back after we started put the pipeline into Production and, made couple of production runs.
@victorskl Sounds good! Created https://github.com/iterative/dvc/issues/4992 for it. Closing this issue for now. Thanks again for the feedback!
Most helpful comment
Just to update, we are now using dvc common cache directory as suggested workaround for avoiding dvc pull lock issue. It is working so far.