Dvc: DVC Sync

Created on 11 Nov 2020  路  13Comments  路  Source: iterative/dvc

Background:
We used to do aws s3 sync for pulling reference data set. Now we are trying to use dvc and looking for similar dvc sync or equivalent behaviour.

Issue:
We run our jobs on AWS Batch deployment environment and, there were situation two job or two processes might happen to do dvc pull in the same checkout DVC project location, hence hitting Unable to acquire lock and job script fail/exit. We have tried dvc get but it misses the cache download and reference dataset is generally big in size. As a workaround, we have to check .dvc/tmp/lock to peak PID and, ps -p $PID to block another process while, the other process is dvc pull in progress, refer PR https://github.com/umccr/infrastructure/pull/90 -- however, workaround this way is bit cumbersome and no fool proof.

Use Case:
We kind of like the way dvc does and, we can do switch to dvc pull or dvc get for pulling specific tag or version dataset, etc... At the moment, our use case is mainly on reference blob dataset versioning and, pull them into deployment environment to run the report/analysis job.

May be, I explain a bit more about our use case details.

  • we have post-analysis cancer report pipeline called umccrise
  • it has reference dataset in order to run
  • we used to do a simple approach such that we create a manifest checksum file of blob dataset and, put this manifest file under git to track changes
  • recently we found out dvc and, started explore/using it in-placed of previous method
  • so far so good, and it is in CI/CD CodeBuild pipeline whereas build process run in a single build process; so no issue with dvc pull there!
  • and, we actually run this in AWS Batch environment; and we have wrapper script to stage the reference dataset
  • AWS Batch under-the-hood is ECS cluster which its queue/schedular dispatch jobs onto avail instance for processing and scale dynamically based on samples size (workload).
  • At some point, AWS Batch schedular may allocate, say, two samples onto the same instance; and hence, there are two invocations of wrapper script or, two processes. Therefore, one job end up failed with dvc pull step while the other process still downloading dataset in progress.

Desirable:
What I might be referring to is similar to aws s3 sync behaviour described in this article; whereas, no matter how many times, say, you do dvc sync it will do diff check, cache check like dvc pull, etc.. and sync up dataset nicely in idempotent way, for example. The whole use case shall be implicitly read-only though. For the start, dvc sync feature could be very similar to dvc pull perhaps without locking or, for dataset sync up or downloading/staging purpose.

feature request p3-nice-to-have

Most helpful comment

Just to update, we are now using dvc common cache directory as suggested workaround for avoiding dvc pull lock issue. It is working so far.

All 13 comments

Hi @victorskl !

Do I understand correctly that the only problem for you with dvc get is that it doesn't preserve cache between runs so it has to download stuff each time it runs even if cache for that data was downloaded before? But otherwise dvc get suits you best in your scenario?

Hi @victorskl !

Do I understand correctly that the only problem for you with dvc get is that it doesn't preserve cache between runs so it has to download stuff each time it runs even if cache for that data was downloaded before? But otherwise dvc get suits you best in your scenario?

Kind of yes. However, AFAIK, dvc get does not have two options, compare to dvc pull:

  • yes, doesn't preserve/check cache download between two runs (always checkout new temp folder then download)
  • no multi-threads / parallel download option (i.e. -j --jobs)

Missing --jobs option may be ok such that user shall invoke multiple dvc get processes to, say, if dvc get can handle cache check, diff check and/or; what file to sync or skip if file checksum match, etc. Hence, proposing dvc sync.

Please correct me if I understood it wrong. Thanks.

@victorskl Thanks for clarifying! A few more questions to make sure I understand your use case:

no multi-threads / parallel download option (i.e. -j --jobs)

You mean that the defaults it has don't suit you, right? By default, we use 4*NCPU jobs for s3 remotes. Is that too little?

Btw, do you want to download all the data used in the project (e.g. how dvc pull works without targets) or particular datasets?

no multi-threads / parallel download option (i.e. -j --jobs)

You mean that the defaults it has don't suit you, right? By default, we use 4*NCPU jobs for s3 remotes. Is that too little?

Sorry for confusion. For this, I was referring to dvc get command that has no --jobs option. And, I am not sure whether dvc get download data in multi-threads. So far, I observe it is in single thread, yes?

Btw, do you want to download all the data used in the project (e.g. how dvc pull works without targets) or particular datasets?

We normally pull all reference data in, so yes, we download all the data at that point in time or a particular commit version. Our use case is more pronounce on checkout to a revision or tag; then get all as fast as possible -- no lock or any blocking, multi-threads, multi-processes, etc -- suffice to say at this point we are pulling data for the run. Not for modification.

@victorskl dvc get does use multiple threads when pulling the data internally. The number of those threads is the same as default --jobs in dvc pull.

We normally pull all reference data in, so yes, we download all the data at that point in time or a particular commit version. Our use case is more pronounce on checkout to a revision or tag; then get all as fast as possible -- no lock or any blocking, multi-threads, multi-processes, etc -- suffice to say at this point we are pulling data for the run. Not for modification.

Makes sense. I suppose you are also using some git files too (e.g. scripts/code) from that same git revision? If so, even more reason to stick with dvc pull. A simple workaround might be to make your workers create separate instances of git repos, but make them all use one shared cache (see dvc cache dir command). That would avoid the repo lock conflicts between workers and would preserve the cache for any worker to use (checkouts will be really quick, see https://dvc.org/doc/user-guide/large-dataset-optimization if you don't use reflinks/hardlinks/symlinks already). Would that work for you?

@efiop thanks for the suggestion.

I can see how that would solve the lock conflicts on the repo level. However, would that not just push them to the cache level?

I imagine two processes started at the same time. They start a dvc pull in their respective repos, but since the cache is (initially) empty they would proceed to fill it, potentially running into the same contention pulling and writing the same files?
Or is that handled gracefully at this point? I.e. can the cache be filled by two independent parallel processes?

Sorry, quite new to DVC, so this may be a stupid question.

@reisingerf Very good question! The cache is designed to be filled by multiple processes at the same time. The way we do it is when downloading a file, we first download it to a temporary file right beside the original location (e.g. .dvc/cache/12/3456.tmp) and then atomically rename it into place. That being said, there is a potential for one process to discover that that particular file was already downloaded and not be able to rename it and fail the dvc pull, but the chances are pretty low in regular use cases. Though it would be good for us to maybe handle those rare cases more gracefully in terms of just recognizing that the file already exists and we are safe to skip it.

@efiop thanks a lot for the detailed explanation!

We are working with some fairly large files which may take a little while to download. So I guess there is a chance that two parallel processes, especially when they start at pretty much the same time, try to fetch the same file at the same time.

In any case, I think we'll try your suggestion and see if we run into issues.

Thanks again for your help!

@reisingerf Makes sense! Well, for now, the worst-case scenario is that you might have to run dvc pull once again for it to discover that file that was downloaded by another process after it fails to move, so even data won't be lost and you won't have to download it again. You could even put a dvc pull loop in your bootstrapping code that will retry running it N times more if the command fails. We will definitely look into handling this more gracefully on our side.

Just to update, we are now using dvc common cache directory as suggested workaround for avoiding dvc pull lock issue. It is working so far.

@victorskl Glad to hear it works! I guess the only thing in this ticket left is to check for that potential recoverable bug that I've described before? Or is there anything else you would like to add here?

I guess the only thing in this ticket left is to check for that potential recoverable bug that I've described before? Or is there anything else you would like to add here?

Not much to add, yes. We are okay if it works this way for now. We still concern that we might encounter the potential recoverable edge case @reisingerf and you described, yes. It works well so far in DEV environment. Happy to report back after we started put the pipeline into Production and, made couple of production runs.

@victorskl Sounds good! Created https://github.com/iterative/dvc/issues/4992 for it. Closing this issue for now. Thanks again for the feedback!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  路  3Comments

siddygups picture siddygups  路  3Comments

mfrata picture mfrata  路  3Comments

nik123 picture nik123  路  3Comments

GildedHonour picture GildedHonour  路  3Comments