For projects created with dvc init --no-scm, since there's no Git repo to version all the files NOT tracked by DVC (code, DVC-files), it could be useful to have a dvc export <external-location> command to easily create a lightweight copy of the project (for backup). It's "lightweight" because it wouldn't include any of the data tracked by DVC.
Similar to git archive. Could even include an --archive flag to make a tar/zip bundle of the export.
In the future, similar bundling/compressing functionality for actual data sets could be reused.
Just a random idea! (It came from reading some conversations about non-Git projects on Discord.)
UPDATE: To see the latest discussion go to https://github.com/iterative/dvc.org/issues/1521#issuecomment-558726115, but in summary:
Don't need a new command for now, just document this to archive a snapshot:
git archive -o code.zip HEAD
dvc list . -R --dvc-only | zip -@ data.zip # if `zip` available
dvc list . -R --dvc-only | xargs python -m zipfile -c data.zip # alternative for windows (assuming `xargs` available)
@jorgeorpinel User can do that with tar or zip no problem. Wouldn't bother with this until someone asks for this functionality and has good reasons why he can't use tar or zip :slightly_smiling_face: There is a special reason for git archive -- it excludes .git directory, which is somewhat useful. For us there is nothing we should exclude, so it would be the same as running tar or zip on the whole directory. Unless I'm missing something here.
For us there is nothing we should exclude, so it would be the same as running tar or zip...
What about huge data files? I'm talking about a lightweight copy of the repo as if you just cloned it with Git (but for --no-scm projects). A backup of the DVC project without data files.
@jorgeorpinel Ah, got it. Yes, I can now see the value there. Thanks for clarifying! :slightly_smiling_face: Please feel free to raise the priority if you need this feature, otherwise I would probably wait until there is a clear useful scenario in which someone will actually use this.
No problem. Yes, I agree maybe no one really needs this haha. p3 seems correct, maybe no p at all for now. Let's wait and see. Thanks!
p.s. another variant of this feature could be something like dvc clear to delete all linked data files from the workspace, thus producing a lightweight project (except for the cache dir.) This could easily be reverted again by dvc fetch. Let's see if anyone ever wants this in the future. 馃構
another variant of this feature could be something like
dvc clearto delete all linked data files from the workspace
Another alternative solution might also be provided by dvc list (when it is implemented).
If we can list all the files that are managed by DVC, then it is possible to exclude them while making the archive.
We've received a very similar question from a user https://opendatascience.slack.com/archives/CGGLZJ119/p1574762045023000?thread_ts=1574761369.020000&cid=CGGLZJ119 (russian-only, sorry 馃檨 ). Long story short, the guy is creating an arcive with code and data to send to the customer, which is very similar to the idea from @jorgeorpinel described above. I've asked him to leave a comment here too.
Hi,
I'm the guy from the previous comment. I think, dvc archive is a bad idea. I have a lot of ignored files in my repo and I think it is not a dvc's responsibility to clear them. For example, I have .git, .dvc, .idea in my local repo folder. So, if dvc exports all artifacts due to my local repo it will not remove all those ignored files. Second disadvantage of this approach is that I would have to sort which artifacts I need to have in the archive and which not.
So, I would like to have something like dvc clear which replaces all links with original binaries stored in dvc and removes all dvc-files. I mean dvc clears only what it responsible for. It will be good to have it this way.
@RomanSteinberg Thanks for your comment!
I'm pretty sure git archive will remove .git and gitignored files, leaving only the ones that are actually tracked by it. We were thinking we could work the same as that, plus also remove .dvc/, *.dvc files and replace them with actual data.
Speaking about dvc clear, I think that we should make dvc destroy do those actions (we actually
have a ticket for it with that precise proposed functionality 馃檪 )
Thanks guys! There's some confusion though, my original idea here is for DVC projects that DO NOT use a Git repository as base. There would be no .git/ dir or .gitignore file. dvc export would create a copy without the cache directory or any of the outputs linked in DVC-files, leaving everything else as is including .dvc/ dir and DVC-files. (Also keeping any hidden stuff like .idea/, etc. which is not DVC's responsibility indeed.)
git archivewill remove .git and gitignore files, leaving only the ones that are actually tracked by it. We were thinking we could work the same as that, plus also remove .dvc/, *.dvc files and replace them with actual data.
@efiop actually I was not thinking to include the tracked data in the export. So maybe dvc export is not the best name as the Git analogy breaks... In fact I don't mean to remove DVC from the export, just to make a lightweight copy! So perhaps closer to a dvc clone?
Maybe the idea of dvc clear can replace all this though, and work both for Git and non-Git DVC projects. Let's see: So basically it would move all cached objects to the workspace and delete .dit/ and cache dir, as well as all DVC-files @RomanSteinberg? Kind of like DVC removing itself from the project, the opposite of dvc init?
Seems a bit risky to me, TBH. You would lose any outputs from other Git versions (not linked in checked out DVC-files) and if there's no other copy of the project, its gone forever.
So to summarize, we're basically talking about 2 different things:
dvc init): I like this also but it should probably be a separate issue and I suggest it's done in an exported copy by default, not to risk losing possibly the only DVC project copy.@jorgeorpinel I couldn't imagine that someone can use dvc without git. I don't understand this case at all. How can one versioning data and not versioning code? So, I can't give any feedback about your idea.
In fact without Git you could not version the data either. But still we offer dvc init --no-scm to provide the pipeline management functionality (without versioning). I think. I didn't decide this, but the fact is we have the option.
Just to clarify, we could have dvc list which could either list:
Then dvc export/archive would actually bundle either one of those into a folder/zip.
Use case: deploy releases without SCM/DVC: git archive -o code.zip && dvc archive -o data.zip for upload to customer/Zenodo/publication etc.
I'd agree that dvc list would be a first step. Writing an archive script would be easy after that.
For the record: dvc list is already implemented.
iterative/dvc#4108 I see :)
Right so to archive a snapshot:
git archive -o code.zip HEAD
dvc list . -R --dvc-only | zip -@ data.zip
Perfect! Maybe we just need to put a note about this in the dvc list cmd ref. and link from a few more places? If so please move this issue to the docs repo. Thanks
Closed by #2075.
Most helpful comment
Hi,
I'm the guy from the previous comment. I think,
dvc archiveis a bad idea. I have a lot of ignored files in my repo and I think it is not a dvc's responsibility to clear them. For example, I have.git, .dvc, .ideain my local repo folder. So, if dvc exports all artifacts due to my local repo it will not remove all those ignored files. Second disadvantage of this approach is that I would have to sort which artifacts I need to have in the archive and which not.So, I would like to have something like
dvc clearwhich replaces all links with original binaries stored in dvc and removes all dvc-files. I mean dvc clears only what it responsible for. It will be good to have it this way.