Dvc: Sharing for experiments and checkpoints

Created on 15 Nov 2020 · 13Comments · Source: iterative/dvc

We need a way to share experiments between remote executors and a local environment.

Scenarios:

Run an experiment through CI/CD (without auto-commits) and see the results in a local environment
Run an experiment in a remote machine (yeah, it is not implemented yet but the experiment architecture should support it).
Run parallel experiments in a local machine and see all the runs from a single place.

An "experiment" can be:

regular experiment dvc exp
an experiment with checkpoints. All the checkpoints need to be shared as well.

What is important:

Addressation - experiments and checkpoints name should be consistent among the remote and local runs.
Names - it should be a clear connection (ideally 1-1) between experiment names and Git branches in the main repo.
Local exp to branch - we need a way (command) to convert regular experiments (w/ and w/o checkpoints) to Git branches in the main repo.
Run-cache exp to branch - we need a way (command) to convert experiments from run-cache (w/ and w/o checkpoints) to Git branches in the main repo.

EDITS: It extends #4821

experiments

Source

dmpetrov

👍1

Most helpful comment

These responses from GH staff imply that they have git configured to disable even the regular automatic git garbage collection, and generally only gc anything if it's explicitly requested from a user:

https://github.community/t/does-github-ever-purge-commits-or-files-that-were-visible-at-some-time/1944/2
https://github.community/t/submodule-commit-recoverability/2593/2

They try to avoid deleting any orphaned git objects in case a user wants to recover them later. And in our case, our experiments are not considered as orphaned by git, since they are explicitly referenced in refs/. So it looks like for GH the only way experiments would be purged is if a DVC user explicitly contacted github support to delete things inside our custom refs namespace.

Gitlab does regularly run git gc to remove unreferenced objects, but the interval is user-configurable: https://docs.gitlab.com/ee/administration/housekeeping.html#automatic-housekeeping. We should be safe here as well.

pmrowla on 18 Nov 2020

👍2

All 13 comments

I've been thinking about this a bit more, and I think custom git refs may work for us here rather than storing patches in DVC run-cache or exp-cache.

We've discussed the potential of using a custom git ref namespace for experiments before, but it was tabled since github actions cannot be triggered when pushing a custom (non-branch or tag) ref.

However, in this case even though we have to push an actual branch to trigger a github actions (or gitlab) CI/CD build, I think we can still leverage custom refs as a cleaner way to store experiment patches.

Internally on the DVC side, we will need to support git push and git fetch'ing from our experiments workspace (.dvc/experiments) to and from a custom refs namespace (like refs/exp/... or refs/dvcexp/...). This would also facilitate sharing experiments between team members even without taking the CI/CD use case into consideration.

dvc exp push (or maybe dvc push --exp) would push all "local" experiment branches to an upstream git remote (plus the run cache for whatever we have already run locally)
dvc exp pull would pull everything from the upstream refs/exp into our local experiments workspace (plus the associated run cache).
We would probably also need commands to list and gc experiments in the upstream git repo

On the CI/CD side, we are still limited by needing to push actual branches or PRs in order to trigger a build, so either

the user needs to set up an actual repo level github branch with the changes they want to test and push it
or we trigger it ourselves by pushing a PR branch via something like dvc exp deploy

The user's CML workflow would look similar to the existing DVC workflow except that we use exp run instead of repro. This would result in a single new experiment branch generated in .dvc/experiments on the CI runner (including checkpoint commits for checkpoint experiments). We can then push that experiment to our custom ref from the CI runner, the same way we would do it from a "local" machine.

In the end, on github/gitlab there will be a branch (named appropriately for the experiment) containing a single commit (required to trigger the build). The result of the

Hypothetical workflow:

name: train-test
on: [push]
jobs:
  run:
    ...
    steps:
      - uses: actions/checkout@v2
      - name: cml_run_exp
        ...
        run: |
          ...
          # Pull data & run-cache
          dvc pull data --run-cache

          # Run as experiment instead of via repro 
          # Note that here we just want a normal "local" run (since in this case the "local" machine is the user's chosen CI runner)
          dvc exp run

          # Push experiment branch to custom ref (and associated run-cache) so it can be pulled and reviewed locally
          dvc exp push

          # Report metrics & params
          echo "## Experiment" >> report.md
          dvc exp show --show-md >> report.md

          # Publish other reports (plots, etc)
          ...

          cml-send-comment report.md

The main benefits of doing it this way would be that we can continue leveraging git and avoid needing to manage patch sets ourselves in some .dvc/cache/exp/... structure. And it is safe for us to make "auto-commits" in this case, since the commits are generated in .dvc/experiments on the CI runner and will only be pushed to our custom ref namespace (which will not recursively trigger more CI builds).

pmrowla on 16 Nov 2020

👍1

@pmrowla that's a very smart idea 🧠👍

A few questions to clarify:

So, you suggest synchronization through a central Git repo. Right?
It is easy to fetch custom refs by names. But how to get the whole list of custom refs without knowing the names? I was unable to find it - see my discussion in stackoverflow.
What prevents us from applying the same for local experiments? 😃Custom refs can be used instead of branches in temp repos (if no parallel execution)? dvc exp push should just assign a branch to a custom ref.

dmpetrov on 16 Nov 2020

@dmpetrov

So, you suggest synchronization through a central Git repo. Right?

Yes, this is correct

It is easy to fetch custom refs by names. But how to get the whole list of custom refs without knowing the names? I was unable to find it - see my discussion in stackoverflow.

In this case I've pushed 2 refs/exp/... experiments to https://github.com/pmrowla/checkpoints-test (pushing experiment branches from .dvc/experiments in your test project)

scratch  py:dvc ❯ git clone [email protected]:pmrowla/checkpoints-test.git exp_clone
Cloning into 'exp_clone'...
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 29 (delta 7), reused 29 (delta 7), pack-reused 0
Receiving objects: 100% (29/29), 4.35 KiB | 1.45 MiB/s, done.
Resolving deltas: 100% (7/7), done.

scratch  py:dvc ❯ cd exp_clone

exp_clone git:master  py:dvc ❯ git ls-remote --refs origin "refs/exp/*"                                                                                           ⏎
402356d4d7b349095218486dccd7765d26990a0b        refs/exp/exp1
a14475fb7aac33025d298246548e6cfe1348a32d        refs/exp/exp2

Note that this is a fresh clone, and this does not require fetching anything from the upstream refs/exp

exp_clone git:master  py:dvc ❯ ls .git/refs
heads  remotes  tags

To make fetching everything work you need something along the lines of

exp_clone git:master  py:dvc ❯ cat .git/config
...
[remote "origin"]
        url = [email protected]:pmrowla/checkpoints-test.git
        fetch = +refs/heads/*:refs/remotes/origin/*
        fetch = +refs/exp/*:refs/remote-exp/origin/*
...

exp_clone git:master  py:dvc ❯ git fetch origin

exp_clone git:master  py:dvc ❯ ls .git/refs
heads  remote-exp  remotes  tags

exp_clone git:master  py:dvc ❯ ls .git/refs/remote-exp
origin

exp_clone git:master  py:dvc ❯ ls .git/refs/remote-exp/origin
exp1  exp2

And then we would handle mapping things from local refs/exp with the remote refs/remote-exp/ internally in DVC

What prevents us from applying the same for local experiments? 😃Custom refs can be used instead of branches in temp repos (if no parallel execution)? dvc exp push should just assign a branch to a custom ref.

This is definitely something we could consider now. It would require refactoring some things internally, but we could potentially drop the need for the separate experiments clone.

pmrowla on 17 Nov 2020

To make fetching everything work you need something along the lines of
exp_clone git:master  py:dvc ❯ cat .git/config

Is it only about local settings that we never push to a server?

What prevents us from applying the same for local experiments? 😃

This is definitely something we could consider now.

It would be amazing to have a single abstraction for experiments and work with local parallel execution (if it's needed) the same way as remote execution.

In general, this approach looks elegant. We will be working with the same branch-like paradigm (custom refs) under the hood when Git does all the heavy lifting. At the same time, we won't over pollute the branch namespace and this complicated concept of custom refs will be hidden behind dvc exp command.

The only concern is the performance of repos with many (thousands) experiments. Most of the experiments are short living - but it still might create a lot of pressure to Git garbage collector.

I'd try this approach as the most elegant one and potentially with the smallest amount of code. We can keep in mide the other approaches in case some optimization might be needed.

dmpetrov on 17 Nov 2020

Is it only about local settings that we never push to a server?

Adding this type of line to .git/config will make custom refs be automatically fetched during a regular git fetch origin, similar things can be configured for automatically pushing custom refs as well, it's just a modification to what git automatically generates when configuring a git remote (either via clone or git remote add ...)

fetch = +refs/exp/*:refs/remote-exp/origin/*

But in practice, we likely will not need this type of configuration at all, we can handle everything internally during dvc exp push/dvc exp pull without needing to modify a user's git repo config.

pmrowla on 17 Nov 2020

The only concern is the performance of repos with many (thousands) experiments. Most of the experiments are short living - but it still might create a lot of pressure to Git garbage collector.

can we do some simple tests? Like generate 1000 refs and see the performance impact?

dvc exp push (or maybe dvc push --exp) would push all "local" experiment branches to an upstream git remote (plus the run cache for whatever we have already run locally

sounds a bit too aggressive? it might be fine for CI case, but might create a lot of issue in the local env?

run cache

do we need run cache in CI/CD scenario (you mention it a few times)? To some extent this proposed storage with custom refs already server that purpose? so, run cache can be optional in this case (if needed at all?)

+1 on unification if possible.

I think there should be a way to have access to these refs from the Viewer, btw? At least to show them.

It might also mean that we could use experiments from the Viewer for CML run? Trigger an action that takes a ref id via API?

shcheklein on 17 Nov 2020

Performance wise it would be the same as having a repo with thousands of git tags. Git itself can handle this without any problems. The normal issue w/large numbers of tags is that some git UI's don't handle it particularly well, but we shouldn't be affecting those apps anyways since we're using custom refs and not refs/tags.

On the DVC side, we can restrict pushing/pulling to specific experiments (or glob patterns) instead of doing everything by default.

so, run cache can be optional in this case (if needed at all?)

Run cache would still speed things up on CI/CD if there's a pipeline stage that someone else has already run themselves locally, but yes it's optional.

I think there should be a way to have access to these refs from the Viewer, btw? At least to show them.

It might also mean that we could use experiments from the Viewer for CML run? Trigger an action that takes a ref id via API?

If the viewer can access tags/branches in a repo it should also be able to access custom refs. The only thing I'm not sure on is how github/gitlab's oauth permissions work for reading custom refs, but I'm guessing the read branches/tags grant permission should also apply to custom refs.

And yes, triggering CML builds from the viewer should also be possible, github actions doesn't support automatic triggering of builds upon custom ref push, but manually triggering an action to checkout + build a custom ref should still be possible.

pmrowla on 17 Nov 2020

👍1

@pmrowla one more question - since it's a bit of a grey area for GH/GL and other servers - are there any docs regarding this? Are we 100% that they support and guarantee safety of those refs?

shcheklein on 17 Nov 2020

since it's a bit of a grey area for GH/GL and other servers

It is not about GH/GL, it is a core Git functionality. The only question to GH/GL - if they can trigger CI on custom refs changes. I’ve checked GH - it is not triggering, there is no such things in GH roadmap and it is a potential breaking change in their API. We should check GL and BB.

Ideally, exp command should abstract users out from the underlying exp sharing technology. So, it should not be risky to align on one of the technologies.

dmpetrov on 17 Nov 2020

It is not about GH/GL, it is a core Git functionality.

Yep. I know. But GH/GL wraps Git for you. Nothing prevents them from running some GC? I would double check if they support everything and they provide guarantees to store everything.

shcheklein on 17 Nov 2020

https://github.community/t/does-github-ever-purge-commits-or-files-that-were-visible-at-some-time/1944/2
https://github.community/t/submodule-commit-recoverability/2593/2

pmrowla on 18 Nov 2020

👍2

Thanks @pmrowla for doing this research!! 🙏

This is an amazing proposal and it would be really cool if everything works as expected!

Btw, I've invited you to the Iterative's Bitbucket, would be great to check it when you have time.

This is what I've found, whatever it means. I hope that custom git refs could not be considered garbage.

Bitbucket implements its own garbage collection logic without relying on git gc anymore (this is achieved by setting the [gc] auto = 0 on all repositories). When a fork is created, the pruneexpire=never is added to the git configuration and this is removed when the last fork is deleted.

shcheklein on 18 Nov 2020

@shcheklein it sounds to me like they essentially do the same thing as github. Based on their steps for forcing/triggering the bitbucket garbage collection, it still looks like they will use the standard git commands (like gc/prune/etc) for doing the actual garbage collection, so they should still not be touching objects from any custom ref namespaces.

pmrowla on 18 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings