We need a way to share experiments between remote executors and a local environment.
Scenarios:
An "experiment" can be:
dvc exp
What is important:
EDITS: It extends #4821
I've been thinking about this a bit more, and I think custom git refs may work for us here rather than storing patches in DVC run-cache or exp-cache.
We've discussed the potential of using a custom git ref namespace for experiments before, but it was tabled since github actions cannot be triggered when pushing a custom (non-branch or tag) ref.
However, in this case even though we have to push an actual branch to trigger a github actions (or gitlab) CI/CD build, I think we can still leverage custom refs as a cleaner way to store experiment patches.
Internally on the DVC side, we will need to support git push
and git fetch
'ing from our experiments workspace (.dvc/experiments
) to and from a custom refs namespace (like refs/exp/...
or refs/dvcexp/...
). This would also facilitate sharing experiments between team members even without taking the CI/CD use case into consideration.
dvc exp push
(or maybe dvc push --exp
) would push all "local" experiment branches to an upstream git remote (plus the run cache for whatever we have already run locally)dvc exp pull
would pull everything from the upstream refs/exp
into our local experiments workspace (plus the associated run cache).On the CI/CD side, we are still limited by needing to push actual branches or PRs in order to trigger a build, so either
dvc exp deploy
The user's CML workflow would look similar to the existing DVC workflow except that we use exp run
instead of repro
. This would result in a single new experiment branch generated in .dvc/experiments
on the CI runner (including checkpoint commits for checkpoint experiments). We can then push that experiment to our custom ref from the CI runner, the same way we would do it from a "local" machine.
In the end, on github/gitlab there will be a branch (named appropriately for the experiment) containing a single commit (required to trigger the build). The result of the
Hypothetical workflow:
name: train-test
on: [push]
jobs:
run:
...
steps:
- uses: actions/checkout@v2
- name: cml_run_exp
...
run: |
...
# Pull data & run-cache
dvc pull data --run-cache
# Run as experiment instead of via repro
# Note that here we just want a normal "local" run (since in this case the "local" machine is the user's chosen CI runner)
dvc exp run
# Push experiment branch to custom ref (and associated run-cache) so it can be pulled and reviewed locally
dvc exp push
# Report metrics & params
echo "## Experiment" >> report.md
dvc exp show --show-md >> report.md
# Publish other reports (plots, etc)
...
cml-send-comment report.md
The main benefits of doing it this way would be that we can continue leveraging git and avoid needing to manage patch sets ourselves in some .dvc/cache/exp/...
structure. And it is safe for us to make "auto-commits" in this case, since the commits are generated in .dvc/experiments
on the CI runner and will only be pushed to our custom ref namespace (which will not recursively trigger more CI builds).
@pmrowla that's a very smart idea 🧠👍
A few questions to clarify:
dvc exp push
should just assign a branch to a custom ref.@dmpetrov
- So, you suggest synchronization through a central Git repo. Right?
Yes, this is correct
- It is easy to fetch custom refs by names. But how to get the whole list of custom refs without knowing the names? I was unable to find it - see my discussion in stackoverflow.
In this case I've pushed 2 refs/exp/...
experiments to https://github.com/pmrowla/checkpoints-test (pushing experiment branches from .dvc/experiments
in your test project)
scratch py:dvc ❯ git clone [email protected]:pmrowla/checkpoints-test.git exp_clone
Cloning into 'exp_clone'...
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 29 (delta 7), reused 29 (delta 7), pack-reused 0
Receiving objects: 100% (29/29), 4.35 KiB | 1.45 MiB/s, done.
Resolving deltas: 100% (7/7), done.
scratch py:dvc ❯ cd exp_clone
exp_clone git:master py:dvc ❯ git ls-remote --refs origin "refs/exp/*" ⏎
402356d4d7b349095218486dccd7765d26990a0b refs/exp/exp1
a14475fb7aac33025d298246548e6cfe1348a32d refs/exp/exp2
Note that this is a fresh clone, and this does not require fetching anything from the upstream refs/exp
exp_clone git:master py:dvc ❯ ls .git/refs
heads remotes tags
To make fetching everything work you need something along the lines of
exp_clone git:master py:dvc ❯ cat .git/config
...
[remote "origin"]
url = [email protected]:pmrowla/checkpoints-test.git
fetch = +refs/heads/*:refs/remotes/origin/*
fetch = +refs/exp/*:refs/remote-exp/origin/*
...
exp_clone git:master py:dvc ❯ git fetch origin
exp_clone git:master py:dvc ❯ ls .git/refs
heads remote-exp remotes tags
exp_clone git:master py:dvc ❯ ls .git/refs/remote-exp
origin
exp_clone git:master py:dvc ❯ ls .git/refs/remote-exp/origin
exp1 exp2
And then we would handle mapping things from local refs/exp
with the remote refs/remote-exp/
internally in DVC
- What prevents us from applying the same for local experiments? 😃Custom refs can be used instead of branches in temp repos (if no parallel execution)?
dvc exp push
should just assign a branch to a custom ref.
This is definitely something we could consider now. It would require refactoring some things internally, but we could potentially drop the need for the separate experiments clone.
To make fetching everything work you need something along the lines of
exp_clone git:master py:dvc ❯ cat .git/config
Is it only about local settings that we never push to a server?
- What prevents us from applying the same for local experiments? 😃
This is definitely something we could consider now.
It would be amazing to have a single abstraction for experiments and work with local parallel execution (if it's needed) the same way as remote execution.
In general, this approach looks elegant. We will be working with the same branch-like paradigm (custom refs) under the hood when Git does all the heavy lifting. At the same time, we won't over pollute the branch namespace and this complicated concept of custom refs will be hidden behind dvc exp
command.
The only concern is the performance of repos with many (thousands) experiments. Most of the experiments are short living - but it still might create a lot of pressure to Git garbage collector.
I'd try this approach as the most elegant one and potentially with the smallest amount of code. We can keep in mide the other approaches in case some optimization might be needed.
Is it only about local settings that we never push to a server?
Adding this type of line to .git/config
will make custom refs be automatically fetched during a regular git fetch origin
, similar things can be configured for automatically pushing custom refs as well, it's just a modification to what git automatically generates when configuring a git remote (either via clone or git remote add ...
)
fetch = +refs/exp/*:refs/remote-exp/origin/*
But in practice, we likely will not need this type of configuration at all, we can handle everything internally during dvc exp push
/dvc exp pull
without needing to modify a user's git repo config.
The only concern is the performance of repos with many (thousands) experiments. Most of the experiments are short living - but it still might create a lot of pressure to Git garbage collector.
can we do some simple tests? Like generate 1000 refs and see the performance impact?
dvc exp push (or maybe dvc push --exp) would push all "local" experiment branches to an upstream git remote (plus the run cache for whatever we have already run locally
sounds a bit too aggressive? it might be fine for CI case, but might create a lot of issue in the local env?
run cache
do we need run cache in CI/CD scenario (you mention it a few times)? To some extent this proposed storage with custom refs already server that purpose? so, run cache can be optional in this case (if needed at all?)
+1 on unification if possible.
I think there should be a way to have access to these refs from the Viewer, btw? At least to show them.
It might also mean that we could use experiments from the Viewer for CML run? Trigger an action that takes a ref id via API?
Performance wise it would be the same as having a repo with thousands of git tags. Git itself can handle this without any problems. The normal issue w/large numbers of tags is that some git UI's don't handle it particularly well, but we shouldn't be affecting those apps anyways since we're using custom refs and not refs/tags
.
On the DVC side, we can restrict pushing/pulling to specific experiments (or glob patterns) instead of doing everything by default.
so, run cache can be optional in this case (if needed at all?)
Run cache would still speed things up on CI/CD if there's a pipeline stage that someone else has already run themselves locally, but yes it's optional.
I think there should be a way to have access to these refs from the Viewer, btw? At least to show them.
It might also mean that we could use experiments from the Viewer for CML run? Trigger an action that takes a ref id via API?
If the viewer can access tags/branches in a repo it should also be able to access custom refs. The only thing I'm not sure on is how github/gitlab's oauth permissions work for reading custom refs, but I'm guessing the read branches/tags grant permission should also apply to custom refs.
And yes, triggering CML builds from the viewer should also be possible, github actions doesn't support automatic triggering of builds upon custom ref push, but manually triggering an action to checkout + build a custom ref should still be possible.
@pmrowla one more question - since it's a bit of a grey area for GH/GL and other servers - are there any docs regarding this? Are we 100% that they support and guarantee safety of those refs?
since it's a bit of a grey area for GH/GL and other servers
It is not about GH/GL, it is a core Git functionality. The only question to GH/GL - if they can trigger CI on custom refs changes. I’ve checked GH - it is not triggering, there is no such things in GH roadmap and it is a potential breaking change in their API. We should check GL and BB.
Ideally, exp command should abstract users out from the underlying exp sharing technology. So, it should not be risky to align on one of the technologies.
It is not about GH/GL, it is a core Git functionality.
Yep. I know. But GH/GL wraps Git for you. Nothing prevents them from running some GC? I would double check if they support everything and they provide guarantees to store everything.
These responses from GH staff imply that they have git configured to disable even the regular automatic git garbage collection, and generally only gc anything if it's explicitly requested from a user:
https://github.community/t/does-github-ever-purge-commits-or-files-that-were-visible-at-some-time/1944/2
https://github.community/t/submodule-commit-recoverability/2593/2
They try to avoid deleting any orphaned git objects in case a user wants to recover them later. And in our case, our experiments are not considered as orphaned by git, since they are explicitly referenced in refs/
. So it looks like for GH the only way experiments would be purged is if a DVC user explicitly contacted github support to delete things inside our custom refs namespace.
Gitlab does regularly run git gc
to remove unreferenced objects, but the interval is user-configurable: https://docs.gitlab.com/ee/administration/housekeeping.html#automatic-housekeeping. We should be safe here as well.
Thanks @pmrowla for doing this research!! 🙏
This is an amazing proposal and it would be really cool if everything works as expected!
Btw, I've invited you to the Iterative's Bitbucket, would be great to check it when you have time.
This is what I've found, whatever it means. I hope that custom git refs could not be considered garbage.
Bitbucket implements its own garbage collection logic without relying on git gc anymore (this is achieved by setting the [gc] auto = 0 on all repositories). When a fork is created, the pruneexpire=never is added to the git configuration and this is removed when the last fork is deleted.
@shcheklein it sounds to me like they essentially do the same thing as github. Based on their steps for forcing/triggering the bitbucket garbage collection, it still looks like they will use the standard git commands (like gc
/prune
/etc) for doing the actual garbage collection, so they should still not be touching objects from any custom ref namespaces.
Most helpful comment
These responses from GH staff imply that they have git configured to disable even the regular automatic git garbage collection, and generally only gc anything if it's explicitly requested from a user:
https://github.community/t/does-github-ever-purge-commits-or-files-that-were-visible-at-some-time/1944/2
https://github.community/t/submodule-commit-recoverability/2593/2
They try to avoid deleting any orphaned git objects in case a user wants to recover them later. And in our case, our experiments are not considered as orphaned by git, since they are explicitly referenced in
refs/
. So it looks like for GH the only way experiments would be purged is if a DVC user explicitly contacted github support to delete things inside our custom refs namespace.Gitlab does regularly run
git gc
to remove unreferenced objects, but the interval is user-configurable: https://docs.gitlab.com/ee/administration/housekeeping.html#automatic-housekeeping. We should be safe here as well.