Dear DVC folk,
you mention it yourself on your documentation: Fully versioned Hyperparameter Optimization comes to mind when using DVC.
I just made a quick research and it gets apparent very soon that this needs a specific implementation for dvc.
All the existing hyperparameter optimizers like python's hyperopt
It seems to me the following is needed for hyperparameter optimization to be a natural addition to DVC:
dvc metrics
already work.dvc repro
the principle idea is simple: instead of running a concrete algorithm with the specific framework, you run a wrapper which
1. checksout a new hyperoptimization branch
1. grabs the hyperparameters from the framework-specific API (e.g. as commandline args) and writes them into the new json file format
1. runs ``dvc repro myalgorithm.dvc`` on a previously specified routine ``myalgorithm.dvc``
1. commits everything on the branch
1. somehow find out the winner of the hyper-optimization, create a specific branch for this, and commit everything nicely.
wrapping existing optimization frameworks has several advantages
* less code to maintain and also only against stable APIs
* monitoring webui and else for evaluating or live-inspecting the hyperoptimization may be already available
* the community could be new wrappers
Of course more details will pop up while actually implementing this, e.g. how to integrate hyperoptimization with .dvc pipeline files as neatly as possible (for instance we may want to commit both the single run.dvc as well as a hyperopt.dvc to the same repository -- these need to interact seamlessly together)
What do you think about this suggested approach?
Hi @schlichtanders Thank you for the detailed description - interesting perspective to the hyper params and good ideas!
We are discussing experimentation scenarios in DVC and it looks like DVC needs special support for some cases. A recent discussion example - #2379. I'd love to discuss this from the point of hyperparameter tuning case and hyper param optimization packages.
Could you please clarify a few things:
The major question I have - Why do we need two abstractions: branches AND subfolders? Additional questions:
Q1. Can we use only branches? Experiment as a branch is well supported in DVC.
Q2. Can we use only subfolders? I know that this experiment-as-a-folder is not supported in DVC yet.
Q3. Which abstraction would you prefer (if experiment-as-a-folder will be supported)?
Q4. Which of the branches OR subfolders would you prefer to keep/commit into Git?
I made some progress and created a small example, however currently have no time completing it.
Nevertheless here the link:
https://github.com/schlichtanders/dvc_hyperopt_example
the idea is simple: after defining two helper functionalities a hyperparameter search is just a little wrapper script which calls another .dvc file
two helpers
bin/git_push_set_upstream++.py
which pushes a local branch to remote by adding an incremental integer suffix.E.g. if you branch is "myhyperoptimizationbranch" it would be pushed as "myhyperoptimizationbranch/1" if it is the first one, or "myhyperoptimizationbranch/43" if there already "myhyperoptimizationbranch/1" to "myhyperoptimizationbranch/42" on the remote)
bin/git_merge_hyperoptimization.py
(which should be named dvc_merge_hyperoptimization rather) which takes a hyperoptimization branch prefix, looks at all subbranches and merges them according to a given metric using dvc metrics
E.g. with pointing to "myhyperoptimizationbranch", uses dvc metrics
to get the metric information of "myhyperoptimizationbranch/1" to "myhyperoptimizationbranch/43" and merges the best one
I hope I find time in november/december to finish this and answer all your questions respectively
I had two thoughts related to potential API for hyperparameters on how to choose whether to store resulting models or not ("treat it as cache" and "treat it as optimal decision"). I posted them in another thread: https://github.com/iterative/dvc/issues/2379#issuecomment-543082399
If API would allow such flexibility, exact decision can be easily delegated to other libraries. Unfortunately I don't have anything more concrete than this wish/feature request yet.
Most helpful comment
Related: https://discuss.dvc.org/t/best-practice-for-hyperparameters-sweep/244