dvc: deprecate persistent outputs

Created on 30 Jul 2019  路  7Comments  路  Source: iterative/dvc

I don't see a clear benefit on having them, this concern was already expressed here https://github.com/iterative/dvc/issues/1884#issuecomment-483009901 and here https://github.com/iterative/dvc/issues/1214#issuecomment-485580891 .

Supporting them is not a hassle, but also we can go with the argument that "less code is better code", and if it doesn't offer a clear benefit, it is making the interface more complex.

Current _types_ of outputs:

  • cached
  • not-cached
  • persisted
  • persisted-not-cached
  • external
  • external not-cached
  • external persisted / external persisted-not-cached ? (I've never seen this one in the wild :sweat_smile: )
question research

All 7 comments

@guysmoilov guys did end up using these outputs?

Yes: https://dagshub.com/Guy/fairseq/src/dvc/dvc-example/train.dvc

I don't see any other sane way to allow for checkpoints and resuming training.

@guysmoilov , what about not tracking the checkpoints with dvc ? Why do you need to _resume_ training?
I'm more used to the _queue-worker_ architecture, and I'm not familiar with "training infrastructures".

Still I'm not sure if it's a _hack_ or something that we should put more attention and attach it to an specific use case with proper documentation.

@mroutis It's a very common pattern in deep learning.

  1. Pick a model & hyperparameters
  2. Train for a few epochs to get intermediate results,
  3. Try a few different configurations, get intermediate results for them.
  4. Pick the most promising configuration, give it more budget for training, resuming from the checkpoint where you paused.
  5. Rinse & repeat until you can claim SOTA on Arxiv 馃

Hyperparameter optimization algorithms such as Hyperband even formalize and automate this process.
Existing frameworks such as TF have pretty well established norms for working with checkpoints - usually, you'll pick a checkpoint directory, and save the model to a checkpoint file every N epochs (with a filename like checkpoints/checkpoint_N).
Special names will be given to the following checkpoints:

  • checkpoint_best (according to a defined metric, probably validation loss)
  • checkpoint_latest (so it's easier for the resuming code to know which checkpoint to resume from)

And sometimes you'll do fancier things, like only keep the latest 5 checkpoints, or keep the best 3 checkpoints at any given time, so you can later average out their weights or turn them into an ensemble.

Obviously, you need something like DVC to be able to resume from checkpoints, switch contexts to a different configuration, and to keep track of the runs.

Hope this explains things better.

@guysmoilov , thanks a lot for your comments! Frist time hearing about _early-stopping_, and didn't know that you can train a model that was already trained (it makes a lot of sense, tho)!

I'll close this issue, then :)

@shcheklein, what do you think about adding this info to the docs?

@mroutis So, I'm still hesitant on merging the docs for this. I totally understand the problem, but I'm not sure adding a flag and saving outputs this way is the right way of solving it. We need to brainstorm a little bit and it can be part of a broader discussion around experiments management in general.

Was this page helpful?
0 / 5 - 0 ratings