Dvc: store whole DAG in one DVC-file

Created on 10 Apr 2019  Â·  56Comments  Â·  Source: iterative/dvc

I understand the merits of having multiple .dvc files for complex processes,
but it would be just great to have the option to store the whole DAG in one Dvcfile!

I feel it might help the overall readability of the structure

feature request p2-medium product research

Most helpful comment

I might be overestimating this, but it looks to me that we would need to do a huge refactor to support this, or at least not treating Dvcfile as a stage and create a separate class to deal with the logic of parsing such file to come up with a pipeline.

I can see the analogy with make on this one, but from my perspective, dvc is a CLI tool to write, read and execute (stage) files, whereas make is reduced to read and control de execution of (make) files.

I do appreciate the _aesthetics_ of having everything on a single file, tho.

All 56 comments

I might be overestimating this, but it looks to me that we would need to do a huge refactor to support this, or at least not treating Dvcfile as a stage and create a separate class to deal with the logic of parsing such file to come up with a pipeline.

I can see the analogy with make on this one, but from my perspective, dvc is a CLI tool to write, read and execute (stage) files, whereas make is reduced to read and control de execution of (make) files.

I do appreciate the _aesthetics_ of having everything on a single file, tho.

I see.
as a last resort, perhaps it is at least possible to add paths to the previous steps as comments in the file?

@Casyfill I think we will release a new version very soon (next week? @efiop can confirm) that will be preserving comments in the DVC files. I'm not sure we should be generating these comments on the DVC side though - it's hard to keep them up to date. What editor are using? May be it's a good feature to support on the editor level - navigate up/down the pipeline cc @prihoda

Just for the context - we got a request on the ods.ai #tool_dvc channel to support this:

(translation is mine)

Hi! Is there a way (feature/third-party tool/script/...) to describe the whole dvc-pipeline as a single file (as it is happening in build tools - make, bazel, maven, etc) instead of having a file per stage

I can see the analogy with make on this one, but from my perspective, dvc is a CLI tool to write, read and execute (stage) files, whereas make is reduced to read and control de execution of (make) files.

The single-file approach could be built upon the CLI approach as an extra optional layer. I suggested something like tup as another syntax option to make, which I happen to like. It inverts the dependency graph to be a flow graph and they look like unix pipes. It also can be faster due to the way it checks for updates.

I do appreciate the aesthetics of having everything on a single file, tho.

It is more than just about aesthetics (although great aesthetics in a build tool would be nice). Its a complex problem and waterbed theory is real.
If you only want to have dvc be analogous to gcc I suppose that is fine, but as projects get more complex you will want some sort of build tool like make, scons, waf, redo and so on. Eventually people will make the same thing in some ad hoc way bad way. So I see it as an inevitability that some sort of build automation tool will come about. I will probably be integrating them into my invoke tasks so that I can take care of pipeline things that aren't directly related to ML pipelines and wouldn't require dvc run. You could consider integrating or pairing with scons or waf which are both python projects.

Since it is a python project there is always the option of making pipelines in python, even though the syntax suffers a little bit. The way that prefect does their pipelines isn't so bad using a context manager.

Not really a solution since not everyone uses emacs and you can't emulate the awesome UI features it has on all platforms but magit is really the only way I interact with git and makes it very easy. A graphical interface could also solve some of the issues without sacrificing the versionability of generated files and not introducing new files.

My two cents :smile: and also this essay is awesome reading for exactly this kind of stuff https://ngnghm.github.io/blog/2016/04/26/chapter-9-build-systems/

Ok, so this issue has triggered a reoccurring discussion about separating pipelines and data subsystems of dvc. The main reason being that dvc-files will be modified by dvc on every dvc repro/run, so storing multiple stages in one dvc-file will absolutely turn into a merge nightmare.

If we forget about pipelines, then DVC-files could become one-liner placeholder files that would be easy to merge and resolve conflicts. They would be used by commands like pull/push/checkout/etc to handle the data itself:

NOTE I'm oversimplifying some formatting of json/yaml files here for simplicity, please don't mind it

$ dvc add path/to/something
$ cat path/to/something.dvc
90104d9e83cfb825cf45507e90aadd27

So they are no longer human-readable/editable. They are still acting as a placeholder, giving visibility to the user that some file is located there, even through github web UI and without running dvc checkout.

Now to the pipeline part. As everyone has experienced, dvc-files are pretty weird in the regard that they are modified by you and by the dvc (writing/changing hashes). Which causes non-trivial merge conflicts and is causing us a lot of hustle trying to preserve comments and stuff like that. Plus, dvc run is notoriously cumbersome when it comes to defining a long list of deps and outs in your cmdline. So to solve all of that, we could create new pipeline-files that would be hand-written and won't be ever touched by dvc itself. Those pipeline-files would be basically like current dvc-files but without the hashes and could also contain more than one pipeline stage in them (this is what this ticket is about). To store hashes for dependencies and outputs that those stages were run with, dvc will auto-generate a special hash file, that would basically be a one-line (to avoid git conflicts) yaml or json with list of deps and outs and their hashes. So it would look something like:

$ cat Pipeline # Feels like `Dvcfile` would be a nice fit here, but not using it for compat reasons
stages:
  - cmd: mycmd1 dep out1 out2
    deps: a 
    outs: # don't worry about metrics and stuff, it could be expanded from this no probs
      - out1
      - out2
$ dvc run Pipeline
$ cat out1.dvc
94614d6650e062655f9f77507dc9c1f2
$ cat out2.dvc
06cda08aa427e73492389a0f17c72d91
$ cat Pipeline.hash # or some better name
{"md5": "45678854", "deps": {"dep": "655f9f7750"}, "outs": [{"out1": "94614d6650e062655f9f77507dc9c1f2"}, {"out2": "06cda08aa427e73492389a0f17c72d91"}]} # note that `md5` is a hash of the Pipeline file itself

this way when merging .dvc files you choose your data version, but you don't have to touch your pipeline.hash file, so it will know by itself if it needs to run or not. So we get a complete decoupling of data and pipeline parts, which is pretty sweat. External outs/deps stuff could be handled on top of that, just don't want to get into those details.

Obviously, if we go that route, we will have two approaches there: try to keep backward compatibility or release 1.0 and basically start over. I have two concerns for the former one: naming new files and keeping old code alive. If we just move on to 1.0, we'll be able to drop all stage-writing code, all dvc run logic as we know it today and overall it seems like it will be much more elegant. But breaking compatibility is a very big pill to swallow...

Would love to hear what you guys think. CC @iterative/engineering

In my opinion this is a step in the right direction, I would go for adapting this change and breaking compatibility, but it definitely brings up a lot of questions.

  • So this means that all of the stages would be defined using one or multiple Pipeline files in your repo?
  • How do you handle cases when a new version of the Pipeline is commited but not the Pipeline.hash (or vice versa), which will make it incompatible for the other users?
  • Do you actually need to keep the .dvc files, since the hash is stored in the Pipeline.hash file? Are they now only a convenience for checking out out1 using an already existing dvc checkout out1.dvc?
  • What happens if the hash in the .dvc file is different from Pipeline.hash?
  • How do you handle merge conflicts in the Pipeline.hash file? This could occur when the user first designs the Pipeline, commits it, and then different users run a different part of it.

One thing that might make it simpler is to ditch the Pipeline.hash file and keep storing all the hashes in the .dvc file.

On a slightly unrelated note, one thing that I would welcome is to have an option to explicitly configure locations where the Pipeline files can be discovered so that the whole repo does not have to be listed with every DVC action. This has been a cause for me personally when working on a shared filesystem on our HPC when working with thousands of files, since listing can take up to several minutes. I actually opted to use Makefiles in those cases for that reason.

So this means that all of the stages would be defined using one or multiple Pipeline files in your repo?

I think it would make sense to allow multiple pipeline files, so that you could have one in your subdirs and do stuff like that. Having only one pipeline file seems too strict and limiting.

How do you handle cases when a new version of the Pipeline is commited but not the Pipeline.hash (or vice versa), which will make it incompatible for the other users?

There is also an idea of not having pipeline.hash, but rather storing it in .dvc/build-cache (or similar) in a form of -> . Kudos @Suor . That way we wll get build-cache by-design and could possibly push it to dvc remote alongside the data. If some file is missing, then other users will simply have to re-run their pipeline, though the results will already be there.

Do you actually need to keep the .dvc files, since the hash is stored in the Pipeline.hash file? Are they now only a convenience for checking out out1 using an already existing dvc checkout out1.dvc?

Yes, that will make build-cache idea possible (described above). So .dvc files will only deal with caching, and hash(or something in .dvc/build-cache) will deal with pipelines. It is best to think about those as completely separate instances. So hash files for pipelines are not used by checkout at all.

What happens if the hash in the .dvc file is different from Pipeline.hash?

If corresponding data-file has the same hash as described in .dvc file, but which differs from the one in pipeline.hash, it means that your pipeline will need to be re-run. So think about those only as metadata about the last execution, which could then be checked against the real-world to determine if we need to re-run it or not.

How do you handle merge conflicts in the Pipeline.hash file? This could occur when the user first designs the Pipeline, commits it, and then different users run a different part of it.

In the build-cache idea described above, .dvc/build-cache is based on the hash of cmd + dep hashes, so there won't be conflicts. But also since we are considering pushing those to dvc remote, there won't be merges for those at all, since they won't be tracked by git.

One thing that might make it simpler is to ditch the Pipeline.hash file and keep storing all the hashes in the .dvc file.

Yes, theoretically we could combine both approaches under one .dvc file umbrella. E.g. if this particular .dvc file only has a hash inside, it means that it is a "data" metadata, that would be used on checkout and friends. And if it is a list of stages (with commands/deps/outs) then it will be treated as a pipeline file, so hashes won't be written into it, but rather will be stored in .dvc/build-cache. I really like your idea, I think this will work even if we decide to keep backward compatibility :+1:

On a slightly unrelated note, one thing that I would welcome is to have an option to explicitly configure locations where the Pipeline files can be discovered so that the whole repo does not have to be listed with every DVC action. This has been a cause for me personally when working on a shared filesystem on our HPC when working with thousands of files, since listing can take up to several minutes. I actually opted to use Makefiles in those cases for that reason.

Have you considered adding those giant directories to .dvcignore? I don't really like the idea of such configuration, as it makes it less intuitive and less flexible than just collecting files. Plus we will still have to collect .dvcignores and stuff and will also have to collect old dvcfiles for backward compatibility.

There is also an idea of not having pipeline.hash, but rather storing it in .dvc/build-cache (or similar) in a form of -> .

Can you elaborate more on the build-cache idea? I think I am missing something.

If we forget about pipelines, then DVC-files could become one-liner placeholder files that would be easy to merge and resolve conflicts
90104d9e83cfb825cf45507e90aadd27

Are we sure? I suspect that changes in the commit hash is what causes merge conflicts anyway, and it would still be the one thing left in single-line DVC-files. I rec trying this with some mock files in Git first.

As everyone has experienced, dvc-files are pretty weird in the regard that they are modified by you and by the dvc

I don't personally ever change DVC-files manually and it's not really something we advocate for, in docs at least. I wonder how many users really need to do this. Maybe I'm totally off, but I think that human-readable DVC-files are great and helps people understand what's happening (helps with examples in docs), but it doesn't necessarily mean people should edit them other than in edge cases.


That said:

we could create new pipeline-files that would be hand-written and won't be ever touched by dvc itself... and could also contain more than one pipeline stage in them

It's not an unattractive idea (and I do think Dvcfile would be a nice default name, kind of like Dockerfile). In fact it sounds like a cool feature that would allow more automations/integrations in the future.
But would it be possible to make this into an alternative feature? Keeping the existing possibility to define individual stages via dvc add/run/import

Pipeline.hash # or some better name

Dvcfile.lock? But as mentioned above this approach reintroduces the merge conflict situation...

try to keep backward compatibility or release 1.0 and basically start over

Seems like a 2.0 for sure, but I'm not seeing why we couldn't keep back compat with multi-line DVC-files as mentioned in my previous comment (except for Dvcfile name).

an idea of not having pipeline.hash, but rather storing it in .dvc/build-cache... based on the hash of cmd + dep hashes, so there won't be conflicts. But also since we are considering pushing those to dvc remote, there won't be merges for those at all, since they won't be tracked by git.

I'm also not getting how the build cache would work exactly, what it contains, etc. Maybe a more visual example could help describe it?

Thanks!

It is a really great design proposal from @efiop. Very solid ideas. I especially like the idea of rolling out the pipeline hashes under build-cache. Thanks to @Suor. (Except that I don't like the name, run-cache might be better than build-cache).

It is great that @efiop deeply immersed in the problem. (To make it even more challenging 😄) I encourage you to think about related DVC-areas and how it can affect the proposed design:

  1. So, build-cache (#1234) is already involved. Good - very useful scenario.
  2. If we separate out pipelines from data sources it might affect other scenarios:

    1. Some dvc gc strategies might depend on pipelines - like remove everything but source datasets. How can we make the separation? Should we move these GC strategies under pipeline commands/module/package?

    2. Should we mix data source cache and pipeline cache? How about extracting it to a separate dir .dvc/cache/run/ or even to a separate remote?

    3. We can go even further if we separate dirs in the cache... It can be related to some scenarios with granular remote management like #2095. A possible solution - pull different remotes into different dirs or give to users the ability to specify a cache subdir for each import.

  3. How about metrics? Should we separate them from data sources as well and how? Btw.. we will be introducing a new type of metric for plots and visualization commands in DVC. How this might affect this design.
  4. Are you thinking about extracting a separate module in the level of DVC code or a library/package? Benefits?

Can you elaborate more on the build-cache idea
not getting how the build cache would work exactly, what it contains, etc.

@prihoda I found this description/example about it: https://github.com/iterative/dvc/issues/1234#issuecomment-431115289

@prihoda @dmpetrov Sorry for the delay, guys, I was thinking about it some more and have some adjustments to it. Will post a summary here a bit later.

Sorry for the delay. So the current proposed plan is to not change dvc-file format, but extend it to allow for multi-stage files that look something like this:

1) dvc added and dvc import-ed dvc-files stay the same as they are right now.

2) Currently our dvc run dvc-files look something like this:

  - cmd: mycmd1
     deps:
         - path: dep1
           md5: ...
     outs:
         - path: out1
           md5: ...

and I propose also allowing this format:

stages:
   - cmd: mycmd1
     deps:
         - path: dep1
     outs:
         - path: out1
  - cmd: mycmd2
  ...

so basically our previous dvc run stages but now without hashes inside, so that they are unchanged when you dvc repro something, and you could conveniently keep your whole (or a big chunk of) DAG in it. Now, where do we store hashed now for those multi-stage files?

  • hashes are saved in <dvc-file-name>.lock file lying beside it. If something goes wrong during the merge, you could simply delete that file and dvc repro will be forced to reproduce it.
  • we should also introduce build cache right away, that would look something like:
    .dvc/cache/build/<hash of cmd>/<hash of deps> and would contain hashes of outputs inside. This way, if we see that this cmd was already ran with these deps, we could simply try to checkout outputs by their hashes, instead of reproducing them. It is not yet clear how this build cache should be transfered though, so I would start with just keeping it locally and not tracking it with dvc or git. Maybe in the future we'll push it to our remotes, but for now it is an optional feature, that is just really convenient with the new design.

It seems to me that we could do this while keeping backward compatibility, but if it takes too much effort, we might as well consider breaking it and releasing 1.0.

Also, during the discussion with @dmpetrov , it was clear that our metrics concept is separate from data and pipeline stages (remember those cumbersome -m/-M options and dvc metrics commands?), so we might also consider extracting them into their own section in that multi-stage dvc file, like so:

$ cat Dvcfile
metrics:
    - path: path/to/metrics1
      type: json
      xpath: a.b.c
    - path/to/metrics2
stages:
    ...

which might also allow us to get rid of -m/-M flags for dvc run. It will also allow us to make metrics independent of all the possible changes in your pipeline, so you don't have to remember to use -m/-M when you decide to rework a few stages.

Just want to give you guys a _big_ applause! Breaking backwards compatibility hurts, but separating the pipelines from data sources is exactly what DVC needs.

Minor thing: Might the .lock conflict with other files like Pipfile.lock?

Minor thing: Might the .lock conflict with other files like Pipfile.lock?

@elgehelge Nope, because it will be either Dvcfile.lock or *.dvc.lock.

Okay, I toyed around a number of ideas around this (to check if we could do more than what @efiop proposed):

  1. Having no lockfile at all (i.e.stage file can have multiple outputs).
  2. Having both kinds of pipeline as well as old-style stages.
  3. Also supporting multiple data source stages.

But, for the first iteration, I'm thinking of not doing that and just stick to the core of the issue (able to have multiple pipeline stages on a single Dvcfile + separating other stages from pipeline stages). So, I plan on moving with:

  1. Pipeline stages will output one Dvcfile, Dvcfile.lock and separate output files.
  2. The single-stage dvcfiles will be used for checkouts. So, old-style stages can be used for both checkouts and pipelines whereas new multi-stage dvcfile will only be used during pipelines.

The original issue that I'm not planning to address at the first iteration is, not to allow flexible schema. The idea was to allow user to write the Dvcfile themselves, so, a flexible schema would be perfect. But, for now, it'd be too much and will leave to dvc run to address this for now.

Also, currently the name of the stage is dvc file itself. So, during dvc repro, we can run a given stage file via it's stage file name, but with multiple stage file, we cannot really do this, so we need some kind of name for given pipeline.

So, I propose the following kind of schema (i.e. having name of the pipeline):

stages:
   name_of_the_stage:
       outs:
       deps:
         - 1.txt
       cmd:

I understand that not every pipeline needs to be named, but I'd like to force users to set one, it makes it easy for them to work with and us as well.

So, user should be able to reproduce new-style stages with the given name (implementation not clear).

I was a bit worried about making a dent on the optimizations, but on pipeline stages,
we already load all the stages before anyway, so, performance impact should be negligible.

dvc repro and dvc pipeline should support showing old-style stages and new-style stages. dvc run will only be creating new-style pipeline stages (what about if it already exists and user wants to override? It should dump as old stage pipeline file, but unclear how it's going to be achieved).

Regarding the implementation POV, our current Stage's responsibility is interleaved with handling Dvcfile as well. This was okay before, but, since we are talking about having multiple stages in a single Dvcfile, we need to split the responsibility to Stage and Dvcfile components.

So, anything related to dumping/loading and path-related operations should be delegated to the Dvcfile and Stage will basically be an in-memory representation for, well, stage. And, Dvcfile will be responsible for loading stages, dumping them as appropriate (old style stages vs new style stages to different checkout stage file and lockfile).

All problems in computer science can be solved by another level of indirection. :)

We should also need to split stages to two: pipeline stages (old pipeline stages file + new pipeline stages files) and stages/checkout stages (old-style stages file only).

run/pipeline/repro will use pipeline stages. Other should not be affected at all.

How about dvc move/remove?

Need to individually verify these though :)

@skshetry thanks for the great summary! some quick questions:

Pipeline stages will output one Dvcfile, Dvcfile.lock and separate output files.

Just to clarify. Means that in the command like dvc run -o something it will create something.dvc, right? And that file will include only hash (more-or-less) related to that artifact?

so, a flexible schema would be perfect

could you point to a discussion or elaborate on what does flexible mean?

Need to individually verify these though :)

Good time to make them about data-management only.


  • check lock, unlock commands
  • how easy it would be to write a script that translates and old-style project to a single Dvcfile?
  • what happens if users already have Dvcfile in their project that is old-style?

Just to clarify. Means that in the command like dvc run -o something it will create something.dvc, right? And that file will include only hash (more-or-less) related to that artifact?

Yup, but the file is same as if the user did dvc add something, so, they'll have a better flexibility around the checkouts. This will in itself, separate the data-management with pipeline, in that run/repro/pipeline will only utilize Dvcfile and Dvcfile.lock.

could you point to a discussion or elaborate on what does flexible mean?

It was just my idea, that, if we want users to be able to write pipelines themselves, it'd be great to allow some flexibility around how they coud write those.
Eg:

stages:
   1_generator:
      outs: [1.txt, 2.txt]
...

vs.

stages:
    1_generator:
      outs:
        - 1.txt

Here, outs was just an example, so, we could define outs need to be sequence and not really care how they write that sequence, maybe as a set, list, sequence, etc. And, Dvcfile is read-only
(ignoring dvc run). So, we could read it. But, we need dvc run, so, this flexibility would be a curse when dumping (I mean, for outs it isn't as it's related to a single stage, but, what if user wants to write stages not as dictionary but as a list?). So, I thought it to be better to just leave that one out for now.

how easy it would be to write a script that translates and old-style project to a single Dvcfile?

Shouldn't be too hard because we anyway need to load old-style stages and even dump old-style stages (old-style stages are dvc add/import/import-url files).

To this point, I have kept all the old dump code alive, thinking that, we might need it on some cases. Eg: What to do when user runs dvc run --overwrite-dvcfile? Should we dump it as old-style or transform it to new-style? So, I'd like to be explicit about the conversion, so, either we could throw error or, just dump it as old-style. Maybe, there are more similar edgecases that I haven't really considered. So, moving between old-style and new-style on both directions are possible.

what happens if users already have Dvcfile in their project that is old-style?

It does not really have to be Dvcfile. If it already exists, the user can create another file.
The idea was to use Dvcfile if the user does not specify and using following kind of schema:

<p1, anything that was already here>
stages:
   ... <p2, check above schema>

But, I can see it could be a problem where we will be writing the p1 part on the Dvcfile itself and p2 on the lockfile and the output stages. We could transform those as well, but, I think it's better to not allow using that file at all.

@skshetry great summary and the plan. All the proposals look very reasonable including the mandatory stage names. They're just a few questions about the details.

  1. Pipeline stages will output one Dvcfile, Dvcfile.lock and separate output files Is this how data management is separated from the pipeline part? Do we have any other areas when the separation is happening?
  2. Do you expect an exact match in Dvcfile and lock file names (prefixes)?
  3. How DVC should resolve conflicts (missing lock file, a particular file is not in lock file, you see a file in lock but not in a corresponded outputs in Dvcfile)? This pipeline/data-file discrepancy is not a big problem, I'm just trying to understand rules and potential issues.
  4. * The conflicts part becomes even more interesting when you start thinking about build/run-cache which introduces another level of indirection. It might be not relevant for today's discussion.
  5. Should we suggest users use a single Dvcfile and Dvcfile.lock for all the pipelines and optimize all the commands for this? My opinion - yes.

A final thought. This is quite a big change for DVC and change in the file semantics. Good opportunity to think about terminology. Should we get rid of dvc-file term (I personally don't like it) and use something like lock file and pipeline file/stage file instead?

Just to clarify. Means that in the command like dvc run -o something it will create something.dvc, right? And that file will include only hash (more-or-less) related to that artifact?

Yup, but the file is same as if the user did dvc add something, so, they'll have a better flexibility around the checkouts. This will in itself, separate the data-management with pipeline, in that run/repro/pipeline will only utilize Dvcfile and Dvcfile.lock.

I get lost a bit in this example :) why is it something.dvc, not something.lock?
@shcheklein could you please clarify what does dvc run -o something means.

  1. Do you expect the full command looks like dvc run -o something -d file --stage-name train python classifier.py?
  2. Is the assumption that the command will be stored in the default Dvcfile?. Then we have to provide the stage name like --stage-name "train".
  3. With (1) and (2), I'd expect something.lock, not something.dvc?

I'd appreciate the clarification.

@dmpetrov this is how I understand this:

Should we suggest users use a single Dvcfile and Dvcfile.lock for all the pipelines and optimize all the commands for this? My opinion

Agreed. May be we can allow files in subdirectories.

Do you expect the full command looks like dvc run -o something -d file --stage-name train python classifier.py?

yep. The way I understand this.

Is the assumption that the command will be stored in the default Dvcfile?. Then we have to provide the stage name like --stage-name "train".

yes. We can probably reuse -f (file name that we have)? In general, there is not yet spec in this discussion how would dvc run change.

With (1) and (2), I'd expect something.lock, not something.dvc

I see where this suggestion come from. I think the whole idea is to keep .dvc files as simple pointers to data DVC-tracked artifacts and they won't be related eventually to pipelines at all. Pipelines can use data they reference or not but they don't care about their existence at all. Another way yo see this is that pipelines just deal with a virtual filesystem (similar to what dvc list returns). ... _contd_

Should we get rid of dvc-file term (I personally don't like it) and use something like lock file and pipeline file/stage file instead?

... _contd_ so, yes. DVC-file might become confusing if we keep pipelines in Dvcfile. I would think about the naming. Since we split Pipeline, I think it's fine to have a "lock" for them but keep "pointer" files (currently DVC-files) for data. They are different after all.

I also don't like lock since we have dvc lock and dvc unlock that are pipeline specific. It's better to rename them.


@skshetry thanks for the clarification!!

  • could you give an example/details of what is going to be stored in the "lock" file and in the .dvc pointer/output files (I hate term output!)?
  • since it's easy to write a script that transforms one old-style project into a new one, would it be easier for us to do this and simplify code?

@skshetry Some questions from me:

  1. Will we have any sort of build cache? In first iteration or at all? Ruslans plan was to make local build cache for first iteration, and probably sync it as cache later.
  2. What is the exact mechanism how lock files work?
  3. Do we need distinct lock and pointer files? Or that might be the same? Cause if they might then we can just either use .dvc files for both or make .dvc files obsolete and only use some new lock/pointer files. This will create a clear distinction between new-style and old-style dvc repo.

@shcheklein on conversion script. I think the big issue with it is that even if you convert your current repo state and commit it, you will still have history in old style, which will affect at least some commands.

@Suor that "script" can become a compat layer that we will be using to access history. The point is to isolate old code as much as possible so that it does not stop us at anything we want to redesign. To clarify, if there is no significant additional code complexity and we don't take some suboptimal decisions because of this backward compatibility thingy, I'm totally fine to keep.

Will we have any sort of build cache? In first iteration or at all? Ruslans plan was to make local build cache for first iteration, and probably sync it as cache later.

It is not necessary for multistage dvc-file, but it is a desirable feature (e.g. for cicd). We have a separate ticket for it and I have an implementation that I'm getting ready to push right now.

What is the exact mechanism how lock files work?

They would contain cmd, hashes for deps and hashes for outputs, locking the execution state. We might want to use the same format as in build cache though (but it is not required), we will discuss it with @skshetry when syncing up.

Do we need distinct lock and pointer files? Or that might be the same? Cause if they might then we can just either use .dvc files for both or make .dvc files obsolete and only use some new lock/pointer files. This will create a clear distinction between new-style and old-style dvc repo.

Yes, lock files are not the same (strictly speaking) as pointer (dvc) files. But they could have the same format, this is an implementation detail. Current plan is to keep compatibility with old dvc files too.

Do we need distinct lock and pointer files? Or that might be the same? Cause if they might then we can just either use .dvc files for both or make .dvc files obsolete and only use some new lock/pointer files. This will create a clear distinction between new-style and old-style dvc repo.

Yes, lock files are not the same (strictly speaking) as pointer (dvc) files. But they could have the same format, this is an implementation detail. Current plan is to keep compatibility with old dvc files too.

"distinct lock and pointer files" is a very important question. It feels like it is a bit too much to have 3 different concepts: dvc-files, pipeline files and lock files.

Compatibility is important. But I don't think it worth introducing a new concept (or keep the old one) just for compatibility. We should avoid this as much as we can. It is a potential technical debt.

Yes, lock files are not the same (strictly speaking) as pointer (dvc) files. But they could have the same format, this is an implementation detail. Current plan is to keep compatibility with old dvc files too.

If they have the same format then why we can't make it a single thing?

@skshetry and @efiop any update on this? It seems like folks are very happy with the proposal and most of the design decisions in it. The only open question is how many concepts we need 2 or 3 (dvc-file, pipeline file, lock file). It would be great to see the proposed files formats to understand the separation better.

+1 on file formats, maybe some examples here. Will help make sure we understand things the same.

File format is almost same as above. Stage file will be same as shown above:

stages:
  line:
    always_changed: False
    locked: False
    wdir: '.'
    cmd: echo `cat foo | wc` > bar
    deps:
    - foo
    outs:
    - bar
    persist:
   - foo
   ...

But, I might leave a few things for the first iteration (eg: always_changed, locked, etc).

Also, regarding lockfile, current implementation is similar to:

{
  "md5": "8bbfc8f55ef5ad29d607b5c3599cb74b",
  "deps": {
    "line": {
      "foo": {
        "md5": "d3b07384d113edec49eaa6238ad5ff00"
      }
    }
  },
  "outs": {
    "bar": {
      "md5": "071aa733fb2090d0b2b749b8856af4ff"
    }
  },
  "stages": {
    "line": {
      "md5": "836ffebe6f8fc2f92985e85a37a252cd"
    }
  }
}

Each of the stage create one output stagefiles (if no-cache is not set):

md5: e65efb023dd2aebf3dad212a8b41d66b
outs:
- md5: 071aa733fb2090d0b2b749b8856af4ff
  path: foobar
  cache: true
  metric: false
  persist: false

I used md5 to simplify for the implementation, but, we (@efiop & I) discussed that md5 is not really essential and we can be better off with {stage.cmd, deps, outs} (similar to build cache). So, something like this, perhaps:

{
"stage_name": {
    "cmd": "cat foo",
    "deps": {
       "1.txt": "<checksum>"
   }
   "outs": {
   # ... 
   }
}

@shcheklein, it's just too much effort to do a compat for the history + migrations. Best thing is to be backwards-compatible or not. Again, unless we think of something for single-stage dvc files (eg: those added by dvc add/import etc.), we are kinda stuck in supporting old-style stage files. Something to discuss in future.


P.S. However, Stage collection gets a bit complicated (eg: pipeline_stages needs to have all the stages even the data source stages except for pipeline-generated output stages (:laughing: ))

  1. Why do we have deps -> stage-name -> dep-filename in lock file?
  2. Aside from that lock-file looks like old stage file in json, so do we really need it as a separate thing?
  3. Do we need base md5 in lock file? How is it calculated?
  4. How parameters play with all of this?

Good question @Suor.

⁣4. params hasn't really been discussed and not a part of this at the moment.
⁣3. md5 not really used. Just an example at this point. The final implementation will use similar formats to build cache.
⁣2. Old stage file is implemented to allow for individual checkouts.
⁣1. deps -> stage-name -> dep-filename in lockfile is mostly for not introducing discrepancy between different pipeline stages (i.e. stage have separate deps). Not sure though, if I understood the question.

  1. dep is just a filename, I can't see any point in prefixing it with stage, that would be the same file regardless of the stage. This notation looks meaningless and confusing.
  2. We don't need old stage files for individual checkouts, we might address stages with stages-filename.ext:stage-name, we will need to implement such addressing anyway for repro. The real question was can we drop old .dvc stage files and only use lock files or not introduce lock files and use old-style dvc files instead?
  1. It can be different. Take an example:
echo "foo" > foo
dvc add foo
dvc run -d foo -o bar "cat foo foo > bar"
echo "foobar" > foo
dvc add foo
dvc run -d foo -o bar2 "cat foo foo > bar2"

EDIT: But, other repros will change as deps has changed.

thanks, @skshetry ! It all makes sense. a bit more questions/concerns :)

  1. I understand that we need a lock file and it will be pipelines-specific, while .dvc are kind-of data pointer files. What I don't understand yet, why do we need to save outs info into this lock file. And what would happen if we get a discrepancy between those? How will repro behave?

  2. Naming - Dvcfile and DVC-files . Should we just name pipelines dvc.yaml and dvc.lock? or pipeline.yaml and pipeline.lock (and keep .dvc as pointers)? any thoughts @dmpetrov ?

  3. Backward compatibility. I also think it's important for us to have it. There are different levels though and different possible implementations (from a conversion script to full read/write old-style files). For example, reading previous commits is important (e.g. to run dvc get and/or dvc.api). Can we support read-only mode? And isolate code that does the transformation as much as possible? In general, if we can isolate the reader for the old code very well, I think I'm fine with the backward compatibility. Having a lot of ifs everywhere - that I would try to avoid as much as possible.

cc @shcheklein

  1. pointer files will only be read by non-pipeline related commands (i.e. except of run/repro/pipeline/lock/unlock). repro will use dependencies from the workspace and regenerate pointer files and the lockfile if they change, same behavior as we have at the moment.

  2. Better naming would be good. Plus, it will also not be confusing, eg: when user will do dvc checkout <pipeline-file.dvc>. So, we do need better naming. Maybe even, .pipeline file.

@skshetry
I your example both stages depend on exactly the same file, the md5 allows to differentiate file versions. Stage prefix does nothing to do this.

Another way to say that is that stages depend on file paths not on their contents. If file contents change then stage might be reprod to update deps md5s, run it and update outs and their md5s. Neither stage deps paths, not stage outs paths are changed by that.

So again stage prefix somehow presumes that deps for different stages are somehow different, but they are not.

@Suor

So again stage prefix somehow presumes that deps for different stages are somehow different, but they are not.

but you can a single stage, right? and it should update only deps for that stage.

@skshetry

pointer files will only be read by non-pipeline related commands (i.e. except of run/repro/pipeline/lock/unlock). repro will use dependencies from the workspace and regenerate pointer files and the lockfile if they change, same behavior as we have at the moment.

from the discussion with @skshetry :

  • I get the idea about data command being independent from the pipelines - and I like it. We should keep .dvc pointer files.
  • we need to save hash for outs in the lock at leas for the Git-tracked files, so no way to avoid saving anything about outs in the lock file.
  • I still don't like the idea that we are saving the same hash of a data file or dir in two places effectively - lock and .dvc. There is a possibility of discrepancies. If pipelines write them why don't they read them, etc, etc.

One idea to contemplate:

in the pipeline.lock file for DVC-tracked outputs save hash of the corresponding .dvc file. This way we will be able to detect the change in the data and in the corresponding .dvc file. It feels that it makes more sense.

@skshetry thank you for the example!

Just to make sure we are on the same page... If a user runs something like:

$ dvc add users.csv
$ dvc run -d users.csv -o model.pkl -o logs/ python train.py

Then we will have (naming is not important now):

  1. Data source file like users.csv.dvc with md5/pointers
  2. Pipeline file Dvcfile without any md5 but with the command, deps and outputs (your first example)
  3. Lock file of that pipeline Dvcfile.lock (your 2nd example) with the commands, deps, outputs and md5/pointers.
  4. Two dvc files (or lock files) logs.dvc and model.mkl.dvc for each of the output (your 3rd example) with the pointers.

If that's correct, are (0) and (3) have the same structure or (2) and (3)?

@dmpetrov, (0) & (3) will have the same structure. Regarding lockfile, lockfile formats will probably be as follows (4th example in above):

{
"stage_name": {
    "cmd": "cat foo",
    "deps": {
       "1.txt": "<checksum>"
   }
   "outs": {
   # ... 
   }
}

@Suor, Can you please explain what you are trying to say? Are you talking about using hashes like we do in build cache? If yes, we decided against that because it'd require us to hash all the deps/outs to check if it has changed, and to allow for granularity.

@Suor, Can you please explain what you are trying to say? Are you talking about using hashes like we do in build cache? If yes, we decided against that because it'd require us to hash all the deps/outs to check if it has changed, and to allow for granularity.

@skshetry I see the problem that @Suor mentioned. By stage_name level in the lock file, we allow deps duplications and outputs duplication. We should avoid this.

Currently, DVC prevents outputs duplications:

$ dvc run -f res1_new.dvc -d file2 -o res1 "cat file2 file2 > res1"
ERROR: failed to run command - file/directory 'res1' is specified as an output in more than one stage: res1_new.dvc
    res1.dvc
This is not allowed. Consider using a different output name.

For some reason, we do not prevent the deps duplication (but we should whenever is possible). Your example:

echo "foo" > foo
dvc add foo
dvc run -d foo -o bar "cat foo foo > bar"
echo "foobar" > foo
dvc add foo
dvc run -d foo -o bar2 "cat foo foo > bar2"

Ideally, we should prevent both of these cases. Or at least, do not do any extra work (extra stage_name level) to support one of these cases.

For some reason, we do not prevent the deps duplication

I don't understand what is deps duplication. A single dep might be used by many stages, I don't see the issue with that.

@skshetry The issue was with lockfile format you presented in your first example, the example from your last message doesn't have that issue. So it's resolved.

@dmpetrov, for the record, there will be .dvc file for each outputs, and regarding to your example, there will be two output files, logs.dvc and model.pkl.dvc.

I don't understand what is deps duplication.

@Suor, I think, @dmpetrov is talking about different versions of deps in the stages.

In this discussion, we are mostly focused on the separation of the pipeline and the data management layer. This is definitely important aspect of this issue. However, there are actually two pain points that this issue addresses:

  1. Not easy to understand pipeline if it is split in multiple files
  2. If I run a command dvc run -d x -d y -o Ooo -M mm -m mmmm python code.py myparams it is not easy to modify the command if I add a new dependency or output or metrics.

(2) is actually a prerequisite for (1). It is very important to keep the pipeline file format simple and to have a clear connection between the CLI dvc run and the pipeline file content.

2. If I run a command dvc run -d x -d y -o Ooo -M mm -m mmmm python code.py myparams it is not easy to modify the command if I add a new dependency or output or metrics.

I was thinking more about this issue and talked with the team members. It is clear that the single-dvc-file/pipeline-file format should be as close as possible to dvc run syntax. Users should not learn a new format after learning dvc run. We should keep this as a top-priority requirement for the format. Otherwise, some users will keep using bash scrips to build dvc-pipelines.

Possible ways of formatting:

  1. Bash script with a list of dvc run commands seems like a good option for single-dvc-file format.

    • Pros: perfect matching between command and pipeline file

    • Cons: we cannot construct a whole DAG after running such a script. Some intermedia representation of DAG is needed. So, we have to generate dvc-files the same way we do that today. So, bash script is not the way to go in creating a single pipeline file.

dvc run -d logs/ -d taxonomy.csv -o users.csv -p process,thresholds ./myprocessor logs/ users.csv
dvc run -d users.csv -o model.pkl -M summary.json -m train_logs.csv -p train python train.py
  1. Makefile style when we store the entire command in the single dvc file in stages section.

    • Pros: no need to learn new format - one-to-one mapping between dvc run and the file. The file is very short.

    • Cons: the command lines are very long and we need to split them properly. Also, it might be not easy (if possible) to translate cmd to yaml string properly in DVC code. Not parsable by machines in case users need to extract dependencies of a particular stage for example.

stages:
  process:
    cmd: "dvc run -d logs/ -d taxonomy.csv -o users.csv -p process,thresholds ./myprocessor logs/ users.csv"
  train:
    cmd: "dvc run -d users.csv -o model.pkl -M summary.json -m train_logs.csv -p train python train.py"
  1. Exact param names matching between dvc run cmd and the pipeline file.

    • Pros: (almost) no need to learn new format - straightforward mapping between dvc run and the file.

    • Cons: not the same as the current dvc-format

stages:
  process:
    cmd: "./myprocessor logs/ users.csv"
    deps: [logs, taxonomy.csv]
    params:
      file: params.yaml
      params: [process, thresholds]
    outs: [users.csv]
  train:
    cmd: "python train.py"
    deps: [users.csv]
    params:
      file: params.yaml
      params: [train]
    outs: [model.plk, tb_logs]
    metrics: [summary.json]
    metrics_no_cache: [train_logs.csv]
  1. The current dvc-file format but we need to clean it a bit by removing the noisy default values like cache: true, metric: false, persist: false

    • Pros: it is very close to the existing format

    • Cons: need to learn a new format - it does not satisfy the top-priority requirements. Users will continue to use bash scripts to make dvc pipelines. Also, it is a bit too verbose and takes too much space.

stages:
  process:
    cmd: ./myprocessor logs/ users.csv
    deps:
    - path: logs
    - path: taxonomy.csv
    - path: params.yaml
        params: [process, thresholds]
    outs:
    - path: users.csv
stages:
  train:
    cmd: python train.py
    deps:
    - path: users.csv
    - path: params.yaml
        params: [train]
    outs:
    - path: model.plk
    - path: train_logs.csv
       metrics: true
    - path: summary.json
       cache: false
       metrics: true
  1. Python code can be generated as pipeline specification.

    • Pros: python folks will appreciate this approach. Nothing prevents us to implement this option in addition to one from the above.

    • Cons: need to learn a new format - it does not satisfy the top-priority requirements. It will prevent not Python users from using DVC (today a portion of DVC users are R users for example). DevOps folks won't like this idea who is the target audience for some DVC scenarios.

import dvc

process = dvc.stage(
    cmd="./myprocessor logs/ users.csv",
    deps=["logs/ ", "taxonomy.csv"],
    outs=["users.csv"],
    params={
        "params.yaml": ["process, thresholds"]
    }
)

train = dvc.run_stage(
    cmd="train python train.py",
    deps=["users.csv "],
    outs=["model.pkl"],
    metrics_no_cache=["summary.json"],
    metrics=["train_logs.csv"]
    params={
        "params.yaml": ["train"]
    }
)

dvc.run_pipeline([process, train])

To me, the third option looks like the best one for single-dvc-file/pipeline-file.

@iterative/engineering, @Casyfill, @salotz, @prihoda @elgehelge I'd love to hear your opinion about the future file format.

EDIT: Python option was added.

Keeping opened, since we are not done yet :)

Previous approach with the lockfile and the multistage Dvcfile, even though the hashes were duplicated among output stage file and lockfile, they were different concepts. So, user could still think of a (lockfile + dvcfile) to be a single concept (they need not care about lockfile at all) and could just dvc repro Dvcfile:stage_name. There's no ambiguity here as there is a clear separation between pipeline stage and output file stage.

But, with new suggested approach, it's quite opposite. The pipelines.yaml will house stage templates for multiple stage (similar to Dvcfile in previous approach). This yaml file will generate multiple entries of stage in a single file (just list of whatever we have in our current single stage file), hence will bear the checksums that can be used to handle data-related commands, etc. So, this removes the requirement for lockfile and removes duplication of the checksums.

But, this does share the same concepts and same structure related to a given stage among different files. This might make it complicated for user. Eg:
Should dvc lock pipelines.yaml:stage_name be used or, should it be dvc lock pipelines.dvc:stage_name?
Maybe, we should allow both? We might even need to duplicate few information on both files anyway (eg: locked, persist, cache, etc), so duplication is still there.

Should dvc lock pipelines.yaml:stage_name be used or, should it be dvc lock pipelines.dvc:stage_name?

pipeline.yaml is the ultimate source of truth for everything except hashes, so lock should be stored in pipeline.yaml and not pipeline.dvc. Doesn't seem like locked: True matters for .dvc, so there is no need to include it there. Same with persist. With cache it is a bit more tricky, because it will affect things like push/pull/checkout, but we could again look into the corresponding pipeline.yaml for the ultimate source of truth and not store cache: True/false in .dvc.

Hi, just sharing one of my answers given privately...

Should dvc lock pipelines.yaml:stage_name be used or, should it be dvc lock pipelines.dvc:stage_name? Maybe, we should allow both?

If the base name is the same in both, why not just pipelines:split-into-two? DVC can determine from which internal file to get the info. When you're using the default pipelines file name (pipelines.yaml/lock) you could just skip it: dvc checkout split-into-two — maybe just print a warning.

How will dvc add DVC-files look like? Will those remain the same? And should we call them something else in order to stop using the term "DVC-file" (esp. if we keep Dvcfile as the default pipeline file name)? Maybe "DVC metadata file" or just "metadata file", or "data source file" ... Idk

That was meant to be a secret presentation. :sweat_smile:.
@jorgeorpinel, the dvc add-ed files will still generate .dvc files. We are only changing for the stage files.

you could just skip it: dvc checkout split-into-two — maybe just print a warning.

Yes, for the naming and if it's a default file, yes we can go with it. But, for the data-related commands, it'd be better to be explicit with file naming because that's going to be confusing.

Np. Original comment deleted, text moved to https://github.com/iterative/dvc/issues/1871#issuecomment-618719315 above. And thanks for the answers (both secretly and above)

for the data-related commands, it'd be better to be explicit with file naming because that's going to be confusing

I'm not sure I get why that would be confusing. Is it because you need to open the pipeline file to know/remember the stage names? dvc pipeline show can probably help with this.

@jorgeorpinel, the only concern I have is, because some of the data commands allow granular operations on file (eg: dvc checkout foo), it might overlap with dvc checkout foo, where foo might be a stage name.
Ideally, we will not be able to know what the user's intention was. We can check stages and see if the name was mentioned and then fallback to whichever way is possible or even tell user that it's ambiguous in this condition. But, this is us guessing, right?

Take an example, user have a file foo and creates another stage with same name. Suppose, later, foo is no longer tracked. And, now, when user does dvc checkout foo, we will be falling back to stage checkout which may or may not be what the user intended. The :stage_name addressing makes intention clear.

^ is me talking ideally. If we do not have better ideas, we can easily fallback, there's no problem with that.

Another approach can be to do nothing and keep :stage_name addressing as-is. This will allow us to move away from *.dvc and *.yaml file easily and user can just do dvc checkout file --stage or something to just checkout files from the stage where that specific file belonged to (and, we can also keep :stage_name for convenience). My point is to, make data related commands be primarily about data not stage or dvc files, and, pipelines related command can use plain and simple stage name (eg: dvc repro <stage_name).

@iterative/engineering, this should be ready to try out. I'd love to get the feedbacks.
Just checkout to master or pip install via:

$ pip install --user https://github.com/iterative/dvc/archive/master.zip

Remember, the pipeline file is only generated if you specify a hidden -n/--name flag on dvc run (else, it will fallback to old-style *.dvc files).

A working example should be following to try out:

dvc run --name "generate-foo" --outs foo \
               "echo 'foo' > foo"

And, the stage can be addressed via :stage_name (eg: dvc repro :generate-foo).

because some of the data commands allow granular operations on file (eg: dvc checkout foo), it might overlap with dvc checkout foo, where foo might be a stage name... we will not be able to know what the user's intention was

So are you changing the implicit arguments accepted by these commands? I think either the stage needs a stage e.g. --stage, or path targets needs one e.g. --
Otherwise how are you detecting what kind of argument you're getting anyway? Does argparse even support this

The :stage_name addressing makes intention clear... do nothing and keep :stage_name addressing

OK, this also works but flags are more explicit (which is good). Both could be supported I guess.

My point is to, make data related commands be primarily about data not stage or dvc files, and, pipelines related command can use plain and simple stage name

Agree

Closing in favor of https://github.com/iterative/dvc/issues/3693 . Multistage dvcfiles are now default for dvc run in alpha release 1.0.0a0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrethrill picture andrethrill  Â·  70Comments

ynop picture ynop  Â·  41Comments

JoeyCarson picture JoeyCarson  Â·  53Comments

shcheklein picture shcheklein  Â·  36Comments

pared picture pared  Â·  73Comments