Dvc: stage: preserve comments and don't add default fields removed by user

Created on 28 Mar 2019  ·  15Comments  ·  Source: iterative/dvc

md5: adbb969c66c0fe9449b17a57afcd9531
outs:
- cache: true
  md5: c157a79031e1c40f85931829bc5fc552
  metric: false
  path: file
  persist: false
wdir: .
enhancement p2-medium

Most helpful comment

It would also be great to maintain the spacing on the ‘cmd’ field. I often change the spacing such a long cmd with several options is split up with new lines in between each argument. DVC currently undoes this spacing

All 15 comments

As @shcheklein noted, this should not remove fields that were already set before. This is also relevant to user comments in dvc files, that we currently throw out, which is not nice. Need to consider using something like https://pypi.org/project/ruamel.yaml/ first.

It would also be great to maintain the spacing on the ‘cmd’ field. I often change the spacing such a long cmd with several options is split up with new lines in between each argument. DVC currently undoes this spacing

It would be nice if we could also have a custom filed for users to add some metadata.

For my use case in particular, i would like to add the repository and git commit in which the file was generated as well as the username of the person that generated the data. This is because i intend to use the datasets in more than one repository and i need a way to track the origin of the data

@martinjms , have you seen the issue related to reusing datasets, it might be interesting to you? #1487

https://pypi.org/project/ruamel.yaml/ - this parser or something like this can be used to preserve the info

I reviewed the issue and two several strategies to solve it:

  1. Superficial. Use ruamel.yaml to parse file, store structure with comments and everything in Stage, then work as usual. On save dump classes to dicts and apply diffs to stored structure. Then serialize structure to file.

  2. Fundamental. Use ruamel.yaml, resulting structure will serve as internal state of Stage, deps, outs classes. Some classes that don't have much or any polymorphic behaviour are eliminated and replaced either with their base or even structs + funcs. Since state = ruamel structure then we just dump it when needed.

@efiop what do you think? ;)

P.S. ruamel.yaml do preserve comments and string structure. Custom keys will also be saved in both cases.

@Suor just trying to better understand the second case. What will happen to classes (e.g. outs) that do have polymorphic behavior? Also what are the benefits of using this approach do you see?

P.S. ruamel.yaml do preserve comments and string structure. Custom keys will also be saved in both cases.

👍

@shcheklein they will stay classes, but all data access will be proxied to underlying ruamel structure (ordered dict with additional attrs), this way will they stay the same from outside and in big part inside too. The benefits of the fundamental approach are simpler code and less code.


Here is my thought process behind those solutions.

Currently we load yaml to python structure, validate it, then transform to classes. When dumping we go in reverse. Like this:

    yaml -x> structure -> validation -х> classes
    classes -> structure -> yaml

"x" on an arrow means we loose formatting info. If we use ruamel.yaml the first x is solved, but the second remains. To solve that we need to somehow pass formatting info to classes and then restore that on save, which is big and stupid work.

To avoid that we can store ruamel structure in Stage and apply diffs to it after classes -> structure dump step. This will be what I called superficial approach:

    yaml -> structure -> validation -х> classes
      + save old structure in Stage
    classes -> new-structure + old-structure -> yaml

However, if we lift any limitations then it looks like the best approach is to get rid of the classes:

    yaml -> structure -> validation
    structure -> yaml

This way it wouldn't matter if we use dicts from pyyaml or ordered dicts with additional info from ruamel.yaml.

Ok, but structure is only data we also need behaviour. There are two extreme ways to do that: functions and classes. Classes shine when we have polymorphic behaviour and we do have that with remotes, but less so with everything else.

Anyway we already have classes there, so it would be easier to hide structures within. Make all data access, both read and write, go to structures generated by ruamel.yaml and left the rest the same. It will simplify load/dump process immediately and will allow us removing some classes in the future.

Discussed this with @efiop.

Will be implementing superficial approach now. We will return to fundamental approach later in separate ticket when we decide to go after tech debt.

Another topic we discussed was future compatibility issues connected to arbitrary custom keys. Our solution was to allow users to write anything under single top level key meta and disallow in any other places. The tool will automatically move any custom keys there with a warning, as a less annoying behaviour then to just complain-and-stop.

@Suor Actually, I don't think we should do any moving automatically, because of potential problems with forwarding compatibility. Imagine we've introduced a new key newkey in a new release. One user creates a stage and then shares it with his colleague, that is running an older dvc. In that case, if we would to automatically move newkey to meta/newkey, would lead to bad behaviour, which, if committed back to repo, would break everything even for the first user that was using a new version of dvc. I think throwing a schema error as we do right now is a proper way to handle that (but of course we should probably make the error more user-friendly and suggest user to move that field if it is his own).

Just a reminder this now consists of two parts:

  • [x] retain comments and formatting
  • [ ] support custom fields under top level meta key

The first part is implemented in this PR https://github.com/iterative/dvc/pull/1885

Please, don't forget to update or create an issue for the documentation - probably, this one at least - https://dvc.org/doc/user-guide/dvc-file-format

So this is done but docs.

@AlJohri @martinjms Guys, this feature has been released in 0.40.0, please feel free to give it a try 🙂

Was this page helpful?
0 / 5 - 0 ratings