Dvc: Easy way to update existing DVC-file for "run" stage

Created on 10 Jan 2020  路  12Comments  路  Source: iterative/dvc

I often need to update existing dvc file (say train.dvc) for run stage with some minor changes like slightly modifying the cmd section, or adding new outs or deps.

The way I do it now is to reverse-engineer the original dvc run -d ... -o ... -f train.dvc <cmd> command that was used to create this train.dvc file's content.

This is cumbersome and seems like an unnecessary work.

I've tried to edit train.dvc manually, but sometimes it resulted in a corrupted file.
Also it seems that there is some special format that is preserved while splitting cmd into multiple line.

Is there any easy and not so hacky way to do this?

awaiting response question

Most helpful comment

Hi @PavelKovalets !

I've tried to edit train.dvc manually, but sometimes it resulted in a corrupted file.
Also it seems that there is some special format that is preserved while splitting cmd into multiple line.

Just regular yaml. Maybe you could show the error you've been getting? There is a chance that it is a bit too strict and we could ease up on it. The hashes could be (re-)computed without running your command with dvc commit.

All 12 comments

@PavelKovalets unfortunately the easiest way to do this that comes to my head is to write a shell script that generates the pipeline/stage (use --no-exec for dvc run) and then run dvc commit. @iterative/engineering any other ideas?

update existing dvc file (say train.dvc) for run stage with some minor changes like slightly modifying the cmd section, or adding new outs or deps...
I've tried to edit train.dvc manually, but sometimes it resulted in a corrupted file.

Editing the checksums manually seems very error prone. If it's just changing the command, that should be doable manually (no cache corruption) from what I understand.

@PavelKovalets could you provide an actual example perhaps? The original run command, the resulting DVC-file, and what you want to change exactly.

馃挕 Maybe a DVC command to generate DVC checksums from arbitrary files could be useful in these or other situations, basically exposing our internal function that does this (which I know is based on MD5 but not always exactly that).

Definite +1 on checksum command.

Hi @PavelKovalets !

I've tried to edit train.dvc manually, but sometimes it resulted in a corrupted file.
Also it seems that there is some special format that is preserved while splitting cmd into multiple line.

Just regular yaml. Maybe you could show the error you've been getting? There is a chance that it is a bit too strict and we could ease up on it. The hashes could be (re-)computed without running your command with dvc commit.

Thanks everybody for suggestions. I've ended up editing .dvc files manually, and running dvc repro ... afterwards, it updates the formatting of the file and re-calculates the hashes.

But still this seems like a hack because sometimes file name which contains spaces e.g. "File name" is split into two lines which causes issues with manual editing.

But still this seems like a hack because sometimes file name which contains spaces e.g. "File name" is split into two lines which causes issues with manual editing.

You mean when you specify it in dvc run?

But still this seems like a hack because sometimes file name which contains spaces e.g. "File name" is split into two lines which causes issues with manual editing.

You mean when you specify it in dvc run?

Yes

@PavelKovalets So dvc run -o Part1 Part2? Or do you properly escape it to prevent shell treating it as 2 arguments?

I mean that in the following case .dvc file line splitting is confusing (I work on Windows but suppose this is also applicable to Linux):

  1. Run dvc run -f echo.dvc echo "folder/folder/folder/folder/folder/folder/folder/folder 1/folder/folder 2/folder/folder/folder 3/folder/folder/folder/folder" where the long argument kind of represents a lot of arguments in case of the real command

  2. Check the echo.dvc file content:

md5: c8212d12fc6f787fb594a136f73c34ef
cmd: echo "folder/folder/folder/folder/folder/folder/folder/folder 1/folder/folder
  2/folder/folder/folder 3/folder/folder/folder/folder"
wdir: .
  1. When I edit the echo.dvc file to make changes to the command arguments it is not very clear how many spaces does the actual file name contain - was is ".../folder 2/..." or ".../folder 2/..." because of the custom dvc formatting and multi-line split.

@PavelKovalets Got it. The format is caused by yaml dumping it like that, so the confusion here is because of yaml multiline string formats. When editing your dvc files by hand, you could use other styles of splitting https://yaml-multiline.info/ if you want to.

@PavelKovalets unfortunately the easiest way to do this that comes to my head is to write a shell script that generates the pipeline/stage (use --no-exec for dvc run) and then run dvc commit. @iterative/engineering any other ideas?

Bash scripts are pain with conda (and I assume that dvc is part of conda env in many projects) because of https://github.com/conda/conda/issues/7980 So, the more I can avoid bash scripts the better.

@antonkulaga You could write it with a python script with a similar amount of effort :slightly_smiling_face:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

analystanand picture analystanand  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

siddygups picture siddygups  路  3Comments