(see Discord context)
It would sometimes be quite handy to have a flag like --add-stdout foo
to dvc run
, e.g.
dvc run -d input.csv --add-stdout output.csv mycommand.sh
which would capture stdout
to a file, and add that file as an output of the pipeline step. It would basically be equivalent to
dvc run -d input.csv --outs output.csv 'mycommand.sh > output.csv'
but the implementation would presumably be to just dvc
's subprocess.run(cmd)
call to subprocess.run(cmd, stdout=file)
- you wouldn't literally change the cmd
string to use shell redirection.
Name is negotiable of course.
Benefits include:
Duplicating shell features will open a whole new universe for us to implement and a whole new UI for people to get used to, I am strongly against adding this. We should encourage people using either shell redirection or adjusting their commands instead.
While I agree that I find myself repeating the input/output filename in almost every dvc stage already,
e.g:
md5: 33abd8c0a78648177c165c9a5b549ea7
cmd: ../scripts/sample.py raw/train.parquet train_sample.parquet
wdir: .
deps:
- md5: 65dd48acd71d88dec45949e0f8a17817
path: raw/train.parquet
outs:
- md5: cd0ec3b2b25d5bdbdebe872dbfcf6576
path: train_sample.parquet
cache: true
metric: false
persist: false
My vote is we either tackle this problem more broadly (not _just_ for the case of simple redirection) or leave it as is?
If name duplication is the root of this then we should count this issue as +1 to providing some dedup methods.
Most helpful comment
While I agree that I find myself repeating the input/output filename in almost every dvc stage already,
e.g:
My vote is we either tackle this problem more broadly (not _just_ for the case of simple redirection) or leave it as is?