Dvc: Parametrize pipeline using DVC properties

Created on 5 Dec 2018  路  13Comments  路  Source: iterative/dvc

It would be useful if I could parametrize my pipeline using environment variables, which could be read from a properties file specified using dvc config env my.properties. DVC would load those environment variables when running the command.

For example, I could have this properties file:

DVC_NICKNAME=David

And run:

dvc run -o hello.txt 'echo "Hello ${DVC_NICKNAME}!" > hello.txt'
dvc run -o cheers.txt 'echo "Cheers ${DVC_NICKNAME}!" > cheers.txt'

And produce "Hello David!" and "Cheers David!" files.

Users would just have to make sure to quote the command or use interactive mode #1415.

The DVC file would contain the variable reference:

cmd: echo "Hello ${DVC_NICKNAME}!" > hello.txt

The value would be added to the environment by DVC at DVC startup so it would be handled natively by the shell.

In order for dvc status to be able to detect that variables in a stage changed, we can calculate the internal md5 checksum on contents with the variable values injected in place of the variable names, so that it would be handled as if the contents of the DVC file changed. This can be done using os.path.expandvars. But unfortunately, this would just replace variable references used directly in the shell command, it would not cover cases where you're using the environment variable inside a script. The only foolproof way would be force the user to explicitly request environment variables that would be injected from the properties file, e.g. using dvc run -e DVC_NICKNAME -e DVC_OTHER. That would basically allow adding additional "env dependencies" to stages.

It would be nice to inject the variables also into paths to dependencies, so that you can parametrize those as well. Could also be done using os.path.expandvars. This would change the DAG dynamically, but AFAIK it should actually magically work without breaking anything, right? As long as you just initialize the environment at each DVC startup and call expandvars when reading deps paths.

feature request p2-medium

Most helpful comment

@shcheklein It can be used in any pipeline when you're providing the same parameters in different stages. I was solving it by manually specifying the parameter multiple times and I didn't realize it could be solved using a custom config file provided as a dependency as suggested by @efiop.

The problem is that if the config properties were provided as environment variables, even a global DVC config file would have to break granularity of caching, since you could use those variables hidden inside bash scripts so there would be no way to check which variables are used.

[edited] So the only benefit would probably be if the variables could also be used in dependencies/outputs. For example, configuring the highest performing model file and using that throughout the pipeline. But not sure it's worth the effort - currently I'm solving it by just having a special location "models/top.pkl" where I copy it.

All 13 comments

This one is interesting, @prihoda :thinking:

I've never encounter the of using varibables in DVC commands, maybe my use cases are very simple :sweat_smile: !

This looks like a _Makefile_ behavior, where you can define variables on the top of your file and use them in the rules.

I prefer being "explicit than implicit" and I'll think it twice before introducing this request.
Let's leave this open and wait for some thumbs up or comments from other users :slightly_smiling_face:

If you need this right now, a workaround would be using something like direnv and define the variables right there, and just let the shell expand them.

@prihoda If going with plain env vars, I think a more natural approach, would be to define something like a config file, that you would specify as a dependency for your stage. E.g.

# env.sh
DVC_NICKNAME=David

and you would use it in your stage like so:

$ dvc run -d env.sh -o hello.txt 'source env.sh && echo "Hello ${DVC_NICKNAME}!" > hello.txt'

Though, it doesn't solve dynamic dep/out expansion. Maybe we could consider introducing -e env.sh(or maybe env.yml, to make it more cross-platform) option, that would make dvc read it before expanding deps/outs/cmd paths. And it would make env.sh a simple direct dependency for the stage, which seems to suit current dvc architecture nicely. I think we've discussed this briefly in https://github.com/iterative/dvc/issues/1119 . Need to take a closer look into this.

@prihoda could you please describe a "real-life" example where parametrizing pipelines like this would give benefits? Was you trying to solve some problem? (I do have some ideas on my own, I just want to know your thoughts on this)

@mroutis what do you mean by "explicit vs implicit" in this case? May be I missing something, but having a way of passing some parameters (in a way that DVC tracks changes, expands, etc) can be done more or less explicitly - like an explicit config file with all these parameters.

@efiop creating a single config file and using it as a dependency (via -d) breaks granularity of caching - every change in this global config makes the whole pipeline (bc usually a lot of stages depend on different variables in this config) outdated.

@shcheklein It can be used in any pipeline when you're providing the same parameters in different stages. I was solving it by manually specifying the parameter multiple times and I didn't realize it could be solved using a custom config file provided as a dependency as suggested by @efiop.

The problem is that if the config properties were provided as environment variables, even a global DVC config file would have to break granularity of caching, since you could use those variables hidden inside bash scripts so there would be no way to check which variables are used.

[edited] So the only benefit would probably be if the variables could also be used in dependencies/outputs. For example, configuring the highest performing model file and using that throughout the pipeline. But not sure it's worth the effort - currently I'm solving it by just having a special location "models/top.pkl" where I copy it.

@mroutis what do you mean by "explicit vs implicit" in this case?

I was using "explicit" to refer to any commands using additional context from the environment (for example, variables).

However, I really like the ideas proposed:

  • Having the variables as a dependency (either with the -e option or an env.sh file), this way, if the env changes, it is going to be reproduced (with the -e option we could even raise an error if the user doesn't have those variables on their environment)

I like the idea with -e as well. To be completely fair, I don't like that with DVC you have to specify (and keep them up to date) all the dependencies yourself, but I don't see any good implicit options.

I don't like that with DVC you have to specify (and keep them up to date) all the dependencies yourself

@shcheklein, have you seen any other solution that deals with dependencies implicitly?

Maybe we could watch the current directory for events triggered by the command's PID (implying that every "read" file is a dependency and "created" one an output), sadly, there's a lot of edge cases :disappointed: (process creating temp files, windows support, remote dependencies/outputs as HTTP or S3, etc.)

The only solution that I'm thinking about is "implicit" rules _a la Makefile_, but I don't think something similar could work for DVC.


By the way, I didn't understand quite well the implications of "breaking granularity of caching" by having a file with parameters, would you mind explaining?

By the way, I didn't understand quite well the implications of "breaking granularity of caching" by having a file with parameters, would you mind explaining?

Yep. Let's imagine you have a global env.sh file with two parameters A and B. And you have two stages - S1 and S2. S1 depends on (uses) A, S2 depends on B. We have to specify -d env.sh for both stages to capture these dependencies. The problem is that -d env.sh is not granular enough in a sense that if you change A dvc makes S2 stale along with S1 and we have to run it again. Basically, usually what I saw happening in this scenario (one global config) is that almost every stage depends on this single file and every change to file invalidates all intermediate results (cached data produced by some intermediate stages in the pipeline DAG). Hope all of this makes sense :)

@shcheklein, have you seen any other solution that deals with dependencies implicitly?

Make does not itself implicitly derive dependencies as far as I remember. You have to specify them (or use autotools or gcc to parse source and create a list with dependencies). CMake does this automatically. But I agree, I don't think this can work with DVC.

I don't know if there are other tools to be honest.

Maybe we could watch the current directory for events triggered by the command's PID (implying that every "read" file is a dependency and "created" one an output), sadly, there's a lot of edge cases 馃槥 (process creating temp files, windows support, remote dependencies/outputs as HTTP or S3, etc.)

Yep. This is fragile. I think @dmpetrov and @efiop tried this already a while ago.

Another user has expressed a use case for supporting env vars in stage/pipeline files. Context: https://discord.com/channels/485586884165107732/485596304961962003/715639271914209332

I want to create a dvc stage with code from a third-party python package, that needs to be installed first. Since the path to the code of this source file might look different for different contributors, I wonder if it is even possible to track such a file as a dependancy
I'm talking about an executable from a third party python package that I would direclty use as a source file in the stage
they will just go in the corresponding Python installation bin folder and would then be in the PYTHONPATH afterwards..

as an example:
pip install -> executbalte then lies under ~/.local./bin or something
then I want to use it as dvc run -d .py ... python .py ...
problem will be that .py will look different for other contributors of the project..

Another user expressed interest in this feature implicitly in https://discord.com/channels/485586884165107732/485596304961962003/765677497843974204:

what if I want to execute pipeline for another dataset, should I do it manually? change params.yaml and other stuff every time?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

analystanand picture analystanand  路  3Comments

siddygups picture siddygups  路  3Comments

ghost picture ghost  路  3Comments

prihoda picture prihoda  路  3Comments

mdscruggs picture mdscruggs  路  3Comments