Dvc: Avoid shell (re-)interpretation in `dvc run`

Created on 8 Aug 2019 · 15Comments · Source: iterative/dvc

When the command for dvc run is complicated enough, or needs internal quoting, it can be somewhat difficult to get the quoting correct:

# Fails
% dvc run -f thing.dvc -d TGP_Opp_2014.csv Rscript -e 'saveRDS(1:10, "ops-data.rds")'
Running command:
    Rscript -e saveRDS(1:10, "ops-data.rds")
zsh:1: unknown file attribute: 1
ERROR: failed to run command - stage 'thing.dvc' cmd 'Rscript -e saveRDS(1:10, "ops-data.rds")' failed

# Fails
% dvc run -f thing.dvc -d TGP_Opp_2014.csv 'Rscript -e "saveRDS(1:10, 'ops-data.rds')"'
Running command:
    Rscript -e "saveRDS(1:10, ops-data.rds)"
Error in saveRDS(1:10, ops - data.rds) : object 'ops' not found
Execution halted
ERROR: failed to run command - stage 'thing.dvc' cmd 'Rscript -e "saveRDS(1:10, ops-data.rds)"' failed

# Succeeds
% dvc run -f thing.dvc -d TGP_Opp_2014.csv 'Rscript -e "saveRDS(1:10, \"ops-data.rds\")"'

# Succeeds
% dvc run -f thing.dvc -d TGP_Opp_2014.csv 'Rscript -e "saveRDS(1:10, '\''ops-data.rds'\'')"'

# Succeeds
% dvc run -f thing.dvc -d TGP_Opp_2014.csv 'Rscript -e "saveRDS(1:10, '"'"'ops-data.rds'"'"')"'

The reason for the failure is that the shell evaluates the command arguments twice - once when passing the arguments to dvc run, and again when dvc run actually executes the command to create the artifact. Once this is understood, the behavior makes consistent sense, but is still a bit tricky to deal with sometimes.

(The last example succeeds because of the fun fact that the best way to escape single-quotes in a single-quoted string is to use the pattern'"'"'. The penultimate example is similar, but using '\'' instead. Simple backslash quoting of single quotes does not work - see https://stackoverflow.com/questions/1250079/how-to-escape-single-quotes-within-single-quoted-strings)

I assume that normally the words of the command are pasted back together with space separators, then invoked by the shell - which splits them back apart using its quoting rules, and does all the other stuff a shell does (globbing, substitution, running multiple commands if there are semicolons, etc.).

Proposal is to add some combination of a couple more flags:

--no-shell - indicate that the words of the command shouldn't be pasted back together and invoked using the shell, they should exec the command & its args directly (i.e. the spiritual equivalent of Python's subprocess.run(..., shell=False)).
-- - indicate that argument-processing should stop, and all the rest of the words are part of the command - see precedent here. This is probably not necessary though, since it seems that the first non-option argument already stops argument processing?

Furthermore, a good heuristic for figuring out whether to invoke the shell or not, when running the command, might be whether or not the command has multiple words - if it's a single word, invoke the shell, if it's multiple words, don't use the shell. There's some precedent for this heuristic in Perl, and the Python subprocess command also has some ties between the number of words and the shell behavior: "If passing a single string, either shell must be True (see below) or else the string must simply name the program to be executed without specifying any arguments."

feature request p2-medium research

Source

kenahoo

❤4

Most helpful comment

So, --no-shell seems to be a reasonable option. We would need to add a flag to the DVC file as far as I understand. From the implementation perspective it's a quite simple change. @efiop do you have other concerns here, see edge cases?

@kenahoo can you clarify/give more example re the single word vs multiple words, please?

shcheklein on 9 Aug 2019

👍2

All 15 comments

Discord context https://discordapp.com/channels/485586884165107732/563406153334128681/609092672128286741

efiop on 8 Aug 2019

@kenahoo can you clarify/give more example re the single word vs multiple words, please?

shcheklein on 9 Aug 2019

👍2

Furthermore, a good heuristic for figuring out whether to invoke the shell or not, when running the command, might be whether or not the command has multiple words - if it's a single word, invoke the shell, if it's multiple words, don't use the shell.

I don't see the benefit of this heuristic (maybe because I don't know the difference between subprocess.run(..., shell=False) and subprocess.run(..., shell=True) ).
So, the question is why not always use shell=False, which is the default and the recommended value?

dashohoxha on 9 Aug 2019

@dashohoxha

So, the question is why not always use shell=False, which is the default and the recommended value?

Because you might want your command to be evaluated by the shell, as you usually do when working in shell. E.g. by using env vars, piping, etc. The heuristics tells us that if you are passing 1 word (e.g. escaped command which is passed as 1 argument to the CLI dvc run 'cat data | grep something') it means that you probably want shell=True, otherwise shell=False is fine. I'm not qite sure that it is reliable enough, so going with --shell --no-shell feels like a safer choice for me. We might consider switching to --no-shell by default later. The reason we went with --shell by default, is that it felt more natural when coming from not-wrapping your commands in dvc run.

efiop on 9 Aug 2019

@efiop Thanks for the explanation.
I missed the fact that the single command can actually be quoted (like 'cat data | grep something') and may need to be interpreted from the shell.

My opinion is that putting such complex commands in a bash script would usually be better. However there still may be some cases when you want to run the command from a shell.

I think that --no-shell should be the default option, unless it breaks some existing projects that use DVC.

dashohoxha on 10 Aug 2019

I do think that --no-shell as the default would break some existing projects. For example, I have an artifact with this command:

dvc run ... 'jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace foo.ipynb && jupyter nbconvert --execute --to=html --ExecutePreprocessor.timeout=180 foo.ipynb'

Since && is shell syntax, it would fail if it were run without shell interpretation. Similarly with anything that uses ; or |.

Sometimes the command can be transformed to something that doesn't need the shell - for example, @dashohoxha's cat data | grep something can be written as grep something data, which is usually a better and more efficient thing to do anyway.

A question (which I don't understand the details of very well) is what to do in cases like mine above, where the command is stored in the .dvc file like so:

% head -n3 foo.dvc 
wdir: .
cmd: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace foo.ipynb
  && jupyter nbconvert --execute --to=html --ExecutePreprocessor.timeout=180 foo.ipynb

Is the cmd read in as a single string? Or is it two strings, or 12 strings? If it's one string, then I think everything is fine.

kenahoo on 12 Aug 2019

👍1

I guess YAML reads it as a single string, so that should work well:

% Rscript -e 'yaml::yaml.load_file("foo.dvc")$cmd'
[1] "jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace foo.ipynb && jupyter nbconvert --execute --to=html --ExecutePreprocessor.timeout=180 foo.ipynb"

By which I mean - existing .dvc files should only have the arguments stored as single strings, and are all currently using shell=True, so that should work well for backward compatibility.

kenahoo on 12 Aug 2019

👍1

dvc run ... 'jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace foo.ipynb && jupyter nbconvert --execute --to=html --ExecutePreprocessor.timeout=180 foo.ipynb'

@kenahoo I wonder, are there any drawbacks of putting a long command like this in a small bash script?

I do think that --no-shell as the default would break some existing projects.

I think that at some point DVC should introduce proper versioning and releases. I mean that it should have major releases and minor releases.

Minor releases are only for bug fixes and efficiency improvements, but they don't change the interface and the features (they are like maintenance releases).

Major releases on the other hand can also introduce interface and feature changes, while also documenting these changes from the last major release.

This way people know that as long as they update to minor releases they can expect their project to work as before. However when they upgrade to major releases they may need to fix a few things on their project, to match the latest changes in the interface.

dashohoxha on 12 Aug 2019

@kenahoo I wonder, are there any drawbacks of putting a long command like this in a small bash script?

Shouldn't be any technical drawbacks, it's really just a matter of organizing the code in a way that's maintainable. Sometimes it's nice to encapsulate in a script, sometimes it just becomes another file that needs managing.

I think that at some point DVC should introduce proper versioning and releases. I mean that it should have major releases and minor releases.

Yes, though to be really useful it would need some way to figure out which artifacts were created under which version of the tool (presumably by putting version numbers in .dvc files), and whether or not they need to be updated - ideally using automatic upgrade tools.

kenahoo on 12 Aug 2019

After reading through the thread again, I wanted to emphasize something a bit:

Changing the default from shell=True to shell=False would definitely break existing projects. I wouldn't personally advocate for that.

I guess I don't know how widespread shell-isms are in people's pipelines, but I wouldn't be surprised if they're quite common.

kenahoo on 12 Aug 2019

👍1

Yes, though to be really useful it would need some way to figure out which artifacts were created under which version of the tool (presumably by putting version numbers in .dvc files)

Maybe the version can be stored somewhere on the .dvc/ meta-dir, not on every .dvc file. I don't think a project can contain artifacts from different versions of dvc.

Changing the default from shell=True to shell=False would definitely break existing projects.

Maybe this can be a configuration option, so that users can set it according to their needs.

dashohoxha on 13 Aug 2019

I'll take --no-shell part. Planning to make pr by Tuesday.

Melevir on 28 Nov 2020

👍1

So, to make it clear, this is what I'll do:

add --no-shell param to CmdRun's parser (dvc.command.run.add_parser);
access this argument to dvc.stage.run.cmd_run;
if specified, add shell=False to subprocess.Poped;
add tests.

That seems it or I'm missing something?

Melevir on 28 Nov 2020

Hi @Melevir !

Thanks for looking into it! The plan regarding run looks pretty good! But, unless I'm missing something, looks like you are forgetting about repro?

access this argument to dvc.stage.run.cmd_run;
if specified, add shell=False to subprocess.Poped;

Note that cmd_run has a bit more complex logic regarding shell (at least for posix), so it will be a little bit more involved than that. Let us know if you'll have any questions :)

efiop on 28 Nov 2020

👍1

I've made a pr for run command only.
Please, review it, then I'll make a new pr for repro command and will make a pr with docs improvements.
Tests and linters are green locally, lets see what CI will say.

Melevir on 30 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings