Machinelearningnotebooks: Ambiguity on PythonScriptStep's allow_reuse parameter

Created on 6 Apr 2019 · 6Comments · Source: Azure/MachineLearningNotebooks

What exactly does "settings/inputs" mean in this scenario?
For example, If allow_reuse = True will a new run be generated I change:

the code of the underlying Python script provided to the script_name parameter,
or a script_arg?

allow_reuse
bool
Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.

I might be wrong, but I think in both cases, a new run will not be generated. This is rather frustrating when developing a pipeline with many steps...

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 2836b6f1-7d38-6091-e67a-1a4305fc8f4d
Version Independent ID: 4153252b-fa22-0ccd-b9bd-a460f422339f
Content: azureml.pipeline.steps.python_script_step.PythonScriptStep class - Azure Machine Learning Python
Content Source: AzureML/docs-ref-autogen/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.PythonScriptStep.yml
Service: machine-learning
Sub-service: core
GitHub Login: @j-martens
Microsoft Alias: jmartens

Source

swanderz

Most helpful comment

The default behavior of a Step execution in Pipelines is that when the script specified in the Step using script_name, inputs, and the parameters of a step remain the same, the output of a previous step run will be reused instead of running the step again. When a step is reused, the job is not submitted to the compute, instead, the results from the previous run are immediately available to the next step runs.
Azure Machine Learning Pipelines provide ways to control and alter this behavior.

### allow_reuse Flag
You can specify allow_reuse=False as a parameter of the Step. When allow_reuse is set to False, the step run won’t be reused, and a new run will always be generated for the step during pipeline execution. Default behavior of Pipelines is to set allow_reuse=True for steps.

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False
                        )

### regenerate_outputs Flag
If regenerate_outputs is set to True for the Experiment.Submit() call, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run. Default behavior of Pipelines is to set regenerate_outputs=False for experiment submit calls.

exp = Experiment(ws, 'Hello_World')
pipeline_run = exp.submit(pipeline, regenerate_outputs=False)

### hash_paths Parameter
By default, only the main script file is hashed. In addition to the default hashing behavior of the script, you can specify files or directories to be included in the hash calculations via the hash_paths parameter. Paths specified here can be absolute paths or relative paths to the source_directory. To include all contents of source_directory, specify hash_paths='.'
With the additional parameter as specified below, each pipeline run tracks changes when either the script or the notebook hello_world.ipynb changes.

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False,
                 hash_paths=['hello_world.ipynb']

sanpil on 9 Apr 2019

❤2 👍1

All 6 comments

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False
                        )

exp = Experiment(ws, 'Hello_World')
pipeline_run = exp.submit(pipeline, regenerate_outputs=False)

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False,
                 hash_paths=['hello_world.ipynb']

sanpil on 9 Apr 2019

❤2 👍1

@sanpil ok this is great information. thank you.

If I do the following, will the step be reused the second time i run the pipeline?

define the the PythonScriptStep below
submit said step as part of a PipelineRun for the first time and it completes without error
change the code inside of hello_world.py
run the pipeline again

hello_step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=True)

If it does not re-run, I would make the case that this case should be reflected in the allow_reuse parameter's description.

swanderz on 10 Apr 2019

In the above scenario, if you make a change to hello_world.py then rerun the pipeline, the step WILL NOT be reused (it will re-run). If you see otherwise, please provide the Pipeline Run ID and Step Run ID and we will take a look.

sanpil on 10 Apr 2019

hello,
I would like to know when the input data to the step changes i.e consider from X rows its now increased to Y rows, and allow_reuse = True, i have seen the step not re-running and using the prev step run result. Is this an expected scenario? I would ideally consider the step to rerun because the input data has changed, though the input data file name is same.

gargiulman on 9 Aug 2019

👍1

If the data is in a datastore, we would not be able to detect the data change. If the data is uploaded as part of the snapshot (under source_directory) [this is not recommended though], then the hash will change and will trigger a rerun.

sanpil on 9 Aug 2019

On this same note, it's confusing that the wording/explanation changes between docs. In the main how-to guides and even in the comments it says that if the script changes the pipeline will not reuse the previous results.

Seems like the actual behavior is if the snapshot changes the pipeline will not be reused as stated in the remarks section: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py#remarks

This behavior makes sense but the documentation is inconsistent which makes it confusing.