What exactly does "settings/inputs" mean in this scenario?
For example, If allow_reuse = True will a new run be generated I change:
script_name parameter,script_arg?
allow_reuse
bool
Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
The default behavior of a Step execution in Pipelines is that when the script specified in the Step using script_name, inputs, and the parameters of a step remain the same, the output of a previous step run will be reused instead of running the step again. When a step is reused, the job is not submitted to the compute, instead, the results from the previous run are immediately available to the next step runs.
Azure Machine Learning Pipelines provide ways to control and alter this behavior.
### allow_reuse Flag
You can specify allow_reuse=False as a parameter of the Step. When allow_reuse is set to False, the step run won’t be reused, and a new run will always be generated for the step during pipeline execution. Default behavior of Pipelines is to set allow_reuse=True for steps.
step = PythonScriptStep(name="Hello World",
script_name="hello_world.py",
compute_target=aml_compute,
source_directory= source_directory,
allow_reuse=False
)
### regenerate_outputs Flag
If regenerate_outputs is set to True for the Experiment.Submit() call, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run. Default behavior of Pipelines is to set regenerate_outputs=False for experiment submit calls.
exp = Experiment(ws, 'Hello_World')
pipeline_run = exp.submit(pipeline, regenerate_outputs=False)
### hash_paths Parameter
By default, only the main script file is hashed. In addition to the default hashing behavior of the script, you can specify files or directories to be included in the hash calculations via the hash_paths parameter. Paths specified here can be absolute paths or relative paths to the source_directory. To include all contents of source_directory, specify hash_paths='.'
With the additional parameter as specified below, each pipeline run tracks changes when either the script or the notebook hello_world.ipynb changes.
step = PythonScriptStep(name="Hello World",
script_name="hello_world.py",
compute_target=aml_compute,
source_directory= source_directory,
allow_reuse=False,
hash_paths=['hello_world.ipynb']
@sanpil ok this is great information. thank you.
If I do the following, will the step be reused the second time i run the pipeline?
hello_world.pyhello_step = PythonScriptStep(name="Hello World",
script_name="hello_world.py",
compute_target=aml_compute,
source_directory= source_directory,
allow_reuse=True)
If it does not re-run, I would make the case that this case should be reflected in the allow_reuse parameter's description.
In the above scenario, if you make a change to hello_world.py then rerun the pipeline, the step WILL NOT be reused (it will re-run). If you see otherwise, please provide the Pipeline Run ID and Step Run ID and we will take a look.
hello,
I would like to know when the input data to the step changes i.e consider from X rows its now increased to Y rows, and allow_reuse = True, i have seen the step not re-running and using the prev step run result. Is this an expected scenario? I would ideally consider the step to rerun because the input data has changed, though the input data file name is same.
If the data is in a datastore, we would not be able to detect the data change. If the data is uploaded as part of the snapshot (under source_directory) [this is not recommended though], then the hash will change and will trigger a rerun.
On this same note, it's confusing that the wording/explanation changes between docs. In the main how-to guides and even in the comments it says that if the script changes the pipeline will not reuse the previous results.
Seems like the actual behavior is if the snapshot changes the pipeline will not be reused as stated in the remarks section: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py#remarks
This behavior makes sense but the documentation is inconsistent which makes it confusing.
Most helpful comment
The default behavior of a Step execution in Pipelines is that when the script specified in the Step using script_name, inputs, and the parameters of a step remain the same, the output of a previous step run will be reused instead of running the step again. When a step is reused, the job is not submitted to the compute, instead, the results from the previous run are immediately available to the next step runs.
Azure Machine Learning Pipelines provide ways to control and alter this behavior.
### allow_reuse Flag
You can specify allow_reuse=False as a parameter of the Step. When allow_reuse is set to False, the step run won’t be reused, and a new run will always be generated for the step during pipeline execution. Default behavior of Pipelines is to set allow_reuse=True for steps.
### regenerate_outputs Flag
If regenerate_outputs is set to True for the Experiment.Submit() call, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run. Default behavior of Pipelines is to set regenerate_outputs=False for experiment submit calls.
### hash_paths Parameter
By default, only the main script file is hashed. In addition to the default hashing behavior of the script, you can specify files or directories to be included in the hash calculations via the hash_paths parameter. Paths specified here can be absolute paths or relative paths to the source_directory. To include all contents of source_directory, specify hash_paths='.'
With the additional parameter as specified below, each pipeline run tracks changes when either the script or the notebook hello_world.ipynb changes.