I've noticed a strange phenomenon and haven't been able to find an explanation. When I try to generate a DataReference output from a PythonScriptStep, I have to set it as an input rather than an output to the step itself. This is not the case, however, when I use PipelineData objects. Why is this?
For example:
data_ref_output = DataReference(datastore, data_reference_name=data_ref_name, path_on_datastore=ds_path)
my_step = PythonScriptStep(name="some_name",
script_name="my-script.py",
arguments=["--data_ref_output", data_ref_output],
source_dir="./main",
inputs=[data_ref_output],
etc...
very strange @nicolemhfarley. do you mean to say that if you were to pass the data_ref_output to the outputs param of my_step then it won't work? cc: @MayMSFT
pipelinedata is for output. everytime you run the pipeline, it will generate a new folder in your datastore to store the output files. so instead of defining data_output as DataReference, you shall define it as a PipelineData object and pass it as the output parameter.
doc: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline
@nicolemhfarley We will now proceed to close this thread. If there are further questions regarding this matter, please tag me in your reply. We will gladly continue the discussion and we will reopen the issue.
@swanderz Yes.
@MayMSFT @ram-msft I am aware of Pipelinedata objects and their use. In fact, I use them frequently for data transfer between steps. Your response saying that I should just use a Pipelinedata object doesn't address the issue that I raised. I am specifying a specific output location on a dedicated Datastore for consumption of these data by other groups for purposes that are completely independent of the platform. This is very easy to do using a DataReference with the strange caveat that it has to be specified as an input to the step producing said data. As far as I am aware, such a thing is not possible using Pipelinedata. Is that not the case?
@sanpil can we specify a target path for PipelineData? Looking at API reference, doesn't look like it's supported. any reason why?
We can't. For reuse purpose, the system generates path in the current implementation.
@sanpil, in @nicolemhfarley's case above, what would be the suggested solution to write the same data (produced by a pipeline step) into two different locations? One location is in a pipeline data object (that is used to define port-binding between steps and is logged in a run specific artifacts path), and another location that is somewhere on a completely separate Azure Blob, not related to AzureML that is being used by other users in the business?
Is there a preferred way to tackle the second location that I mentioned above? Writing the output of a step in a separate blob? What raises an issue is that even though technically the defined data reference is an output path, it still has to be fed into the step in the inputs.
@ram-msft can you please re-open this issue as it's not solved yet?
on our team, what we do is have a DataTransferStep for each PipelineData that we'd like to persist outside of the pipeline. For us, we keep use ds_pipeline to keep all the PipelineData's and ds_output_blob to persist a separate consolidated list of artifacts for each repo. @nicolemhfarley @jadhosn below is an example of a very common pattern we use. Hopefully this helps...
ds_pipeline = Datastore.get(ws, datastore_name='pipeline_blob_container')
ds_output_blob = Datastore.get(ws, datastore_name='output_blob_container')
output_spot = DataReference(
data_reference_name='output_data',
datastore=ds_output_blob,
path_on_datastore=ds_path,
overwrite=True)
data_ref_output = PipelineData('data_ref_output', datastore=ds_pipeline)
my_step = PythonScriptStep(name="some_name",
script_name="my-script.py",
arguments=["--data_ref_output", data_ref_output],
source_dir="./main",
outputs=[data_ref_output],
# etc..
)
transfer_results_step = DataTransferStep(
name="transfer_results",
source_data_reference=data_ref_output,
destination_data_reference=output_spot,
source_reference_type='directory',
destination_reference_type='directory',
compute_target=data_factory_compute,
allow_reuse=pipeline_reuse)
Agree with the above suggestion by Anders.
@swanderz Nice. Thanks for the tip!
Most helpful comment
Agree with the above suggestion by Anders.