Machinelearningnotebooks: DataReference objects as step inputs even when they're functionally outputs

Created on 5 Mar 2020 · 11Comments · Source: Azure/MachineLearningNotebooks

I've noticed a strange phenomenon and haven't been able to find an explanation. When I try to generate a DataReference output from a PythonScriptStep, I have to set it as an input rather than an output to the step itself. This is not the case, however, when I use PipelineData objects. Why is this?

For example:

data_ref_output = DataReference(datastore, data_reference_name=data_ref_name, path_on_datastore=ds_path)

my_step = PythonScriptStep(name="some_name", 
     script_name="my-script.py", 
     arguments=["--data_ref_output", data_ref_output],
     source_dir="./main",
     inputs=[data_ref_output],
     etc...

Pipelines cxp product-question triaged

Source

nicolemhfarley

🚀2

Most helpful comment

Agree with the above suggestion by Anders.

sanpil on 17 Mar 2020

🚀2

All 11 comments

very strange @nicolemhfarley. do you mean to say that if you were to pass the data_ref_output to the outputs param of my_step then it won't work? cc: @MayMSFT

swanderz on 10 Mar 2020

pipelinedata is for output. everytime you run the pipeline, it will generate a new folder in your datastore to store the output files. so instead of defining data_output as DataReference, you shall define it as a PipelineData object and pass it as the output parameter.
doc: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline

MayMSFT on 10 Mar 2020

@nicolemhfarley We will now proceed to close this thread. If there are further questions regarding this matter, please tag me in your reply. We will gladly continue the discussion and we will reopen the issue.

ram-msft on 11 Mar 2020

@swanderz Yes.

@MayMSFT @ram-msft I am aware of Pipelinedata objects and their use. In fact, I use them frequently for data transfer between steps. Your response saying that I should just use a Pipelinedata object doesn't address the issue that I raised. I am specifying a specific output location on a dedicated Datastore for consumption of these data by other groups for purposes that are completely independent of the platform. This is very easy to do using a DataReference with the strange caveat that it has to be specified as an input to the step producing said data. As far as I am aware, such a thing is not possible using Pipelinedata. Is that not the case?

nicolemhfarley on 13 Mar 2020

🚀2

@sanpil can we specify a target path for PipelineData? Looking at API reference, doesn't look like it's supported. any reason why?

MayMSFT on 13 Mar 2020

We can't. For reuse purpose, the system generates path in the current implementation.

sanpil on 13 Mar 2020

@sanpil, in @nicolemhfarley's case above, what would be the suggested solution to write the same data (produced by a pipeline step) into two different locations? One location is in a pipeline data object (that is used to define port-binding between steps and is logged in a run specific artifacts path), and another location that is somewhere on a completely separate Azure Blob, not related to AzureML that is being used by other users in the business?

Is there a preferred way to tackle the second location that I mentioned above? Writing the output of a step in a separate blob? What raises an issue is that even though technically the defined data reference is an output path, it still has to be fed into the step in the inputs.

jadhosn on 15 Mar 2020

🚀2

@ram-msft can you please re-open this issue as it's not solved yet?

jadhosn on 15 Mar 2020

on our team, what we do is have a DataTransferStep for each PipelineData that we'd like to persist outside of the pipeline. For us, we keep use ds_pipeline to keep all the PipelineData's and ds_output_blob to persist a separate consolidated list of artifacts for each repo. @nicolemhfarley @jadhosn below is an example of a very common pattern we use. Hopefully this helps...

ds_pipeline = Datastore.get(ws, datastore_name='pipeline_blob_container')
ds_output_blob = Datastore.get(ws, datastore_name='output_blob_container')
output_spot = DataReference(
    data_reference_name='output_data',
    datastore=ds_output_blob,
    path_on_datastore=ds_path,
    overwrite=True)

data_ref_output = PipelineData('data_ref_output', datastore=ds_pipeline)

my_step = PythonScriptStep(name="some_name", 
     script_name="my-script.py", 
     arguments=["--data_ref_output", data_ref_output],
     source_dir="./main",
     outputs=[data_ref_output],
     # etc..
)

transfer_results_step = DataTransferStep(
    name="transfer_results",
    source_data_reference=data_ref_output,
    destination_data_reference=output_spot,
    source_reference_type='directory',
    destination_reference_type='directory',
    compute_target=data_factory_compute,
    allow_reuse=pipeline_reuse)

swanderz on 17 Mar 2020

🚀1 🎉1

Agree with the above suggestion by Anders.

sanpil on 17 Mar 2020

🚀2

@swanderz Nice. Thanks for the tip!

nicolemhfarley on 17 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Pipeline Portal UX: log StepRun's metrics to parent PipelineRun

swanderz · 4Comments

Azure Machine Learning- Triggered Pipeline does not Execute the Python Script

AakanchJoshi · 4Comments

_OfflineRun should inheirit Run's tag() method

swanderz · 5Comments

Moving Data in and out of pipelines documentation has incorrect API

ahyerman · 3Comments

Bug in Estimator class when runconfig specifies "target: local"

swanderz · 5Comments