Machinelearningnotebooks: You must provide `azureml.pipeline.core.pipeline_output_dataset.PipelineOutputTabularDataset` to create a remote run

Created on 4 Nov 2019 · 6Comments · Source: Azure/MachineLearningNotebooks

I'm trying to pass an intermediate dataset into AzureMLStep, and it's throwing an error:
"message": "You must provide a 'data_script' or provide data with azureml.dataprep.Dataflow, azureml.core.Dataset, or azureml.data.DatasetDefinition, or azureml.pipeline.core.pipeline_output_dataset.PipelineOutputTabularDataset to create a remote run"

However, when I make a PipelineData object (prepared_churn_featuredata), the class is AZUREML_DATAREFERENCE_prepared_churn_featuredata.

I was following this file: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb

`
datastore = Datastore.get(ws, datastore_name='churn_data')
input_ds = DataReference(datastore, data_reference_name="churn_featuredata")
prepared_churn_featuredata = PipelineData(
"prepared_churn_featuredata",
datastore=datastore,
pipeline_output_name="prepared_churn_featuredata")

ingest_step = PythonScriptStep(
script_name="churn-automl-dataingest.py",
arguments=["--input_data", input_ds, "--output_prepped", prepared_churn_featuredata],
inputs=[input_ds],
outputs=[prepared_churn_featuredata],
compute_target=compute_target,
runconfig=run_config
)

label_y = "Churned_Ind"
automl_config = AutoMLConfig(task='classification',
primary_metric='AUC_weighted',
compute_target=compute_target,
iteration_timeout_minutes=60,
iterations=50,
training_data=prepared_churn_featuredata,
label_column_name=label_y,
n_cross_validations=2
)

from azureml.pipeline.core import TrainingOutput

metrics_output_name = 'ChemPoint_Churn_Metrics'
best_model_output_name = 'ChemPoint_Churn_Model'

metrics_data = PipelineData(name='metrics_data',
datastore=datastore,
pipeline_output_name=metrics_output_name,
training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
datastore=datastore,
pipeline_output_name=best_model_output_name,
training_output=TrainingOutput(type='Model'))

automl_step = AutoMLStep(
name='automl_module',
automl_config=automl_config,
inputs=[prepared_churn_featuredata],
outputs=[metrics_data, model_data],
allow_reuse=True)

from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
description="Churn_AutoML_Pipeline",
workspace=ws,
steps=[ingest_step,automl_step])
`

If I instead treat the intermediate dataset like this, then the type is "azureml.pipeline.core.pipeline_output_dataset.PipelineOutputFileDataset", which is close to the requested one, but I don't know how to convert between them.... I couldn't even find this class on the internet.

prepared_churn_featuredata = PipelineData( "prepared_churn_featuredata", datastore=datastore, pipeline_output_name="prepared_churn_featuredata").as_dataset() prepared_churn_featuredata = prepared_churn_featuredata.register(name='prepared_churn_featuredata', create_new_version=True)

I've also tried just casting the intermediate data as a dataset, and that didn't work either:

automl_step = AutoMLStep( name='automl_module', automl_config=automl_config, inputs=[prepared_churn_featuredata.as_dataset()], outputs=[metrics_data, model_data], allow_reuse=True)

Auto ML awaiting-product-team-response machine-learninsvc product-question triaged

Source

lphomiej

All 6 comments

Here is an example usage of how you can use intermediate data with AutoMLStep

run_config = RunConfiguration()

ds = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')

dstore = Datastore(ws, 'workspaceblobstore')
output_dataset = PipelineData('output', datastore=dstore, output_mode="upload",
output_path_on_compute='/tmp/data/').as_dataset()

prep_step = PythonScriptStep(
script_name="train.py",
arguments=[ds.as_named_input('train').as_mount('/tmp/{}'.format(uuid4()))],
outputs=[output_dataset],
source_directory='.',
compute_target=compute_target,
runconfig=run_config
)

automl_settings = {
"iteration_timeout_minutes": 5,
"iterations": 1,
"n_cross_validations": 2,
"primary_metric": 'AUC_weighted',
"preprocess": True,
"max_concurrent_iterations": 5,
"verbosity": logging.INFO
}

tds = output_dataset.parse_delimited_files()
tds_x = tds.drop_columns('Survived')
tds_y = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv').keep_columns('Survived')

automl_config = AutoMLConfig(task='classification',
debug_log='automl_errors.log',
path='.',
X=tds_x,
y=tds_y,
compute_target=compute_target,
**automl_settings)
automl_config._run_configuration = run_config
train_step = AutoMLStep('automl', automl_config, passthru_automl_config=False)

purnesh42H on 6 Nov 2019

👍1

Hi @lphomiej , you are very close. In order to use an intermediate output from a previous step as an AutoML input, you need to convert the intermediate data as a tabular dataset. The reason for this is because AutoML supports structured tabular data, and intermediate data in pipeline are just files, there is no parsing done or schema associated with it.

To convert intermediate data into a tabular dataset, you can use the code below (assuming the files are delimited files):

```python
prepared_churn_featuredata = PipelineData("prepared_churn_featuredata", datastore=datastore, pipeline_output_name="prepared_churn_featuredata")\
.as_dataset()\
.from_delimited_files()

prepared_churn_featuredata = prepared_churn_featuredata.register(name='prepared_churn_featuredata', create_new_version=True)

automl_config = AutoMLConfig(
# other configs,
training_data=prepared_curn_featuredata,
label_column_name='{the label column name}'
)

automl_step = AutoMLStep(name='automl_module', automl_config=automl_config, outputs=[metrics_data, model_data], allow_reuse=True)
```.

rongduan-zhu on 6 Nov 2019

👍1

@rongduan-zhu following up on this issue, as the thread mentions above, I'm trying to pass in a FileDataset to an AutoML step. Your suggested fix no longer works on the latest stable 1.0.85 (most probably with the switch to the new datasets since Nov). Can you please share how subsetting a FileDataset would work for the purpose of AutoML? @purnesh42H

jadhosn on 9 Mar 2020

Oke, it took a while, but I figured out the working solution, but mainly due updated-examples I have to admit. Great work on those examples!

I'm using AzureML 1.3.0 (for me this did not work with previous versions).

The trick is to create PipelineData(name, store) and call .as_dataset() before passing it to a python step in the outputs!

It's not required to _register_ the PipelineData before passing it to the AutoMLConfig, but it's required to promote the dataset into Tabular, using either parse_parquet_files() or parse_delimited_files(). This can be done just before passing it into AutoMLConfig or when creating the PipelineData.

Note: the taxi-examples are using parquet-fileformat. This is better than csv, because of multiple reasons: mainly datatypes, reduced file size. But 'pyarrow' is not present on the default runtime. If you don't want to mess with environments, keep using csv's instead.

Here is my solution:

# For Parquet: needs add pyarrow package!
# train_data = PipelineData("train_data", datastore=datastore).as_dataset().parse_parquet_files()

train_data = (
    PipelineData("train_data", datastore=datastore)
        .as_dataset()
        .parse_delimited_files()
)

split_step = PythonScriptStep(
    name="split-train-test",
    source_directory="steps",
    script_name="split_train_test.py",
    arguments=[
        # No need to define input, use: Run.get_context().input_datasets['clean_data']
        "--train_output", train_data,
        "--test_output", test_data
    ],
    inputs=[clean_data],
    outputs=[train_data, test_data],
    compute_target=aml_compute,
)

automl_config = AutoMLConfig(
    # other configs...
    training_data=train_data,  # <<--  PipelineOutputTabularDataset goes in
    label_column_name='day_date'
)

automl_step = AutoMLStep(
    name='automl_module', 
    automl_config=automl_config, 
    outputs=[metrics_data, model_data],
    passthru_automl_config=False,
)

abij on 23 Apr 2020

🚀1

@abij - That's a great summary of your experience! Thanks for providing it!

The team is writing a HOW TO for AutoMLStep (to be published at docs.microsoft.com) so I'll make sure your comments here are also covered in that doc. 👍