Machinelearningnotebooks: Dir structure of PipelineOutputFileDataset?

Created on 5 Feb 2020 · 2Comments · Source: Azure/MachineLearningNotebooks

@MayMSFT The example given in how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/keras_mnist.py is confusing because it doesn't communicate the directory structure the PipelineOutputFileDataset.

X_train_path = glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0]

When I try to use PipelineOutputFileDataset, I'm getting this error. Is it a requirement to use glob with a DatasetConsumptionConfig? Or is it that now I have to include the split_data

User program failed with FileNotFoundError:
[Errno 2] No such file or directory: 'DatasetConsumptionConfig:split_data/data.pkl'

before `PipelineOutputFileDataset`

This is what is currently working for me.

end of `get_data.py`

joblib.dump(value=data, filename=os.path.join(args.output_dir, "data.pkl))

beginning of `train.py`

data = load(os.path.join(args.input_dir, 'data.pkl'))

split_data = PipelineData('split_data', datastore=ds_pipeline)

split_step = PythonScriptStep(
    name='Split data',
    script_name='get_data.py',
    arguments=['--input_dir', gold_data,
               '--output_dir', split_data],
    compute_target=compute_target,
    inputs=[gold_data],
    outputs=[split_data],
    runconfig=run_config,
    source_directory=os.path.join(os.getcwd(), 'compute', 'train'),
    allow_reuse=pipeline_reuse
)

# hyperdrive config
est_config_aml = Estimator(
    source_directory=os.path.join(os.getcwd(), 'compute', 'train'),
    entry_script="train.py",
    compute_target=compute_target,
    environment_definition=run_config.environment
)

random_sampling = RandomParameterSampling({
    # 'boosting_type' : choice('gbdt', 'dart'),
    'learning_rate': quniform(0.05, 0.1, 0.01),
    'num_leaves': quniform(4, 50, 1),
    # "max_bin": quniform(50, 300, 5),
    "min_child_samples": quniform(10, 200, 5),
    "colsample_bytree": quniform(0.3, 1, 0.1),
    "subsample": quniform(0.3, 1, 0.1)
})

hyperdrive_run_config = HyperDriveConfig(
    estimator=est_config_aml,  # AML
    hyperparameter_sampling=random_sampling,
    primary_metric_name="geometric mean",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=hyperdrive_runs,
    max_concurrent_runs=8)

hyperdrive_step = HyperDriveStep(
    name='kickoff hyperdrive jobs',
    hyperdrive_config=hyperdrive_run_config,
    estimator_entry_script_arguments=["--input_dir", split_data],
    inputs=[split_data],
    metrics_output=hyperdrive_json,
    allow_reuse=pipeline_reuse
)

attempt to implement `PipelineOutputFileDataset`

train.py is failing with the following error:

User program failed with FileNotFoundError:
[Errno 2] No such file or directory: 'DatasetConsumptionConfig:split_data/data.pkl'

the only changes to the above I make are:

split_data = (
    PipelineData('split_data', datastore=ds_pipeline)
    .as_dataset()
    .register(
        name="ret-holdout-split",
        create_new_version=True)
)

hyperdrive_step = HyperDriveStep(
    name='kickoff hyperdrive jobs',
    hyperdrive_config=hyperdrive_run_config,
    estimator_entry_script_arguments=[
        "--input_dir", split_data.as_named_input('split_data').as_mount()
    ],
    inputs=[],
    metrics_output=hyperdrive_json,
    allow_reuse=pipeline_reuse
)

Training product-issue

Source

swanderz

Most helpful comment

Hi, Anders.

Do you have pipeline_reuse set to True?
If so pipeline tries to reuse output from the previous run. In your case, it takes the output from the previous run of split_step (without using datastes) and tries to reuse it in the current run (with datasets).

We are working on the fix but to work around the issue now could you try do one of these things:

Change pipeline_reuse to False;
or pass regenerate_outputs = True when submitting experiment:

# ...
pipeline_run = Pipeline(
    workspace, 
    steps=[get_data_step, hd_step],
    description="test-pipeline"
)
pipeline_run.submit("test-pipeline", regenerate_outputs=True)

myshylin on 10 Feb 2020

🎉1 👍1

All 2 comments

Hi, Anders.

We are working on the fix but to work around the issue now could you try do one of these things:

Change pipeline_reuse to False;
or pass regenerate_outputs = True when submitting experiment:

# ...
pipeline_run = Pipeline(
    workspace, 
    steps=[get_data_step, hd_step],
    description="test-pipeline"
)
pipeline_run.submit("test-pipeline", regenerate_outputs=True)

myshylin on 10 Feb 2020

🎉1 👍1

We are working on the fix but to work around the issue now could you try do one of these things:

Change pipeline_reuse to False;

or pass regenerate_outputs = True when submitting experiment:

@myshylin that worked! Thanks for the help. I know it can be hard to debug without a reprex, so I very much appreciate lending your brain on this.

swanderz on 10 Feb 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Register model from cloud storage

jmwoloso · 4Comments

Dataset.Tabular.from_delimited_files creating an extra class

BillmanH · 5Comments

Moving Data in and out of pipelines documentation has incorrect API

ahyerman · 3Comments

Document Origin of DataPath

tkawchak · 5Comments

Uploading and registering a dataset overwrites the previous versions

lefaivre · 5Comments

Machinelearningnotebooks: Dir structure of PipelineOutputFileDataset?

before PipelineOutputFileDataset

end of get_data.py

beginning of train.py

attempt to implement PipelineOutputFileDataset