@MayMSFT The example given in how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/keras_mnist.py is confusing because it doesn't communicate the directory structure the PipelineOutputFileDataset.
X_train_path = glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0]
When I try to use PipelineOutputFileDataset, I'm getting this error. Is it a requirement to use glob with a DatasetConsumptionConfig? Or is it that now I have to include the split_data
User program failed with FileNotFoundError:
[Errno 2] No such file or directory: 'DatasetConsumptionConfig:split_data/data.pkl'
PipelineOutputFileDatasetThis is what is currently working for me.
get_data.pyjoblib.dump(value=data, filename=os.path.join(args.output_dir, "data.pkl))
train.pydata = load(os.path.join(args.input_dir, 'data.pkl'))
split_data = PipelineData('split_data', datastore=ds_pipeline)
split_step = PythonScriptStep(
name='Split data',
script_name='get_data.py',
arguments=['--input_dir', gold_data,
'--output_dir', split_data],
compute_target=compute_target,
inputs=[gold_data],
outputs=[split_data],
runconfig=run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'train'),
allow_reuse=pipeline_reuse
)
# hyperdrive config
est_config_aml = Estimator(
source_directory=os.path.join(os.getcwd(), 'compute', 'train'),
entry_script="train.py",
compute_target=compute_target,
environment_definition=run_config.environment
)
random_sampling = RandomParameterSampling({
# 'boosting_type' : choice('gbdt', 'dart'),
'learning_rate': quniform(0.05, 0.1, 0.01),
'num_leaves': quniform(4, 50, 1),
# "max_bin": quniform(50, 300, 5),
"min_child_samples": quniform(10, 200, 5),
"colsample_bytree": quniform(0.3, 1, 0.1),
"subsample": quniform(0.3, 1, 0.1)
})
hyperdrive_run_config = HyperDriveConfig(
estimator=est_config_aml, # AML
hyperparameter_sampling=random_sampling,
primary_metric_name="geometric mean",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=hyperdrive_runs,
max_concurrent_runs=8)
hyperdrive_step = HyperDriveStep(
name='kickoff hyperdrive jobs',
hyperdrive_config=hyperdrive_run_config,
estimator_entry_script_arguments=["--input_dir", split_data],
inputs=[split_data],
metrics_output=hyperdrive_json,
allow_reuse=pipeline_reuse
)
PipelineOutputFileDatasettrain.py is failing with the following error:
User program failed with FileNotFoundError:
[Errno 2] No such file or directory: 'DatasetConsumptionConfig:split_data/data.pkl'
the only changes to the above I make are:
split_data = (
PipelineData('split_data', datastore=ds_pipeline)
.as_dataset()
.register(
name="ret-holdout-split",
create_new_version=True)
)
hyperdrive_step = HyperDriveStep(
name='kickoff hyperdrive jobs',
hyperdrive_config=hyperdrive_run_config,
estimator_entry_script_arguments=[
"--input_dir", split_data.as_named_input('split_data').as_mount()
],
inputs=[],
metrics_output=hyperdrive_json,
allow_reuse=pipeline_reuse
)
Hi, Anders.
Do you have pipeline_reuse set to True?
If so pipeline tries to reuse output from the previous run. In your case, it takes the output from the previous run of split_step (without using datastes) and tries to reuse it in the current run (with datasets).
We are working on the fix but to work around the issue now could you try do one of these things:
pipeline_reuse to False;regenerate_outputs = True when submitting experiment:# ...
pipeline_run = Pipeline(
workspace,
steps=[get_data_step, hd_step],
description="test-pipeline"
)
pipeline_run.submit("test-pipeline", regenerate_outputs=True)
We are working on the fix but to work around the issue now could you try do one of these things:
- Change
pipeline_reusetoFalse;- or pass
regenerate_outputs = Truewhen submitting experiment:
@myshylin that worked! Thanks for the help. I know it can be hard to debug without a reprex, so I very much appreciate lending your brain on this.
Most helpful comment
Hi, Anders.
Do you have
pipeline_reuseset toTrue?If so pipeline tries to reuse output from the previous run. In your case, it takes the output from the previous run of
split_step(without using datastes) and tries to reuse it in the current run (with datasets).We are working on the fix but to work around the issue now could you try do one of these things:
pipeline_reusetoFalse;regenerate_outputs = Truewhen submitting experiment: