The run.input_datasets dictionary is empty - even after passing into the PythonScriptStep.
input_dataset = Dataset.get_by_name(ws, name='super_secret_data')
cleanStep = PythonScriptStep(
script_name = "clean.py",
inputs = [input_dataset.as_named_input('important_dataset')],
outputs = [output_data],
compute_target = cpu_cluster,
source_directory = experiment_folder
)
run = Run.get_context()
print(run.input_datasets)
input_ds = run.input_datasets['important_dataset']
input_df = input_ds.to_pandas_dataframe()
When the pipeline is run, the log for the clean.py step shows the run.input_datasets object is an empty dict and therefore the script fails with a KeyError.
@colbyford
strange... Is it a Tabular or File Dataset? cc: @MayMSFT
It's a Tabular Dataset in a blob Datastore
spitball ideas while we wait for @MayMSFT to respond:
azureml-defaults and azureml-dataprep in cpu_cluster's pip_dependencies?input_dataset and .as_named_input()?@swanderz - Yes, the pip dependencies are there and the space was Github-only typo on my part - space not in the source code. Will edit it out here.
what does the 70_driver_log.txt look like? Does DatasetContextManager have anything interesting to say?
can you share ur azureml sdk, azureml-dataprep version as well? thanks
Here is 70_driver_log.txt
```bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
Starting the daemon thread to refresh tokens in background for process with pid = 153
Entering Run History Context Manager.
{}
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.0030050277709960938 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 153
Traceback (most recent call last):
File "clean.py", line 24, in
df = run.input_datasets['important_dataset']
File "/azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/python3.6/site-packages/azureml/core/run.py", line 2205, in __getitem__
return super().__getitem__(key)
KeyError: 'important_dataset'
2019/12/19 15:03:08 MPI Publisher Details : Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
2019/12/19 15:03:08 MPI Publisher: intel ; Publisher version: 2018
```
@colbyford - Could you find the versions of AML that were used for these runs?
The version of azureml.core is 1.0.79
This is the only part of the SDK that we're using. (No AutoML, etc.)
I've just tested to double confirm. It works fine. @rongduan-zhu what information do you need to help debug this? or we can schedule a video call to help the customer debug
@ezwiefel Based on the driver log, it looks like the code that was supposed to set up input_datasets is not run. Can you please paste the code that shows how you set up the conda dependencies? I don't see you passing in a run configuration, which is where you would specify the conda dependencies, to the PythonScriptStep.
@colbyford would have that code.
Since we edited the notebook to get this working and then I ported the working version over to the client's other workspace, I can't seem to find the version with the issue(s).
We didn't pass in any conda dependencies at this point (as it wasn't necessary).
We can probably just close this issue at this point. We can keep you posted if we run into this issue again, though.
@ezwiefel
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.
Hi,
I am facing the same issue. I am using TabularDataset.
Installed below dependencies:
env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['tensorflow==1.12.0','keras==2.2.4','azureml-sdk','azureml-defaults','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]>=1.1.14'])
env.python.conda_dependencies = cd
est = TensorFlow(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
inputs=[ds.as_named_input('my_data')],
entry_script='keras_lstm.py',
environment_definition= env)
Script:
dataset = run.input_datasets["my_data"]
Error:
return super().__getitem__(key)
KeyError: 'my_data'
Could someone please share solution if any?
Thanks,
SJ
@ezwiefel Based on the driver log, it looks like the code that was supposed to set up
input_datasetsis not run. Can you please paste the code that shows how you set up the conda dependencies? I don't see you passing in a run configuration, which is where you would specify the conda dependencies, to thePythonScriptStep.
Hi, could you please look into code in comments above? I have added dependencies mentioned https://github.com/Azure/MachineLearningNotebooks/issues/707#issuecomment-567585408.
@YutongTie-MSFT - Given @joshisn26 has the same issue with more details, please reopen this issue.
@joshisn26 what version of the azureml-core do you have installed locally? You can check this by doing
from azureml.core import VERSION
print(VERSION)
The reason I am asking is that when you create a conda dependency by doing CondaDependencies.create, it by default pins the version of the azureml packages to the ones installed locally, you can change that by setting the pin_sdk_version parameter to True, which will make sure it will install the latest versions of the azureml packages (unless you explicitly set a version/version range). If possible, can you please also change the name of the Environment to force a creation of a new environment since it might be using an old cached Environment which has old versions of the azureml packages installed.
This is the code snippet I used and was unable repro the error:
conda = CondaDependencies.create(
pip_packages=['tensorflow==1.12.0', 'azureml-sdk', 'azureml-dataprep[fuse,pandas]', 'azureml-telemetry'],
pin_sdk_version=False
)
env = Environment('github_707')
env.python.conda_dependencies = conda
inputs = [
Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
inputs=inputs)
test.py
from azureml.core import Run
print(Run.get_context().input_datasets['test'])
70_driver_log.txt
TabularDataset
{
"source": [
"('workspaceblobstore', 'titanic.csv')"
],
"definition": [
"GetDatastoreFiles",
"ParseDelimited",
"DropColumns",
"SetColumnTypes"
],
"registration": {
"id": "bdd57c83-589f-4bf3-8b17-0076981a9eae",
"name": null,
"version": null,
"workspace": "Workspace.create(name='', subscription_id='', resource_group='')"
}
}
@joshisn26 what version of the
azureml-coredo you have installed locally? You can check this by doingfrom azureml.core import VERSION print(VERSION)The reason I am asking is that when you create a conda dependency by doing
CondaDependencies.create, it by default pins the version of the azureml packages to the ones installed locally, you can change that by setting thepin_sdk_versionparameter toTrue, which will make sure it will install the latest versions of the azureml packages (unless you explicitly set a version/version range). If possible, can you please also change the name of theEnvironmentto force a creation of a new environment since it might be using an old cachedEnvironmentwhich has old versions of the azureml packages installed.This is the code snippet I used and was unable repro the error:
conda = CondaDependencies.create( pip_packages=['tensorflow==1.12.0', 'azureml-sdk', 'azureml-dataprep[fuse,pandas]', 'azureml-telemetry'], pin_sdk_version=False ) env = Environment('github_707') env.python.conda_dependencies = conda inputs = [ Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test') ] tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env, inputs=inputs)
test.pyfrom azureml.core import Run print(Run.get_context().input_datasets['test'])
70_driver_log.txtTabularDataset { "source": [ "('workspaceblobstore', 'titanic.csv')" ], "definition": [ "GetDatastoreFiles", "ParseDelimited", "DropColumns", "SetColumnTypes" ], "registration": { "id": "bdd57c83-589f-4bf3-8b17-0076981a9eae", "name": null, "version": null, "workspace": "Workspace.create(name='', subscription_id='', resource_group='')" } }
@rongduan-zhu Thanks for reply! I am running code on Notebook VM and azure.core version is 1.0.74.
As suggested above, I did following step by step:
But still getting same error, here's the code:
env = Environment('new_env')
cd = CondaDependencies.create(pip_packages=['tensorflow','keras','azureml-sdk','azureml-telemetry','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]'], pin_sdk_version=False)
env.python.conda_dependencies = cd
@joshisn26 I am still unable to repro the error even using a notebook VM (however my SDK version is 1.0.81 since it's a new notebook VM). Here is the notebook cell:
from azureml.core import Workspace, Dataset, Datastore, ComputeTarget, Environment, Experiment
from azureml.core.runconfig import CondaDependencies
from azureml.train.dnn import TensorFlow
conda = CondaDependencies.create(pip_packages=['tensorflow','keras','azureml-sdk','azureml-telemetry','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]'], pin_sdk_version=False)
env = Environment('github_707_2')
env.python.conda_dependencies = conda
ws = Workspace.create(name='centraleuap', subscription_id='35f16a99-532a-4a47-9e93-00305f6c40f2', resource_group='rongduan-dev', exist_ok=True)
dstore = Datastore(ws, 'workspaceblobstore')
inputs = [
Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
inputs=inputs)
exp = Experiment(ws, 'github_707')
run = exp.submit(tf)
run
Can you please share the image build log and driver log of your failed run?
@joshisn26 bump. Can you please share the image build log and driver log of your failed run?
I've got a reproducible (_hopefully_) example for this. Clone the repo to get the conda yml and python script. My theory is that it doesn't like the 4-letter environment variable for some reason...
https://github.com/swanderz/MachineLearningNotebooks/blob/empty_input_datasets/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/dataset_reprex.ipynb
70_driver_log.txtAzure ML SDK Version: 1.0.85
Azure ML dataprep Version: 1.1.38
is this none? nrows = <class 'NoneType'>
run details:
~~~redacted~~~
input_datasets:
[]
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.25638699531555176 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 153
Traceback (most recent call last):
File "extract.py", line 107, in <module>
dataset = run.input_datasets['dsIM']
File "/azureml-envs/azureml_ef603ddd7729973f7484ef984f5589d2/lib/python3.6/site-packages/azureml/core/run.py", line 2205, in __getitem__
return super().__getitem__(key)
KeyError: 'dsIM'
@YutongTie-MSFT please reopen!
thanks Anders. This was caused by a bug in our code. For dataset.as_named_input(), passing string with capital letter will cause the error. We will fix it on Feb 17 release. The current walkaround is to use small letters only.
I'm also having this issue (Azure ML SDK Version: 1.6.0)
70_driver_log.txtrun.get_details()['inputDatasets'] shows the datasets that I gave as inputsrun.input_datasets is {}run.register_model() registers the model _without_ reference to the input datasets.The above happens regardless of local or a compute instance in azure.
@MayMSFT apologies if this is not the correct place to raise this issue.
@grjzwaan can you help me understand the issue here? Did the your experiment run successfully? is the issue about that the model is registered without reference to the input dataset?
@MayMSFT
Yes, the experiment runs successfully and the problem is that the model is registered _without_ reference to the input dataset.
@grjzwaan that's a missing feature. now you have to manually register the models with dataset. run.register_model(datasets=(...))
we will add this feature request into our roadmap to auto register models with input and output dataset. Thanks for your feedback!
@MayMSFT Ok, thanks!
Perhaps it's good to add a note with the run.input_datasets that the attribute input_datasets remains empty?
The first thing I tried was to use this information to register the dataset to the model.