Machinelearningnotebooks: The `run.input_datasets` dictionary is empty - even after passing into the PythonScriptStep

Created on 19 Dec 2019  路  29Comments  路  Source: Azure/MachineLearningNotebooks

The run.input_datasets dictionary is empty - even after passing into the PythonScriptStep.

Pipeline.ipynb

input_dataset = Dataset.get_by_name(ws, name='super_secret_data')

cleanStep = PythonScriptStep(
    script_name = "clean.py",
    inputs = [input_dataset.as_named_input('important_dataset')],
    outputs = [output_data],
    compute_target = cpu_cluster,
    source_directory = experiment_folder
)

clean.py

run = Run.get_context()
print(run.input_datasets)

input_ds = run.input_datasets['important_dataset']
input_df = input_ds.to_pandas_dataframe()

When the pipeline is run, the log for the clean.py step shows the run.input_datasets object is an empty dict and therefore the script fails with a KeyError.

Data4ML Data鈥疨rep鈥疭ervices awaiting-product-team-response cxp product-question triaged

All 29 comments

@colbyford

strange... Is it a Tabular or File Dataset? cc: @MayMSFT

It's a Tabular Dataset in a blob Datastore

spitball ideas while we wait for @MayMSFT to respond:

  1. do you have azureml-defaults and azureml-dataprep in cpu_cluster's pip_dependencies?
  2. why do I see a space between input_dataset and .as_named_input()?

@swanderz - Yes, the pip dependencies are there and the space was Github-only typo on my part - space not in the source code. Will edit it out here.

what does the 70_driver_log.txt look like? Does DatasetContextManager have anything interesting to say?

can you share ur azureml sdk, azureml-dataprep version as well? thanks

Here is 70_driver_log.txt
```bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
Starting the daemon thread to refresh tokens in background for process with pid = 153
Entering Run History Context Manager.
{}

The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.0030050277709960938 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 153
Traceback (most recent call last):
File "clean.py", line 24, in
df = run.input_datasets['important_dataset']
File "/azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/python3.6/site-packages/azureml/core/run.py", line 2205, in __getitem__
return super().__getitem__(key)
KeyError: 'important_dataset'

2019/12/19 15:03:08 MPI Publisher Details : Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
2019/12/19 15:03:08 MPI Publisher: intel ; Publisher version: 2018
```

@colbyford - Could you find the versions of AML that were used for these runs?

The version of azureml.core is 1.0.79
This is the only part of the SDK that we're using. (No AutoML, etc.)

I've just tested to double confirm. It works fine. @rongduan-zhu what information do you need to help debug this? or we can schedule a video call to help the customer debug

@ezwiefel Based on the driver log, it looks like the code that was supposed to set up input_datasets is not run. Can you please paste the code that shows how you set up the conda dependencies? I don't see you passing in a run configuration, which is where you would specify the conda dependencies, to the PythonScriptStep.

@colbyford would have that code.

Since we edited the notebook to get this working and then I ported the working version over to the client's other workspace, I can't seem to find the version with the issue(s).

We didn't pass in any conda dependencies at this point (as it wasn't necessary).

We can probably just close this issue at this point. We can keep you posted if we run into this issue again, though.

@ezwiefel
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.

Hi,

I am facing the same issue. I am using TabularDataset.
Installed below dependencies:

env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['tensorflow==1.12.0','keras==2.2.4','azureml-sdk','azureml-defaults','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]>=1.1.14'])
env.python.conda_dependencies = cd


 est = TensorFlow(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target, 
                 inputs=[ds.as_named_input('my_data')],
                 entry_script='keras_lstm.py', 
                 environment_definition= env)

Script:
dataset = run.input_datasets["my_data"]

Error:

return super().__getitem__(key)
KeyError: 'my_data'

Could someone please share solution if any?

Thanks,
SJ

@ezwiefel Based on the driver log, it looks like the code that was supposed to set up input_datasets is not run. Can you please paste the code that shows how you set up the conda dependencies? I don't see you passing in a run configuration, which is where you would specify the conda dependencies, to the PythonScriptStep.

Hi, could you please look into code in comments above? I have added dependencies mentioned https://github.com/Azure/MachineLearningNotebooks/issues/707#issuecomment-567585408.

@YutongTie-MSFT - Given @joshisn26 has the same issue with more details, please reopen this issue.

@joshisn26 what version of the azureml-core do you have installed locally? You can check this by doing

from azureml.core import VERSION
print(VERSION)

The reason I am asking is that when you create a conda dependency by doing CondaDependencies.create, it by default pins the version of the azureml packages to the ones installed locally, you can change that by setting the pin_sdk_version parameter to True, which will make sure it will install the latest versions of the azureml packages (unless you explicitly set a version/version range). If possible, can you please also change the name of the Environment to force a creation of a new environment since it might be using an old cached Environment which has old versions of the azureml packages installed.

This is the code snippet I used and was unable repro the error:

conda = CondaDependencies.create(
    pip_packages=['tensorflow==1.12.0', 'azureml-sdk', 'azureml-dataprep[fuse,pandas]', 'azureml-telemetry'],
    pin_sdk_version=False
)

env = Environment('github_707')
env.python.conda_dependencies = conda

inputs = [
    Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
                    inputs=inputs)

test.py

from azureml.core import Run

print(Run.get_context().input_datasets['test'])

70_driver_log.txt

TabularDataset
{
  "source": [
    "('workspaceblobstore', 'titanic.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "bdd57c83-589f-4bf3-8b17-0076981a9eae",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='', subscription_id='', resource_group='')"
  }
}

@joshisn26 what version of the azureml-core do you have installed locally? You can check this by doing

from azureml.core import VERSION
print(VERSION)

The reason I am asking is that when you create a conda dependency by doing CondaDependencies.create, it by default pins the version of the azureml packages to the ones installed locally, you can change that by setting the pin_sdk_version parameter to True, which will make sure it will install the latest versions of the azureml packages (unless you explicitly set a version/version range). If possible, can you please also change the name of the Environment to force a creation of a new environment since it might be using an old cached Environment which has old versions of the azureml packages installed.

This is the code snippet I used and was unable repro the error:

conda = CondaDependencies.create(
    pip_packages=['tensorflow==1.12.0', 'azureml-sdk', 'azureml-dataprep[fuse,pandas]', 'azureml-telemetry'],
    pin_sdk_version=False
)

env = Environment('github_707')
env.python.conda_dependencies = conda

inputs = [
    Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
                    inputs=inputs)

test.py

from azureml.core import Run

print(Run.get_context().input_datasets['test'])

70_driver_log.txt

TabularDataset
{
  "source": [
    "('workspaceblobstore', 'titanic.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "bdd57c83-589f-4bf3-8b17-0076981a9eae",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='', subscription_id='', resource_group='')"
  }
}

@rongduan-zhu Thanks for reply! I am running code on Notebook VM and azure.core version is 1.0.74.
As suggested above, I did following step by step:

  1. Change environment name
  2. Use pin_sdk_version
  3. Both

But still getting same error, here's the code:

env = Environment('new_env')
cd = CondaDependencies.create(pip_packages=['tensorflow','keras','azureml-sdk','azureml-telemetry','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]'], pin_sdk_version=False)
env.python.conda_dependencies = cd

@joshisn26 I am still unable to repro the error even using a notebook VM (however my SDK version is 1.0.81 since it's a new notebook VM). Here is the notebook cell:

from azureml.core import Workspace, Dataset, Datastore, ComputeTarget, Environment, Experiment
from azureml.core.runconfig import CondaDependencies
from azureml.train.dnn import TensorFlow

conda = CondaDependencies.create(pip_packages=['tensorflow','keras','azureml-sdk','azureml-telemetry','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]'], pin_sdk_version=False)

env = Environment('github_707_2')
env.python.conda_dependencies = conda

ws = Workspace.create(name='centraleuap', subscription_id='35f16a99-532a-4a47-9e93-00305f6c40f2', resource_group='rongduan-dev', exist_ok=True)
dstore = Datastore(ws, 'workspaceblobstore')
inputs = [
    Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
                inputs=inputs)
exp = Experiment(ws, 'github_707')
run = exp.submit(tf)
run

Can you please share the image build log and driver log of your failed run?

@joshisn26 bump. Can you please share the image build log and driver log of your failed run?

I've got a reproducible (_hopefully_) example for this. Clone the repo to get the conda yml and python script. My theory is that it doesn't like the 4-letter environment variable for some reason...
https://github.com/swanderz/MachineLearningNotebooks/blob/empty_input_datasets/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/dataset_reprex.ipynb

70_driver_log.txt

Azure ML SDK Version:  1.0.85
Azure ML dataprep Version:  1.1.38
is this none? nrows =  <class 'NoneType'>
run details:
~~~redacted~~~
input_datasets:
 []


The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.25638699531555176 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 153
Traceback (most recent call last):
  File "extract.py", line 107, in <module>
    dataset = run.input_datasets['dsIM']
  File "/azureml-envs/azureml_ef603ddd7729973f7484ef984f5589d2/lib/python3.6/site-packages/azureml/core/run.py", line 2205, in __getitem__
    return super().__getitem__(key)
KeyError: 'dsIM'

@YutongTie-MSFT please reopen!

thanks Anders. This was caused by a bug in our code. For dataset.as_named_input(), passing string with capital letter will cause the error. We will fix it on Feb 17 release. The current walkaround is to use small letters only.

I'm also having this issue (Azure ML SDK Version: 1.6.0)

  • No errors in 70_driver_log.txt
  • On ml.azure.com the dataset is listed
  • run.get_details()['inputDatasets'] shows the datasets that I gave as inputs
  • run.input_datasets is {}
  • run.register_model() registers the model _without_ reference to the input datasets.

The above happens regardless of local or a compute instance in azure.

@MayMSFT apologies if this is not the correct place to raise this issue.

@grjzwaan can you help me understand the issue here? Did the your experiment run successfully? is the issue about that the model is registered without reference to the input dataset?

@MayMSFT

Yes, the experiment runs successfully and the problem is that the model is registered _without_ reference to the input dataset.

@grjzwaan that's a missing feature. now you have to manually register the models with dataset. run.register_model(datasets=(...))

we will add this feature request into our roadmap to auto register models with input and output dataset. Thanks for your feedback!

@MayMSFT Ok, thanks!

Perhaps it's good to add a note with the run.input_datasets that the attribute input_datasets remains empty?

The first thing I tried was to use this information to register the dataset to the model.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nswitanek picture nswitanek  路  4Comments

vineetgarhewal picture vineetgarhewal  路  3Comments

ahyerman picture ahyerman  路  3Comments

wagenrace picture wagenrace  路  3Comments

ashic picture ashic  路  3Comments