Machinelearningnotebooks: The `run.input_datasets` dictionary is empty - even after passing into the PythonScriptStep

Created on 19 Dec 2019 · 29Comments · Source: Azure/MachineLearningNotebooks

The run.input_datasets dictionary is empty - even after passing into the PythonScriptStep.

Pipeline.ipynb

input_dataset = Dataset.get_by_name(ws, name='super_secret_data')

cleanStep = PythonScriptStep(
    script_name = "clean.py",
    inputs = [input_dataset.as_named_input('important_dataset')],
    outputs = [output_data],
    compute_target = cpu_cluster,
    source_directory = experiment_folder
)

clean.py

run = Run.get_context()
print(run.input_datasets)

input_ds = run.input_datasets['important_dataset']
input_df = input_ds.to_pandas_dataframe()

When the pipeline is run, the log for the clean.py step shows the run.input_datasets object is an empty dict and therefore the script fails with a KeyError.

Data4ML Data Prep Services awaiting-product-team-response cxp product-question triaged

Source

ezwiefel

All 29 comments

@colbyford

ezwiefel on 19 Dec 2019

strange... Is it a Tabular or File Dataset? cc: @MayMSFT

swanderz on 19 Dec 2019

It's a Tabular Dataset in a blob Datastore

colbyford on 19 Dec 2019

spitball ideas while we wait for @MayMSFT to respond:

do you have azureml-defaults and azureml-dataprep in cpu_cluster's pip_dependencies?
why do I see a space between input_dataset and .as_named_input()?

swanderz on 19 Dec 2019

@swanderz - Yes, the pip dependencies are there and the space was Github-only typo on my part - space not in the source code. Will edit it out here.

ezwiefel on 19 Dec 2019

what does the 70_driver_log.txt look like? Does DatasetContextManager have anything interesting to say?

swanderz on 19 Dec 2019

can you share ur azureml sdk, azureml-dataprep version as well? thanks

MayMSFT on 20 Dec 2019

Here is 70_driver_log.txt
```bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
Starting the daemon thread to refresh tokens in background for process with pid = 153
Entering Run History Context Manager.
{}

The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.0030050277709960938 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 153
Traceback (most recent call last):
File "clean.py", line 24, in
df = run.input_datasets['important_dataset']
File "/azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/python3.6/site-packages/azureml/core/run.py", line 2205, in __getitem__
return super().__getitem__(key)
KeyError: 'important_dataset'

2019/12/19 15:03:08 MPI Publisher Details : Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
2019/12/19 15:03:08 MPI Publisher: intel ; Publisher version: 2018
```

@colbyford - Could you find the versions of AML that were used for these runs?

ezwiefel on 20 Dec 2019

The version of azureml.core is 1.0.79
This is the only part of the SDK that we're using. (No AutoML, etc.)

colbyford on 22 Dec 2019

I've just tested to double confirm. It works fine. @rongduan-zhu what information do you need to help debug this? or we can schedule a video call to help the customer debug

MayMSFT on 23 Dec 2019

@ezwiefel Based on the driver log, it looks like the code that was supposed to set up input_datasets is not run. Can you please paste the code that shows how you set up the conda dependencies? I don't see you passing in a run configuration, which is where you would specify the conda dependencies, to the PythonScriptStep.

rongduan-zhu on 26 Dec 2019

@colbyford would have that code.

ezwiefel on 27 Dec 2019

Since we edited the notebook to get this working and then I ported the working version over to the client's other workspace, I can't seem to find the version with the issue(s).

We didn't pass in any conda dependencies at this point (as it wasn't necessary).

We can probably just close this issue at this point. We can keep you posted if we run into this issue again, though.

colbyford on 27 Dec 2019

@ezwiefel
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.

YutongTie-MSFT on 29 Dec 2019

Hi,

I am facing the same issue. I am using TabularDataset.
Installed below dependencies:

env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['tensorflow==1.12.0','keras==2.2.4','azureml-sdk','azureml-defaults','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]>=1.1.14'])
env.python.conda_dependencies = cd


 est = TensorFlow(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target, 
                 inputs=[ds.as_named_input('my_data')],
                 entry_script='keras_lstm.py', 
                 environment_definition= env)

Script:
dataset = run.input_datasets["my_data"]

Error:

return super().__getitem__(key)
KeyError: 'my_data'

Could someone please share solution if any?

Thanks,
SJ

joshisn26 on 2 Jan 2020

👀1

@ezwiefel Based on the driver log, it looks like the code that was supposed to set up input_datasets is not run. Can you please paste the code that shows how you set up the conda dependencies? I don't see you passing in a run configuration, which is where you would specify the conda dependencies, to the PythonScriptStep.

Hi, could you please look into code in comments above? I have added dependencies mentioned https://github.com/Azure/MachineLearningNotebooks/issues/707#issuecomment-567585408.

joshisn26 on 2 Jan 2020

❤1

@YutongTie-MSFT - Given @joshisn26 has the same issue with more details, please reopen this issue.

ezwiefel on 2 Jan 2020

@joshisn26 what version of the azureml-core do you have installed locally? You can check this by doing

from azureml.core import VERSION
print(VERSION)

The reason I am asking is that when you create a conda dependency by doing CondaDependencies.create, it by default pins the version of the azureml packages to the ones installed locally, you can change that by setting the pin_sdk_version parameter to True, which will make sure it will install the latest versions of the azureml packages (unless you explicitly set a version/version range). If possible, can you please also change the name of the Environment to force a creation of a new environment since it might be using an old cached Environment which has old versions of the azureml packages installed.

This is the code snippet I used and was unable repro the error:

conda = CondaDependencies.create(
    pip_packages=['tensorflow==1.12.0', 'azureml-sdk', 'azureml-dataprep[fuse,pandas]', 'azureml-telemetry'],
    pin_sdk_version=False
)

env = Environment('github_707')
env.python.conda_dependencies = conda

inputs = [
    Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
                    inputs=inputs)

test.py

from azureml.core import Run

print(Run.get_context().input_datasets['test'])

70_driver_log.txt

TabularDataset
{
  "source": [
    "('workspaceblobstore', 'titanic.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "bdd57c83-589f-4bf3-8b17-0076981a9eae",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='', subscription_id='', resource_group='')"
  }
}

rongduan-zhu on 2 Jan 2020

@joshisn26 what version of the azureml-core do you have installed locally? You can check this by doing
from azureml.core import VERSION
print(VERSION)
The reason I am asking is that when you create a conda dependency by doing CondaDependencies.create, it by default pins the version of the azureml packages to the ones installed locally, you can change that by setting the pin_sdk_version parameter to True, which will make sure it will install the latest versions of the azureml packages (unless you explicitly set a version/version range). If possible, can you please also change the name of the Environment to force a creation of a new environment since it might be using an old cached Environment which has old versions of the azureml packages installed.

This is the code snippet I used and was unable repro the error:
conda = CondaDependencies.create(
    pip_packages=['tensorflow==1.12.0', 'azureml-sdk', 'azureml-dataprep[fuse,pandas]', 'azureml-telemetry'],
    pin_sdk_version=False
)

env = Environment('github_707')
env.python.conda_dependencies = conda

inputs = [
    Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
                    inputs=inputs)
test.py
from azureml.core import Run

print(Run.get_context().input_datasets['test'])
70_driver_log.txt
TabularDataset
{
  "source": [
    "('workspaceblobstore', 'titanic.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "bdd57c83-589f-4bf3-8b17-0076981a9eae",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='', subscription_id='', resource_group='')"
  }
}

@rongduan-zhu Thanks for reply! I am running code on Notebook VM and azure.core version is 1.0.74.
As suggested above, I did following step by step:

Change environment name
Use pin_sdk_version
Both

But still getting same error, here's the code:

env = Environment('new_env')
cd = CondaDependencies.create(pip_packages=['tensorflow','keras','azureml-sdk','azureml-telemetry','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]'], pin_sdk_version=False)
env.python.conda_dependencies = cd

joshisn26 on 2 Jan 2020

@joshisn26 I am still unable to repro the error even using a notebook VM (however my SDK version is 1.0.81 since it's a new notebook VM). Here is the notebook cell:

from azureml.core import Workspace, Dataset, Datastore, ComputeTarget, Environment, Experiment
from azureml.core.runconfig import CondaDependencies
from azureml.train.dnn import TensorFlow

conda = CondaDependencies.create(pip_packages=['tensorflow','keras','azureml-sdk','azureml-telemetry','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]'], pin_sdk_version=False)

env = Environment('github_707_2')
env.python.conda_dependencies = conda

ws = Workspace.create(name='centraleuap', subscription_id='35f16a99-532a-4a47-9e93-00305f6c40f2', resource_group='rongduan-dev', exist_ok=True)
dstore = Datastore(ws, 'workspaceblobstore')
inputs = [
    Dataset.Tabular.from_delimited_files((dstore, 'titanic.csv')).as_named_input('test')
]
tf = TensorFlow('.', entry_script='test.py', compute_target=ComputeTarget(ws, 'mlc'), environment_definition=env,
                inputs=inputs)
exp = Experiment(ws, 'github_707')
run = exp.submit(tf)
run

Can you please share the image build log and driver log of your failed run?

rongduan-zhu on 3 Jan 2020

@joshisn26 bump. Can you please share the image build log and driver log of your failed run?

ddivakaruni on 16 Jan 2020

I've got a reproducible (_hopefully_) example for this. Clone the repo to get the conda yml and python script. My theory is that it doesn't like the 4-letter environment variable for some reason...
https://github.com/swanderz/MachineLearningNotebooks/blob/empty_input_datasets/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/dataset_reprex.ipynb

`70_driver_log.txt`

Azure ML SDK Version:  1.0.85
Azure ML dataprep Version:  1.1.38
is this none? nrows =  <class 'NoneType'>
run details:
~~~redacted~~~
input_datasets:
 []


The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.25638699531555176 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 153
Traceback (most recent call last):
  File "extract.py", line 107, in <module>
    dataset = run.input_datasets['dsIM']
  File "/azureml-envs/azureml_ef603ddd7729973f7484ef984f5589d2/lib/python3.6/site-packages/azureml/core/run.py", line 2205, in __getitem__
    return super().__getitem__(key)
KeyError: 'dsIM'

swanderz on 31 Jan 2020

@YutongTie-MSFT please reopen!

swanderz on 31 Jan 2020

thanks Anders. This was caused by a bug in our code. For dataset.as_named_input(), passing string with capital letter will cause the error. We will fix it on Feb 17 release. The current walkaround is to use small letters only.

MayMSFT on 31 Jan 2020

🎉1

I'm also having this issue (Azure ML SDK Version: 1.6.0)

No errors in 70_driver_log.txt
On ml.azure.com the dataset is listed
run.get_details()['inputDatasets'] shows the datasets that I gave as inputs
run.input_datasets is {}
run.register_model() registers the model _without_ reference to the input datasets.

The above happens regardless of local or a compute instance in azure.

@MayMSFT apologies if this is not the correct place to raise this issue.

grjzwaan on 10 Jun 2020

👀1

@grjzwaan can you help me understand the issue here? Did the your experiment run successfully? is the issue about that the model is registered without reference to the input dataset?

MayMSFT on 10 Jun 2020

@MayMSFT

Yes, the experiment runs successfully and the problem is that the model is registered _without_ reference to the input dataset.

grjzwaan on 11 Jun 2020

@grjzwaan that's a missing feature. now you have to manually register the models with dataset. run.register_model(datasets=(...))

we will add this feature request into our roadmap to auto register models with input and output dataset. Thanks for your feedback!

MayMSFT on 11 Jun 2020

@MayMSFT Ok, thanks!

Perhaps it's good to add a note with the run.input_datasets that the attribute input_datasets remains empty?

The first thing I tried was to use this information to register the dataset to the model.

grjzwaan on 12 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

train-hyperparameter-tune-deploy-with-keras : deployment failing

nswitanek · 4Comments

Tensorboard exmple notebook not working

vineetgarhewal · 3Comments

Moving Data in and out of pipelines documentation has incorrect API

ahyerman · 3Comments

Pip configuration

wagenrace · 3Comments

Trying to register a dataset with the python sdk

ashic · 3Comments