Machinelearningnotebooks: Trying to register a dataset with the python sdk

Created on 7 Aug 2020  路  3Comments  路  Source: Azure/MachineLearningNotebooks

I'm trying to programmatically register a dataset with the python sdk. The dataset will be from a datastore for a storage container, and point to a csv file. I'm trying the following:

dataset = Dataset.Tabular.from_delimited_files(ds.path(relative_path), validate=False)
dataset.register(ws, name)

Now this fails when I run it from archlinux, stating that archlinux is not supported. Going through the stack trace, it looks like the ml sdk is asking the dot net core 2 python package to install additional dependencies at runtime. I've got dot net core 2 and 3 installed on the machine. Is there some way to install whatever's needed up front to not require this download of binaries at runtime? I'll need to run this script as part of a ci pipeline, and it's not really that sensible to download binaries each and every time.

Also, is there a way to register a dataset without having the data accessed, or used locally via dot net or otherwise? I just want a dataset in the azureml workspace - I don't really care about using the data locally (or in the ci pipeline).

Thanks.

MLOps product-question

Most helpful comment

I've managed to make some progress. It appears the dependency that's missing is python lttngust. This isn't a pip / conda package, but an optional dependency of dot net core, and is installed at the os level. I'm running archlinux, so

sudo pacman -S python-lttngust

allowed me to progress beyond the point where it was checking for dependencies. Looking through the codebase, it appears that azureml puts dotnetcore2 in site-packages/bin of your python environment, then checks for os level dependencies, and if it can't find them, it copies them for the OSes "supported" from azure blobs, and when everything's ok, it writes a deps/success file. Tbf, this seems quite strange, as a library is effectively downloading stuff at runtime, but ah well...

@swanderz the UI is not an option, as I'm automating this. Another OS may be needed in the build pipeline unless I can get lttngust on our existing images. Either way the automation will be running outside the Azure ML environment.

I'm still not sure whether this brings down data or not, but at the very least, a dataset registration is working.

For those facing similar issues (i.e. "NotImplementedError: Unsupported Linux distribution"), running the following in a python terminal with the same python environment will help you see what the missing dependencies are. If you then install them at the OS level, it should work:

from dotnetcore2 import runtime
runtime._enable_debug_logging()
runtime.ensure_dependencies()

The third instruction will have a debug line with the missing dependencies.

(A similar issue: https://github.com/Azure/MachineLearningNotebooks/issues/713 )

All 3 comments

I'm pretty sure you can do this from the Azure ML UI (Create Datasets in the Studio). Another alternative would be to use another os, or a Azure ML Compute Instance Notebook VM.

I've managed to make some progress. It appears the dependency that's missing is python lttngust. This isn't a pip / conda package, but an optional dependency of dot net core, and is installed at the os level. I'm running archlinux, so

sudo pacman -S python-lttngust

allowed me to progress beyond the point where it was checking for dependencies. Looking through the codebase, it appears that azureml puts dotnetcore2 in site-packages/bin of your python environment, then checks for os level dependencies, and if it can't find them, it copies them for the OSes "supported" from azure blobs, and when everything's ok, it writes a deps/success file. Tbf, this seems quite strange, as a library is effectively downloading stuff at runtime, but ah well...

@swanderz the UI is not an option, as I'm automating this. Another OS may be needed in the build pipeline unless I can get lttngust on our existing images. Either way the automation will be running outside the Azure ML environment.

I'm still not sure whether this brings down data or not, but at the very least, a dataset registration is working.

For those facing similar issues (i.e. "NotImplementedError: Unsupported Linux distribution"), running the following in a python terminal with the same python environment will help you see what the missing dependencies are. If you then install them at the OS level, it should work:

from dotnetcore2 import runtime
runtime._enable_debug_logging()
runtime.ensure_dependencies()

The third instruction will have a debug line with the missing dependencies.

(A similar issue: https://github.com/Azure/MachineLearningNotebooks/issues/713 )

@ashic you rock for:
1) being a problem-solver, and
2) sharing your findings back with the community
gold star for you!
cc: @MayMSFT

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dumbledad picture dumbledad  路  3Comments

swanderz picture swanderz  路  5Comments

vineetgarhewal picture vineetgarhewal  路  3Comments

tkawchak picture tkawchak  路  5Comments

tylercmsft picture tylercmsft  路  4Comments