Machinelearningnotebooks: Stalled pipelines

Created on 14 Oct 2020 · 4Comments · Source: Azure/MachineLearningNotebooks

The following is an issue I am not sure how to solve or make it reproducible but hopefully I will get an answer.

I have multiple step pipelines that take a very long time to start. Sometimes they may take >15 hours to start, sometimes they will start right away (for the very same pipeline, no code changes at all). This randomness in execution times is quite unexpected and makes working with AML very hard.

All of these pipelines have one thing in common: they start with a DatabricksStep. For some reason I can't explain, these steps take a lot of time to start. Other steps like PythonScriptStep start right away.

What can be the issue? Could it be a VNET issue between Azure Databricks and Azure Machine Learning services?

Pipelines product-issue

Source

jarandaf

👀1

All 4 comments

After many trials and errors I would like to share my findings so that nobody loses as much time as I did.

As described above, the main issue was somehow related to DatabricksStep class. The source_directory parameter can be problematic and I highly recommend to specify a folder with as few contents as possible (e.g. a directory with a single script, the one to be run by the step).

Since the pipeline code I used was part of an internal package, I was using a pretty global directory as source_directory and only then I did specify the specific script to be run via the python_script_name argument. This should be avoided, as I found out that the source_directory is critical by the following reasons:

It is used to compute code changes between successive runs. Any change in a file under that folder will imply a pipeline rerun (in sucessive runs) no matter what.
Snapshots are created with the contents of this folder and these are used by the step (in my case, apparently these snapshots were not properly created or made available, maybe a folder with not more than 20 files is too much 🤦 ).

All the issues disappeared once I used folders with a single script under them, the script used by the DatabricksStep.

From my point of view, this should be considered an SDK bug. If snapshots creation can be an issue for folders with "many" files, this should be properly stressed in the docs and the log traces should somehow inform about it.

jarandaf on 5 Nov 2020

🎉1

@jarandaf Thank you for the feedback. I have opened a bug for our team to investigate this scenario. If it is a design change we plan to implement, we will loop it into our planning and release cycle.

shbijlan on 11 Nov 2020

👀1

@jarandaf To give you an update, our engineering team has recently made a change to improve the reliability of Databricks upload. We are also working on adding clarification to the documentation to address the issue raised here.

shbijlan on 3 Dec 2020

🎉1

The documentation has been updated and should go live in a few hours.