Machinelearningnotebooks: BUG: compute queues for many hours, then silent failure on DataTransferStep

Created on 28 Feb 2020 · 7Comments · Source: Azure/MachineLearningNotebooks

Overview

Wish I could give a clean repo on this one, but can't really. Multiple people across two repos we have are experience this lagged compute provisioning issue and ghostly failures of random steps.

SDK version:

compute: 1.0.85
local: both 1.0.85 and 1.1.1rc0

Timeline:

Time (UTC)      Event
2/28/20 3:36    kick off pipeline
2/28/20 3:36    1st StepRun, "extract.py"  created
2/28/20 5:28    I manually cancel Pipeline in UI & delete compute
2/28/20 5:33    create new compute 'isitmarchyetpt2'
2/28/20 5:35    kick off pipeline again
2/28/20 10:05   extract.py "created"
2/28/20 10:11   extract.py "finishes"
2/28/20 10:31   DataTransferStep fails

Relevant RunIds

| Step                     | StepRunId                            | PipelineRunId                        |
|--------------------------|--------------------------------------|--------------------------------------|
| extract.py (1st StepRun) | bc4195a7-9306-4edf-97dd-dd6          | af725f04-a6ff-4cff-81b4-1a54e        |
| extract.py (1st StepRun) | b27bdc2f-8dd7-4de0-ad67-e3cb02669078 | b19aa251-3946-478a-a80d-16a42862dd4c |
| Failed DataTransferStep  | f15444d2-7f22-4611-b20f-22b4d5baf05f | b19aa251-3946-478a-a80d-16a42862dd4c |

Logs -- Failed Data Transfer Step

`logs/azureml/stderrlogs.txt`

[2020-02-28 10:31:44Z] Data transfer failed:

`logs/azureml/executionlogs.txt`

[2020-02-28 10:31:10Z] Parsing command text 
[2020-02-28 10:31:10Z] Parsed command line. Will be submitting job : {
  "Command": 2,
  "CopyCommand": null,
  "WaitCommand": null,
  "DataCopyCommand": {
    "ComputeName": "adf",
    "AzureDataFactoryConfig": null,
    "SourceDataId": "2e4cbf83-0b4b-4f63-811b-6c29ff27c94d",
    "DestinationDataId": "17bae934-c8e2-4f08-8bea-e1312c5f4a29",
    "OutputDataId": "8e9f10d1-8934-4085-82b3-8cd56aed217f",
    "CopyOperationEntity": null,
    "CopyOptions": "{\"source_reference_type\": \"directory\", \"destination_reference_type\": \"directory\"}"
  },
  "IsDataManagementEnabled": true,
  "ComplianceCluster": null,
  "EuclidWorkspaceId": null
}
[2020-02-28 10:31:10Z] Copy source: Blob storage account: avadevitsmlstorage, directory: dealpipeline/azureml/b909f1ca-eab7-42f6-abd8-69fcb74244e1/gold_data1, filename: , binary copy: True, using SAS: False
[2020-02-28 10:31:10Z] Copy sink: Blob storage account: avadevitsmlstorage, directory: dealoutput/deal-PRhotfix/2020-02-27 21.35.08/stage1, filename: , binary copy: True, using SAS: False
[2020-02-28 10:31:26Z] RunId:[f15444d2-7f22-4611-b20f-22b4d5baf05f] ParentRunId:[b19aa251-3946-478a-a80d-16a42862dd4c] ComputeTarget:[ADF]

AML Compute Instance Compute assigned-to-author product-question triaged

Source

swanderz

All 7 comments

Not sure if this is relevant at all but after reverting local and remote env to 1.0.85 in one of our repos and submitting a pipeline run via the SDK. It stalled for ~12 min on the create compute step of the notebook... never seen that before! this is before the run is submitted and min_nodes = 0

swanderz on 2 Mar 2020

@sanpil @rastala
we're not seeing this anymore. what fixed it (_I think_) was either:

a fix on your side, or
reinstalling the SDK.

my hypothesis for 2) is that although the stable SDK version remains 1.0.85 many of the underlying required packages have been upgraded in the past month numerous times. So pip install azureml-sdk==1.0.85 a month ago would result in a different environment than today. I have no insight into what exactly changes in these patches.

`azureml-core` release history

swanderz on 3 Mar 2020

@sanpil @rastala FYI

@swanderz
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.

YutongTie-MSFT on 4 Mar 2020

@YutongTie-MSFT please re-open. I'm working with @sonnypark on resolving this issue.

swanderz on 4 Mar 2020

👍1

@swanderz We're seeing the same issue with the compute queuing. We've tried opening more than one case with MS for different workspaces. The way we went around it (hacky fix) is by allocating at least 1 node as minimum nodes for the compute on creation (and then manual teardown EOD). There's also a bug in SDK-created compute clusters that even if you allocate 20 nodes, it doesn't rescale when needed by the pipelines (to which we go through the portal and edit the number of nodes for it to start resizing).

jadhosn on 15 Mar 2020

😕1 👎1

@YutongTie-MSFT not sure if you have the right person assigned

jadhosn on 20 Apr 2020

I believe this issue has been already addressed and resolved under: https://icm.ad.msft.net/imp/v3/incidents/details/177662847/home

sonnypark on 20 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Stalled pipelines

jarandaf · 4Comments

PermissionError: [Errno 13] Permission denied: '.\NTUSER.DAT'. when trying to run ML pipeline

casieo · 4Comments

Bug in Estimator class when runconfig specifies "target: local"

swanderz · 5Comments

Register model from cloud storage

jmwoloso · 4Comments

Trying to register a dataset with the python sdk

ashic · 3Comments