Machinelearningnotebooks: BUG: compute queues for many hours, then silent failure on DataTransferStep

Created on 28 Feb 2020  路  7Comments  路  Source: Azure/MachineLearningNotebooks

Overview

Wish I could give a clean repo on this one, but can't really. Multiple people across two repos we have are experience this lagged compute provisioning issue and ghostly failures of random steps.

SDK version:

compute: 1.0.85
local: both 1.0.85 and 1.1.1rc0

Timeline:

Time (UTC)      Event
2/28/20 3:36    kick off pipeline
2/28/20 3:36    1st StepRun, "extract.py"  created
2/28/20 5:28    I manually cancel Pipeline in UI & delete compute
2/28/20 5:33    create new compute 'isitmarchyetpt2'
2/28/20 5:35    kick off pipeline again
2/28/20 10:05   extract.py "created"
2/28/20 10:11   extract.py "finishes"
2/28/20 10:31   DataTransferStep fails

Relevant RunIds

| Step                     | StepRunId                            | PipelineRunId                        |
|--------------------------|--------------------------------------|--------------------------------------|
| extract.py (1st StepRun) | bc4195a7-9306-4edf-97dd-dd6          | af725f04-a6ff-4cff-81b4-1a54e        |
| extract.py (1st StepRun) | b27bdc2f-8dd7-4de0-ad67-e3cb02669078 | b19aa251-3946-478a-a80d-16a42862dd4c |
| Failed DataTransferStep  | f15444d2-7f22-4611-b20f-22b4d5baf05f | b19aa251-3946-478a-a80d-16a42862dd4c |

Logs -- Failed Data Transfer Step

logs/azureml/stderrlogs.txt

[2020-02-28 10:31:44Z] Data transfer failed: 

logs/azureml/executionlogs.txt

[2020-02-28 10:31:10Z] Parsing command text 
[2020-02-28 10:31:10Z] Parsed command line. Will be submitting job : {
  "Command": 2,
  "CopyCommand": null,
  "WaitCommand": null,
  "DataCopyCommand": {
    "ComputeName": "adf",
    "AzureDataFactoryConfig": null,
    "SourceDataId": "2e4cbf83-0b4b-4f63-811b-6c29ff27c94d",
    "DestinationDataId": "17bae934-c8e2-4f08-8bea-e1312c5f4a29",
    "OutputDataId": "8e9f10d1-8934-4085-82b3-8cd56aed217f",
    "CopyOperationEntity": null,
    "CopyOptions": "{\"source_reference_type\": \"directory\", \"destination_reference_type\": \"directory\"}"
  },
  "IsDataManagementEnabled": true,
  "ComplianceCluster": null,
  "EuclidWorkspaceId": null
}
[2020-02-28 10:31:10Z] Copy source: Blob storage account: avadevitsmlstorage, directory: dealpipeline/azureml/b909f1ca-eab7-42f6-abd8-69fcb74244e1/gold_data1, filename: , binary copy: True, using SAS: False
[2020-02-28 10:31:10Z] Copy sink: Blob storage account: avadevitsmlstorage, directory: dealoutput/deal-PRhotfix/2020-02-27 21.35.08/stage1, filename: , binary copy: True, using SAS: False
[2020-02-28 10:31:26Z] RunId:[f15444d2-7f22-4611-b20f-22b4d5baf05f] ParentRunId:[b19aa251-3946-478a-a80d-16a42862dd4c] ComputeTarget:[ADF]
AML Compute Instance Compute assigned-to-author product-question triaged

All 7 comments

Not sure if this is relevant at all but after reverting local and remote env to 1.0.85 in one of our repos and submitting a pipeline run via the SDK. It stalled for ~12 min on the create compute step of the notebook... never seen that before! this is before the run is submitted and min_nodes = 0

image

@sanpil @rastala
we're not seeing this anymore. what fixed it (_I think_) was either:

  1. a fix on your side, or
  2. reinstalling the SDK.

my hypothesis for 2) is that although the stable SDK version remains 1.0.85 many of the underlying required packages have been upgraded in the past month numerous times. So pip install azureml-sdk==1.0.85 a month ago would result in a different environment than today. I have no insight into what exactly changes in these patches.

azureml-core release history

image

@sanpil @rastala FYI

@swanderz
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.

@YutongTie-MSFT please re-open. I'm working with @sonnypark on resolving this issue.

@swanderz We're seeing the same issue with the compute queuing. We've tried opening more than one case with MS for different workspaces. The way we went around it (hacky fix) is by allocating at least 1 node as minimum nodes for the compute on creation (and then manual teardown EOD). There's also a bug in SDK-created compute clusters that even if you allocate 20 nodes, it doesn't rescale when needed by the pipelines (to which we go through the portal and edit the number of nodes for it to start resizing).

@YutongTie-MSFT not sure if you have the right person assigned

I believe this issue has been already addressed and resolved under: https://icm.ad.msft.net/imp/v3/incidents/details/177662847/home

Was this page helpful?
0 / 5 - 0 ratings