Region: westus2
RunId: aba5defa-b35c-4072-a6a2-2c0437cd77ed
PipelineRunId: 9e037eee-ab69-4db2-b67a-3200a749c9bb
RunId: f0f73421-243b-40a5-967d-e1bf6f85b727
PipelineRunId: 3885bfcc-6486-46ba-9199-10d15e8ce8a5
Our production ML pipeline is failing again.
Here's some of the other related #transient issues we've been having (#1045 #1059 #1058)
executionlogs.txt[2020-09-02 14:01:55Z] Parsing command text
[2020-09-02 14:01:55Z] Parsed command line. Will be submitting job : {
"Command": 2,
"CopyCommand": null,
"WaitCommand": null,
"DataCopyCommand": {
"ComputeName": "adf",
"AzureDataFactoryConfig": null,
"SourceDataId": "c989fcc3-0847-479a-98d0-a574b79e7f91",
"DestinationDataId": "73c04d7f-5138-4c49-bf70-eb55526db6ad",
"OutputDataId": "f613e646-6da7-4bca-9716-ca1e49ea99d9",
"CopyOperationEntity": null,
"CopyOptions": "{\"source_reference_type\": \"directory\", \"destination_reference_type\": \"directory\"}",
"PolicyValidationStatus": 0
},
"IsDataManagementEnabled": true,
"ComplianceCluster": null,
"EuclidWorkspaceId": null
}
[2020-09-02 14:01:56Z] Copy source: Blob storage account: avaprditsmlstorage, directory: dealpipeline/azureml/46e96948-27a9-4e52-83bb-aea407ca0218/interpretability2, filename: , binary copy: True, using SAS: False
[2020-09-02 14:01:56Z] Copy sink: Blob storage account: avaprditsmlstorage, directory: dealoutput/deal-master/WinLoss/2020-09-02 05.28.40/stage2, filename: , binary copy: True, using SAS: False
[2020-09-02 14:02:02Z] RunId:[aba5defa-b35c-4072-a6a2-2c0437cd77ed] ParentRunId:[9e037eee-ab69-4db2-b67a-3200a749c9bb] ComputeTarget:[ADF]
[2020-09-02 14:02:20Z] Data transfer is in progress.
stderrlogs.txt[2020-09-02 14:02:38Z] Data transfer failed:
{
"errorCode": "2200",
"message": "Failure happened on 'Sink' side.
ErrorCode=AzureStorageOperationFailedConcurrentWrite,'
Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=
"Error occurred when trying to upload a file.
Its possible because you have multiple concurrent copy activities runs writing to the same file
'dealoutput/deal-master/WinLoss/2020-09-02 05.28.40/stage2/df_shap_open.tsv'.
Check your ADF configuration.",
Source=Microsoft.DataTransfer.ClientLibrary,''
Type=Microsoft.WindowsAzure.Storage.StorageException,
Message=The remote server returned an error:
(400) Bad Request.,
Source=Microsoft.WindowsAzure.Storage,
StorageExtendedMessage=
The specified block list is invalid.
RequestId:a360604b401e00b0-0b31-81123d000000
Time:2020-09-02T14:02:13.1428713Z,,''
Type=System.Net.WebException,
Message=The remote server returned an error: (400) Bad Request.,
Source=Microsoft.WindowsAzure.Storage,'",
"failureType": "UserError",
"target": "CopyActivity",
"details": []
}

cc: @akshay-0
Today, our error message is now just blank...
executionlogs.txt[2020-09-03 14:00:33Z] Parsing command text
[2020-09-03 14:00:33Z] Parsed command line. Will be submitting job : {
"Command": 2,
"CopyCommand": null,
"WaitCommand": null,
"DataCopyCommand": {
"ComputeName": "adf",
"AzureDataFactoryConfig": null,
"SourceDataId": "4b526acb-e5a8-4021-a145-748d9ab38080",
"DestinationDataId": "7f952012-14d4-4a52-a289-e616aea91d80",
"OutputDataId": "cb958f40-d202-4f46-928e-ab0e0dd079aa",
"CopyOperationEntity": null,
"CopyOptions": "{\"source_reference_type\": \"directory\", \"destination_reference_type\": \"directory\"}",
"PolicyValidationStatus": 0
},
"IsDataManagementEnabled": true,
"ComplianceCluster": null,
"EuclidWorkspaceId": null
}
[2020-09-03 14:00:33Z] Copy source: Blob storage account: avaprditsmlstorage, directory: dealpipeline/azureml/145c96cb-c142-49c0-9c7a-baee27d2ac33/interpretability1, filename: , binary copy: True, using SAS: False
[2020-09-03 14:00:33Z] Copy sink: Blob storage account: avaprditsmlstorage, directory: dealoutput/deal-master/WinLoss/2020-09-02 05.28.40/stage1, filename: , binary copy: True, using SAS: False
[2020-09-03 14:02:11Z] RunId:[54896683-6772-431d-a1f2-cc9b7028b94d] ParentRunId:[585b0d23-89b8-4738-81da-ac066d0308b5] ComputeTarget:[ADF]
stderrlogs.txt[2020-09-03 14:02:39Z] Data transfer failed: {
"errorCode": "",
"message": "",
"failureType": "",
"target": "CopyActivity"
}
@swanderz Thank you for your feedback. We are actively investigating this issue and will update you shortly.
@swanderz After investigation by the engineering team, it was found that two runs that were going in parallel and caused race condition because they were writing to same output file.
For the second issue you shared, ADF team has acknowledged an issue on their side and are working on a fix. I will update you with ETA for the fix once available.
@shbijlan thanks for the update. The first issue was our fault, we hotfixed our codebase and will monitor for a day or two to make sure its resolved. Perhaps we close this issue and create on specifically for the ADF-side error?
@swanderz we can keep this thread open until we have a fix with ETA for the ADF-side error. I will share updates here.
@swanderz apologies for not sharing any update on this item. If it is still an issue, would you be able to open a new thread on the ADF error and we can mark it with the appropriate labels to be routed to the right team?
@shbijlan is there an update to share?
@swanderz no update on the ADF issue.
Edit: Looks like the ADF issue is not being worked on due to its transient nature. Please let us know if it comes up again.