Describe the issue
We've got quite a lot of pipelines that are scheduled to run tests on nextminor early every morning. They are running on several agents installed on the same machine.
Ever since the switch to BCArtifacts we've had concurrency issues. Most of them was fixed with the implementation of mutexes (ref #1088).
But now and then (like every other week) we still get failing pipelines that seems to be caused by several pipelines running at the same time on the same machine.
The result seems to be is a image that is not working. All pipelines using the image that was created by the first pipeline run fails with the error "Service Tier doesn't exist / is not installed"
The workaround is to remove all created images and rerun the pipelines - causing them to rebuild the image.
Scripts used to create container and cause the issue
Since this is only occuring in pipelines with a lot of options, I'm having a hard time isolating this to a simple repro script.
But the pipeline are using the following parameters to New-BcContainer
{
"PublishPorts": [
],
"doNotExportObjectsToText": true,
"myScripts": [
{
"MainLoop.ps1": "while ($true) { start-sleep -seconds 10 }"
},
{
"SetupWebClient.ps1": "try {\r\n . c:\\run\\SetupWebClient.ps1\r\n} catch {\r\n Write-Host \"SetupWebClient failed, waiting a few moments before retrying\"\r\n Start-sleep -seconds 10\r\n . c:\\run\\SetupWebClient.ps1\r\n}"
}
],
"includeAL": true,
"containerName": "CI-bld17",
"enableTaskScheduler": false,
"Credential": {
"UserName": "***",
"Password": {
"Length": 12
}
},
"doNotCheckHealth": true,
"additionalParameters": [
"--volume \"C:\\Agent_AS011_scheduled_01:C:\\Agent\""
],
"includeCSide": false,
"accept_eula": true,
"vsixFile": "https://ms-dynamics-smb.gallerycdn.vsassets.io/extensions/ms-dynamics-smb/al/6.1.352568/1602750099508/Microsoft.VisualStudio.Services.VSIXPackage",
"artifactUrl": "https://bcinsider.azureedge.net/sandbox/17.1.17840.0/se***",
"multitenant": false,
"accept_outdated": true,
"alwaysPull": {
"IsPresent": true
},
"auth": "NavUserPassword",
"imageName": "nabbcimage",
"shortcuts": "None",
"updateHosts": true,
"licenseFile": "***"
}
Full output of scripts
The outputs of the two pipelines that are messing with each other:
If you diff the above files with your favorite diff app, you'll see how they interfere with each other. At a point they are running the processes 0.07 seconds apart, hinting that there are a small time window where this issue might occur.
After the above happens, all other pipelines tries to use the image that was created by the above pipelines. They all fail with an output like this: Subsequent fails.txt
Additional context
I assume that the mutexes are taken a bit too late or released too early, but haven't had the time to dig into this properly...
Here's a log from a successful run, where the pipeline has created a new image and then using that for a new container:
success.txt
I think I found something...
It seems as if the mutex is taken after it has checked if a new image is needed. And then the second thread is waiting for the first thread to finish the image building, and then it recreates the same image at the same time as the first thread is using it to create a container - or something like that, felt as a messy description. :)
Look at the following in CI-bld18.txt:
2020-10-20T02:06:19.6494486Z Fetching all docker images <--- Checks what images already exist
2020-10-20T02:06:19.9872083Z ArtifactUrl and ImageName specified
2020-10-20T02:06:22.0955568Z Waiting for other process building image nabbcimage:sandbox-17.1.17840.0-se <---- Waiting for build mutex
2020-10-20T02:08:08.0466579Z Other process completed building
2020-10-20T02:08:08.0716354Z Building image nabbcimage:sandbox-17.1.17840.0-se based on mcr.microsoft.com/dynamicsnav:10.0.17763.1518-generic with https://bcinsider.azureedge.net/sandbox/17.1.17840.0/se <--- Creates an image that just has been created
2020-10-20T02:08:08.0741012Z Pulling latest image mcr.microsoft.com/dynamicsnav:10.0.17763.1518-generic
I fixed that in 1.0.9:
Issue #1367 - If two processes build the same image at the same time, the second one always rebuilds
but you are using 1.0.9 - strange
Seems strange.
Let me double check if I messed up when I updated to 1.0.9...
I am creating a machine with 4 agents as we speak...:-)
Reinstalled 1.0.9 with Install-Module and run a directory compare between that and the copy in my pipeline tools - I am running an unmodified v1.0.9 (except from 5 files diffing because you've got mixed line endings them)
The annoying thing is that this does not happen every day, even though the nextminor artifacts is updated quite often.
But hopefully you'll see the same behavior in your pipelines!
I see that the image is rebuild - odd... - will fix that right away.
Issue with the rebuild was that the allimages cache was transferred from new-bccontainer.
meaning that running two new-bcimage commands at the same time would actually work (which I tested) - just not if they were called from New-BcContainer (which I didn't:-()
I don't think the failed builds are caused by this though - I will keep this open and look out for this.
Will clean all images on my build server and run all pipelines.
BTW - when I run nextminor and nextmajor daily - I do not cache the image. It will always have changed and building images takes extra time and space to clean up.
I'll update our pipelines to use your updated New-NavImage. (https://github.com/microsoft/navcontainerhelper/blob/11e6e9c3e331330c71d5cd8ff70f8def8ce41543/ContainerHandling/New-NavImage.ps1) - And time will tell :)
To not cache nextminor and nextmajor images works if you got just one or a few pipelines. Since we've got a lot of pipelines being scheduled for nextminor/nextmajor, adding 10 minutes for every run is not an option...
I'm thinking of rewriting this for our pipelines, so I have a new pipeline scheduled every night that removes images and builds the current (if needed), nextminor, nextmajor and even a prevmajor (customers not yet upgraded to the current major need this during the 60 day upgrade time) for the countries we need. Then all other pipelines will use those images instead.
So instead of using myimage:sandbox-17.0.17126.17922-se or myimage:sandbox-17.1.17933.0-se, they will use current:se and nextminor:se and so on. Pipelines running OnPrem images would still need to use the old behavior, since they are always pinpointed to a specific version.
That would probably solve a few of my issues with the current approach:
But my time are limited for this, so if you got a better way of solving above issues I'll be happy to try those out! ;-)
Actually, I think your code change in New-NavImage is just what is needed to fix this issue...
You can run this script a few hours before you run the pipelines:
$vaultName = "BuildVariables"
$insiderSasTokenSecret = Get-AzKeyVaultSecret -VaultName $vaultName -Name "insiderSasToken"
if ($insiderSasTokenSecret) { $insiderSasToken = $insiderSasTokenSecret.SecretValueText } else { $insiderSasToken = "" }
Write-Host "Determining artifacts to use"
$images = @("current","nextminor","nextmajor")
$countries = @("se","dk")
$images | % {
$image = $_
$storageAccount = ""
$type = "Sandbox"
$select = $image
$version = ""
$minver = $null
$countries | ForEach-Object {
$url = Get-BCArtifactUrl -storageAccount $storageAccount -type $type -version $version -country $_.Trim() -select $select -sasToken $insiderSasToken | Select-Object -First 1
if ($url) {
$ver = [Version]$url.Split('/')[4]
if ($minver -eq $null -or $ver -lt $minver) {
$minver = $ver
$minsto = $url.Split('/')[2].Split('.')[0]
$minsel = "Latest"
$mintok = $url.Split('?')[1]; if ($mintok) { $mintok = "?$mintok" }
}
}
}
if ($minver -eq $null) {
Write-Host -ForegroundColor Red "Unable to locate artifacts for $image"
}
else {
$storageAccount = $minsto
$version = $minver.ToString()
$select = $minsel
$sasToken = $mintok
$countries | % {
$country = $_
$artifactUrl = Get-BCArtifactUrl -storageAccount $storageAccount -type $type -version $version -country $country -select $select -sasToken $sasToken | Select-Object -First 1
if (!($artifactUrl)) {
Write-Host -ForegroundColor Red "Unable to locate artifacts for country $_ for $image (version $version)"
}
else {
Write-Host -ForegroundColor Yellow "$($image):$($country) is $($artifactUrl.Split('?')[0])"
New-BcImage -artifactUrl $artifactUrl -imageName "$($image):$($country)" -multitenant -licenseFile "C:\temp\license.flf"
}
}
}
}
The reason why it is a bit complex is, that it will find the latest from the current, nextminor or nextmajor release stream, where the countries requested are available.
If you only have se - it is very easy, but I don't want to run one version for dk and another for se because artifacts are in the middle of updating (which actually happens all the time).
Thanks, that will give me a great start!
I also wanted the different countries to have the same build, so you saved me quite some time here. :)
I had a piece of this code in Run-AlPipeline, so I thought I would extract it since you asked:-)
In Run-AlPipeline you can also specify -additionalCountries which will build and test for all.
The fix was shipped in 1.0.10 - will close this now.
Let's open a new if concurrency issues persists - I will also be running all mine nightly at the same time on the same machine.