Arcade: "Build promotion pipeline Yaml" changes require Guardian Service connection

Created on 3 Feb 2020  路  28Comments  路  Source: dotnet/arcade

  • Is this issue blocking (yes)
  • Is this issue causing unreasonable pain (yes)

As introduced here: https://github.com/dotnet/arcade/pull/4625 , this change turns on SDL checks like Guardian by default (possibly intentionally) but we didn't have the service connection for Guardian defined on devdiv, so Roslyn's arcade-based builds broke.

I'm putting this up to be discussed in Arcade triage but basically we need to have some way when changes occur to make sure that any Service connections or variable groups (or literally any other shared resource I'm failing to list here) referenced in the eng\common folder need to exist everywhere Arcade bits flow.

Most helpful comment

This was not a breaking change in AzDO.

I chatted with @riarenas about this a bit, and @jaredpar expressed frustration that Arcade is breaking signed builds a lot of the time. We have the arcade-validation repo builds specifically for this purpose, but we're still missing things. This comes down to a few things:

  • dnceng is different than DevDiv
  • arcade-validation does not have enough coverage for the breadth of scenarios that show up in official builds of real repos.

The second bullet should be one that we should be able to remedy in arcade-validation over time, but gaining that test coverage, given the complexity of official builds, will be time consuming and involve breaks in repos in the process.

Here is what I think we should do:

  1. Change the arcade-validation official build to be a 'stub' build
  2. The arcade-validation build launches a series of official builds that must pass before the build can be promoted. These builds will always include two base builds:

    • An arcade-validation build in dnceng

    • An arcade-validation build in DevDiv

  3. In addition to those two base builds, there will be a series of identified repos, base branches, and build pipelines that must pass prior to promoting a build to the Eng - Latest channel. This set will vary over time based on where we think our test coverage is lacking. Let's say it initially includes dotnet/roslyn and dotnet/runtime.
  4. When we encounter failures that show up in the target repos but not in the arcade-validation builds, we will investigate, determine whether the failure is our fault and determine a course of action. We will add test coverage to arcade-validation where needed. The course of action may be that the target repo needs to make changes if they were using arcade in an unsupported way/
  5. Over time, as arcade-validation becomes a reliable measure of whether arcade will pass in all target repos, we can remove the extra validation builds to reduce flakiness and end to end validation time.

Mechanically, what it means is that arcade-validation will branch off the repos+branches we need to validate, update to the candidate version of arcade, and queue the build. The build must pass for arcade to move forward.

One additional notes: It's important that this not be a license for repos to use arcade in non-standard ways and get support for it.

/cc @shawnro @markwilkie @Chrisboh @tkapin

All 28 comments

Is this issue blocking (no)

Why no? AFAIK this is blocking all our signed builds.

Is this issue blocking (no)

Why no? AFAIK this is blocking all our signed builds.

you may change that a yes above if you like; I considered it non-blocking because it had a few-minutes, no-code workaround.

As introduced here: #4625 , this change turns on SDL checks like Guardian by default (possibly intentionally) but we didn't have the service connection for Guardian defined on devdiv, so Roslyn's arcade-based builds broke.

I don't think this was what introduced the problem. The Build Promotion Pipeline didn't require new service connections to work and the only related change in execute-sdl.yml was to make Run_SDL disabled by default. I think the issue might be that we didn't ran the SDL tools on devdiv yet? @sunandabalu might know better.

@MattGal told me that he created the missing service connection in DevDiv so I went ahead and queued a new build here: https://devdiv.visualstudio.com/DevDiv/_build/results?buildId=3438980&view=results @MattGal had already done that :-)

As introduced here: #4625 , this change turns on SDL checks like Guardian by default (possibly intentionally) but we didn't have the service connection for Guardian defined on devdiv, so Roslyn's arcade-based builds broke.

I don't think this was what introduced the problem. The Build Promotion Pipeline didn't require new service connections to work and the only related change in execute-sdl.yml was to make Run_SDL disabled by default. I think the issue might be that we didn't ran the SDL tools on devdiv yet? @sunandabalu might know better.

~@MattGal told me that he created the missing service connection in DevDiv so I went ahead and queued a new build here: https://devdiv.visualstudio.com/DevDiv/_build/results?buildId=3438980&view=results~ @MattGal had already done that :-)

I don't think enabled or disabled is the deal here, in the case of the build I queued after adding the service connection the usage of it was indeed disabled. I still think it was this refactoring that brought the reference to the connection along and caused the issue... am I missing something?

@MattGal now I understand what happened here.

  • The problem was indeed caused after the merge of PR #4625.
  • The failure, however, wasn't caused by changes in execute-sdl.yml
  • The failure was caused by the removal of a YAML template condition that now causes execute-sdl.yml to always be evaluated by ADO parser.

    • ADO requires that service connections be available at build-def parsing time even though the job that uses the service connection won't really execute.

Now I got what happened :-)

So even though the SDL stuff wasn't actually running, AzDO is requires that the service connections be there regardless?

cc/ @mmitche @sunandabalu @jaredpar

Yes, this makes sense too since the conditions that disable a step from usage have to be evaluated at the last possible second (as, "previous step succeeded" is a very common check)

@markwilkie - That's correct.

The thing is that the type of the condition also change. Before we had a condition that was evaluated at template expansion time and then it got changed to a runtime condition..

Two questions then:

1) What are the next steps here to fix the root cause
2) Was there a breaking change from AzDO? (e.g. template expansion vs. runtime)

This was not a breaking change in AzDO.

I chatted with @riarenas about this a bit, and @jaredpar expressed frustration that Arcade is breaking signed builds a lot of the time. We have the arcade-validation repo builds specifically for this purpose, but we're still missing things. This comes down to a few things:

  • dnceng is different than DevDiv
  • arcade-validation does not have enough coverage for the breadth of scenarios that show up in official builds of real repos.

The second bullet should be one that we should be able to remedy in arcade-validation over time, but gaining that test coverage, given the complexity of official builds, will be time consuming and involve breaks in repos in the process.

Here is what I think we should do:

  1. Change the arcade-validation official build to be a 'stub' build
  2. The arcade-validation build launches a series of official builds that must pass before the build can be promoted. These builds will always include two base builds:

    • An arcade-validation build in dnceng

    • An arcade-validation build in DevDiv

  3. In addition to those two base builds, there will be a series of identified repos, base branches, and build pipelines that must pass prior to promoting a build to the Eng - Latest channel. This set will vary over time based on where we think our test coverage is lacking. Let's say it initially includes dotnet/roslyn and dotnet/runtime.
  4. When we encounter failures that show up in the target repos but not in the arcade-validation builds, we will investigate, determine whether the failure is our fault and determine a course of action. We will add test coverage to arcade-validation where needed. The course of action may be that the target repo needs to make changes if they were using arcade in an unsupported way/
  5. Over time, as arcade-validation becomes a reliable measure of whether arcade will pass in all target repos, we can remove the extra validation builds to reduce flakiness and end to end validation time.

Mechanically, what it means is that arcade-validation will branch off the repos+branches we need to validate, update to the candidate version of arcade, and queue the build. The build must pass for arcade to move forward.

One additional notes: It's important that this not be a license for repos to use arcade in non-standard ways and get support for it.

/cc @shawnro @markwilkie @Chrisboh @tkapin

I'm sure we are, but I want us to make sure we're communicating directly with the people who are feeling this pain to let them know that we prioritizing and actively working on preventative measures so that these types of things don't keep happening.

This will add a considerable amount of time from the moment the arcade build finishes to the moment it starts flowing to the repos, I guess is a price to pay to warranty that the version which is flowing is clean enough but I think it would be good to let partners know

This will add a considerable amount of time from the moment the arcade build finishes to the moment it starts flowing to the repos, I guess is a price to pay to warranty that the version which is flowing is clean enough but I think it would be good to let partners know

We still have an option of promoting builds manually if we need. Most partners are taking daily updates, so they will not see a tremendous amount of difference. This is about quality first. We should then be working on our arcade-validation coverage, so that we can have a fast validation that we trust.

I added the details captured here to #4719

dnceng is different than DevDiv

Part of the way we can help is getting off of DevDiv. That's a goal we've been working towards and we're close to removing the biggest obstacle in our way of moving: OptProf training.

Unfortunately though in order to fix our OptProf issues we need consistent signed builds to insert into Visual Studio. The recent arcade publish issues have essentially blocked this work because it interrupts our signed build cadence.

arcade-validation does not have enough coverage for the breadth of scenarios that show up in official builds of real repos.

A point I've pushed several people on in person is whether we can actually truly close this gap. Publish validation is hard because there are just too many variables involved:

  • Build definitions of every repository
  • AzDO instance that is used to build repositories. Each instance has a different security context, different set of apps, etc ... to validate.
  • Publish locations. Every repository seems to publish to a variety of non-pipeline end points, each has different security requirements, different tokens, etc ...
  • Time. There is a non-trivial time lag between when arcade validation occurs and when repositories actually start ingesting the changes. How much of the variables above change in that gap (think about for instance secret rotation, variable group changes, etc ...)

The question I've been asking myself are:

  • Can arcade really validate the combination of all these variables? Maybe ... but it seems really expensive.
  • Is it worth doing this or is there a better way we could partition responsibilities here.

One idea we've been discussing is changing how we build. For repositories maybe official builds should focus on build only and not publish. The job of the official build would be to produce signed binaries in the form of build and pipeline artifacts. There would be a separate job, which was central to arcade, which was responsible for taking the artifacts and pushing them to all of our publish locations:

  • NuGet
  • Sleet
  • Dependency flow
  • etc ...

This would make our official builds significantly more reliable, reduce the validation burden of arcade and let us handle publishing issues centrally. It also has the benefit that when publishing goes down for a period of time no information is lost. The official build output is still waiting around in artifact form for the publishing process to pick up and continue working with.

This would make our official builds significantly more reliable, reduce the validation burden of arcade and let us handle publishing issues centrally. It also has the benefit that when publishing goes down for a period of time no information is lost. The official build output is still waiting around in artifact form for the publishing process to pick up and continue working with.

The build output is still waiting in the current case. Anyways, @JohnTortugo opened up an issue to implement this. I also think this is a good step forward, but except in cases where a repo cares less about the arcade publishing vs. VS insertion, and they are separate, this is likely to be similar to today:

  • If the publishing pipeline fails, the result is largely the same. The build is not ready for use. Yes, you can say "I have a green build", but that's not really helpful. You can say "my build stage is green today".
  • We will still need separate dnceng and DevDiv implementations, which has the same sort of pitfalls as it does today.
  • Furthermore, we've still seen breaks in official builds outside publishing that need coverage.

Anyways, I think we should take both approaches. Separate out the publishing pipeline and add more repos to what we use for validation prior to promotion.

  • Also, we are trying to move things out of the publishing stages altogether and into the build (meaning full .NET Core build) promotion pipelines, to avoid a lot of these issues altogether. We increased complexity in those stages because we had nowhere else to put validation, but that resulted in some major pain (e.g. symbol servers have problems)

Another thing that we should think about is reducing the amount of dependencies in the build pipeline.
This particular issues was caused by not having a service connection for Guardian. Is the service connection for Guardian actually needed? Can it be eliminated?

Another thing that we should think about is reducing the amount of dependencies in the build pipeline.
This particular issues was caused by not having a service connection for Guardian. Is the service connection for Guardian actually needed? Can it be eliminated?

That's specifically what this is about:

Also, we are trying to move things out of the publishing stages altogether and into the build (meaning full .NET Core build) promotion pipelines, to avoid a lot of these issues altogether. We increased complexity in those stages because we had nowhere else to put validation, but that resulted in some major pain (e.g. symbol servers have problems)

What can we do in the immediate term to ensure we're not breaking partners in DevDiv that are actively trying to make progress towards dnceng? Or, is there some small price that partners in DevDiv can pay to achieve higher quality? @mmitche, how far away is inter-branch dependency flow? ie, Roslyn could take updates from Arcade into Roslyn@arcade-updates which publishes to some channel that, if successful, flows to Roslyn@master.

What can we do in the immediate term to ensure we're not breaking partners in DevDiv that are actively trying to make progress towards dnceng? Or, is there some small price that partners in DevDiv can pay to achieve higher quality? @mmitche, how far away is inter-branch dependency flow? ie, Roslyn could take updates from Arcade into Roslyn@arcade-updates which publishes to some channel that, if successful, flows to Roslyn@master.

That would be possible today, no extra features required besides some extra channels. I am reticent to say that it's on roslyn to validate this stuff though. I think the problem is most likely we won't convince everyone to move off of DevDiv. Nuget comes to mind. As I see it, we're supporting people in DevDiv today. If we want to make the statement that we don't support them there, then they could take the steps to get off of DevDiv, or add an extra valdiation step themselves. But AFAIK that's not the statement we have made, and thus putting the burden on our partners to validate is not a good approach.

Is this a lot of work or a large tax if they wanted to get more stable official builds?

To add an arcade-validation build in DevDiv? I don't think it's a lot of work. Especially since I think the other validation builds are potentially also worth doing. The statement being made is that arcade-validation is not good enough to not break builds in DevDiv OR dnceng.

Any input @jcagme ? Maybe this has been sorted based on the recent work in this area?

SDL related features in the release pipeline only include BinSkim. Source code checks need to be done during builds. Given this and assuming nothing else has been done around SDL in builds, I guess things should be the same as of Feb 4th.

Triage: Adding to the release epic as SDL is going to execute there.

Required infrastructure changes/update so SDL runs in validation and release pipelines are complete. Repo owners need to onboard to get the benefits but ATM we are not pushing on them to make the change and rather have them opt in when they have time

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JohnTortugo picture JohnTortugo  路  35Comments

ViktorHofer picture ViktorHofer  路  27Comments

riarenas picture riarenas  路  49Comments

vatsan-madhavan picture vatsan-madhavan  路  28Comments

JohnTortugo picture JohnTortugo  路  29Comments