Caseflow: Post-mortem: Certification v2 interfering with Certification v1 | 10:28AM June 7th, 2017

Created on 7 Jun 2017  Â·  25Comments  Â·  Source: department-of-veterans-affairs/caseflow

Events:

10:03 : Deployed Caseflow to production. Build 105

10:22 : Sentry reported a BGS Soapfault on findPersonByPtcpntld. The searched person was not found. This brought to our attention immediately because this BGS endpoint is not supposed to be called in Certification V1.

10:42 : @amprokop posted PR to address the issue.

11:03 : The damage was beginning to be understood. Caseflow was setting BFSO field to 3, which translates to nil. We were deleting the Attorney Representative's Type field.

11:22 : Merge Caseflow fix that addresses the issue. https://github.com/department-of-veterans-affairs/caseflow/pull/2264. (Presumably we deployed immediately thereafter.)

12:00 : Since the previous values was not recorded, @ToddStumpf was brought into the discussion to restore a VACOLS snapshot. Todd was able to restore an earlier hourly (09:46) backup to the spare failover instance of VACOLS.

13:00 - 16:00: Verified that 18 entries were affected, and they were manually restored to the previous values through manual SQL commands. The commands were validated by 5 engineers before the commit.

Affected parties:

18 rows in VACOLS were mutated incorrectly. Specifically, the BFSO field that determines the Veterans' Attorney Rep type was set to 3, which means nil.

Solutions:

  • Deployed the patch to production environment to address the issue.
  • Manually restored a VACOLS snapshot to 3 hours prior to the incident.
  • Manually ran SQL commands to restore the entries.

Key issues

  • There was a feature leak. Certification V2 (an unreleased feature) has a logic path leak, and was mutating data incorrectly in VACOLS.

  • We were very fortunate for 2 reasons. Any slight misalignment would have caused permanent data loss.

    • BGS reported an unrelated error that lead us to the discovery of this bug. If BGS did not report any SoapFault, we would have never found this issue.
    • @ToddStumpf was still around, and has access to VACOLS. He was able to restore a snapshot for us within an hour.

    • While the fix was simple, we waited for over 40 min for Travis to validate the PR. The test usually takes about 15-20 minutes. And they were flaky, so the PR was built twice.

    • We hesitated to rollback the deployment because we didn't understand the scope of the damage. The rollback decision needs to be accelerated.

    • We did not have the previous VACOLS values in logs, and therefore we have to rely on DB snapshots to recover. As @ToddStumpf transition out of USDS next month, this path will be extremely painful. This is because VACOLS backup only persist for 3 days, and we may not receive support from OIT in time.

  • We do not have a way to bring down a specific portion of the site without affecting all users. While Certification should've been taken down, Reader and Dispatch should continue to operate.

Action items:

caseflow-certification

All 25 comments

So this bug slipped past all three gate - Code Review, Unit Test and Integration Test.

Thoughts:

  • Adding more gates will not help. We need to resolve this issue in one of the existing stages.
  • Code Review stage is likely in best position to capture this issue. All mutation to external system (VACOLS, BGS, VBMS) must be gated with feature flags.
  • Like @kierachell posted, all mutation to external system must have before/after value recorded. Whether or not it should be a history table in PG or just a log entry in CW is up for debate.
  • We should not rely on VACOLS snapshot to bail us out.

the mutation log is a powerful but dangerous thing: to be useful it must be accurate, to be accurate, it must have PII. If it has PII -- can it be in CW? it may have to reside on local disk...

the VACOLS backups only persist for 3 days; if the error hadn't been caught quickly, the prior states could have been lost. A more robust backup policy for the db would be:

  • hourly retained for several days (3 days)
  • daily retained for several weeks (14 days)
  • weekly retained for several months (60 days)
  • monthly retained for several months (180 days)
    that usually requires some sort of tape backup, however, which is probably not available. We could look at using glacier, though, as that is available in GovCloud.

CW already have PII. I brought up this topic months ago, and as far as I can tell, there is no reason why PII can not be in CW.

I've been trying to think of any technique that would be expected to catch this before going out, except code review. Nothing is coming to mind.

We should be verifying that any active feature can be deactivated without impacting production -- as we want to be able to turn off a feature if necessary -- but this is the opposite case ...

An approach that was mentioned by Alan earlier:

  • maintain a set of companion tables that record prior state right in the db itself

As we're using Oracle, it seems to have triggers which could fire automatically to populate the companion tables (I've heard them called "audit tables", but "history tables" or "prior-value tables" also capture the idea).

This is something worthy of discussion and debate. And research. There may be a gem that does this already for ActiveRecord (if it's not a built-in feature).

I don't think feature flags are the problem here. This would have been a bad bug even if all our users were on Cert V2.

The offending line of code was a logic branch that had no associated unit tests, which would have prevented this sort of issue.

@kierachell was this deployed before or after you had a chance to QA this story? This could be our example of a bug that was introduced because we deployed prior to QA looking at the story. However, I'm not sure what we should have done, there were (are?) a ton of blocked Cert V2 issues in validation for weeks which would have prevented deployment.

Like @kierachell posted, all mutation to external system must have before/after value recorded. Whether or not it should be a history table in PG or just a log entry in CW is up for debate.

I like this idea. VACOLS updates could have better logging.

@ToddStumpf CW can have PII in it.

I QA'd part of the story, however:

The error that caused the alert was an edge case from #2061, which was a blessing in disguise. None of these would've "broken" actual Certification - it's just an annoying Sentry error.

I helped create the logic for #2098, so I was aware of needing to validate it and it would've silently passed both v1 and v2 tests because it doesn't cause errors or create any obvious noise. I did not fully validate #2098 prior to deploy (assuming that it was behind a feature toggle); I was going to...after the deploy.

I probably would've noticed the new VACOLS update logs in v1 if I was to go super-thorough through the code and logs (the AC not having mentioned anything about "make sure v1 does _not_ update anything) and questioned their presence.

I guess the point for QA is to be more vigilant about these sort of regressions: making sure that v1 of whatever is being upgraded is not affected by v2 - even silently. That would block deploys for any pressing "feature rollouts" however.

Unit Test and Integration Test would not catch a bug like that.

  1. We need to be notified (via Sentry?) that the code that should hidden behind the feature flag is being used in Production. However, in this case, we forgot to hide the code behind the feature flag in the first place
  2. Be careful mixing the current code with the code hidden behind the feature flag
  3. QA needs to be aware and validate that the code hidden behind the feature flag is not running in Production (tricky to validate)
  4. Implement a table on our side (VACOLS action history)
  5. We need to monitor Vacols updates to ensure we are not corrupting data

_"I don't think feature flags are the problem here. This would have been a bad bug even if all our users were on Cert V2. The offending line of code was a logic branch that had no associated unit tests, which would have prevented this sort of issue."_

I am not sure if unit test would've prevented this unless the unit test is written with the intent of preventing feature leak. Keep in mind, there was no exception thrown in this code. Everything was operating _normally_ at the surface. We only observed anomaly after BGS reported a "Person not found" soapfault.

Yeah. Potentially, we should have done nil checking and thrown an exception — I don't think there's a business case where we actually want to update the VACOLS rep to nil here.

We could have caught this with a unit test, but it would have had to be awfully specific. Potentially we can establish a team norm that you have to test that all code behind a feature flag does not run unless the feature flag is enabled?

If we're looking at FeatureToggle, then a good place to validate something like this is @NickHeiner 's checklist:
https://github.com/department-of-veterans-affairs/caseflow/blob/master/docs/nick-cr-how-to.pdf

  • When working with FeatureToggle-able code: double-check that it's behind the flag.

Would it help to make this a full on discussion for next week? Say... Wednesday?
Action items I see:

  1. Before 'n' After logs to CW
  2. Code Review checklist item
  3. QA really watch the logs of any v1 testing
  4. Discuss more options next week (or whenever people are back)

Another issue here that we should consider is— we knew the problem and the fix and had the code on hand for somewhere between 30-45 minutes before we were able to merge it into master. In this case, the absolute amount of data that was affected was very low. But it would have been halved if we had a reliable build.

There are two problems here:

  • Travis is unreliable.
  • We weren't able to take only Certification out of service.

Yes, the timing was really weird. We found the anomaly about 10 min after deployment. It _looked_ like a easy fix, but Travis PR build time really hurt us. Because of the Travis PR penalty, maybe a rollback should've been chosen as the default path.

I think we either have an immediate rollback or we immediately take the site down after we realize we have a bug that potentially affects production data.

Or a combination of both— we could take the site down then rollback.

Good point. I am going to take an action item on fast rollback using ASG/LC. https://github.com/department-of-veterans-affairs/appeals-deployment/issues/595

Going from what @amprokop and @askldjd mentioned I made an issue:
https://github.com/department-of-veterans-affairs/caseflow/issues/2265

  • I leave Travis snafu up to ya'll, but the above will be able to address several problems:

    • Service degradation

    • Stop bleeding for app-specific bugs

So, here's a summary of the action items that pop out to me.

  1. Clearer norms around unit testing (Do we have any team guidelines at all?). Discussion issue to come, @amprokop to take that on.
  2. Clearer norms around new service integrations. Discussion issue to come, @amprokop to take that on.
  3. CI is slow and unreliable. @mdbenjam to make action items for himself/whiskey (anyone else?)
  4. More granular way to take down specific apps, captured in https://github.com/department-of-veterans-affairs/caseflow/issues/2265
  5. Fast rollback of Caseflow, captured in (https://github.com/department-of-veterans-affairs/appeals-deployment/issues/595)
  6. Playbook for this type of situation in the future (how to we determine when to take an app out of service? rollback? both?). Can be captured in a doc. @NickHeiner, do you have bandwidth to spearhead this?

@NickHeiner, @nicholasholtz, @kierachell, @askldjd, @aroltsch, @ToddStumpf

Re: https://github.com/department-of-veterans-affairs/caseflow/issues/2268#issuecomment-307102330

@amprokop If I understand what this feature is doing, there definitely could be a situation where a rep in VACOLS is replaced with nil; sometimes Vets decide to drop their reps without replacing them.

CI is slow and unreliable. @mdbenjam to make action items for himself/whiskey (anyone else?)

I'm on this. See #2293.

Playbook for this type of situation in the future (how to we determine when to take an app out of service? rollback? both?). Can be captured in a doc. @NickHeiner, do you have bandwidth to spearhead this?

Yes, I'll take this on, although it may take me a little while to get around to it.

I agree with the previously stated action items.

I am skeptical that we can have enough automated testing to drive risk down to 0%. It's easy to write tests for known issues. It's hard to write tests for every way that any thing could possibly go wrong. Maybe we could have used more testing here, but I don't want to rely on that.

I would like a stronger team norm around code review for code that writes to a system we don't directly control. Maybe this is too cumbersome, but what if we asked for two reviewers to sign off on such a situation?

@amprokop If I understand what this feature is doing, there definitely could be a situation where a rep in VACOLS is replaced with nil; sometimes Vets decide to drop their reps without replacing them.

@nicholasholtz — VACOLS actually has a "No Representative" value ("L" in BFSO), so we would always set it to "L", never nil.

@amprokop @nickheiner-usds @askldjd
Seems like the discussion is over: I will extract action items and close this

Was this page helpful?
0 / 5 - 0 ratings