As a user, I would like to be informed that Caseflow's dependencies may be experiencing an outage so I can decide whether to continue using Caseflow or come back later.
This may take the form of a site-wide banner, a modal, or a page that we redirect to.
"We've detected some issues with the systems that Caseflow depends on. We're aware of the issue and will remove this message when functionality is restored."
Designs forthcoming. Please challenge the hypotheses above or add additional thoughts in the comment section!
tl;dr Use Caseflow Monitor to detect downtime, and show a friendly warning banner in the app.
Today, Caseflow applications depends on several VA intranet applications to function properly. If any of these systems experience downtime during business hours, our system would be partially functional at best, and completely nonoperational at worst. When these systems go down, an engineer would manually toggle a flag in Redis to take the site out-of-service. This process is not sustainable for several reasons.
With Caseflow Monitor, we have the capabilities to detect these downtime in near real time. Monitor queries the APIs of these external VA dependencies every few minutes and calculate status and up rate. The data is available through a RESTful API and long term data is reported through Prometheus for aggregation. So, the pieces are all there. We just need to put it together.
A dead simple approach is to query Monitor periodically (e.g. sidekiq) to determine if VA systems are stable. If they are not, display a Site Degradation banner that says _"Sorry, we are experiencing some server issues at the moment. We are working hard on it. :construction: "_.
Feedback are always welcomed.
@amprokop
We'll have a discussion on this next Tuesday, June 6th after standup.
@shellicious is planning to produce a mockup before then to aid discussion! can you post it in here once complete, shelly?
And adding a bit more background.
VBMS prod has been down for the past 3 weekends for maintenance and upgrade. Our Dispatch user were experiencing failures on Saturday.
BGS/Siteminder prod has gone down twice in May during business hours. During the outage, I was about 30min to 1 hour behind on taking the site down.
On each occasion, we have received a number of error 500s. If we can give users immediate awareness on the situation, I believe it would reduce user frustration.
So this is not the same as taking the site offline, as does the manual toggle today? This would instead be warning users that they may experience issues?
For the message, could we change "server issues" to something like "issues with the systems that Caseflow depends on" and change "we are working on it" to something like "we're aware of the issue and will remove this message when functionality is restored." The first change is just to finger point, albeit vaguely. The second is to be a bit more honest, as we may not be taking any specific actions, as in the examples Alan cites above.
So this is not the same as taking the site offline, as does the manual toggle today? This would instead be warning users that they may experience issues?
Yes. This is just a warning that they _may_ experience issues. We are not taking down the site during this period.
As for wording, please feel free to change it. I have little opinion and expertise in this matter. Thanks!
To refocus this discussion, i've added some edits, take a look and add more thoughts as they arise!
This will be discussed after standup on Tuesday, June 6th.
So if BGS goes down, then we're totally down.
If VACOLS goes down, then Certification/Dispatch/Reader/Hearings/API go down.
If VBMS goes down, then... well it depends.
Thanks for the edits @amprokop, I'll post mocks here later today
General idea LGTM. I agree with Chris' points that we should be clear that we're not actively trying to fix our dependencies, but are aware of the issue. And I think this is much better than just failing randomly.
Mocks! with these goals in mind:
and assumptions:
Separate page option
Here we display the warning message on a separate page, and allow the user to continue onto the Caseflow app they were trying to use.

Warning message underneath navigation
Here are mocks following visual design patterns from alerts in the style guide. They have copy of different lengths and detail, as well as different type treatment.
With a 2-line message and styling close to our current alert styling:

With a 1-line message and type at normal weight:

With a short 1-line message and an icon:

With a red background and a long message (I don't like this, it's very jarring):

Warning message at top of page
Here are mocks with warning messages that appear at the very top of the page, above the navigation.
Icon and 1-line message, bold weight type:

1-line message, normal weight type, without icon:

1-line message, bold type, without icon:

Copy used for the messages
Here are the variations of messages in the mocks:
Here's the long message in the separate page:
@amprokop @askldjd @NickHeiner @cmgiven @kierachell, appreciate your feedback on separate page vs. in-page alert, and thoughts on copy.
@Chingujo @gnakm @lakohl @abbyraskin, appreciate your design feedback on:
Looks great. Add to the design meeting agenda this week?
@cmgiven sure! I will be out, i trust y'all to come to a conclusion. We are also discussing this tomorrow (tues 6/6) after standup.
Agree that this would be a good to pic for the Design mtg. as it will be a pattern that we use across apps. Personally I like the first version. (possibly with a little more grey space between it and the Vet ID). It's the easiest to read and lets the user know they can continue working regardless. I would suggest a slight adjustment to the message - so we don't say we are working on fixing the problem if this is not true.
I'm not so fond of the versions with the warning above the Caseflow logo.
We decided to pursue this via unanimous consent.
We came to a consensus that taking the site down on an automated basis is not something we want to do. Outages are a spectrum. We can revisit this topic if we begin to believe we’re really good at automatic detection of outages. Alan described his experience with VBMS partial outages.
Shelly expressed a preference for the separate page option. Alex (and/or Lara/Lauren?) expressed a preference for the banner option. Nick expressed his trust in the design team.
Design team to take this up and settle on copy, placement, behavior, and separate page/banner.
cc @lakohl who seems likely to bring this topic up next ^
I noticed this banner in Slack yesterday that's related to the problem we are trying to solve here and ran it by Shelly - something to reference for the design discussion tomorrow, if helpful. Slack provides users with more information by linking to a status update page, a strategy that may or may not be useful/possible for our use case as we think about banner vs. separate page vs. both.

Ticket for implementation can now be found here: #2294
Thanks everybody for your input!
Most helpful comment
Mocks! with these goals in mind:
and assumptions:
Separate page option
Here we display the warning message on a separate page, and allow the user to continue onto the Caseflow app they were trying to use.
Warning message underneath navigation
Here are mocks following visual design patterns from alerts in the style guide. They have copy of different lengths and detail, as well as different type treatment.
With a 2-line message and styling close to our current alert styling:
With a 1-line message and type at normal weight:
With a short 1-line message and an icon:
With a red background and a long message (I don't like this, it's very jarring):
Warning message at top of page
Here are mocks with warning messages that appear at the very top of the page, above the navigation.
Icon and 1-line message, bold weight type:
1-line message, normal weight type, without icon:
1-line message, bold type, without icon:
Copy used for the messages
Here are the variations of messages in the mocks:
This may cause delays or errors. Keep calm and carry on as we work to fix these issues.
Here's the long message in the separate page:
@amprokop @askldjd @NickHeiner @cmgiven @kierachell, appreciate your feedback on separate page vs. in-page alert, and thoughts on copy.
@Chingujo @gnakm @lakohl @abbyraskin, appreciate your design feedback on: