Caseflow: Enhancement: Display site degradation banner automatically

Created on 2 Jun 2017  Â·  17Comments  Â·  Source: department-of-veterans-affairs/caseflow

As a user, I would like to be informed that Caseflow's dependencies may be experiencing an outage so I can decide whether to continue using Caseflow or come back later.

This may take the form of a site-wide banner, a modal, or a page that we redirect to.

Potential copy

"We've detected some issues with the systems that Caseflow depends on. We're aware of the issue and will remove this message when functionality is restored."

Details

  • Since we don't want outages in our monitoring system to take production down, we should not actually prevent the user from continuing, we should merely provide information
  • We should make this as simple as possible, at least at first — there should just be one message, it should be displayed if one or both of BGS or VBMS are down, and it should be displayed across all Caseflow apps.

Designs forthcoming. Please challenge the hypotheses above or add additional thoughts in the comment section!


Original post

tl;dr Use Caseflow Monitor to detect downtime, and show a friendly warning banner in the app.

Today, Caseflow applications depends on several VA intranet applications to function properly. If any of these systems experience downtime during business hours, our system would be partially functional at best, and completely nonoperational at worst. When these systems go down, an engineer would manually toggle a flag in Redis to take the site out-of-service. This process is not sustainable for several reasons.

  • Dispatch users work on Saturday morning. :zzz:
  • Certification users start at 6am EST. :sleeping_bed:
  • eX users often work til 10pm EST. :tv:

With Caseflow Monitor, we have the capabilities to detect these downtime in near real time. Monitor queries the APIs of these external VA dependencies every few minutes and calculate status and up rate. The data is available through a RESTful API and long term data is reported through Prometheus for aggregation. So, the pieces are all there. We just need to put it together.

A dead simple approach is to query Monitor periodically (e.g. sidekiq) to determine if VA systems are stable. If they are not, display a Site Degradation banner that says _"Sorry, we are experiencing some server issues at the moment. We are working hard on it. :construction: "_.

Feedback are always welcomed.

@amprokop

Most helpful comment

Mocks! with these goals in mind:

  • inform Caseflow app users that they may experience difficulties using the app
  • enable Caseflow app users to continue using the app

and assumptions:

  • Users are aware of how to contact the support team and send feedback. We don't need to emphasize ways to submit feedback in the warning message, especially because displaying the warning means we've detected issues.
  • I suggest avoiding modals, because they can be disruptive to a workflow and create accessibility issues. Additionally, in Caseflow, we use modals to collect information from users. Using a modal to communicate a warning to users is a new pattern that doesn't contribute to a smooth user experience. We have existing patterns for communicating warnings that users already associate with warnings.

Separate page option

Here we display the warning message on a separate page, and allow the user to continue onto the Caseflow app they were trying to use.

  • Pros:

    • More space to explain what's happening and what to expect

    • Existing pattern for communicating information with users

  • Cons:

    • Separate page might be disruptive

    • Once they navigate to the app, they won't see a banner anymore, so they might not know when the issues are resolved. Then again, we can always show the separate page and a warning message on the app.

seperate page

Warning message underneath navigation

Here are mocks following visual design patterns from alerts in the style guide. They have copy of different lengths and detail, as well as different type treatment.

  • Pros:

    • Because it's displayed in app, we can remove banner when Caseflow is working or use a different banner to communicate that the issues are fixed

  • Cons:

    • May not be easily seen on a page

    • Text heavy message clutters the page, and potentially doesn't provide enough space for a longer explanation

With a 2-line message and styling close to our current alert styling:

long error under nav bold and reg

With a 1-line message and type at normal weight:

medium error under nav regular

With a short 1-line message and an icon:

short error under nav icon

With a red background and a long message (I don't like this, it's very jarring):

long error under nav bold and reg

Warning message at top of page

Here are mocks with warning messages that appear at the very top of the page, above the navigation.

  • Pros:

    • Because it's displayed in app, we can remove banner when Caseflow is working or use a different banner to communicate that the issues are fixed

  • Cons:

    • May not be easily seen on a page

    • Text heavy message clutters the page, and potentially doesn't provide enough space for a longer explanation

    • Putting a message above the nav might be too different from existing templates/patterns

Icon and 1-line message, bold weight type:

icon medium error page top

1-line message, normal weight type, without icon:

medium error page top

1-line message, bold type, without icon:

bold medium error page top

Copy used for the messages

Here are the variations of messages in the mocks:

  • Longest message (2 lines): We've detected some issues with the systems that Caseflow depends on and are working on fixing them. Caseflow pages may take longer to load or display more errors.
  • Long message (2 lines): Caseflow has detected issues with systems it depends on. [line break]
    This may cause delays or errors. Keep calm and carry on as we work to fix these issues.
  • Medium message (1 line): Systems Caseflow uses are having issues, which may cause delays or errors. Keep calm and carry on as we work to fix them!
  • Short message (1 line): Caseflow has detected issues with systems it uses. Don’t worry, we’re working to fix them!

Here's the long message in the separate page:

  • Caseflow has detected issues in systems it uses. The Caseflow team has been notified and is working to resolve them. You may notice delays or errors as you use Caseflow apps. Thanks for your patience as we work to get Caseflow running smoothly again.

@amprokop @askldjd @NickHeiner @cmgiven @kierachell, appreciate your feedback on separate page vs. in-page alert, and thoughts on copy.

@Chingujo @gnakm @lakohl @abbyraskin, appreciate your design feedback on:

  • separate page vs. in-page alert
  • if an in-page alert, type treatment and display of icon in the message

All 17 comments

We'll have a discussion on this next Tuesday, June 6th after standup.

@shellicious is planning to produce a mockup before then to aid discussion! can you post it in here once complete, shelly?

And adding a bit more background.

  • VBMS prod has been down for the past 3 weekends for maintenance and upgrade. Our Dispatch user were experiencing failures on Saturday.

  • BGS/Siteminder prod has gone down twice in May during business hours. During the outage, I was about 30min to 1 hour behind on taking the site down.

On each occasion, we have received a number of error 500s. If we can give users immediate awareness on the situation, I believe it would reduce user frustration.

So this is not the same as taking the site offline, as does the manual toggle today? This would instead be warning users that they may experience issues?

For the message, could we change "server issues" to something like "issues with the systems that Caseflow depends on" and change "we are working on it" to something like "we're aware of the issue and will remove this message when functionality is restored." The first change is just to finger point, albeit vaguely. The second is to be a bit more honest, as we may not be taking any specific actions, as in the examples Alan cites above.

So this is not the same as taking the site offline, as does the manual toggle today? This would instead be warning users that they may experience issues?

Yes. This is just a warning that they _may_ experience issues. We are not taking down the site during this period.

As for wording, please feel free to change it. I have little opinion and expertise in this matter. Thanks!

To refocus this discussion, i've added some edits, take a look and add more thoughts as they arise!

This will be discussed after standup on Tuesday, June 6th.

So if BGS goes down, then we're totally down.
If VACOLS goes down, then Certification/Dispatch/Reader/Hearings/API go down.
If VBMS goes down, then... well it depends.

Thanks for the edits @amprokop, I'll post mocks here later today

General idea LGTM. I agree with Chris' points that we should be clear that we're not actively trying to fix our dependencies, but are aware of the issue. And I think this is much better than just failing randomly.

Mocks! with these goals in mind:

  • inform Caseflow app users that they may experience difficulties using the app
  • enable Caseflow app users to continue using the app

and assumptions:

  • Users are aware of how to contact the support team and send feedback. We don't need to emphasize ways to submit feedback in the warning message, especially because displaying the warning means we've detected issues.
  • I suggest avoiding modals, because they can be disruptive to a workflow and create accessibility issues. Additionally, in Caseflow, we use modals to collect information from users. Using a modal to communicate a warning to users is a new pattern that doesn't contribute to a smooth user experience. We have existing patterns for communicating warnings that users already associate with warnings.

Separate page option

Here we display the warning message on a separate page, and allow the user to continue onto the Caseflow app they were trying to use.

  • Pros:

    • More space to explain what's happening and what to expect

    • Existing pattern for communicating information with users

  • Cons:

    • Separate page might be disruptive

    • Once they navigate to the app, they won't see a banner anymore, so they might not know when the issues are resolved. Then again, we can always show the separate page and a warning message on the app.

seperate page

Warning message underneath navigation

Here are mocks following visual design patterns from alerts in the style guide. They have copy of different lengths and detail, as well as different type treatment.

  • Pros:

    • Because it's displayed in app, we can remove banner when Caseflow is working or use a different banner to communicate that the issues are fixed

  • Cons:

    • May not be easily seen on a page

    • Text heavy message clutters the page, and potentially doesn't provide enough space for a longer explanation

With a 2-line message and styling close to our current alert styling:

long error under nav bold and reg

With a 1-line message and type at normal weight:

medium error under nav regular

With a short 1-line message and an icon:

short error under nav icon

With a red background and a long message (I don't like this, it's very jarring):

long error under nav bold and reg

Warning message at top of page

Here are mocks with warning messages that appear at the very top of the page, above the navigation.

  • Pros:

    • Because it's displayed in app, we can remove banner when Caseflow is working or use a different banner to communicate that the issues are fixed

  • Cons:

    • May not be easily seen on a page

    • Text heavy message clutters the page, and potentially doesn't provide enough space for a longer explanation

    • Putting a message above the nav might be too different from existing templates/patterns

Icon and 1-line message, bold weight type:

icon medium error page top

1-line message, normal weight type, without icon:

medium error page top

1-line message, bold type, without icon:

bold medium error page top

Copy used for the messages

Here are the variations of messages in the mocks:

  • Longest message (2 lines): We've detected some issues with the systems that Caseflow depends on and are working on fixing them. Caseflow pages may take longer to load or display more errors.
  • Long message (2 lines): Caseflow has detected issues with systems it depends on. [line break]
    This may cause delays or errors. Keep calm and carry on as we work to fix these issues.
  • Medium message (1 line): Systems Caseflow uses are having issues, which may cause delays or errors. Keep calm and carry on as we work to fix them!
  • Short message (1 line): Caseflow has detected issues with systems it uses. Don’t worry, we’re working to fix them!

Here's the long message in the separate page:

  • Caseflow has detected issues in systems it uses. The Caseflow team has been notified and is working to resolve them. You may notice delays or errors as you use Caseflow apps. Thanks for your patience as we work to get Caseflow running smoothly again.

@amprokop @askldjd @NickHeiner @cmgiven @kierachell, appreciate your feedback on separate page vs. in-page alert, and thoughts on copy.

@Chingujo @gnakm @lakohl @abbyraskin, appreciate your design feedback on:

  • separate page vs. in-page alert
  • if an in-page alert, type treatment and display of icon in the message

Looks great. Add to the design meeting agenda this week?

@cmgiven sure! I will be out, i trust y'all to come to a conclusion. We are also discussing this tomorrow (tues 6/6) after standup.

Agree that this would be a good to pic for the Design mtg. as it will be a pattern that we use across apps. Personally I like the first version. (possibly with a little more grey space between it and the Vet ID). It's the easiest to read and lets the user know they can continue working regardless. I would suggest a slight adjustment to the message - so we don't say we are working on fixing the problem if this is not true.

I'm not so fond of the versions with the warning above the Caseflow logo.

We decided to pursue this via unanimous consent.

We came to a consensus that taking the site down on an automated basis is not something we want to do. Outages are a spectrum. We can revisit this topic if we begin to believe we’re really good at automatic detection of outages. Alan described his experience with VBMS partial outages.

Shelly expressed a preference for the separate page option. Alex (and/or Lara/Lauren?) expressed a preference for the banner option. Nick expressed his trust in the design team.

Design team to take this up and settle on copy, placement, behavior, and separate page/banner.

cc @lakohl who seems likely to bring this topic up next ^

I noticed this banner in Slack yesterday that's related to the problem we are trying to solve here and ran it by Shelly - something to reference for the design discussion tomorrow, if helpful. Slack provides users with more information by linking to a status update page, a strategy that may or may not be useful/possible for our use case as we think about banner vs. separate page vs. both.

screen shot 2017-06-06 at 4 23 34 pm

Ticket for implementation can now be found here: #2294
Thanks everybody for your input!

Was this page helpful?
0 / 5 - 0 ratings