The Admin UI is currently primarily organized around time series graphs. However, in addition to time series, we currently record major "events" that occur whenever a significant cluster state change occurs. These events currently consist of node membership changes and DDL changes.
Due to the nature of these changes, they may correlate strongly with visible changes in our time series; for example, a node going down temporarily may result in an increase in traffic to other nodes, or a significant amount of rebalancing traffic. Schema changes may result in increased error counts if they break an application, or in the case of large tables may result in a lot of internal traffic to alter an index.
In order to help clarify this to users, we are going to attempt to overlay these events visually on the graphs. Some of these events (such as a new node joining) occur at an immediate instant in time; others, such as a schema change or a node being down, occur over a period of time. In either case, these can be overlaid as vertical markers on our graphs with a time axis.
Three parts to completing this enhancement:
There is design work in progress to find an acceptable appearance for the overlays; that may include a decision as to whether events are always visible overlaid, or whether they can be toggled through some mechanism. However, much of the technical work can be
I'm in for having these events to be _always_ visible on the timeframe selected, rather than having a toggle.
Here's my preferred visual to implement the time range on graphs:

Other ideas are:


After some work towards this implementation, we have noticed a few issues with very large numbers of events:
A proposed alternative solution is to develop a single "events" timeline that will display above the graphs. This would replace "events over graphs", as well as the "events list" on the right-hand summary bar.
This control, when designed, would be more insulated from issues arising from large number of events.
The current designs are still seen as being adequate for our current goals of supporting 10-node clusters well. Issues with huge number of events do not become intractable until clusters and applications reach a somewhat larger scale. That scale is an inevitable target for our UI, so we are recording this alternative solution here; however, for the 1.0 timescale we are continuing with the current events-on-graph design.
@mrtracy
This is what the alternative solution might look like:

Perhaps we can allow some sort of toggles on top of the current implementation to avoid some obvious problems, or make them optional at least. Here's an idea of what that might look like:

I like the second (more compact) one better, especially if I can hover over one of those dots and have it tell me about the event or highlight the event in the event list.
That's the plan, @petermattis.

@vivekmenezes suggested to take out the midpoint (diamond icon) of the events, and currently this work is reprioritized to be done after @mrtracy tackles some bugs.
moving to 1.1 milestone
For our 1.2 scope, here are the acceptance criteria that we need to meet. @kuanluo It would be great if we could get a first pass design within the next week or so. Let me know if you need more time.
Acceptance Criteria
Supported User Flow
Step 1: Identify that an event occurred at a time period that may or may not have resulted in a change in performance metrics.
Step 2: Dig in deeper to get more actionable insights and context on the event
Step 3: Get pointed towards more information that users can use to take action and resolve the problem.
Feature Specification
Design States Needed
Edge Cases
@kuanluo FYI, I just added in two states for the jobs metrics - paused and resumed. We probably want to surface what this looks like there.

To address 1, we discussed a heatmap approach where all events are presented on the same height. The darker the color, the more events are overlapping in the same period of time.

To address 2 & 3, the proposal is to make the events in the log clickable. On hover, it will highlight the corresponding event on the graph. Once clicked, it will expand to show the full info.

cc @dianasaur323 @cuongdo @josueeee @Amruta-Ranade
Thanks, @kuanluo. All your suggestions sound good. Looking forward to the next round of designs!
Having events heatmap at the bottom of the time series graph:

Hovering the events will highlight them in the events log. The tooltip will display graph data as well as the events:

Hovering on the "click to expand" icon:

Click to expand the events:

This looks nice!
@kuanluo this looks great, and we can finalize text in your sketch file. I think a full event log is still useful, so we still need that page. I believe we also wanted an expanded view in the event log side bar, so that users could expand that and see more info? It looks like it's included in an alternative view in an issue above. Is that going to be included in the sketch file? I have noticed that sometimes we are missing some edge cases in the user flow when we build out some admin UI pieces (thinking of the jobs table last time), so lets make sure to capture those this time around.
Yes, @dianasaur323 thanks for bringing that up. The "click to view full event details + actionable links" will be included in the sketch file.
I'll just go ahead and register my dissenting opinion here that I don't think the gray background adds any value, I think it looks cleaner without it. Otherwise I think this looks great.
List of events (Need UI text for these):
case eventTypes.CREATE_DATABASE:
case eventTypes.DROP_DATABASE:
case eventTypes.CREATE_TABLE:
case eventTypes.DROP_TABLE:
case eventTypes.ALTER_TABLE:
case eventTypes.CREATE_INDEX:
case eventTypes.DROP_INDEX:
case eventTypes.CREATE_VIEW:
case eventTypes.DROP_VIEW:
case eventTypes.REVERSE_SCHEMA_CHANGE:
case eventTypes.FINISH_SCHEMA_CHANGE:
case eventTypes.NODE_JOIN:
case eventTypes.NODE_DECOMMISSIONED:
case eventTypes.NODE_RECOMMISSIONED:
case eventTypes.NODE_RESTART:
case eventTypes.SET_CLUSTER_SETTING:
I agree, I don't like the grey background. Thanks for adding that, Amruta. Helpful.
Ok @kuanluo, here are the flows that we can cover with users:
Let's go with the three user flows described in the acceptance criteria comment above. Namely, for each of the below scenarios, we should ask them to identify an issue, dig in deeper to understand the issue, and identify where they would go next after identifying the issue.
Scenario 1: backup causes fall in SQL queries
Scenario 2: node failure causes SQL queries for that node to go to zero before it recovers. Overlay this with a bunch of schema changes and backup and restore events to see if user can parse things out
Scenario 3: A node failure occurred yesterday. See how user expects time selection changes to effect the event log.
That should be good enough for now. Open to comments from anyone else who might have an opinion on this.
Moving this into our next release for now :'(