Cht-core: Add user telemetry for replication requests

Created on 14 Apr 2020  ·  14Comments  ·  Source: medic/cht-core

What feature do you want to improve?
We don't have a good idea of how often phones are trying and failing to sync data.

Describe the improvement you'd like
Every time a sync attempt is made we should record some user telemetry so we can monitor and improve. It could also be useful in setting more useful Service Level Objective based on the impact to the project of stale data rather than raw server uptime which doesn't impact offline users much.

Particularly it would be useful to know...

  • How long the request took to complete.
  • The status, eg: success, server offline, phone offline, other
  • How long it's been since the last successful sync so we can work out how stale the data is

Describe alternatives you've considered
We could calculate this server side based on logs, but that would miss data points when the server or phone are offline.

Chargable Monitoring 2 - Medium Improvement

All 14 comments

It's going to be difficult to get any useful information by just recording how long sync took. We can't add telemetry for a specific request (because PouchDB does everything in the background), only for the whole operation, and the duration can vary greatly depending on the number of docs that needed to be synced (either upwards or downwards), how many batches and general payload size.

Connection quality is also a huge factor.

Were you thinking to have a finer granulated recording?

We could add finer grained metrics as well, but the main metric I was after was how long does it take actual users with real connections etc to complete a replication. It would be good to record how many changes were synced both ways to take that into account.

We can already see individual requests from each user by inspecting the logs if we need to dig down deeper. The advantage of the holistic metric is being able to record how long it takes to do a complete sync from the user's perspective.

Thanks for flagging this ticket for me @garethbowen. In follow-up to our conversation with @yembrick today, I wanted to add that I'd be interested in a number of metrics related to sync times - average # of times a CHW syncs per month, amount of time it takes per sync, and the % of CHWs who do not sync at all in a given month.

More broadly, I wanted to flag a few additional telemetry-derived metrics that we are interested in exploring for inclusion in our impact monitoring across projects:

  • number of CHT app logins (total, per CHW)
  • number of tasks (total, per CHW), _ideally broken down by completed, cancelled, or still active_
  • number of messages (total, per CHW)
  • Average “boot time” for the CHT application
  • Average time to search contacts across all CHWs per month
  • Average time to search reports across all CHWs per month
  • Average time to create and save a given report (i.e. home visit form, patient assessment form)

More to come on this soon but I wanted to give a sense of the metrics we are considering to explore what is possible in our product telemetry data.

@helizabetholsen - As we're warming up the 3.12 release, did you have any info in the "More to come on this soon" per you comment? We can't promise we can deliver all the metrics, but we're happy to listen to requests!

Please allocate any time spent on this to Project | 214 Research in Clicktime.

Hi @mrjones-plip, were we able to define exactly the metrics we need to add for the scope of this ticket? So, we can start designing the solution and know when work is achieved :) Thanks!

It would be worth asking @kennsippell if he has any specific metrics he'd like to capture here. From my perspective, the metrics linked above would be useful:

  • number of times the CHT app is opened (total, per CHW)
  • number of tasks (total, per CHW), ideally broken down by completed, cancelled, or still active [see Maria's analysis on tasks]
  • number of messages (total, per CHW)
  • Average “boot time” for the CHT application
  • Average time to search contacts across all CHWs per month
  • Average time to search reports across all CHWs per month
  • Average time to create and save a given report (i.e. home visit form, patient assessment form)
  • number of user feedback docs created, common errors -- @kennsippell

There are additional metrics that @n-orlowski would like for us to capture for user monitoring and feedback, which include:
*Session - “From the time the app opens to the time the app closes”

  • Number of sessions
  • Time between sessions
  • Length of the session
  • Timing of the session
  • Page views for each session
  • What the user was doing when the session ended

In addition, there is more context here for the type of metrics we'd like to engage in monitoring via telemetry.

@helizabetholsen - like in 6741, we're gonna keep this ticket narrowly scoped to just gathering telemetry around replication times per the top body of this ticket.

A lot items you mentioned overlap with #6651 - but that ticket isn't currently scheduled for 3.12 (and likely won't be added).

Oop! Linked to the wrong ticket in prior comment - fixed. Sorry @helizabetholsen !

I've worked on adding these new telemetry entries, and looking for some feedback. Thanks in advance!

| key | value | recorded |
| --- | ------- | -------- |
| replication:user-initiated | 1 | when the user clicks "Sync now" |
| replication:<database>:<direction>:success | number representing how long in took to replicate, in ms | when replication is successful |
| replication:<database>:<direction>:failure | number representing how long in took to replicate, in ms | when replication is failed |
| replication:<database>:<direction>:failure:reason:offline:client | 1 | replication fails because of connection error and the app detects the client is offline |
| replication:<database>:<direction>:failure:reason:offline:server | 1 | replication fails because of connection error and the app detects the client is online |
| replication:<database>:<direction>:failure:reason:error | 1 | replication fails because of other errors |
| replication:<database>:<direction>:docs | number of replicated docs | For one-directional, stores number of "read" docs, for sync, stores sum of read docs for every direction. |
| replication:medic:<direction>:ms-since-last-replicated-date | number in ms representing the difference between now and when the client last replicated successfully | only recorded for medic database, every time replication is attempted |
| replication:medic:<direction>:denied | 1 | when replication is denied |

Unless otherwise specified, "database" and "direction" placeholders stand for any combination of:
| database | direction |
| --- | --- |
| medic | from or to |
| meta | sync |

Potentially suggest renaming user-action to user-initiated?
Unsure on meaning of denied - does that mean insufficient permission? If so, maybe call it that?

I'd be interested to understand how long initial replications are taking specifically. Maybe this doesn't matter too much we can just look at "worst case" replications? But we know initial replications can be super heavy and a point-point for users (some reporting hours or even days to complete). Counting initial replications would be valuable also, since we have a narrative that they are not common this is central to our technical strategy. I'm not sure how to best capture that dimension -- maybe just one more counter replication:initial or replication:<database>:<direction>:<initial>:success?

denied

This is actually the name of the emitted event in PouchDB.
When replicating to, it generally means users try to create docs that aren't allowed to - most likely because of hierarchy permissions.
When replicating from (or generally). the denied can happen when the doc fails the validate_doc_update function - which is a core CouchDB functionality.
Both of these cases kinda fall under "insufficient permissions" I guess, I have no problem in renaming the metric.

Initial replication happens outside of angular (theoretically). We start initial replication on bootstrap (pre-angular) when we detect the user's database lacks two essential docs: the medic-client ddoc and the settings doc (both of which are required for the app to function). If these docs exist, we load the app.
If the user needs initial replication, starts initial replication, downloads the ddoc and settings doc, then closes their app and starts the app again, bootstrap won't start initial replication and boot the app - replication will continue once the app boots but it still counts as "initial replication", but we don't actually do it in the bootstrap sitting.

So even if I add an outside angular telemetry entry for initial replication, it won't be a guarantee that it's actually recorded how long initial replication took.
We could, theoretically, detect if it's a fresh DB (no view indexes, for example) and "call" an app-side replication as initial, but I'm unsure whether the extra effort is needed. How valuable do you think getting this information accurately is?

denied seems appropriate given the definition

Sounds like initial replication is a tough one. Even if we add it outside Angular, sounds like our numbers will be whacky. Maybe let's skip that detail...

This is ready for AT on 6354-telemetry-for-replication

The new telemetry entries are detailed in the above comment (https://github.com/medic/cht-core/issues/6354#issuecomment-839975519) and also in this documentation PR: https://github.com/medic/cht-docs/pull/502

Was this page helpful?
0 / 5 - 0 ratings