Cht-core: Improve Sentinel performance

Created on 24 Jan 2018  Â·  20Comments  Â·  Source: medic/cht-core

Zerothly, we should benchmark sentinel, see if we consider it fast enough, and then look at progressively more complex solutions to increase its throughput.

Even without benchmarking (though do this first, maybe it's Good Enoughâ„¢ already!), Sentinel has some clear places for optimisation.

Firstly, it has an intentional 50ms delay in it between processing each document. According to the git log and #907, this was done because sentinel going at ludicrous speed caused CouchDB to freeze and become unstable. We should investigate how true this is, and why. Is this still true with CouchDB 2.0? Is this just because of lots of individual writes? Or is this a combination of writes and view regenerations in between those writes? If a delay really is required, is there a small value that we could use (it used to be 500ms!)? Is there another metric entirely other than "arbitrary sleep time" that we could use to maintain stability? Answering these questions will help determine some larger solutions.

Secondly, there are a couple of places: deleteInfoDoc and deleteReadDocs; which could be converted into asynchronous queues[1] instead of their actions blocking more processing. These may not be run that much, because they only run when a document is deleted, so it might not be worth the code change effort.

Thirdly, there are a couple of different easy strategies for improving performance that don't require re-architecting much. Both strategies I'm about to describe would require we add some kind of partition function (other name ideas: index, groupBy) to the transition api, which each transition would implement. partition would be a function that would return, given a document, a value which is used to describe how to group / categorise the action the transition performs on that document. This can then be used to try to increase performance:

  1. Increase throughput by increasing the number of queues we have, from one to as many queues as we have unique values from partition.
  2. Reduce writes by putting writes into a queue, and not writing until we're going to process a document for which there is already the same partition value in the queue.

The current thought about the implementation of partition is that for most things it would be the patient id, or perhaps the chw's / contact's uuid. However, for some transitions it would be a constant value (eg the transition's name), to force concurrency inside that transition. The clearest example being multi report alerts (after you see N things send a message: this cannot be parallelised without re-writing the transition and probably the transition processing flow).

So in the first approach we will be writing document from two CHPs concurrently, while still maintaining ordering inside those CHPs. And in the second example, we would delay writing until we were trying to process a document for a CHP for which we already have a queued write. The first approach will smash CouchDB harder, the second would smash CouchDB less. Which approach is best (or even deciding to do both) is dependent on the modern reasoning behind the 50ms delay.

Fourthly, and similarly to the previous point, we could look at relationships between one transition and all others, and itself. We could do this to categorise transitions and work out which ones can be parallelised without trouble, and which rely on others (can't accept patient reports before the registration is processed) or itself (multi-report-alerts, as described above). It's not clear if this individual action is more useful than just mostly doing it in the third point though.

Lastly, it's time to give up the idea that you can just re-write bits of sentinel, and we have look at a whole new architecture. Yikes.

[1] With all queues that is we talk about in this doc, we'd expect that: they are disk-backed; pick up queued items on re-boot, and where relevant flush after N seconds regardless. Once we add queues we'd also need to make sure the "seq we're up to" flow works properly, as that would be disrupted by this kind of thing.

2 - Medium Sentinel Performance

Most helpful comment

A small Sentinel performance test:

Settings:

"transitions": {
    "default_responses": true,
    "update_clinics": true,
    "accept_patient_reports": true,
    "conditional_alerts": false,
    "update_sent_by": false,
    "registration": true,
    "update_notifications": true,
    "update_scheduled_reports": false,
    "multi_report_alerts": false,
    "update_sent_forms": false,
    "generate_patient_id_on_people": true,
    "death_reporting": true
}

Sample doc:

{
    "_id": <unique UUID>,
    "type": "data_record",
    "from": <known phone number>,
    "fields": {},
    "reported_date": 1556888171937,
    "sms_message": {
        "type": "sms_message",    
        "message": "This is a random sms",
        "from": <known phone number>
    }
}

Two transitions run over the docs: update_clinics and default_responses. Docs are inserted by posting to _bulk_docs.
Tests are run over the same DB, regardless of the app version.
Counter starts before sentinel processes the first change and ends after processing the last change.

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 50 | 39.4 s | 41.3 s | 41.9 s | 40.9 s | |
| 3.5(master) | 50 | 17.7 s | 17.2 s | 17.4 s | 17.4 s | 57% |
| 3.5(updated) | 50 | 7.6 s | 8.2 s | 8.7 s | 8.2 s | 52% |

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 200 | 149.6 s | 142.3 s | 142.2 s | 144.7 s | |
| 3.5(master) | 200 | 65.8 s | 68.3 s | 65.6 s | 66.6 s | 54% |
| 3.5(updated) | 200 | 32.1 s | 35.3 s | 32.7 s | 33.4 s | 49% |


A small Api performance test:

Same settings as above.

Sample message:

{   
    id: <unique UUID>
    from: <known phone number>, 
    content: 'This is a random sms' 
}

Docs are inserted by posting to /api/sms. Two transitions run over each doc: update_clinics and default_responses.
Counter starts before processing the docs and ends after all docs are saved.

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|-----|-----------|
| 3.5(master) | 50 | 4691 ms | 4585 ms | 4235 ms | 4503 ms | |
| 3.5(updated) | 50 | 3464 ms | 3374 ms | 3402 ms | 3413 ms | 24% |


The updated version of 3.5 (https://github.com/medic/medic/pull/5633) includes a couple of performance fixes:

  • removes 50ms delay between processing changes in Sentinel
  • Sentinel feed skips tombstones
  • Info docs are saved after all transitions run instead of after every transition.

Clearly the biggest performance improvement came in 3.0, where we update Sentinel to skip processing info docs and save new info docs into a separate db.

We could formalize this similar test by using sample data (maybe the same sample data that our QA team uses for the performance test suite). I'm not sure what metric would be most relevant for the release notes.

All 20 comments

First things first:

  • [ ] create / copy a database that can be used to benchmark Sentinel.
  • [ ] benchmark Sentinel working through that db from the start, locally, on AWS, on CouchDB 1.x and CouchDB 2.x. Try to use the app at the same time
  • [ ] Come up with some snazzy numbers on throughput metrics.

Hi @browndav,

Please ensure this ticket has a Priority, Status and Type label.

(See triaging old issues for more detail)

This is seven days old and seems quite sensible enough to do; I've marked it triaged and assigned a medium priority.

@browndav why did you assign this ticket to me?

@SCdF I think we should consider adding this to 2.15.0 or 3.0.0 to make sure we are at least investigating early in the year and can plan for improvements.

Makes sense to me. Added to 2.15.

^ Once we have a perf benchmark we should test what happens if we move -info docs into another DB. This will probably:

  • Make sentinel faster because there are less _changes entries in medic
  • Make view generation faster because there less documents to walk over

Hat tip: @alxndrsn

We haven't gotten to this in 2.15.0. Moving to 3.0.

One thing to look into here is moving certain "hot path" views into their own ddoc. These are somewhat documented here: https://github.com/medic/medic-webapp/issues/2849

The general idea is that 2-3 views are called for every single change, which would cause the entire ddoc's views to be regenerated for the couple of extra changes, which is super inefficient. If we put those "hotpath" views in its own ddoc this regeneration should be faster.

Deprioritised out of 3.0.0.

@vimemo Any update on this one? Should we move it to 3.5.0?

@garethbowen, yes. Let's please move it to 3.5. I have timed things with and without the 50ms processing delay and haven't made any code updates. I have had different issues and not consistent times but in most cases it has been issues with my local setup.

Sorry about dropping this ticket. Unfortunately, I was not able to do much here but to update and test
locally the default processing delay (https://github.com/medic/medic/pull/5306). Still needs to be tested as suggested by @garethbowen with something similar to the great e2e sentinel tests introduce by @dianabarsan in the previous release.

@dianabarsan I know you did some performance testing as part of this - can you add a comment with the metrics you got so we can include them in the release notes?

A small Sentinel performance test:

Settings:

"transitions": {
    "default_responses": true,
    "update_clinics": true,
    "accept_patient_reports": true,
    "conditional_alerts": false,
    "update_sent_by": false,
    "registration": true,
    "update_notifications": true,
    "update_scheduled_reports": false,
    "multi_report_alerts": false,
    "update_sent_forms": false,
    "generate_patient_id_on_people": true,
    "death_reporting": true
}

Sample doc:

{
    "_id": <unique UUID>,
    "type": "data_record",
    "from": <known phone number>,
    "fields": {},
    "reported_date": 1556888171937,
    "sms_message": {
        "type": "sms_message",    
        "message": "This is a random sms",
        "from": <known phone number>
    }
}

Two transitions run over the docs: update_clinics and default_responses. Docs are inserted by posting to _bulk_docs.
Tests are run over the same DB, regardless of the app version.
Counter starts before sentinel processes the first change and ends after processing the last change.

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 50 | 39.4 s | 41.3 s | 41.9 s | 40.9 s | |
| 3.5(master) | 50 | 17.7 s | 17.2 s | 17.4 s | 17.4 s | 57% |
| 3.5(updated) | 50 | 7.6 s | 8.2 s | 8.7 s | 8.2 s | 52% |

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 200 | 149.6 s | 142.3 s | 142.2 s | 144.7 s | |
| 3.5(master) | 200 | 65.8 s | 68.3 s | 65.6 s | 66.6 s | 54% |
| 3.5(updated) | 200 | 32.1 s | 35.3 s | 32.7 s | 33.4 s | 49% |


A small Api performance test:

Same settings as above.

Sample message:

{   
    id: <unique UUID>
    from: <known phone number>, 
    content: 'This is a random sms' 
}

Docs are inserted by posting to /api/sms. Two transitions run over each doc: update_clinics and default_responses.
Counter starts before processing the docs and ends after all docs are saved.

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|-----|-----------|
| 3.5(master) | 50 | 4691 ms | 4585 ms | 4235 ms | 4503 ms | |
| 3.5(updated) | 50 | 3464 ms | 3374 ms | 3402 ms | 3413 ms | 24% |


The updated version of 3.5 (https://github.com/medic/medic/pull/5633) includes a couple of performance fixes:

  • removes 50ms delay between processing changes in Sentinel
  • Sentinel feed skips tombstones
  • Info docs are saved after all transitions run instead of after every transition.

Clearly the biggest performance improvement came in 3.0, where we update Sentinel to skip processing info docs and save new info docs into a separate db.

We could formalize this similar test by using sample data (maybe the same sample data that our QA team uses for the performance test suite). I'm not sure what metric would be most relevant for the release notes.

Ready for AT on 4128-sentinel-transitions-performance.

Post AT, please do not merge. It requires publishing a shared library and bumping the dependencies on the branch.

@dianabarsan how did you measure the duration (ms)?

I updated API and Sentinel code to measure the time it took to process all docs.

@dianabarsan
AT complete!

Sentinel (50 docs)

before change ->
duration (ms): 18881 (first)
duration (ms): 19211 (second)
duration (ms): 19133 (third)
avg (ms): 19075

after change ->
duration (ms): 11812 (first)
duration (ms): 11905 (second)
duration (ms): 11871 (third)
avg (ms): 11862
delta: -38%

API

before change ->
duration (ms): 9440 (first)
duration (ms): 9311 (second)
duration (ms): 8348 (third)
avg (ms): 9033

after change ->
duration (ms): 7374 (first)
duration (ms): 7522 (second)
duration (ms): 6132 (third)
avg (ms): 7009
delta: -22%

LGTM.

Not merged.

Reopening so we don't forget to merge the code! @dianabarsan Can you merge and close when ready?

Was this page helpful?
0 / 5 - 0 ratings