Cht-core: Improve Sentinel performance

Created on 24 Jan 2018 · 20Comments · Source: medic/cht-core

Zerothly, we should benchmark sentinel, see if we consider it fast enough, and then look at progressively more complex solutions to increase its throughput.

Even without benchmarking (though do this first, maybe it's Good Enough™ already!), Sentinel has some clear places for optimisation.

Firstly, it has an intentional 50ms delay in it between processing each document. According to the git log and #907, this was done because sentinel going at ludicrous speed caused CouchDB to freeze and become unstable. We should investigate how true this is, and why. Is this still true with CouchDB 2.0? Is this just because of lots of individual writes? Or is this a combination of writes and view regenerations in between those writes? If a delay really is required, is there a small value that we could use (it used to be 500ms!)? Is there another metric entirely other than "arbitrary sleep time" that we could use to maintain stability? Answering these questions will help determine some larger solutions.

Secondly, there are a couple of places: deleteInfoDoc and deleteReadDocs; which could be converted into asynchronous queues[1] instead of their actions blocking more processing. These may not be run that much, because they only run when a document is deleted, so it might not be worth the code change effort.

Thirdly, there are a couple of different easy strategies for improving performance that don't require re-architecting much. Both strategies I'm about to describe would require we add some kind of partition function (other name ideas: index, groupBy) to the transition api, which each transition would implement. partition would be a function that would return, given a document, a value which is used to describe how to group / categorise the action the transition performs on that document. This can then be used to try to increase performance:

Increase throughput by increasing the number of queues we have, from one to as many queues as we have unique values from partition.
Reduce writes by putting writes into a queue, and not writing until we're going to process a document for which there is already the same partition value in the queue.

The current thought about the implementation of partition is that for most things it would be the patient id, or perhaps the chw's / contact's uuid. However, for some transitions it would be a constant value (eg the transition's name), to force concurrency inside that transition. The clearest example being multi report alerts (after you see N things send a message: this cannot be parallelised without re-writing the transition and probably the transition processing flow).

So in the first approach we will be writing document from two CHPs concurrently, while still maintaining ordering inside those CHPs. And in the second example, we would delay writing until we were trying to process a document for a CHP for which we already have a queued write. The first approach will smash CouchDB harder, the second would smash CouchDB less. Which approach is best (or even deciding to do both) is dependent on the modern reasoning behind the 50ms delay.

Fourthly, and similarly to the previous point, we could look at relationships between one transition and all others, and itself. We could do this to categorise transitions and work out which ones can be parallelised without trouble, and which rely on others (can't accept patient reports before the registration is processed) or itself (multi-report-alerts, as described above). It's not clear if this individual action is more useful than just mostly doing it in the third point though.

Lastly, it's time to give up the idea that you can just re-write bits of sentinel, and we have look at a whole new architecture. Yikes.

[1] With all queues that is we talk about in this doc, we'd expect that: they are disk-backed; pick up queued items on re-boot, and where relevant flush after N seconds regardless. Once we add queues we'd also need to make sure the "seq we're up to" flow works properly, as that would be disrupted by this kind of thing.

2 - Medium Sentinel Performance

Source

SCdF

Most helpful comment

A small Sentinel performance test:

Settings:

"transitions": {
    "default_responses": true,
    "update_clinics": true,
    "accept_patient_reports": true,
    "conditional_alerts": false,
    "update_sent_by": false,
    "registration": true,
    "update_notifications": true,
    "update_scheduled_reports": false,
    "multi_report_alerts": false,
    "update_sent_forms": false,
    "generate_patient_id_on_people": true,
    "death_reporting": true
}

Sample doc:

{
    "_id": <unique UUID>,
    "type": "data_record",
    "from": <known phone number>,
    "fields": {},
    "reported_date": 1556888171937,
    "sms_message": {
        "type": "sms_message",    
        "message": "This is a random sms",
        "from": <known phone number>
    }
}

Two transitions run over the docs: update_clinics and default_responses. Docs are inserted by posting to _bulk_docs.
Tests are run over the same DB, regardless of the app version.
Counter starts before sentinel processes the first change and ends after processing the last change.

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 50 | 39.4 s | 41.3 s | 41.9 s | 40.9 s | |
| 3.5(master) | 50 | 17.7 s | 17.2 s | 17.4 s | 17.4 s | 57% |
| 3.5(updated) | 50 | 7.6 s | 8.2 s | 8.7 s | 8.2 s | 52% |

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 200 | 149.6 s | 142.3 s | 142.2 s | 144.7 s | |
| 3.5(master) | 200 | 65.8 s | 68.3 s | 65.6 s | 66.6 s | 54% |
| 3.5(updated) | 200 | 32.1 s | 35.3 s | 32.7 s | 33.4 s | 49% |

A small Api performance test:

Same settings as above.

Sample message:

{   
    id: <unique UUID>
    from: <known phone number>, 
    content: 'This is a random sms' 
}

Docs are inserted by posting to /api/sms. Two transitions run over each doc: update_clinics and default_responses.
Counter starts before processing the docs and ends after all docs are saved.

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|-----|-----------|
| 3.5(master) | 50 | 4691 ms | 4585 ms | 4235 ms | 4503 ms | |
| 3.5(updated) | 50 | 3464 ms | 3374 ms | 3402 ms | 3413 ms | 24% |

The updated version of 3.5 (https://github.com/medic/medic/pull/5633) includes a couple of performance fixes:

removes 50ms delay between processing changes in Sentinel
Sentinel feed skips tombstones
Info docs are saved after all transitions run instead of after every transition.

Clearly the biggest performance improvement came in 3.0, where we update Sentinel to skip processing info docs and save new info docs into a separate db.

We could formalize this similar test by using sample data (maybe the same sample data that our QA team uses for the performance test suite). I'm not sure what metric would be most relevant for the release notes.

dianabarsan on 3 May 2019

🚀2

All 20 comments

First things first:

[ ] create / copy a database that can be used to benchmark Sentinel.
[ ] benchmark Sentinel working through that db from the start, locally, on AWS, on CouchDB 1.x and CouchDB 2.x. Try to use the app at the same time
[ ] Come up with some snazzy numbers on throughput metrics.

SCdF on 24 Jan 2018

Hi @browndav,

Please ensure this ticket has a Priority, Status and Type label.

(See triaging old issues for more detail)

medic-bot on 31 Jan 2018

This is seven days old and seems quite sensible enough to do; I've marked it triaged and assigned a medium priority.

ghost on 31 Jan 2018

@browndav why did you assign this ticket to me?

SCdF on 31 Jan 2018

@SCdF I think we should consider adding this to 2.15.0 or 3.0.0 to make sure we are at least investigating early in the year and can plan for improvements.

sglangevin on 31 Jan 2018

Makes sense to me. Added to 2.15.

SCdF on 1 Feb 2018

^ Once we have a perf benchmark we should test what happens if we move -info docs into another DB. This will probably:

Make sentinel faster because there are less _changes entries in medic
Make view generation faster because there less documents to walk over

Hat tip: @alxndrsn

SCdF on 1 Feb 2018

We haven't gotten to this in 2.15.0. Moving to 3.0.

sglangevin on 22 Feb 2018

One thing to look into here is moving certain "hot path" views into their own ddoc. These are somewhat documented here: https://github.com/medic/medic-webapp/issues/2849

The general idea is that 2-3 views are called for every single change, which would cause the entire ddoc's views to be regenerated for the couple of extra changes, which is super inefficient. If we put those "hotpath" views in its own ddoc this regeneration should be faster.

SCdF on 4 Apr 2018

Deprioritised out of 3.0.0.

garethbowen on 14 May 2018

@vimemo Any update on this one? Should we move it to 3.5.0?

garethbowen on 28 Jan 2019

@garethbowen, yes. Let's please move it to 3.5. I have timed things with and without the 50ms processing delay and haven't made any code updates. I have had different issues and not consistent times but in most cases it has been issues with my local setup.

vimemo on 29 Jan 2019

Sorry about dropping this ticket. Unfortunately, I was not able to do much here but to update and test
locally the default processing delay (https://github.com/medic/medic/pull/5306). Still needs to be tested as suggested by @garethbowen with something similar to the great e2e sentinel tests introduce by @dianabarsan in the previous release.

vimemo on 3 Apr 2019

@dianabarsan I know you did some performance testing as part of this - can you add a comment with the metrics you got so we can include them in the release notes?

garethbowen on 1 May 2019

A small Sentinel performance test:

Settings:

"transitions": {
    "default_responses": true,
    "update_clinics": true,
    "accept_patient_reports": true,
    "conditional_alerts": false,
    "update_sent_by": false,
    "registration": true,
    "update_notifications": true,
    "update_scheduled_reports": false,
    "multi_report_alerts": false,
    "update_sent_forms": false,
    "generate_patient_id_on_people": true,
    "death_reporting": true
}

Sample doc:

{
    "_id": <unique UUID>,
    "type": "data_record",
    "from": <known phone number>,
    "fields": {},
    "reported_date": 1556888171937,
    "sms_message": {
        "type": "sms_message",    
        "message": "This is a random sms",
        "from": <known phone number>
    }
}

| Version | Docs # | Iteration 1 | Iteration 2 | Iteration 3 | AVG | Delta |
|--------------------|----------|-------------|-------------|-------------|----------|-------|
| 2.18 | 200 | 149.6 s | 142.3 s | 142.2 s | 144.7 s | |
| 3.5(master) | 200 | 65.8 s | 68.3 s | 65.6 s | 66.6 s | 54% |
| 3.5(updated) | 200 | 32.1 s | 35.3 s | 32.7 s | 33.4 s | 49% |

A small Api performance test:

Same settings as above.

Sample message:

{   
    id: <unique UUID>
    from: <known phone number>, 
    content: 'This is a random sms' 
}

Docs are inserted by posting to /api/sms. Two transitions run over each doc: update_clinics and default_responses.
Counter starts before processing the docs and ends after all docs are saved.

The updated version of 3.5 (https://github.com/medic/medic/pull/5633) includes a couple of performance fixes:

removes 50ms delay between processing changes in Sentinel
Sentinel feed skips tombstones
Info docs are saved after all transitions run instead of after every transition.

Clearly the biggest performance improvement came in 3.0, where we update Sentinel to skip processing info docs and save new info docs into a separate db.

dianabarsan on 3 May 2019

🚀2

Ready for AT on 4128-sentinel-transitions-performance.

Post AT, please do not merge. It requires publishing a shared library and bumping the dependencies on the branch.

dianabarsan on 7 May 2019

@dianabarsan how did you measure the duration (ms)?

tookam on 14 May 2019

I updated API and Sentinel code to measure the time it took to process all docs.

dianabarsan on 15 May 2019

@dianabarsan
AT complete!

Sentinel (50 docs)

before change ->
duration (ms): 18881 (first)
duration (ms): 19211 (second)
duration (ms): 19133 (third)
avg (ms): 19075

after change ->
duration (ms): 11812 (first)
duration (ms): 11905 (second)
duration (ms): 11871 (third)
avg (ms): 11862
delta: -38%

API

before change ->
duration (ms): 9440 (first)
duration (ms): 9311 (second)
duration (ms): 8348 (third)
avg (ms): 9033

after change ->
duration (ms): 7374 (first)
duration (ms): 7522 (second)
duration (ms): 6132 (third)
avg (ms): 7009
delta: -22%

LGTM.

Not merged.

tookam on 17 May 2019

Reopening so we don't forget to merge the code! @dianabarsan Can you merge and close when ready?

garethbowen on 19 May 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Medic ID never generated if a report is created before sentinel ID is generated

ngaruko · 5Comments

Missing translation "state.duplicated"

abbyad · 5Comments

Improve sms-gateway startup time on well-used installations

alxndrsn · 4Comments

Allow enough space to show list of forms in Reports Filter.

yrimal · 5Comments

We should be able to add Keys with no translations

ngaruko · 3Comments