Cht-core: sentinel keeps processing the same backlog

Created on 20 Jul 2016  路  8Comments  路  Source: medic/cht-core

When I restart sentinel locally I get this in the log:

2016-07-20T21:15:46.371Z - info: backlog: processing 247 items, 500ms per item
2016-07-20T21:17:20.000Z - info: backlog: processing complete, 0 failures
2016-07-20T21:20:00.025Z - info: checking schedule again in 5 minutes

Next restart - same thing. It processes the same 247 items over and over again.

My last_valid_seq has no rows because all transitions are disabled so it should default to "now".

This isn't particularly expensive but it is a waste of cycles.

Bug

Most helpful comment

Apparently @SCdF hit this earlier today so I'll try and catch him online tonight but here are my findings so far...

  1. The code currently attempts to process the backlog from last_valid_seq till 0, while processing any new changes that come in from last_valid_seq.
  2. As the since param doesn't work with descending, and there's no parameter for the end of the sequence, the backlog processing is actually working from now till 0.
  3. With large scale deployments this means the backlog processing is potentially processing hundreds of thousands of documents every time it restarts with a 500ms delay for throttling.
  4. Sentinel is restarted every time a configuration changes which means it may never clear the backlog and instead keep checking the same documents over and over again.
  5. The last_valid_seq is fetched from the last_valid_seq view which is only updated when the transition is enabled and actually makes a modification to the doc. This means when all transitions are disabled (eg: a 2.x instance) the last_valid_seq is never updated.

Some things we can do to improve this...

  1. Update the last_valid_seq whether or not the transition applies to the change. This will fix the case where disabled transitions cause the same docs to be checked every start. This will cause excessive doc updates and a high chance of conflicts, so we should consider storing the last_valid_seq in a separate doc, either as one massive doc or many small ones. Small ones probably scale better...
  2. Replace the two existing feeds with one which works in ascending order from the last_valid_seq || 0.
  3. Reduce the 500ms pause between docs as we need to keep up with a higher throughput now. Consider batching and bulk inserts.
  4. Re-run transitions on docs which have failed transitions before. Perhaps on start look up in a view to find all docs that haven't been successfully transitioned.

All 8 comments

This also affects production instances - from production logs:

2016-07-08T22:48:29.940Z - info: 2016-07-08T22:48:29.940Z - info: backlog: processing 363838 items, 500ms per item
...
2016-07-09T01:12:48.325Z - info: 2016-07-09T01:12:48.324Z - info: backlog: processing 363869 items, 500ms per item
...
2016-07-10T00:00:20.815Z - info: 2016-07-10T00:00:20.815Z - info: backlog: processing 365168 items, 500ms per item
...
2016-07-10T18:54:10.155Z - info: 2016-07-10T18:54:10.155Z - info: backlog: processing 368087 items, 500ms per item

Although I couldn't find it in the docs, from my testing it appears that the since parameter is ignored when descending=true so the current backlog processor is always going to process all docs from now to the start of time.

Apparently @SCdF hit this earlier today so I'll try and catch him online tonight but here are my findings so far...

  1. The code currently attempts to process the backlog from last_valid_seq till 0, while processing any new changes that come in from last_valid_seq.
  2. As the since param doesn't work with descending, and there's no parameter for the end of the sequence, the backlog processing is actually working from now till 0.
  3. With large scale deployments this means the backlog processing is potentially processing hundreds of thousands of documents every time it restarts with a 500ms delay for throttling.
  4. Sentinel is restarted every time a configuration changes which means it may never clear the backlog and instead keep checking the same documents over and over again.
  5. The last_valid_seq is fetched from the last_valid_seq view which is only updated when the transition is enabled and actually makes a modification to the doc. This means when all transitions are disabled (eg: a 2.x instance) the last_valid_seq is never updated.

Some things we can do to improve this...

  1. Update the last_valid_seq whether or not the transition applies to the change. This will fix the case where disabled transitions cause the same docs to be checked every start. This will cause excessive doc updates and a high chance of conflicts, so we should consider storing the last_valid_seq in a separate doc, either as one massive doc or many small ones. Small ones probably scale better...
  2. Replace the two existing feeds with one which works in ascending order from the last_valid_seq || 0.
  3. Reduce the 500ms pause between docs as we need to keep up with a higher throughput now. Consider batching and bulk inserts.
  4. Re-run transitions on docs which have failed transitions before. Perhaps on start look up in a view to find all docs that haven't been successfully transitioned.

Some more analysis...

  1. The 3 "ohw" prefixed transitions will never run - the transition runner checks for a filter method which none of those transitions have.
  2. Most of the rest of the transitions will only run once per change even if the change errors out. This is because they check hasRun first, and the transition status is recorded when an error occurs. So even if the doc is updated the transition won't run again unless it's manually edited to remove the transition state. I wonder if the hasRun function should be changed to hasRunSuccessfully because the doc may have been changed in a way that means the transition succeeds.
  3. The twilio_message transition should keep processing pending messages until they succeed. Currently it's only getting run on change and on backlog processing which is unreliable, so it won't handle situations where the server lacks internet access.

Code review

PR at https://github.com/medic/medic-sentinel/pull/81

I've replaced the broken last_valid_seq view with a doc with id sentinel-meta-data that records the last seq the backlog processor got to. This starts at 0 so existing sites will start processing again but transitions are designed to be able to be run multiple times so other than a slight performance hit it should be fine.

I'll create issues for the other bugs listed above as this is the only blocker.

Batch processing a few hundred docs at a time would be nice but probably best to leave it for next time. LGTM

Looks like this has broken transitions? If I:

  • Run couch, api and sentinel
  • have a gateway user
  • send in the following: curl -i -u gateway:pass --data-urlencode 'message=P 4 Alice' --data-urlencode 'from=+447765902019' --data-urlencode 'sent_timestamp=1469027270' -X POST http://localhost:5988/api/v1/records

With prior codebases, it would generate tasks, and scheduled tasks, as well as fleshing out other parts of the document:

image

With the changes after this change, it doesn't look like sentinel finds the document that needs processing:

image

For acceptance testing restart sentinel multiple times and check the log to make sure it's progressing through the backlog not just starting from the start every time.

Was this page helpful?
0 / 5 - 0 ratings