Cht-core: Bulk delete fails when deleting more than a few hundred records

Created on 3 Jul 2017 · 30Comments · Source: medic/cht-core

Original ticket: https://github.com/medic/medic-projects/issues/2409

This has been reported a few times. The workaround has been to delete in smaller batches, but we should probably fix this so that new projects and projects that are upgrading can make better use of this feature. To reproduce, try to delete a few thousand records on any instance and you'll get a 500 error.

@estellecomment has suggested batching as a potential solution.

cc @ashish-medic @bishwas-medic

1 - High Bug Performance

Source

sglangevin

All 30 comments

mandric on 3 Jul 2017

@sglangevin what is the maximum number of records that should be supported in the bulk delete UI do you think? Naively I'd suggest that if you're deleting hundreds of documents you wouldn't want to be doing it through the UI anyway, and it's something that a script is more suited to. However, I may be missing some use case here.

SCdF on 4 Jul 2017

Use case : the carrier wants to send new ringtones to the gateway phone. Message loop ensues and 10k messages are created for nothing. Tech lead has to delete them.

estellecomment on 4 Jul 2017

Another use case: hundreds or thousands of messages are sent during training and we want to delete them. Bulk delete was meant to make it easy for PMs to delete training data.

sglangevin on 4 Jul 2017

Yes I understand there exists situations where you'd want to delete thousands or documents. I'm sure there are many more than mentioned. The question was whether or not the UI was required to solve them.

Considering that Gareth sounded surprised that bulk delete was being used to delete thousands of documents the requirement for it to scale to the thousands clearly got lost somewhere.

@sglangevin are the search tools adequate for finding these documents?

I am wondering if we'd need to also take another look at the things we're doing that enables us to have a nice UI (for example, every time you select a document you see info about that added to a list), because I worry that is also going to be awful for thousands of documents, especially on poor connections.

SCdF on 5 Jul 2017

OK, I've thought about this some more. Here is what I think we'd need to do to support deleting 10,000 documents in the UI:

RHS list of docs selected should follow the same rules as the LHS bar: only query for ~50 of them. If you want to see more you need to scroll.
- Simpler solution: only ever show the first 50, just say "…and 9950 more" after that
We also want to solve #3465 as part of this, as we'll be doing refactoring in this area
If it doesn't already (don't have enough local data to test) select all should select all search results, not just results loaded in LHS right now.
- Potentially by "live" selecting the search criteria instead of specific documents to save bandwidth
Deletes are done in api, not in webapp. They are batched, and any failures (due to conflict, already deleted etc) are met with extreme prejudice (ie we just try and try again). Webapp passes just the id (or the "live" selector), since we don't care about which rev we're deleting
When api takes a set of doc ids, if there are
Add api endpoint where you give it a job id and it returns the progress: e.g., {id: 1234, progress: 400, total: 10000}. As api completes a successful batch it updates this number in some shared data structure somewhere
- Do we persist these jobs to disk? Do we care if they get lost if api restarts (our default failure mode is restarting, it may happen)?
New UI to show deletion progress. Could either be a modal over the reports screen, or an entirely new page. Pros of the latter is that if you leave the page / refresh / whatever you can navigate back and see progress
Regardless of design, it pings api every few seconds with the job id and updates a progress bar
If we're building an entirely new page we'd also need an api endpoint to list all jobs, and build this page that displays jobs, let's you see progress etc

Things I'm not sure how to solve:

Deleting a bunch of docs will generate a lot of UI churn for admins if they're on the same screen (the _changes feed will fire a lot). How do we make sure that chattiness doesn't cause problems?
Are our search tools good enough to support accurately finding 10,000 documents and deleting them without accidentally deleting documents we need?
This is a large bulk change. Does this mean we have to be better / more specific auditing for this situation? Does this mean we need to easily be able to restore documents since the error rate in deleting 10,000 docs may be higher than deleting 100?

SCdF on 5 Jul 2017

If it doesn't already (don't have enough local data to test) select all should select all search results, not just results loaded in LHS right now.

I can confirm that all the reports and not just the one displayed in LHS are selected.

Deleting a bunch of docs will generate a lot of UI churn for admins if they're on the same screen (the _changes feed will fire a lot). How do we make sure that chattiness doesn't cause problems?

We've already seen this creating issue. It ends up using a lot of memory and browser hangs up and needs a refresh. In worst case, we had to close the window and open up a new one.

bishwasBhatta on 5 Jul 2017

As an alternative to Stefan's backend approach (middle road/pushing client side to the limit) you might have a UI that shows a max of 1000 and so to delete 10,000 records you would need to do 10 searches until your data is cleaned out. Still uses bulk edit API, but client makes the request. We could add an endpoint to medic-api to help with reducing the payload the client needs to send, but backend changes would be minimal. Would be a pain for the end users but less of a pain. Also I agree the current UI design might not suffice, we'd probably create a new search/bulk edit screen that looks more like a table format that allows you to scan many records at once and optionally expand them, maybe even choose which fields/columns to display as well.

Once we have limited client side support we might improve incrementally by adding the backend support for processing bulk edit jobs. At that point I think we're looking at migrations (fine line between bulk edit and migrations). And to do a migration I think we need to use a view or (mango query in CouchDB2) to know when a migration is done. I think the bulk edit screen would look more like the temp view browser in futon or mango query screen in fauxton that allows you to write a query and updates your screen with a pager and results. We would limit the query to certain record types to avoid unintentionally deleting system docs and we would detach from the changes feed(s). I'm not sure how to manage other clients already listening to changes but at least during bulk edit this client would be stable. When the bulk edit request is done running we'd process the results and give you the number of successful updates (e.g. 489 docs updated, 11 failed) then run the query you were using again to continue with your editing.

Since this sounds like a big project this middle road approach might allow us to focus on the client side first and then make further improvements to the backend later?

mandric on 5 Jul 2017

@mandric I agree with your first sentence. A simple hack to just select no more than N (I feel like it's more likely to be 100 than 1000) docs. @sglangevin do you have thoughts here?

_(Not sure about the rest of your proposal, I don't really get it / agree with it, but that's neither here nor there)_

SCdF on 5 Jul 2017

@SCdF can we make it 500? I really like your longer proposal and I'm wondering if we can put in a limit as a quick fix (like the idea, @mandric) and file an issue for the rest which I think would be a great improvement on the back end.

@bishwas-medic if we limit to deleting 500 at a time, do you think that would be enough to handle normal training data deletion? For issues like the one you faced with the operator messages looping and creating 10k messages, we could still use a script for deletion for now and can work on further bulk delete improvements in the near future.

@SCdF are their other pieces of your proposal that would be necessary in order for the hack to work? Or it would be as simple as it sounds?

sglangevin on 5 Jul 2017

Not sure if this helps but here's my "state of the art" project template when it comes to large bulk edits or deletes. https://github.com/medic/medic-bulk-utils/tree/records-support/project-templates/migrations/move-child_name-to-patient_name

mandric on 5 Jul 2017

@sglangevin yeah, you are correct. The situation we faced this week (around 42K messages in 3 instances) rarely happens and we can use the scripts to delete them like we did this time. It's the training data that we want the feature mostly for. On most of the trainings, we expect around 600 messages ( 20 trainees x 30 messages) each day. So limiting to 500 should be a great start.

bishwasBhatta on 6 Jul 2017

@sglangevin we can make the number any number you like :P The question is whether or not it will scale to that, and I don't know that answer, we'd have to test.

Also, @bishwas-medic / @sglangevin my understanding--- though this may be old info--- is that we didn't train on the same instances that we worked on (precisely for this reason). It sounds like this has changed?

SCdF on 6 Jul 2017

we didn't train on the same instances that we worked on

We've never done that for projects in Asia. That could work for small projects in small areas, but for most of our projects, we have a rolling training schedule. So while one group of training is going on, the older group will have already started sending in reports. That complicates the deletion process.

bishwasBhatta on 6 Jul 2017

@SCdF we are only doing that for projects where CHWs are using the Android app. For SMS projects, we use the same instance for training and the deployment.

sglangevin on 6 Jul 2017

👍1

Select All currently seems to be broken.

While individually selecting reports correctly adds them to $scope.model.docs (somehow), select all manages to add the right number of entries to this array, but they are all undefined instead of having a value. Unless, you expand a doc on the RHS after hitting select all. Any docs you expand will be hydrated from undefined to the correct doc that can be used to delete them.

No idea why this is happening, mostly because for now it's entirely unclear to me how $scope.model.docs actually gets set.

SCdF on 11 Jul 2017

Fixed the above issue, see: https://github.com/medic/medic-webapp/issues/3646

Unfortunately medic-api (or express, or nodejs) is causing us to get this if we try to push more than 150 or so docs:

::1 - admin [12/Jul/2017:16:03:20 +0000] "POST /medic/_bulk_docs HTTP/1.1" - - "-" "curl/7.51.0"
ERROR { Error: socket hang up
    at createHangUpError (_http_client.js:254:15)
    at Socket.socketOnEnd (_http_client.js:346:23)
    at emitNone (events.js:91:20)
    at Socket.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickDomainCallback (internal/process/next_tick.js:122:9)
  name: 'Error',
  scope: 'socket',
  errid: 'request',
  code: 'ECONNRESET',
  description: 'socket hang up',
  stacktrace:
   [ 'Error: socket hang up',
     '    at createHangUpError (_http_client.js:254:15)',
     '    at Socket.socketOnEnd (_http_client.js:346:23)',
     '    at emitNone (events.js:91:20)',
     '    at Socket.emit (events.js:185:7)',
     '    at endReadableNT (_stream_readable.js:974:12)',
     '    at _combinedTickCallback (internal/process/next_tick.js:74:11)',
     '    at process._tickDomainCallback (internal/process/next_tick.js:122:9)' ] }
Server error:
 { Error: socket hang up
    at createHangUpError (_http_client.js:254:15)
    at Socket.socketOnEnd (_http_client.js:346:23)
    at emitNone (events.js:91:20)
    at Socket.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickDomainCallback (internal/process/next_tick.js:122:9)
  name: 'Error',
  scope: 'socket',
  errid: 'request',
  code: 'ECONNRESET',
  description: 'socket hang up',
  stacktrace:
   [ 'Error: socket hang up',
     '    at createHangUpError (_http_client.js:254:15)',
     '    at Socket.socketOnEnd (_http_client.js:346:23)',
     '    at emitNone (events.js:91:20)',
     '    at Socket.emit (events.js:185:7)',
     '    at endReadableNT (_stream_readable.js:974:12)',
     '    at _combinedTickCallback (internal/process/next_tick.js:74:11)',
     '    at process._tickDomainCallback (internal/process/next_tick.js:122:9)' ] }

SCdF on 12 Jul 2017

This seems to be related to our friend auditing, unfortunately.

SCdF on 12 Jul 2017

This looks to be a problem in nano, or a library nano uses: https://github.com/apache/couchdb-nano/issues/54

SCdF on 18 Jul 2017

I've changed how auditing works so it batch-audits large audit requests.

This has solved the backend issue for now.

However, deleting 250 documents at once still destroys our UI. It freezes my tab for minutes (on a beefy laptop). I can't imagine what it would do with a normal laptop.

@garethbowen do you have any off the cuff thoughts about what this could be caused by? I presume a very large changes feed, combined with a bunch of angular watchers or something? Any thoughts as to how we could even begin to fix this?

SCdF on 18 Jul 2017

Actually maybe it's OK. I did it without debug stuff opened and it actually performed much more reasonably. Deleting 500 documents takes 10 seconds or so before we say that the delete has finished, and then the UI chuggs along being unusable for a minute or so while angular / pouch's change feed catches up.

I think for a reasonably gross hack for something that will be rarely used like this maybe it's OK.

SCdF on 18 Jul 2017

@garethbowen PR for couchdb-audit: https://github.com/medic/couchdb-audit/pull/4

SCdF on 18 Jul 2017

Back to you @SCdF

The fix for the UI is probably pagination. The design said not to paginate because users are expected to review each item before deleting, but this clearly doesn't hold for use cases where they want to delete thousands of records.

I guess the correct pagination approach is to show the first x (200 or so?) records, user can review, click delete, then we show the next x records. Maybe raise this with design or raise an issue to discuss it further?

garethbowen on 18 Jul 2017

Made some changes that are large enough to require another review I think @garethbowen.

The design said not to paginate because users are expected to review each item before deleting

This is sort of an aside from this ticket, but I think it would be really helpful if @sglangevin could work with @diannakane (or design in general) around our new expectations for bulk delete, based on usage in deployed environments. From what I'm hearing It sounds like it needs another run around with design.

SCdF on 19 Jul 2017

@SCdF Back to you. The code looks good but the build is failing. When you get a clean build, merge away!

garethbowen on 20 Jul 2017

Oh, right. Tests. You've twisted my arm, I'll write some.

SCdF on 20 Jul 2017

I'm not sure I fully understand what the decision was here, but paginating so that users can only delete a few hundred records at a time should be fine for this use case. This feature is used by every project, usually immediately before the initial deployment, but it may also be used if additional health workers are trained when a project expands. So it's used by everyone, though not on a daily basis.

It sounds like we may have had a bit of miscommunication around the intended use of this feature and the performance implications of certain design decisions. We can review again and come back to this, perhaps after the audit stuff is done in 3.0? As long as this is working well enough, this isn't a high priority for redesign. If we continue to have problems with it, we can bump up the priority. cc @diannakane

sglangevin on 20 Jul 2017

This is available in 2.12.2-beta.2. Note that this will currently not work in master due to https://github.com/medic/medic-webapp/issues/3646

SCdF on 21 Jul 2017

I will probably ask @sglangevin to test/confirm this on an instance with 'more than a few hundred records'?

ngaruko on 25 Jul 2017

I'm assuming @SCdF meant 2.12.1-beta.2. I've already tested this and 2.12.1. was released.

sglangevin on 26 Jul 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Remove the ability to set the user's language via the API

garethbowen · 4Comments

Optimize the `fix-user-db-security` migration for databases with many users

SCdF · 3Comments

Page Title vs. Back Title

diannakane · 6Comments

We should be able to add Keys with no translations

ngaruko · 3Comments

Add person icon is not monochrome

abbyad · 3Comments