Original ticket: https://github.com/medic/medic-projects/issues/2409
This has been reported a few times. The workaround has been to delete in smaller batches, but we should probably fix this so that new projects and projects that are upgrading can make better use of this feature. To reproduce, try to delete a few thousand records on any instance and you'll get a 500 error.
@estellecomment has suggested batching as a potential solution.
cc @ashish-medic @bishwas-medic
Possibly related: https://github.com/medic/medic-api/issues/96
@sglangevin what is the maximum number of records that should be supported in the bulk delete UI do you think? Naively I'd suggest that if you're deleting hundreds of documents you wouldn't want to be doing it through the UI anyway, and it's something that a script is more suited to. However, I may be missing some use case here.
Use case : the carrier wants to send new ringtones to the gateway phone. Message loop ensues and 10k messages are created for nothing. Tech lead has to delete them.
Another use case: hundreds or thousands of messages are sent during training and we want to delete them. Bulk delete was meant to make it easy for PMs to delete training data.
Yes I understand there exists situations where you'd want to delete thousands or documents. I'm sure there are many more than mentioned. The question was whether or not the UI was required to solve them.
Considering that Gareth sounded surprised that bulk delete was being used to delete thousands of documents the requirement for it to scale to the thousands clearly got lost somewhere.
@sglangevin are the search tools adequate for finding these documents?
I am wondering if we'd need to also take another look at the things we're doing that enables us to have a nice UI (for example, every time you select a document you see info about that added to a list), because I worry that is also going to be awful for thousands of documents, especially on poor connections.
OK, I've thought about this some more. Here is what I think we'd need to do to support deleting 10,000 documents in the UI:
{id: 1234, progress: 400, total: 10000}. As api completes a successful batch it updates this number in some shared data structure somewhereThings I'm not sure how to solve:
If it doesn't already (don't have enough local data to test) select all should select all search results, not just results loaded in LHS right now.
I can confirm that all the reports and not just the one displayed in LHS are selected.
Deleting a bunch of docs will generate a lot of UI churn for admins if they're on the same screen (the _changes feed will fire a lot). How do we make sure that chattiness doesn't cause problems?
We've already seen this creating issue. It ends up using a lot of memory and browser hangs up and needs a refresh. In worst case, we had to close the window and open up a new one.
As an alternative to Stefan's backend approach (middle road/pushing client side to the limit) you might have a UI that shows a max of 1000 and so to delete 10,000 records you would need to do 10 searches until your data is cleaned out. Still uses bulk edit API, but client makes the request. We could add an endpoint to medic-api to help with reducing the payload the client needs to send, but backend changes would be minimal. Would be a pain for the end users but less of a pain. Also I agree the current UI design might not suffice, we'd probably create a new search/bulk edit screen that looks more like a table format that allows you to scan many records at once and optionally expand them, maybe even choose which fields/columns to display as well.
Once we have limited client side support we might improve incrementally by adding the backend support for processing bulk edit jobs. At that point I think we're looking at migrations (fine line between bulk edit and migrations). And to do a migration I think we need to use a view or (mango query in CouchDB2) to know when a migration is done. I think the bulk edit screen would look more like the temp view browser in futon or mango query screen in fauxton that allows you to write a query and updates your screen with a pager and results. We would limit the query to certain record types to avoid unintentionally deleting system docs and we would detach from the changes feed(s). I'm not sure how to manage other clients already listening to changes but at least during bulk edit this client would be stable. When the bulk edit request is done running we'd process the results and give you the number of successful updates (e.g. 489 docs updated, 11 failed) then run the query you were using again to continue with your editing.
Since this sounds like a big project this middle road approach might allow us to focus on the client side first and then make further improvements to the backend later?
@mandric I agree with your first sentence. A simple hack to just select no more than N (I feel like it's more likely to be 100 than 1000) docs. @sglangevin do you have thoughts here?
_(Not sure about the rest of your proposal, I don't really get it / agree with it, but that's neither here nor there)_
@SCdF can we make it 500? I really like your longer proposal and I'm wondering if we can put in a limit as a quick fix (like the idea, @mandric) and file an issue for the rest which I think would be a great improvement on the back end.
@bishwas-medic if we limit to deleting 500 at a time, do you think that would be enough to handle normal training data deletion? For issues like the one you faced with the operator messages looping and creating 10k messages, we could still use a script for deletion for now and can work on further bulk delete improvements in the near future.
@SCdF are their other pieces of your proposal that would be necessary in order for the hack to work? Or it would be as simple as it sounds?
Not sure if this helps but here's my "state of the art" project template when it comes to large bulk edits or deletes. https://github.com/medic/medic-bulk-utils/tree/records-support/project-templates/migrations/move-child_name-to-patient_name
@sglangevin yeah, you are correct. The situation we faced this week (around 42K messages in 3 instances) rarely happens and we can use the scripts to delete them like we did this time. It's the training data that we want the feature mostly for. On most of the trainings, we expect around 600 messages ( 20 trainees x 30 messages) each day. So limiting to 500 should be a great start.
@sglangevin we can make the number any number you like :P The question is whether or not it will scale to that, and I don't know that answer, we'd have to test.
Also, @bishwas-medic / @sglangevin my understanding--- though this may be old info--- is that we didn't train on the same instances that we worked on (precisely for this reason). It sounds like this has changed?
we didn't train on the same instances that we worked on
We've never done that for projects in Asia. That could work for small projects in small areas, but for most of our projects, we have a rolling training schedule. So while one group of training is going on, the older group will have already started sending in reports. That complicates the deletion process.
@SCdF we are only doing that for projects where CHWs are using the Android app. For SMS projects, we use the same instance for training and the deployment.
Select All currently seems to be broken.
While individually selecting reports correctly adds them to $scope.model.docs (somehow), select all manages to add the right number of entries to this array, but they are all undefined instead of having a value. Unless, you expand a doc on the RHS after hitting select all. Any docs you expand will be hydrated from undefined to the correct doc that can be used to delete them.
No idea why this is happening, mostly because for now it's entirely unclear to me how $scope.model.docs actually gets set.
Fixed the above issue, see: https://github.com/medic/medic-webapp/issues/3646
Unfortunately medic-api (or express, or nodejs) is causing us to get this if we try to push more than 150 or so docs:
::1 - admin [12/Jul/2017:16:03:20 +0000] "POST /medic/_bulk_docs HTTP/1.1" - - "-" "curl/7.51.0"
ERROR { Error: socket hang up
at createHangUpError (_http_client.js:254:15)
at Socket.socketOnEnd (_http_client.js:346:23)
at emitNone (events.js:91:20)
at Socket.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:74:11)
at process._tickDomainCallback (internal/process/next_tick.js:122:9)
name: 'Error',
scope: 'socket',
errid: 'request',
code: 'ECONNRESET',
description: 'socket hang up',
stacktrace:
[ 'Error: socket hang up',
' at createHangUpError (_http_client.js:254:15)',
' at Socket.socketOnEnd (_http_client.js:346:23)',
' at emitNone (events.js:91:20)',
' at Socket.emit (events.js:185:7)',
' at endReadableNT (_stream_readable.js:974:12)',
' at _combinedTickCallback (internal/process/next_tick.js:74:11)',
' at process._tickDomainCallback (internal/process/next_tick.js:122:9)' ] }
Server error:
{ Error: socket hang up
at createHangUpError (_http_client.js:254:15)
at Socket.socketOnEnd (_http_client.js:346:23)
at emitNone (events.js:91:20)
at Socket.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:74:11)
at process._tickDomainCallback (internal/process/next_tick.js:122:9)
name: 'Error',
scope: 'socket',
errid: 'request',
code: 'ECONNRESET',
description: 'socket hang up',
stacktrace:
[ 'Error: socket hang up',
' at createHangUpError (_http_client.js:254:15)',
' at Socket.socketOnEnd (_http_client.js:346:23)',
' at emitNone (events.js:91:20)',
' at Socket.emit (events.js:185:7)',
' at endReadableNT (_stream_readable.js:974:12)',
' at _combinedTickCallback (internal/process/next_tick.js:74:11)',
' at process._tickDomainCallback (internal/process/next_tick.js:122:9)' ] }
This seems to be related to our friend auditing, unfortunately.
This looks to be a problem in nano, or a library nano uses: https://github.com/apache/couchdb-nano/issues/54
I've changed how auditing works so it batch-audits large audit requests.
This has solved the backend issue for now.
However, deleting 250 documents at once still destroys our UI. It freezes my tab for minutes (on a beefy laptop). I can't imagine what it would do with a normal laptop.
@garethbowen do you have any off the cuff thoughts about what this could be caused by? I presume a very large changes feed, combined with a bunch of angular watchers or something? Any thoughts as to how we could even begin to fix this?
Actually maybe it's OK. I did it without debug stuff opened and it actually performed much more reasonably. Deleting 500 documents takes 10 seconds or so before we say that the delete has finished, and then the UI chuggs along being unusable for a minute or so while angular / pouch's change feed catches up.
I think for a reasonably gross hack for something that will be rarely used like this maybe it's OK.
@garethbowen PR for couchdb-audit: https://github.com/medic/couchdb-audit/pull/4
Back to you @SCdF
The fix for the UI is probably pagination. The design said not to paginate because users are expected to review each item before deleting, but this clearly doesn't hold for use cases where they want to delete thousands of records.
I guess the correct pagination approach is to show the first x (200 or so?) records, user can review, click delete, then we show the next x records. Maybe raise this with design or raise an issue to discuss it further?
Made some changes that are large enough to require another review I think @garethbowen.
The design said not to paginate because users are expected to review each item before deleting
This is sort of an aside from this ticket, but I think it would be really helpful if @sglangevin could work with @diannakane (or design in general) around our new expectations for bulk delete, based on usage in deployed environments. From what I'm hearing It sounds like it needs another run around with design.
@SCdF Back to you. The code looks good but the build is failing. When you get a clean build, merge away!
Oh, right. Tests. You've twisted my arm, I'll write some.
I'm not sure I fully understand what the decision was here, but paginating so that users can only delete a few hundred records at a time should be fine for this use case. This feature is used by every project, usually immediately before the initial deployment, but it may also be used if additional health workers are trained when a project expands. So it's used by everyone, though not on a daily basis.
It sounds like we may have had a bit of miscommunication around the intended use of this feature and the performance implications of certain design decisions. We can review again and come back to this, perhaps after the audit stuff is done in 3.0? As long as this is working well enough, this isn't a high priority for redesign. If we continue to have problems with it, we can bump up the priority. cc @diannakane
This is available in 2.12.2-beta.2. Note that this will currently not work in master due to https://github.com/medic/medic-webapp/issues/3646
I will probably ask @sglangevin to test/confirm this on an instance with 'more than a few hundred records'?
I'm assuming @SCdF meant 2.12.1-beta.2. I've already tested this and 2.12.1. was released.