Cht-core: Handle Database has a global failure error during replication

Created on 5 Feb 2019 · 23Comments · Source: medic/cht-core

When replicating to tecno wx3 the indexeddb database seems to be dying. Errors in the console are showing. Reloading the app finishes the app start process and the app becomes usable.

Publishing 'progress' event with: Object {purged: 2325, processed: 486, total: 486}
Purge complete, purged 2325 documents
Publishing 'optimise' event with: undefined
Database has a global failure DOMException: Transaction timed out due to inactivity.
Uncaught (in promise) n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}

Steps to reproduce:
Using user filysa on purge-upgrade-muso-mali instance
Log in to a fresh app install on tecno wx3. Have chrome://inspect going for the mobile app

What should happen:
The app should launch without error.

What actually happens:
The app replicates and purges documents then the errors showing the 404 error page and the error above are shown in the log. Currently we have a solution to handle this when it is unhandled but this is being handled.

https://medic.slack.com/files/UBHER12S2/FFZ6U3HEJ/indexed_db_went_bad.log

Environment:

Instance: https://purge-upgrade-muso-mali.dev.medicmobile.org
Browser: mobile app on tecno wx3
Client platform: Android 7.0
App: mobile app
Version: 3.3.x

2 - Medium Bug

Source

newtewt

All 23 comments

Currently we have a solution to handle this when it is unhandled but this is being handled.

To clarify -- this quote refers to the changes from #5045 which prompt the user to reload the page for unhandled rejection DOMException: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing..

The exception in this issue here is DOMException: Transaction timed out due to inactivity. Although they are both DOMExceptions, this isn't the same issue seen in Siaya. As written, I wouldn't expect to see a reload dialog for the error reported in this issue (when handled or unhandled).

kennsippell on 5 Feb 2019

@SCdF As the resident IDB and purging expert does this ring any bells?

garethbowen on 6 Feb 2019

Originally I thought it might be the view cleanup that happens after purge, but looking at the whole log there is a lot more going on.

A slightly more complete look at that log is:

Purge function has changed, running purge
Publishing 'start' event with: Object {totalContacts: 486}
user
Object {name: "filysa", roles: Array[5], locale: "en"}
Publishing 'progress' event with: Object {purged: 0, processed: 1, total: 486}
user
Object {name: "filysa", roles: Array[5], locale: "en"}
Publishing 'progress' event with: Object {purged: N, processed: M, total: 486}
^ 486 times

Purge complete, purged 2325 documents
Publishing 'optimise' event with: undefined
Database has a global failure DOMException: Transaction timed out due to inactivity.
Uncaught (in promise) n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Uncaught (in promise) DOMException: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing.(…)
^ * 2800 times

n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
db.type() is deprecated and will be removed in a future version of PouchDB
Initiating changes service
Database has a global failure DOMException: Transaction timed out due to inactivity.
Error updating icons n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error initialising watching for db changes n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Attempting changes initialisation in 5 seconds
Error initialising zscore util n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching settings n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error loading facilities n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching settings n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error getting tasks n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error checking authorization n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}

^ * 25 times
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"} "Possibly unhandled rejection: {"status":500,"name":"indexed_db_went_bad","message":"unknown","error":true,"reason":"TimeoutError"}"
Error fetching form definitions n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Failed to retrieve forms n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Error fetching contact n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Initiating changes service
Error initialising watching for db changes n {status: 500, name: "indexed_db_went_bad", message: "unknown", error: true, reason: "TimeoutError"}
Attempting changes initialisation in 5 seconds
^ * many more

The key numbers here are:

486 rounds of purging, which will contain one bulk docs write per round, though only 238 of these rounds contained a change and thus fired a bulk_docs call.
2325 documents purged
2800 uncaught failures

I can't get these to line perfectly up, but they get suspiciously close to each other.

@newtewt can you confirm that for this user the purging actually occurred? That is to say, were the documents no longer accessible?

My hunch is that it IndexedDB has broken either before purge has run, or because we are doing an all_docs at the start and idb is "not ready" somehow, and so none of the purges were actually successful.

In terms of the actual error, it's hard to track down. Apart from one specific usage that we aren't going to get to, it's just used whenever a transaction fails.

One option is that we could inject failures in pdb's code and narrow it down to only places where the promise chain doesn't except.

SCdF on 7 Feb 2019

I'll confirm today that the docs are unavailable. I was going based on the console log that purging completed.

newtewt on 7 Feb 2019

@SCdF purging completes properly. The home_visit docs are not available anymore after refreshing and getting into the app.

newtewt on 9 Feb 2019

I'm trying to work out if this is a blocker for 3.3.0 or if we should add it to 3.5.0.

@newtewt Do you have any idea how often this would happen?

@SCdF I'm leaning towards bumping it - what do you think?

garethbowen on 10 Feb 2019

This happens consistently for the filysa user on first replication and purging. So it is reproducible. However, I logged in as user madinagu and didn't experience that error. This user has more docs than filysa user.

newtewt on 10 Feb 2019

Have you tried wiping the filysa user db, or logging in as them on another device? It's possible the db has been corrupted and isn't recovering.

garethbowen on 10 Feb 2019

I'm wiping all the data in the Android app. I would need to try another device to be sure as I can't remember off the top of my head right now.

newtewt on 10 Feb 2019

That's messed up. @SCdF Would you mind investigating this user to see if they have a dodgy doc or purge configuration that's messing them up? The goal is just to get 3.3.0 to the point of go/no-go so if in your opinion this isn't required then bump it out to 3.5.0 and we'll pick it off the stack when we get to it.

garethbowen on 10 Feb 2019

I'm on it (well, once I get creds etc)

SCdF on 11 Feb 2019

So it took four different devices, but I managed to replicate it, on our old friend the Tecno Y4.

It's unclear why it happens on that device over the others, though it is the worst device I have.

@newtewt can you post a user who does succeed compared to this one, and specifically:

How many documents they have in total
How many documents get purged for them

I'm trying to work out if there is some kind of breaking point ratio.

In the mean time, I'm going to pull this user's data down locally and I will try mucking with the code to see if I can produce any interesting results.

SCdF on 11 Feb 2019

@SCdF Here are the two users I'm using on the wx3 and seeing the error. I tried on my pixel 2 and tecno J5 with no errors. Let me know if you want more.

user | total | number of docs after purging | sees error | device
-- | -- | -- | -- | --
filysa | 4792 | 2467 | yes | tecno wx3
madinagu | 14751 | 7255 | no | tecno wx3

newtewt on 11 Feb 2019

So I'm guessing that we're using a WX3 because Muso people have them deployed?

SCdF on 11 Feb 2019

👍1

I believe so. @abbyad mentioned that is what the CHW's are using in Mali.

https://medic.slack.com/archives/C02LFG69T/p1547677473108500?thread_ts=1547670638.106300&cid=C02LFG69T

newtewt on 11 Feb 2019

👍1

So I'm going to remove the compaction (optimise) step.

We are already supposedly doing this on every write (and the code seems to exist in PouchDB), and manual compaction is seq based, so it's going to run over the whole changes feed again, which is a lot.

There is no reason this should cause crashes, but there is also no reason for us to manually compact if we're always compacting when we write.

@newtewt this build is available in purge-no-optimise branch build.

I'm going to test this locally as well, but you should test it on your phones when you get a chance.

SCdF on 12 Feb 2019

So far this has been working fine on purge-upgrade instance. I haven't seen the crash today. I've initial replicated a few times in browser, tecno wx3, and tecno j5. I'd say let's push a PR and get another 3.3 beta and give it another once over.

newtewt on 12 Feb 2019

@SCdF I'll leave it for you to merge and backport accordingly. The diff looks fine - we should probably delete the translations for "optimize" at the same time.

garethbowen on 12 Feb 2019

I made enough changes after @garethbowen 's comment that I made a PR and have linked it to Gareth.

@garethbowen leave it to me to merge / forward port.

SCdF on 14 Feb 2019

@newtewt @ngaruko I've merged this into 3.3.x so AT (and hopefully RT) can start. Wait for the build to finish and you'll be all good.

garethbowen on 14 Feb 2019

👍1

@SCdF To get AT unblocked I've ignored your directive and merged the PR. I also forward ported to 3.4.x and created a PR for you to merge into master.

garethbowen on 14 Feb 2019

👍1

Coolio, so: