Cht-core: Reflect edited and deleted form reports in postgres

Created on 21 Mar 2016 · 26Comments · Source: medic/cht-core

Due to the way _all_docs is called, it wasn't obvious for the couch2pg adapter how to take doc revision into account. If a UUID was in the database, its contents are not fetched again.

Since this was the case with the couch2pg adapter, the XML Forms parser marks parsed forms using only their UUID and does not take revision into account either.

With the possibility of migrating data, I was informed that the forms might be edited in place, which would change their form version, XML, and possibly other fields, but the UUID would remain the same. These changes would not be picked up on.

I need to figure out the best way to add in support for revisions with the couch2pg piece. The XML forms parser piece should be trivial to add revision support for.

1 - High

Source

btbonval

All 26 comments

The most likely fix is to fetch _all_docs with revision (if possible), look for revision mismatches, and then fetch _all_docs with a list of UUIDs as is currently done. It would include docs with outdated revisions as well as those that are missing.

The question is whether _all_docs can be fetched with revision but not with full contents.

btbonval on 21 Mar 2016

This is going to require some annoying integration testing, since I'm using medic-data.

What I might have to do is pick a few documents by hand, then change their rev to something else and change the contents to something like "OH NO IT FAILED" within postgres. In theory, when fetching and finding different revs in Couch, those docs in postgres should be replaced with the expected ones.

The test would just go through the UUIDs that were chosen and ensure that they don't say "OH NO IT FAILED".

btbonval on 31 Mar 2016

Testing deletes might work similarly, adding records into postgres and making sure that they are removed.

I might expand this ticket to also support deleting records. I think it's enough of a related problem that I might get some synergy bonus doing both together.

btbonval on 31 Mar 2016

@abbyad Just pointed out this was still assigned to me and listed as Active, but I had to switch to other work and haven't looked at GitHub in some time.

btbonval on 6 May 2016

The most robust way of solving this problem is by using the _changes feed instead of _all_docs. An example of this API:

scdf at SCdF in ~/Code/Medic/medic-api on develop
$ http "http://admin:pass@localhost:5984/medic/_changes?limit=10&since=22"
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Type: text/plain; charset=utf-8
Date: Mon, 09 May 2016 04:26:31 GMT
ETag: "6PLLHYO3I08OZ3EAC92ST6NTF"
Server: CouchDB/1.6.1 (Erlang OTP/18)
Transfer-Encoding: chunked

{"results":[
{"seq":23,"id":"E3779A17-7B9A-46EF-9981-288672496D9D","changes":[{"rev":"2-d3385a5c35dfb5f9a44711d50e7ac081"}]},
{"seq":24,"id":"1C3370F0-18AC-09A7-B7C8-7F6A24FC270B","changes":[{"rev":"1-7ba3bafba9e79e31e5a7bb6b0b4c4b05"}]},
{"seq":25,"id":"36B10CED-290A-3AA5-B980-2DEC1B3871BA","changes":[{"rev":"1-7661f050b2f15c2733d14b6b32d1606a"}]},
{"seq":26,"id":"477DF724-6013-CE42-9243-8B602E85BA34","changes":[{"rev":"1-0f5c1585f96b0f1388d3cfaaad63b5c0"}]},
{"seq":27,"id":"52D86FF5-7090-2FEF-8EFA-C726717513DA","changes":[{"rev":"1-c9308999b1a07344a73c6982da31135e"}]},
{"seq":28,"id":"5799AC27-7B69-652B-8F46-FDFB7024236D","changes":[{"rev":"4-c02644f269010dfec29d2bb099ffed9a"}]},
{"seq":29,"id":"5BCF2DF9-8C47-3006-BA0A-87FCA45F866D","changes":[{"rev":"3-b51c13cefd67296cd2b6bfda302b6217"}]},
{"seq":30,"id":"9325D4A0-1EBC-8C9D-AFD3-EA77B0769417","changes":[{"rev":"1-d8c3f04960aa8a4737c65b1dd49ed380"}]},
{"seq":31,"id":"AFDB6E09-04B8-E745-A6A5-98AB8E40AB11","changes":[{"rev":"1-30825e50868424f0a4e9a1182c12a196"}]},
{"seq":32,"id":"BE5DBA71-5FF2-227B-84FC-B82AD9822B3B","changes":[{"rev":"4-bd6e309fa4f1b979985c0527cb676347"}],"deleted":true}
],
"last_seq":32}

This API lets you get document changes in limit chunks starting from a since value. It will tell you which IDs have changed, which revision they've changed to and whether or not it's a deletion, which lets us say:

If it's in this list it's an insert or update
If it's in this list and it has deleted: true then it's a delete from
We can then do this in chunks where after each successful batch (you can use include_docs=true to have it give you docs as well, but you can also use bulk get with ids which IIRC is better performance) you store the last_seq value, which lets you start again.

SCdF on 9 May 2016

If you do that, you might want to add two separate stages for bootstrapping using _all_docs to fetch the current state into an empty database, and then _changes as an iterative step to keep the database up to date.

While you might be able to get _changes from the dawn of time, anything deleted or modified is probably a waste to fetch in multiple where _all_docs would do fine initially.

I think it's a good idea though.

btbonval on 9 May 2016

Based on actual use that myself and @michaelkohn have experienced, bootstrapping a database and running the adapter to keep up with changes tends to be different. That might be worth a different issue, but it's worth discussing here to consider replacing _all_docs vs augmenting _all_docs

btbonval on 9 May 2016

Yeah, so I believe changes compresses changes down so if a doc gets changed 20 times and then deleted, you only get one row back saying it's deleted. I'm not sure if that applies if the changes occur outside of the since+limit that you set. Having said that, you don't actually need to set a limit, we can just have the thing run from the last successful last_seq all the way to the end. On my laptop it took a couple of seconds for it to generate all the JSON for all changes from the beginning of time on my LG-prod DB (changes is fast)

SCdF on 9 May 2016

That sounds on par with what @garethbowen told me out of this channel. So yeah, that sounds like a good solution.

btbonval on 9 May 2016

So I've looked at what we convert from XML into JSON:

Code entry points are pretty much here to convert a report and here to convert a contact. These both follow a very similar formula which looks like it extracts everything out.

Against the contact we ignore some stuff but it's all sensible:

meta is meta junk
inputs is stuff that gets injected into the form and is then represented and saved elsewhere, and so is redundant
repeat etc are where the children sit when you're adding a family: this data is extracted out into individual contacts and so aren't technically part of the finished contact

I think this means we're good for just using the JSON?

SCdF on 26 May 2016

👍1

@sglangevin @abbyad @browndav Can you guys take a look at the above?

We're talking about letting the webapp do the XML parsing instead of having that performed in the adapter to reduce duplicated effort. There was some debate awhile ago that the adapter should parse its own XML, but I don't remember those arguments.

The strongest argument for adapter XML parsing that I recall is "if there's a change to XML formatting, we (the dev team) don't necessarily want to be responsible for publishing a whole new version of the webapp to manage that one change." My recollection is probably not entirely accurate.

Anyway the importance of this relying on webapp XML parsing means that the adapter only needs to work with JSON. As it is now, we create a couple tables to maintain parsed XML data, so upon update or delete, we have to modify all of 1) the original json record, 2) a metadata entry for an XML form including its form name and form version, 3) the actual data entry for an XML form in its table listed by form name and form version. If we get rid of XML parsing from the adapter, that reduces the problem down to just the first point: the original json record.

btbonval on 26 May 2016

@michaelkohn I'm going to ping you to speak from an analytics side. Please take a look at the above and let us know what we might need to consider with respect to analytics and existing projects like LG. You and I might need to create some kind of experimental side project to ensure that we can find all of the expected data without parsing any XML, which could be done independently of any other effort in this ticket.

btbonval on 26 May 2016

@btbonval Just so I'm clear, does this mean form_metadata and formview_XXXXX_9999-99-99 go away and we just have the high level views getting data directly from couchdb.doc?

michaelkohn on 27 May 2016

@michaelkohn Not necessarily. It means form_metadata and formview_XXXX_9999-99-99 become views instead of tables, which means we only need to update or delete from the couchdb table and everything magically falls out from that table without any need for softnukes etc.

btbonval on 29 May 2016

👍1

I think this means we're good for just using the JSON?

@SCdF there are historical reasons why we needed the XML data brought into postgres (eg previously only a subset of info was kept in the JSON), but that's changed since https://github.com/medic/medic-webapp/issues/1938 (see _Make all the fields accessible_). That means that the webapp is already doing the parsing, and there is no need to parse the XML further in the adapter.

abbyad on 6 Jun 2016

Oh I remember that, but I thought it was only covering a subset itself. Cool, so that is a positive response.

We have "high level form views" which map every (important) version of some form to what we expect to see in analytics. It might be worth looking over the field mapping that @sglangevin created for LG and making sure we are able to find every field for every form in the existing JSON that resulted from #1938 . It sounds tedious but it is probably worthwhile.

I'm not sure if I should link the mapping Spreadsheet in a public repo like this, as I think it might be share-by-link. I'll send the link to Stefan in private and I'll send anyone else a link on Slack if they want one.

btbonval on 7 Jun 2016

Hey @btbonval so I've finished the couch2pg section of the changes. The PR is here: https://github.com/medic/medic-analytics/pull/30

I'll be working on making sure xmlforms understands edits and deletes shortly, but this change stands on its own, and can be reviewed.

It's reasonably massive, and hard to comprehend from a PR, so when you've got time we can hangout and talk about it :-)

SCdF on 7 Jun 2016

@SCdF We did discuss quite a lot of the changes you had planned to make, which should help some of the comprehension. I will give it a once-over right now and we can schedule some time to talk about it soon.

btbonval on 8 Jun 2016

You were not kidding about massive. "once over" is going to be a bit more of a commitment than I had originally anticipated with https://github.com/medic/medic-analytics/pull/30/commits/67f85cbffbce8e604137b6c8db985803940e1bce

btbonval on 8 Jun 2016

Okay I got through all of it but the tests. I'll come back to those.

btbonval on 8 Jun 2016

Created a PR for the xmlforms side of things: https://github.com/medic/medic-analytics/pull/31
This PR is compared to the existing PR, because it's an addition to that: https://github.com/medic/medic-analytics/pull/30

@btbonval the last step of this is infrastructural and administrative (getting a user for each project setup in postgres, writing a little startup script, getting it deployed etc) which I will create another ticket for. I'd like to close this ticket (and related tickets) before that so we can merge this code into master.

Let me know if you have any questions / confusion etc.

SCdF on 17 Jun 2016

Hey @garethbowen can you check out medic/medic-analytics#31 ?

If you want to you can also take a look at medic/medic-analytics#30, specifically the testing or lack thereof. IMO it's enough tests (ints to cover the good cases, units to cover the bad) but you may disagree.

SCdF on 20 Jun 2016

(Oh, and don't merge those PRs, they are for diff reading only as 31 is actually comparing to 30. When you're happy with both I'll merge them manually.)

SCdF on 20 Jun 2016

Merged.

SCdF on 20 Jun 2016

Is there a way to AT this @btbonval @SCdF ?