Due to the way _all_docs is called, it wasn't obvious for the couch2pg adapter how to take doc revision into account. If a UUID was in the database, its contents are not fetched again.
Since this was the case with the couch2pg adapter, the XML Forms parser marks parsed forms using only their UUID and does not take revision into account either.
With the possibility of migrating data, I was informed that the forms might be edited in place, which would change their form version, XML, and possibly other fields, but the UUID would remain the same. These changes would not be picked up on.
I need to figure out the best way to add in support for revisions with the couch2pg piece. The XML forms parser piece should be trivial to add revision support for.
The most likely fix is to fetch _all_docs with revision (if possible), look for revision mismatches, and then fetch _all_docs with a list of UUIDs as is currently done. It would include docs with outdated revisions as well as those that are missing.
The question is whether _all_docs can be fetched with revision but not with full contents.
This is going to require some annoying integration testing, since I'm using medic-data.
What I might have to do is pick a few documents by hand, then change their rev to something else and change the contents to something like "OH NO IT FAILED" within postgres. In theory, when fetching and finding different revs in Couch, those docs in postgres should be replaced with the expected ones.
The test would just go through the UUIDs that were chosen and ensure that they don't say "OH NO IT FAILED".
Testing deletes might work similarly, adding records into postgres and making sure that they are removed.
I might expand this ticket to also support deleting records. I think it's enough of a related problem that I might get some synergy bonus doing both together.
@abbyad Just pointed out this was still assigned to me and listed as Active, but I had to switch to other work and haven't looked at GitHub in some time.
The most robust way of solving this problem is by using the _changes feed instead of _all_docs. An example of this API:
scdf at SCdF in ~/Code/Medic/medic-api on develop
$ http "http://admin:pass@localhost:5984/medic/_changes?limit=10&since=22"
HTTP/1.1 200 OK
Cache-Control: must-revalidate
Content-Type: text/plain; charset=utf-8
Date: Mon, 09 May 2016 04:26:31 GMT
ETag: "6PLLHYO3I08OZ3EAC92ST6NTF"
Server: CouchDB/1.6.1 (Erlang OTP/18)
Transfer-Encoding: chunked
{"results":[
{"seq":23,"id":"E3779A17-7B9A-46EF-9981-288672496D9D","changes":[{"rev":"2-d3385a5c35dfb5f9a44711d50e7ac081"}]},
{"seq":24,"id":"1C3370F0-18AC-09A7-B7C8-7F6A24FC270B","changes":[{"rev":"1-7ba3bafba9e79e31e5a7bb6b0b4c4b05"}]},
{"seq":25,"id":"36B10CED-290A-3AA5-B980-2DEC1B3871BA","changes":[{"rev":"1-7661f050b2f15c2733d14b6b32d1606a"}]},
{"seq":26,"id":"477DF724-6013-CE42-9243-8B602E85BA34","changes":[{"rev":"1-0f5c1585f96b0f1388d3cfaaad63b5c0"}]},
{"seq":27,"id":"52D86FF5-7090-2FEF-8EFA-C726717513DA","changes":[{"rev":"1-c9308999b1a07344a73c6982da31135e"}]},
{"seq":28,"id":"5799AC27-7B69-652B-8F46-FDFB7024236D","changes":[{"rev":"4-c02644f269010dfec29d2bb099ffed9a"}]},
{"seq":29,"id":"5BCF2DF9-8C47-3006-BA0A-87FCA45F866D","changes":[{"rev":"3-b51c13cefd67296cd2b6bfda302b6217"}]},
{"seq":30,"id":"9325D4A0-1EBC-8C9D-AFD3-EA77B0769417","changes":[{"rev":"1-d8c3f04960aa8a4737c65b1dd49ed380"}]},
{"seq":31,"id":"AFDB6E09-04B8-E745-A6A5-98AB8E40AB11","changes":[{"rev":"1-30825e50868424f0a4e9a1182c12a196"}]},
{"seq":32,"id":"BE5DBA71-5FF2-227B-84FC-B82AD9822B3B","changes":[{"rev":"4-bd6e309fa4f1b979985c0527cb676347"}],"deleted":true}
],
"last_seq":32}
This API lets you get document changes in limit chunks starting from a since value. It will tell you which IDs have changed, which revision they've changed to and whether or not it's a deletion, which lets us say:
insert or updatedeleted: true then it's a delete frominclude_docs=true to have it give you docs as well, but you can also use bulk get with ids which IIRC is better performance) you store the last_seq value, which lets you start again.If you do that, you might want to add two separate stages for bootstrapping using _all_docs to fetch the current state into an empty database, and then _changes as an iterative step to keep the database up to date.
While you might be able to get _changes from the dawn of time, anything deleted or modified is probably a waste to fetch in multiple where _all_docs would do fine initially.
I think it's a good idea though.
Based on actual use that myself and @michaelkohn have experienced, bootstrapping a database and running the adapter to keep up with changes tends to be different. That might be worth a different issue, but it's worth discussing here to consider replacing _all_docs vs augmenting _all_docs
Yeah, so I believe changes compresses changes down so if a doc gets changed 20 times and then deleted, you only get one row back saying it's deleted. I'm not sure if that applies if the changes occur outside of the since+limit that you set. Having said that, you don't actually need to set a limit, we can just have the thing run from the last successful last_seq all the way to the end. On my laptop it took a couple of seconds for it to generate all the JSON for all changes from the beginning of time on my LG-prod DB (changes is fast)
That sounds on par with what @garethbowen told me out of this channel. So yeah, that sounds like a good solution.
So I've looked at what we convert from XML into JSON:
Code entry points are pretty much here to convert a report and here to convert a contact. These both follow a very similar formula which looks like it extracts everything out.
Against the contact we ignore some stuff but it's all sensible:
meta is meta junkinputs is stuff that gets injected into the form and is then represented and saved elsewhere, and so is redundantrepeat etc are where the children sit when you're adding a family: this data is extracted out into individual contacts and so aren't technically part of the finished contactI think this means we're good for just using the JSON?
@sglangevin @abbyad @browndav Can you guys take a look at the above?
We're talking about letting the webapp do the XML parsing instead of having that performed in the adapter to reduce duplicated effort. There was some debate awhile ago that the adapter should parse its own XML, but I don't remember those arguments.
The strongest argument for adapter XML parsing that I recall is "if there's a change to XML formatting, we (the dev team) don't necessarily want to be responsible for publishing a whole new version of the webapp to manage that one change." My recollection is probably not entirely accurate.
Anyway the importance of this relying on webapp XML parsing means that the adapter only needs to work with JSON. As it is now, we create a couple tables to maintain parsed XML data, so upon update or delete, we have to modify all of 1) the original json record, 2) a metadata entry for an XML form including its form name and form version, 3) the actual data entry for an XML form in its table listed by form name and form version. If we get rid of XML parsing from the adapter, that reduces the problem down to just the first point: the original json record.
@michaelkohn I'm going to ping you to speak from an analytics side. Please take a look at the above and let us know what we might need to consider with respect to analytics and existing projects like LG. You and I might need to create some kind of experimental side project to ensure that we can find all of the expected data without parsing any XML, which could be done independently of any other effort in this ticket.
@btbonval Just so I'm clear, does this mean form_metadata and formview_XXXXX_9999-99-99 go away and we just have the high level views getting data directly from couchdb.doc?
@michaelkohn Not necessarily. It means form_metadata and formview_XXXX_9999-99-99 become views instead of tables, which means we only need to update or delete from the couchdb table and everything magically falls out from that table without any need for softnukes etc.
I think this means we're good for just using the JSON?
@SCdF there are historical reasons why we needed the XML data brought into postgres (eg previously only a subset of info was kept in the JSON), but that's changed since https://github.com/medic/medic-webapp/issues/1938 (see _Make all the fields accessible_). That means that the webapp is already doing the parsing, and there is no need to parse the XML further in the adapter.
Oh I remember that, but I thought it was only covering a subset itself. Cool, so that is a positive response.
We have "high level form views" which map every (important) version of some form to what we expect to see in analytics. It might be worth looking over the field mapping that @sglangevin created for LG and making sure we are able to find every field for every form in the existing JSON that resulted from #1938 . It sounds tedious but it is probably worthwhile.
I'm not sure if I should link the mapping Spreadsheet in a public repo like this, as I think it might be share-by-link. I'll send the link to Stefan in private and I'll send anyone else a link on Slack if they want one.
Hey @btbonval so I've finished the couch2pg section of the changes. The PR is here: https://github.com/medic/medic-analytics/pull/30
I'll be working on making sure xmlforms understands edits and deletes shortly, but this change stands on its own, and can be reviewed.
It's reasonably massive, and hard to comprehend from a PR, so when you've got time we can hangout and talk about it :-)
@SCdF We did discuss quite a lot of the changes you had planned to make, which should help some of the comprehension. I will give it a once-over right now and we can schedule some time to talk about it soon.
You were not kidding about massive. "once over" is going to be a bit more of a commitment than I had originally anticipated with https://github.com/medic/medic-analytics/pull/30/commits/67f85cbffbce8e604137b6c8db985803940e1bce
Okay I got through all of it but the tests. I'll come back to those.
Created a PR for the xmlforms side of things: https://github.com/medic/medic-analytics/pull/31
This PR is compared to the existing PR, because it's an addition to that: https://github.com/medic/medic-analytics/pull/30
@btbonval the last step of this is infrastructural and administrative (getting a user for each project setup in postgres, writing a little startup script, getting it deployed etc) which I will create another ticket for. I'd like to close this ticket (and related tickets) before that so we can merge this code into master.
Let me know if you have any questions / confusion etc.
Hey @garethbowen can you check out medic/medic-analytics#31 ?
If you want to you can also take a look at medic/medic-analytics#30, specifically the testing or lack thereof. IMO it's enough tests (ints to cover the good cases, units to cover the bad) but you may disagree.
(Oh, and don't merge those PRs, they are for diff reading only as 31 is actually comparing to 30. When you're happy with both I'll merge them manually.)
Merged.
Is there a way to AT this @btbonval @SCdF ?
This was done long ago. We can move it to ready.