Thought this might be generally useful, so in addition to the reported_date property we also have a date for when the record was actually created in the database?
There is an ongoing conversation about this with Living Goods. They are very keen to have a created date field. If this is easy to implement, I would consider prioritizing this highly. Would also be good to know if this change would be a breaking change that would require a data migration for CHPs.
How does this differ from how we're using reported_date?
My understanding of reported date is that it is the date that the report was created on a CHP's phone. The partner request is to have two dates: the date the report was created on the phone and the date that report was replicated to the server. There are a number of reasons for this.
Ah, so it's more of a replication_date. One issue with storing this date is when the doc is replicated it'll be updated on the server and then replicated back to the phone which will use more bandwidth and slow the phone down some more. It'd be good to investigate what those "number of reasons are" before implementing this - I suspect there's a better way to achieve the desired solution.
Yes, reported_date is based on data in the sms network or on the gateway and has nothing to do with when the record was created/landed in the database. Working on a strongminds support request I just found it weird that I could not answer the question of when the record was actually created in the database.
Is it possible to only store the date that the report hit the DB on the server and not replicate it? Or perhaps have a date that the record was created in Postgres? This date would only be slightly behind the time that the report was replicated to the server as Postgres updates itself every 15 minutes.
@garethbowen the reason I have heard is that the partner doesn't want a device that syncs with old data to alter their analysis after the fact. Essentially, they want to be able to exclude data that comes in after the date they run their analysis. Having a replication_date could solve this, albeit with a performance hit if stored in the same doc.
Is it possible to only store the date that the report hit the DB on the server and not replicate it?
Yes but it would need to be stored outside of the doc itself so it could get messy.
Or perhaps have a date that the record was created in Postgres?
Yes this would be fine if all we're interested in is analytics.
... partner doesn't want a device that syncs with old data to alter their analysis after the fact.
Ok, I think there are better ways to achieve that. For example, we could make sure the postgres adapter only processes new records, ie: it would check the postgres db and if the doc id was found it would skip over the record.
This seems a little heavy handed as there could well be changes made to the document that you _do_ want updated in analytics (eg: sentinel transitions). There have been other discussions about limiting user's ability to edit records, for example, only allowing them to edit for a short time. This would mean admins and scripts could still update the record and have analytics up to date, but users couldn't fudge the numbers.
@abbyad , another related reason is for incentives payout to CHPs. The old data would reflect in the previous month(s) and the CHPs would miss out on the incentives payable in that reporting period.
@gareth, I think the main focus is analytics. How easy would it be to have this?
Based on our most recent call it would serve the replication_date would be saved in postgres as the time the record was created in postgres. Doing this is blocked pending https://github.com/medic/medic-webapp/issues/2124, because we cannot rely on the records being updated properly. Instead the postgres db contents are deleted and repopulated, which would obviously reset the "creation date" for each record. Resolving that will allow us to store (and maintain) the date that the record hit the db.
PostgreSQL as a datastore for the adapter's output is just a read-only reflection of Couch data. When any problems occur, and due to ongoing changes and features missing from the present adapter, the postgresql database gets wiped and we start the data import process from couch from scratch.
We did this no less than 20 times yesterday for a particular project while removing old data that broke the adapter and messed up the state of the database.
We really can't store timestamps like that in Postgres. We open ourselves up to questions like "Where is all our old data?!" because the import was run a day ago. The data is still there, but all the timestamps would be from yesterday, because we had to reimport all the data fresh yesterday. It's all still there.
I think the aim would be to make the adapter robust enough that data never needs to be deleted. At the very least we should run the adapter against a "test" database before running it against "production" so production never fails.
Ok, but given that the adapter is meant to be a read-only reflection of Couch data, doesn't it make more sense to have this date stored in the webapp? It would also help with the issue that @mandric cited for another project which was the original reason for this ticket.
We could totally store it in the webapp, but as discussed above that's either expensive or difficult.
I think the two features are different and can be solved in different ways. For example, it's easy (I think) to store when the SMS was received in the doc as we're writing it. In this case there is no additional replication needed because the record is just being created. In the analytics case we're modifying the doc and so replication is needed, even though it's a field the CHP will never use.
Storing it in the doc has the negative consequence of requiring the doc to be replicated back to the device, as mentioned by @garethbowen:
so replication is needed, even though it's a field the CHP will never use.
Although this has performance/bandwidth issues, it does have the added benefit of being able to track which reports have been replicated (eg https://github.com/medic/medic-webapp/issues/1849).
Does that make it worth the performance hit?
We have recently discovered other ways to track which documents have been replicated without taking this performance hit. Yes the replication date could be used instead but it's not necessary to achieve #1849
Ah, good to know!
OK I'm taking the blocked label off this: it was previously blocked on medic-analytics understanding edits and deletes, which it now does (not that it's deployed, but the majority of the work has been done).
To reinvigorate this discussion, and to sum up the current thoughts:
_local document.I think another variation might be:
Another variant is querying the audit data for the creation date.
True:
Can we get more information about "sync status", where does this data live, can we query it? Does it provide the creation time and last updated time of any record?
Do we plan on regularly wiping the audit db, or it is a permanent record? If the audit db were also in postgres could we set it up so that it is easily queried (eg easy with SQL) for the replication date?
Can we get more information about "sync status",
When one uses double quotes one usually is quoting something. No has used the phrase sync status anywhere on this ticket, so you may have to elaborate about what you're talking about. Are you asking if you can query how well sync'd the current pg db is to couchdb? We store the seq in the DB because it's needed for the implementation, but apart from that we don't store anything.
Do we plan on regularly wiping the audit db, or it is a permanent record? If the audit db were also in postgres could we set it up so that it is easily queried (eg easy with SQL) for the replication date?
I have always presumed it's permanent, but I could be wrong? In any case, if we pulled in audit logs we could write some code to find the audit log for a given document and extract its replication date. It would complicate the code a little (you're now syncing two databases for 2.6+) and you'd be pulling over a huge amount of extra data to gain a very small amount of usable data, but it's definitely feasible.
@SCdF Sorry linked in the ticket https://github.com/medic/medic-webapp/issues/1849
I guess it also depends how soon it will be before we get our next request to find something in the audit log, because that may make the case to make it more easily queried.
If we are storing the replica db locally the data transfer shouldn't be so much of an issue, but right now the replica is on a different machine.
I expect the audit log to be permanent.
We could easily add a view to the audit database that just returns the replicated date for any given doc id.
We need to decide how this created_date will be used. For example, if it's something we want to show in the webapp then the decision of where to store it is obvious.
@sglangevin @abbyad This needs more design work before we can start on it in order to work out where this data should be stored. @SCdF has laid out 5 options above each with pros and cons. In particular, it is essential to know where is the "replication date" going to be used/displayed.
Based on the requests to see the sync time in postgres as well as in the app by a mobile user I think we are down to two options, both involving saving the info in couchdb: saving it in the report doc, or an associated doc.
Given potential conflicts and needing to download a new/updated doc as soon as a report is replicated, it seems to make sense to save the info in an associated doc, something like ${UUID}_info. That way we avoid conflicts and the doc immediately downloaded is much smaller than the full report doc. Is that a fair assessment @garethbowen?
@sglangevin, does that satisfy the needs from your end?
@abbyad Yes the associated doc idea sounds most efficient if we need the info on the phone. Would this be the initial sync time or the latest sync time (eg: if an edit is made)?
We'd definitely want to know the initial replication time, and can also see it being useful to know the replication time of the most recent edit, along with the associated rev number. It would seem unnecessary to maintain a record of every edit in between since that could create a hugely bloated doc.
I think having an associated doc would accomplish what we need. Keeping track of the initial replication time and the most recent edit should be sufficient. As long as these associated docs are replicated to medic-analytics, that satisfies the initial need. I think it's likely there will also be other needs in the future to display this info in the webapp, so the associated doc could also help with that later on.
@michaelkohn does this sound like it will work for you?
As long as the associated doc is being sync'd to PostgreSQL and I have an easy way to link the associated doc to the doc it refers to, we should be OK from an analytics perspective.
Right now I think the only database that is being replicated to PostgreSQL is the medic DB, so if this associated doc is going to live somewhere else (ie not in the medic database), we'll need to account for that in couch2pg. I only mention this because there is talk above of things potentially living in some audit log, I'm not sure if audit log means records in the medic-audit DB, but if so, I don't believe those are being replicated to PostgreSQL currently.
In addition, we'd definitely want to update a few of the base views / matviews / indexes that couch2pg creates since this new field would be included for all projects.
Also, curious how we would deal with historical records that don't have this associated record (ie for projects that started before this change is done). From a PostgreSQL perspective, we could write our views in a way that looks for the associated record and if it doesn't find one, just use the reported_date, though it would be useful to have some sort of script that goes in and creates an associated record (using the reported_date) for those that don't have one, so our queries are less complicated and more performant.
@michaelkohn the proposal is to have an associated document stored in the medic db that has an _id that references the original doc. Per @abbyad's proposal, the _id of this associated doc would be something like ${UUID}_info. Would that be enough for easy association on your end?
@sglangevin @abbyad Technically speaking we could handle that with SQL, but it would be preferable to have a separate field for the UUID of the doc it refers to so we don't have to do string manipulation before doing the JOIN.
So the doc would have the following fields (for example)...
TYPE
PARENT_UUID
PARENT_INITIAL_REPLICATION_DATETIME
...
And our JOIN could be as simple as pregnancy.uuid = associated_doc.parent_uuid instead of using a substring or wildcard in our query (both of which are much slower).
Since this is essentially doubling the number of records in our database, we are going to want to intelligently index, I'm thinking having the TYPE field may help facilitate this by allowing us to create a partial index using this field (but I'm no DBA and open to suggestions!).
@michaelkohn that seems feasible to me (including the parent_uuid in the _info doc. And we should consider some way to intelligently index these docs now so that we don't run into issues later.
OK here is what I'm going to do (probably):
<<DOCID>>_info doc{
_id: '<<DOCID>>_info',
doc_id: '<<DOCID>>',
replication_date: now(),
last_updated_date: now(),
}
I don't think we need the rev, since it's just going to be whatever the latest version is, but we can store it as well as last_updated_rev just for completeness.
A few questions/comments....
medic DB would have one of these _info docs, right? Not just certain document types. _info docs for docs that existed before this feature was added (it would be my preference to do so and just use the reported_date, but we can handle it if not)type field to these _info docs? (could just set it to _info)couch2pg could add the info doc as an object in the object to which it refers?For one particularly high volume partner, here (see below) are the number of docs by type. My understanding is that we would essentially be doubling the number of docs in the couchdb table. If we can't add the info object to it's parent object, perhaps this is a good time to have couch2pg start creating one table per doc->>'type' (or separating _info and data_record into their own tables) instead of jamming everything into the couchdb table.
doc->>'type' COUNT(*)
----------------------------
meta 1
NULL 6
form 17
usage_stats 18
district_hospital 19
health_center 2189
user-settings 2345
traffic_stats 2959
clinic 229080
feedback 547228
person 575308
data_record 736020
@ngamita @derickl... would be interested in your feedback/suggestions as well.
@SCdF
@garethbowen
Sorry @michaelkohn I missed your comment earlier:
Every new doc in the
medicDB would have one of these_infodocs, right? Not just certain document types.
Sure, we can do that.
For existing parters, I assume we aren't going to create
_infodocs for docs that existed before this feature was added (it would be my preference to do so and just use thereported_date, but we can handle it if not)
We could write a migration script / manual script to do this, but I would be concerned that it's effectively intentionally wrong. And that might cause us issues in the future when we start making presumptions about data based on these values.
Can we add a
typefield to these_infodocs? (could just set it to_info)
Yes definitely.
Any chance
couch2pgcould add theinfodoc as an object in the object to which it refers?
I don't really know what you mean there? What is an "object" in this context?
perhaps this is a good time to have couch2pg start creating one table per doc->>'type'
Something to think about, definitely. Since mat views are also not scaling for larger clients, I wonder if this would be fixed in that fix: i.e., removing our use of mat views will change performance such that this is no longer a concern either.
After seeing comments from @garethbowen and @michaelkohn , here is what I'm actually going to do:
<<UUID>>_info doc. If it does exist we update the last_updated_date with now() (in ms, like other "system" dates in our app){
_id: "<<UUID>>_info",
type: "info",
doc_id: "<<UUID>>",
initial_replication_date: "now()",
latest_replication_date: "now()"
}
NB: from the previous example, I'm standardising on "replication_date" to make it clear that they are the first and last versions of the same piece of data.
Note as well, this means that unless we did some kind of initial migration before sentinel ran, a lot of the initial_replication_date values won't be accurate, as they will be just the first replication after the change. This does lend credence to the idea that we create these _info files in a migrations, which maybe intentionally sets them to "unknown" or something. Thoughts @garethbowen
Sounds good.
I agree having "unknown" would be better but I'm not comfortable with a migration unless absolutely necessary. In CouchDB 1 we can test the rev for 1-xxx to determine if the doc is new or updated. If this holds for CDB2 then we can write the doc with initial_replication_date: undefined, latest_replication_date: "now()" (or whatever) when it's updated. This is similar to how auditing works.
From the sentinel docs, a transition should "not have side effects outside of the document passed in. ". So there goes using transitions, at least in their current form.
In CouchDB 1 we can test the rev for 1-xxx
Not necessarily right, if a document is created and then updated before it's synced it will have a later rev.
Ugh, yeah, good point. I'm sure there's some way to detect if it's new, like some field that sentinel writes...
Any chance couch2pg could add the info doc as an object in the object to which it refers?
I don't really know what you mean there? What is an "object" in this context?
Nested objects. So take the info object and add it to the original doc. Sorry, not sure what the nomenclature is, but below is an example of what I was trying to get at. So take the info doc, and add it as another nested object within the data_record object (and we would do this for all types)
{
"_id": "2F057509-B503-C59E-9929-11F9BB339DD1",
"_rev": "1-1332dfa20db9411776a5661f37a9429d",
"form": "assessment",
"from": "+14159990000",
"type": "data_record",
"fields": {
**"info": {
"initial_replication_date": 1478661752000,
"latest_replication_date": 1478661752000**
},
"meta": {
"instanceID": "uuid:a2a0c289-cf71-4a8c-bf6f-c6b80b57f2d7"
},
"inputs": {
"meta": {
"location": {
"lat": "",
"long": "",
"error": "true",
"message": "Provider 'gps' did not provide a location."
},
"deprecatedID": ""
},
"source": "contact",
"contact": {
"_id": "4A971497-165F-C769-91A1-46C5F651C747",
"sex": "male",
"name": "Max Roach",
"date_of_birth": "2015-11-28"
},
"source_id": ""
}
@michaelkohn oic, right yeah the "xmlforms" chunk of our codebase could potentially look at doing that, to save you a join later on. Would need to think about it a little more.
As noted by Sharon removing the priority label above, I'm parking this for now. This commit details what has and hasn't been done.
I did some force pushing to rebase, so for clarity here is the PR: https://github.com/medic/medic-sentinel/pull/127/files
@garethbowen can you review the PR above please?
Notes:
rebase -i to clean up the different commits, merging some of them and not others, and then merge without squashing. Fun.(btw I'm not necessarily sold on the "system transitions" concept. I'll mull it over on the weekend)
I'm sold on system transitions! Love it! A bunch of comments on the commits (you may need to click "show outdated" to see them all). Back to you!
As this change is pretty transparent to users (I'm not even going to document that this transition exists and is disable-able) I think I'm happy to roll with it for now, as we can back out of it easily later.
However, thinking about it over the weekend, I wonder if we want it to be even more integrated into how transitions work, as opposed to being just a transition that runs by default.
Specifically (while trying not to YAGNI too much), if we are thinking ahead that we would want to move the transitions metadata into this document, as well as holding the schema version, then it would make sense if the creation and maintenance of the doc was part of the flow of running transitions.
I wonder if we really want something like this:
Merged.
To summarise this feature for AT:
medic DB, and is named foo-info, where foo is the id of the document in question@SCdF A couple questions....
@ngaruko please make sure you liaise with @michaelkohn for AT on this issue.
What time zone is the sister doc in?
It's just new Date(), dependent on the server, so UTC.
If you delete a doc, does it's sister doc get deleted?
No. This was less an intentional decision and more a result of how transitions work (they do not get run over deleted documents).
Did we end up doing #2180 (comment) as well? And if not, how much additional work is that?
No, medic-couch2pg didn't change. We can talk about exactly what you want, and what it achieves over doing a join
Is there an environment I can have a look at that has some testing data on it?
alpha.dev is an environment that runs on anything that is merged to master. As long as sentinel is running on that instance, if it's migrated to the latest build and you start editing / creating documents those documents will be created.
Just for future reference, here is an example of how this actually got implemented.
{
"_id": "f8cc78d0-31a7-44e8-8073-176adcc0db7b-info",
"_rev": "2-6e08756f62fa0595d87a3f50777758dc",
"type": "info",
"doc_id": "f8cc78d0-31a7-44e8-8073-176adcc0db7b",
"latest_replication_date": "2018-08-13T22:02:46.699Z",
"initial_replication_date": "2018-08-13T22:02:13.625Z"
}