cht-core 🚀 - Add a replication_date property to records

There is an ongoing conversation about this with Living Goods. They are very keen to have a created date field. If this is easy to implement, I would consider prioritizing this highly. Would also be good to know if this change would be a breaking change that would require a data migration for CHPs.

sglangevin on 11 Apr 2016

How does this differ from how we're using reported_date?

garethbowen on 11 Apr 2016

My understanding of reported date is that it is the date that the report was created on a CHP's phone. The partner request is to have two dates: the date the report was created on the phone and the date that report was replicated to the server. There are a number of reasons for this.

sglangevin on 11 Apr 2016

Ah, so it's more of a replication_date. One issue with storing this date is when the doc is replicated it'll be updated on the server and then replicated back to the phone which will use more bandwidth and slow the phone down some more. It'd be good to investigate what those "number of reasons are" before implementing this - I suspect there's a better way to achieve the desired solution.

garethbowen on 11 Apr 2016

Yes, reported_date is based on data in the sms network or on the gateway and has nothing to do with when the record was created/landed in the database. Working on a strongminds support request I just found it weird that I could not answer the question of when the record was actually created in the database.

mandric on 11 Apr 2016

Is it possible to only store the date that the report hit the DB on the server and not replicate it? Or perhaps have a date that the record was created in Postgres? This date would only be slightly behind the time that the report was replicated to the server as Postgres updates itself every 15 minutes.

sglangevin on 12 Apr 2016

@garethbowen the reason I have heard is that the partner doesn't want a device that syncs with old data to alter their analysis after the fact. Essentially, they want to be able to exclude data that comes in after the date they run their analysis. Having a replication_date could solve this, albeit with a performance hit if stored in the same doc.

abbyad on 12 Apr 2016

Is it possible to only store the date that the report hit the DB on the server and not replicate it?

Yes but it would need to be stored outside of the doc itself so it could get messy.

Or perhaps have a date that the record was created in Postgres?

Yes this would be fine if all we're interested in is analytics.

... partner doesn't want a device that syncs with old data to alter their analysis after the fact.

Ok, I think there are better ways to achieve that. For example, we could make sure the postgres adapter only processes new records, ie: it would check the postgres db and if the doc id was found it would skip over the record.

This seems a little heavy handed as there could well be changes made to the document that you _do_ want updated in analytics (eg: sentinel transitions). There have been other discussions about limiting user's ability to edit records, for example, only allowing them to edit for a short time. This would mean admins and scripts could still update the record and have analytics up to date, but users couldn't fudge the numbers.

garethbowen on 12 Apr 2016

@abbyad , another related reason is for incentives payout to CHPs. The old data would reflect in the previous month(s) and the CHPs would miss out on the incentives payable in that reporting period.

@gareth, I think the main focus is analytics. How easy would it be to have this?

fredatthehub on 19 Apr 2016

Based on our most recent call it would serve the replication_date would be saved in postgres as the time the record was created in postgres. Doing this is blocked pending https://github.com/medic/medic-webapp/issues/2124, because we cannot rely on the records being updated properly. Instead the postgres db contents are deleted and repopulated, which would obviously reset the "creation date" for each record. Resolving that will allow us to store (and maintain) the date that the record hit the db.

abbyad on 19 Apr 2016

PostgreSQL as a datastore for the adapter's output is just a read-only reflection of Couch data. When any problems occur, and due to ongoing changes and features missing from the present adapter, the postgresql database gets wiped and we start the data import process from couch from scratch.

We did this no less than 20 times yesterday for a particular project while removing old data that broke the adapter and messed up the state of the database.

We really can't store timestamps like that in Postgres. We open ourselves up to questions like "Where is all our old data?!" because the import was run a day ago. The data is still there, but all the timestamps would be from yesterday, because we had to reimport all the data fresh yesterday. It's all still there.

btbonval on 19 Apr 2016

I think the aim would be to make the adapter robust enough that data never needs to be deleted. At the very least we should run the adapter against a "test" database before running it against "production" so production never fails.

garethbowen on 19 Apr 2016

Ok, but given that the adapter is meant to be a read-only reflection of Couch data, doesn't it make more sense to have this date stored in the webapp? It would also help with the issue that @mandric cited for another project which was the original reason for this ticket.

sglangevin on 19 Apr 2016

We could totally store it in the webapp, but as discussed above that's either expensive or difficult.

I think the two features are different and can be solved in different ways. For example, it's easy (I think) to store when the SMS was received in the doc as we're writing it. In this case there is no additional replication needed because the record is just being created. In the analytics case we're modifying the doc and so replication is needed, even though it's a field the CHP will never use.

garethbowen on 19 Apr 2016

Storing it in the doc has the negative consequence of requiring the doc to be replicated back to the device, as mentioned by @garethbowen:

so replication is needed, even though it's a field the CHP will never use.

Although this has performance/bandwidth issues, it does have the added benefit of being able to track which reports have been replicated (eg https://github.com/medic/medic-webapp/issues/1849).

Does that make it worth the performance hit?

abbyad on 19 Apr 2016

We have recently discovered other ways to track which documents have been replicated without taking this performance hit. Yes the replication date could be used instead but it's not necessary to achieve #1849

garethbowen on 20 Apr 2016

Ah, good to know!

abbyad on 20 Apr 2016

OK I'm taking the blocked label off this: it was previously blocked on medic-analytics understanding edits and deletes, which it now does (not that it's deployed, but the majority of the work has been done).

To reinvigorate this discussion, and to sum up the current thoughts:

Store the replication date in the document
- Pros: it's the true replication date; it's clearly attached to the report; it replicates to medic-analytics
- Cons: it doubles replication bandwidth costs; it could cause conflicts
Store the replication date in a _local document.
- Pros: it's the true replication date; it doesn't affect replication cost
- Cons: it's not as clear that it's attached to the report; it (intentionally) won't get replicated to medic-analytics, which is probably where it wants to get used
Store the analytics import date in the postgresql replica
- Pros: it doesn't affect replication cost, nor medic-webapp in any way
- Cons: it's not the replication date, it's whenever analytics runs; it removes the "read-only replica" stance of analytics (we'd now need to back it up, we could never again clear it and re-create it, etc)

I think another variation might be:

Store the replication date in a different (non local) document, of a type that doesn't replicate to clients
- Pros: it's the true replication date; it doesn't affect replication cost; it will get replicated to medic-analytics; medic-analytics stays as a read-only replica
- Cons: it's not as clear it's attached to the report; it may be a pain in medic-analytics to link the report and date together

SCdF on 21 Jun 2016

Another variant is querying the audit data for the creation date.

garethbowen on 21 Jun 2016

True:

Query audit data for the creation date
- Pros: it's the true replication date, the work is already done
- Cons: it's not currently replicating to medic-analytics for 2.x; it may be a pain in medic-analytics to link the report and date together

SCdF on 21 Jun 2016

Can we get more information about "sync status", where does this data live, can we query it? Does it provide the creation time and last updated time of any record?

mandric on 22 Jun 2016

Do we plan on regularly wiping the audit db, or it is a permanent record? If the audit db were also in postgres could we set it up so that it is easily queried (eg easy with SQL) for the replication date?

abbyad on 22 Jun 2016

Can we get more information about "sync status",

When one uses double quotes one usually is quoting something. No has used the phrase sync status anywhere on this ticket, so you may have to elaborate about what you're talking about. Are you asking if you can query how well sync'd the current pg db is to couchdb? We store the seq in the DB because it's needed for the implementation, but apart from that we don't store anything.

Do we plan on regularly wiping the audit db, or it is a permanent record? If the audit db were also in postgres could we set it up so that it is easily queried (eg easy with SQL) for the replication date?

I have always presumed it's permanent, but I could be wrong? In any case, if we pulled in audit logs we could write some code to find the audit log for a given document and extract its replication date. It would complicate the code a little (you're now syncing two databases for 2.6+) and you'd be pulling over a huge amount of extra data to gain a very small amount of usable data, but it's definitely feasible.

SCdF on 22 Jun 2016

@SCdF Sorry linked in the ticket https://github.com/medic/medic-webapp/issues/1849

mandric on 22 Jun 2016

I guess it also depends how soon it will be before we get our next request to find something in the audit log, because that may make the case to make it more easily queried.

If we are storing the replica db locally the data transfer shouldn't be so much of an issue, but right now the replica is on a different machine.

abbyad on 22 Jun 2016

I expect the audit log to be permanent.

We could easily add a view to the audit database that just returns the replicated date for any given doc id.

We need to decide how this created_date will be used. For example, if it's something we want to show in the webapp then the decision of where to store it is obvious.

garethbowen on 22 Jun 2016

@sglangevin @abbyad This needs more design work before we can start on it in order to work out where this data should be stored. @SCdF has laid out 5 options above each with pros and cons. In particular, it is essential to know where is the "replication date" going to be used/displayed.

garethbowen on 27 Nov 2016

Based on the requests to see the sync time in postgres as well as in the app by a mobile user I think we are down to two options, both involving saving the info in couchdb: saving it in the report doc, or an associated doc.

Given potential conflicts and needing to download a new/updated doc as soon as a report is replicated, it seems to make sense to save the info in an associated doc, something like ${UUID}_info. That way we avoid conflicts and the doc immediately downloaded is much smaller than the full report doc. Is that a fair assessment @garethbowen?

@sglangevin, does that satisfy the needs from your end?

abbyad on 28 Nov 2016

@abbyad Yes the associated doc idea sounds most efficient if we need the info on the phone. Would this be the initial sync time or the latest sync time (eg: if an edit is made)?

garethbowen on 28 Nov 2016

We'd definitely want to know the initial replication time, and can also see it being useful to know the replication time of the most recent edit, along with the associated rev number. It would seem unnecessary to maintain a record of every edit in between since that could create a hugely bloated doc.

abbyad on 28 Nov 2016

I think having an associated doc would accomplish what we need. Keeping track of the initial replication time and the most recent edit should be sufficient. As long as these associated docs are replicated to medic-analytics, that satisfies the initial need. I think it's likely there will also be other needs in the future to display this info in the webapp, so the associated doc could also help with that later on.

@michaelkohn does this sound like it will work for you?

sglangevin on 28 Nov 2016

As long as the associated doc is being sync'd to PostgreSQL and I have an easy way to link the associated doc to the doc it refers to, we should be OK from an analytics perspective.

Right now I think the only database that is being replicated to PostgreSQL is the medic DB, so if this associated doc is going to live somewhere else (ie not in the medic database), we'll need to account for that in couch2pg. I only mention this because there is talk above of things potentially living in some audit log, I'm not sure if audit log means records in the medic-audit DB, but if so, I don't believe those are being replicated to PostgreSQL currently.

In addition, we'd definitely want to update a few of the base views / matviews / indexes that couch2pg creates since this new field would be included for all projects.

Also, curious how we would deal with historical records that don't have this associated record (ie for projects that started before this change is done). From a PostgreSQL perspective, we could write our views in a way that looks for the associated record and if it doesn't find one, just use the reported_date, though it would be useful to have some sort of script that goes in and creates an associated record (using the reported_date) for those that don't have one, so our queries are less complicated and more performant.

michaelkohn on 28 Nov 2016

@michaelkohn the proposal is to have an associated document stored in the medic db that has an _id that references the original doc. Per @abbyad's proposal, the _id of this associated doc would be something like ${UUID}_info. Would that be enough for easy association on your end?

sglangevin on 28 Nov 2016

@sglangevin @abbyad Technically speaking we could handle that with SQL, but it would be preferable to have a separate field for the UUID of the doc it refers to so we don't have to do string manipulation before doing the JOIN.

So the doc would have the following fields (for example)...

TYPE
PARENT_UUID
PARENT_INITIAL_REPLICATION_DATETIME
...

And our JOIN could be as simple as pregnancy.uuid = associated_doc.parent_uuid instead of using a substring or wildcard in our query (both of which are much slower).

Since this is essentially doubling the number of records in our database, we are going to want to intelligently index, I'm thinking having the TYPE field may help facilitate this by allowing us to create a partial index using this field (but I'm no DBA and open to suggestions!).

michaelkohn on 29 Nov 2016

👍1

@michaelkohn that seems feasible to me (including the parent_uuid in the _info doc. And we should consider some way to intelligently index these docs now so that we don't run into issues later.

sglangevin on 29 Nov 2016

OK here is what I'm going to do (probably):

In medic-api, when a doc gets saved (this should cover replication) we try to load the <<DOCID>>_info doc
If it doesn't exist we create something that looks like this:

{
   _id: '<<DOCID>>_info',
  doc_id: '<<DOCID>>',
  replication_date: now(),
  last_updated_date: now(),
}

If it does exist we load it and update the last_updated_date to now.

I don't think we need the rev, since it's just going to be whatever the latest version is, but we can store it as well as last_updated_rev just for completeness.

SCdF on 20 Jan 2017

A few questions/comments....

Every new doc in the medic DB would have one of these _info docs, right? Not just certain document types.
For existing parters, I assume we aren't going to create _info docs for docs that existed before this feature was added (it would be my preference to do so and just use the reported_date, but we can handle it if not)
Can we add a type field to these _info docs? (could just set it to _info)
Any chance couch2pg could add the info doc as an object in the object to which it refers?

For one particularly high volume partner, here (see below) are the number of docs by type. My understanding is that we would essentially be doubling the number of docs in the couchdb table. If we can't add the info object to it's parent object, perhaps this is a good time to have couch2pg start creating one table per doc->>'type' (or separating _info and data_record into their own tables) instead of jamming everything into the couchdb table.

doc->>'type'        COUNT(*)
----------------------------
meta                1
NULL                6
form                17
usage_stats         18
district_hospital   19
health_center       2189
user-settings       2345
traffic_stats       2959
clinic              229080
feedback            547228
person              575308
data_record         736020

michaelkohn on 20 Jan 2017

@ngamita @derickl... would be interested in your feedback/suggestions as well.

michaelkohn on 20 Jan 2017

@SCdF

Would the info doc be replicated down to the client? I'm thinking no unless we have some designed use case. There has been some discussion about marking a report/contact as synced (#1849).
I assume this will be implemented as a changes listener (rather than trying to intercept the replication request). If so, I think it would fit better in sentinel than api. Sentinel already has a changes listener that applies some transitions to docs which is pretty close to what you're proposing.

garethbowen on 22 Jan 2017

@garethbowen

It could do. The trick then is having to duplicate enough permissions information in those documents to let them replicate down (which, I now realise, we'd have to do in mark as read docs as well). Edges me closer to re-looking at giving everyone their own DB.
Sure, I can be. Sentinel was my initial thought, but then I balked on it because I didn't think everyone had Sentinel running correctly, and so I was going to ape audit docs instead.

SCdF on 22 Jan 2017

Sorry @michaelkohn I missed your comment earlier:

Every new doc in the medic DB would have one of these _info docs, right? Not just certain document types.

Sure, we can do that.

For existing parters, I assume we aren't going to create _info docs for docs that existed before this feature was added (it would be my preference to do so and just use the reported_date, but we can handle it if not)

We could write a migration script / manual script to do this, but I would be concerned that it's effectively intentionally wrong. And that might cause us issues in the future when we start making presumptions about data based on these values.

Can we add a type field to these _info docs? (could just set it to _info)

Yes definitely.

Any chance couch2pg could add the info doc as an object in the object to which it refers?

I don't really know what you mean there? What is an "object" in this context?

perhaps this is a good time to have couch2pg start creating one table per doc->>'type'

Something to think about, definitely. Since mat views are also not scaling for larger clients, I wonder if this would be fixed in that fix: i.e., removing our use of mat views will change performance such that this is no longer a concern either.

SCdF on 22 Jan 2017

After seeing comments from @garethbowen and @michaelkohn , here is what I'm actually going to do:

Have a transition (or, if that doesn't really fit the framework, a custom thing) that hits every document that gets saved
It looks for the <<UUID>>_info doc. If it does exist we update the last_updated_date with now() (in ms, like other "system" dates in our app)
If it doesn't exist it creates a doc that looks something like that:

{
  _id: "<<UUID>>_info",
  type: "info",
  doc_id: "<<UUID>>",
  initial_replication_date: "now()",
  latest_replication_date: "now()"
}

NB: from the previous example, I'm standardising on "replication_date" to make it clear that they are the first and last versions of the same piece of data.

Note as well, this means that unless we did some kind of initial migration before sentinel ran, a lot of the initial_replication_date values won't be accurate, as they will be just the first replication after the change. This does lend credence to the idea that we create these _info files in a migrations, which maybe intentionally sets them to "unknown" or something. Thoughts @garethbowen

SCdF on 23 Jan 2017

Sounds good.

I agree having "unknown" would be better but I'm not comfortable with a migration unless absolutely necessary. In CouchDB 1 we can test the rev for 1-xxx to determine if the doc is new or updated. If this holds for CDB2 then we can write the doc with initial_replication_date: undefined, latest_replication_date: "now()" (or whatever) when it's updated. This is similar to how auditing works.

garethbowen on 23 Jan 2017

From the sentinel docs, a transition should "not have side effects outside of the document passed in. ". So there goes using transitions, at least in their current form.

SCdF on 23 Jan 2017

In CouchDB 1 we can test the rev for 1-xxx

Not necessarily right, if a document is created and then updated before it's synced it will have a later rev.

SCdF on 23 Jan 2017

Ugh, yeah, good point. I'm sure there's some way to detect if it's new, like some field that sentinel writes...

garethbowen on 23 Jan 2017

🎉1

Any chance couch2pg could add the info doc as an object in the object to which it refers?

I don't really know what you mean there? What is an "object" in this context?

Nested objects. So take the info object and add it to the original doc. Sorry, not sure what the nomenclature is, but below is an example of what I was trying to get at. So take the info doc, and add it as another nested object within the data_record object (and we would do this for all types)

{
  "_id": "2F057509-B503-C59E-9929-11F9BB339DD1",
  "_rev": "1-1332dfa20db9411776a5661f37a9429d",
  "form": "assessment",
  "from": "+14159990000",
  "type": "data_record",
  "fields": {
    **"info": {
       "initial_replication_date": 1478661752000,
       "latest_replication_date": 1478661752000**
    },
    "meta": {
      "instanceID": "uuid:a2a0c289-cf71-4a8c-bf6f-c6b80b57f2d7"
    },
    "inputs": {
      "meta": {
        "location": {
          "lat": "",
          "long": "",
          "error": "true",
          "message": "Provider 'gps' did not provide a location."
        },
        "deprecatedID": ""
      },
      "source": "contact",
      "contact": {
        "_id": "4A971497-165F-C769-91A1-46C5F651C747",
        "sex": "male",
        "name": "Max Roach",
        "date_of_birth": "2015-11-28"
      },
      "source_id": ""
    }

michaelkohn on 23 Jan 2017

@michaelkohn oic, right yeah the "xmlforms" chunk of our codebase could potentially look at doing that, to save you a join later on. Would need to think about it a little more.

SCdF on 25 Jan 2017

As noted by Sharon removing the priority label above, I'm parking this for now. This commit details what has and hasn't been done.

SCdF on 6 Feb 2017

garethbowen on 19 Feb 2017

I did some force pushing to rebase, so for clarity here is the PR: https://github.com/medic/medic-sentinel/pull/127/files

SCdF on 6 Apr 2017

@garethbowen can you review the PR above please?

Notes:

I still want to manually test this, please don't merge (this is why I'm still assigned and it's also active).
You probably want to look at this commit by commit, as opposed to the PR in one big go
When I merge it I will probably rebase -i to clean up the different commits, merging some of them and not others, and then merge without squashing. Fun.

SCdF on 7 Apr 2017

(btw I'm not necessarily sold on the "system transitions" concept. I'll mull it over on the weekend)

SCdF on 7 Apr 2017

I'm sold on system transitions! Love it! A bunch of comments on the commits (you may need to click "show outdated" to see them all). Back to you!

garethbowen on 10 Apr 2017

As this change is pretty transparent to users (I'm not even going to document that this transition exists and is disable-able) I think I'm happy to roll with it for now, as we can back out of it easily later.

However, thinking about it over the weekend, I wonder if we want it to be even more integrated into how transitions work, as opposed to being just a transition that runs by default.

Specifically (while trying not to YAGNI too much), if we are thinking ahead that we would want to move the transitions metadata into this document, as well as holding the schema version, then it would make sense if the creation and maintenance of the doc was part of the flow of running transitions.

I wonder if we really want something like this:

Changes listener listens for changes to run transitions on
New change: doc is loaded, as is metadata document: if metadata document doesn't exist a stub document is held in memory instead
for each transition, the change, doc and metadata doc is passed to the transition (along with db and audit links).
Transitions are allowed to change both (here I would consider re-thinking passing a boolean back to indicate a change has occurred, and instead clone and diff, would need to investigate)
Once all transitions are run we save both if required.

SCdF on 10 Apr 2017

Merged.

To summarise this feature for AT:

When documents are written to the server, either via replication or directly, a sister info document is created or updated
This document is in the medic DB, and is named foo-info, where foo is the id of the document in question
This info document (currently) contains the first replication date, as well as the latest replication date. It also contains the direct id of the main document (so you can use that link without having to parse the id out).
It requires that sentinel is running, but otherwise requires no other configuration

SCdF on 10 Apr 2017

@SCdF A couple questions....

What time zone is the sister doc in?
If you delete a doc, does it's sister doc get deleted?
Did we end up doing https://github.com/medic/medic-webapp/issues/2180#issuecomment-274588285 as well? And if not, how much additional work is that?
Is there an environment I can have a look at that has some testing data on it?

michaelkohn on 10 Apr 2017

@ngaruko please make sure you liaise with @michaelkohn for AT on this issue.

sglangevin on 10 Apr 2017

What time zone is the sister doc in?

It's just new Date(), dependent on the server, so UTC.

If you delete a doc, does it's sister doc get deleted?

No. This was less an intentional decision and more a result of how transitions work (they do not get run over deleted documents).

Did we end up doing #2180 (comment) as well? And if not, how much additional work is that?

No, medic-couch2pg didn't change. We can talk about exactly what you want, and what it achieves over doing a join

Is there an environment I can have a look at that has some testing data on it?

alpha.dev is an environment that runs on anything that is merged to master. As long as sentinel is running on that instance, if it's migrated to the latest build and you start editing / creating documents those documents will be created.

SCdF on 11 Apr 2017

Just for future reference, here is an example of how this actually got implemented.

{
    "_id": "f8cc78d0-31a7-44e8-8073-176adcc0db7b-info",
    "_rev": "2-6e08756f62fa0595d87a3f50777758dc",
    "type": "info",
    "doc_id": "f8cc78d0-31a7-44e8-8073-176adcc0db7b",
    "latest_replication_date": "2018-08-13T22:02:46.699Z",
    "initial_replication_date": "2018-08-13T22:02:13.625Z"
}

michaelkohn on 14 Aug 2018

👍1

Cht-core: Add a replication_date property to records

All 60 comments

Related issues