Couchdb: [Sporadic] Views returning stale data

Created on 29 May 2019  路  17Comments  路  Source: apache/couchdb

Description

As per the title, views in some of our production couch databases return stale data. This is sporadic but when it does happen, the solution we follow is to do a complete re-index of the database by deleting the index files found in /var/lib/couchdb/.db_design/mrview/*.view which is a costly operation.

Our databases also go through a daily purge during a scheduled window using the _purge endpoint to permanently delete expired documents (I understand that this is not recommended), which is followed by compaction to reclaim the purged disk space. We then access all the design documents which I believe is sufficient to perform a re-indexing of the purged data. (?)

On observed issues, the stale data corresponds to a purged document.

Steps to Reproduce

I have not been able to successfully reproduce this problem. We sporadically observe this in production and I urgently need to get to the bottom of this to sort things out.

Expected Behaviour

The expected behavior is for the views to return up to date data, and not stale data.

Your Environment

  • CouchDB Version used: CouchDB 1.6.1
  • Browser name and version: Chrome
  • Operating System and version: Ubuntu

Additional context

I do have a couple of questions that I have not been able to find satisfactory answers:

  1. Is touching the design documents after the purge sufficient to perform a re-indexing of the views? Is there a better approach here?

  2. When the indexer identifies that the purge sequence on a database has changed, it compares the purge sequence of the database with that stored in the view index. If the difference between the stored sequence and database is sequence is only 1, then the indexer uses a cached list of the most recently purged documents and then removes these documents from the index individually. This prevents completely rebuilding the index from scratch.

a. Is the purge sequence of the view index the same one that is returned by the /_design/dd/_info endpoint?

b. Is a purge sequence diff the only way a purged document can get re-indexed? Ie: Can the purged
document id remain in the indexes forever if a diff is not identified?

c. Could an anomaly in the purge sequences between the database and the view index be a cause for
purged documents to not get removed from the index?

  1. Could some form of data corruption due to the _purge endpoint (or something else) cause view indexing to fail for certain design documents, resulting in stale data?

I understand that 1.X releases are no longer supported and an exact root cause may not be identified. But I would appreciate pointers and potential reasons for this observed issue so that I can proceed with the debugging process.

wontfix

All 17 comments

In 1.x, if you purge more than a single document at a time, your views must all rebuild entirely from scratch. The recommended approach in 1.x is to purge a single document, then GET a view from each design document, then repeat until all documents are purged. The GET process is what heals the view from that single purge, and only a single purge can be handled by this mechanism.

Not sure if this is what you mean by "touch" a design document or not.

I don't remember anything about the purge sequences, sorry.

In 1.x, if you purge more than a single document at a time, your views must all rebuild entirely from scratch. The recommended approach in 1.x is to purge a single document, then GET a view from each design document, then repeat until all documents are purged. The GET process is what heals the view from that single purge, and only a single purge can be handled by this mechanism.

Not sure if this is what you mean by "touch" a design document or not.

I don't remember anything about the purge sequences, sorry.

We definitely purge more than a single document at a time. The upper bound is 20,000 documents (purged in batches) for a purge but on average it is around 100 - 1000 documents during a single purge window.

And yes that is what I meant by "touching" the design documents after a purge. So are you saying if we purge multiple documents, a simple GET call is not going to do a re-index or will it rebuild an entirely new index from scratch?

I do think it's the latter since the purge implementation has been in production for over 2 years and for the most part, has worked without any problems. I am also concerned if frequent use of the _purge endpoint could cause data corruptions which in turn would cause the view indexing to fail. (Reference: https://github.com/apache/couchdb/issues/758)

As @wohali mentioned "In 1.x, if you purge more than a single document at a time, your views must all rebuild entirely from scratch.". In order to prevent from rebuilding index, you have to purge documents one by one by

  • Stop mem3_sync on all nodes
  • Run the purge for each copy of the shard
  • Restart mem3_sync across the cluster

As @wohali mentioned "In 1.x, if you purge more than a single document at a time, your views must all rebuild entirely from scratch.". In order to prevent from rebuilding index, you have to purge documents one by one by

  • Stop mem3_sync on all nodes
  • Run the purge for each copy of the shard
  • Restart mem3_sync across the cluster

I understood that. My question is if multiple documents are purged at a time, will a GET perform this said rebuild from scratch?

My question is if multiple documents are purged at a time, will a GET perform this said rebuild from scratch?

my understanding is that the rebuilding of index depends on the condition here https://github.com/apache/couchdb-couch-index/blob/master/src/couch_index_updater.erl#L204. i.e. the value of DbPurgeSeq and IdxPurgeSeq when GET is performed.

My question is if multiple documents are purged at a time, will a GET perform this said rebuild from scratch?

my understanding is that the rebuilding of index depends on the condition here https://github.com/apache/couchdb-couch-index/blob/master/src/couch_index_updater.erl#L204. i.e. the value of DbPurgeSeq and IdxPurgeSeq when GET is performed.

Based on the purge_index(Db, Mod, IdxState) function logic (forgive me if I am wrong, I am not familiar with Erlang syntax), it looks like if the purge sequence diff is greater than 1, a complete reset of the view indexes is done.

This would mean that even if multiple documents are purged, the views should be rebuilt from scratch which would ideally get rid of any stale data. This is consistent with our production behavior except when this issue happens.

Is my understanding correct and based on that do you think the bulk purging of documents could cause issues with a complete rebuild of view indexes?

Is my understanding correct and based on that do you think the bulk purging of documents could cause issues with a complete rebuild of view indexes?

If multiple documents are purged and it causes that purge-seq diff is greater than 1, it will cause rebuilding of index..

Is my understanding correct and based on that do you think the bulk purging of documents could cause issues with a complete rebuild of view indexes?

If multiple documents are purged and it causes that purge-seq diff is greater than 1, it will cause rebuilding of index..

Thank you. And this will ideally get rid of purged documents from the index as well right?

Thank you. And this will ideally get rid of purged documents from the index as well right?

Yes, if there is no mem3-sync happened during this phase. So this is why I mentioned early to stop mem3.
BTW: in recent version of couchdb, we ship clustered purge which can prevent from rebuilding of index when multiple documents are purged.

mem3-sync

Sorry I am not familiar with mem3-sync. We use CouchDB 1.6.1 and our production setup for a given operator has a local CouchDB on a device which syncs with a cloud CouchDB through replication. Is this still applicable for our scenario?

It is about sync among nodes in one cluster, and different from replication from local couchdb to remote couchdb.

Appreciate your input on this matter. So it looks like our production setup should ideally work as expected. I am inclined towards some form of view/data corruption due to the bulk purging of documents that are performed daily (Reference: https://github.com/apache/couchdb/issues/758 another issue that we have faced), which in turn could cause the index rebuilding to fail.

Still not sure how I can get to the bottom of this, unfortunately.

Apologies for the excessive questioning, but would an error during an index rebuild get thrown in the Couch log files or is it possible that a view re-index can silently fail without any indications? /var/log/couchdb/couch.log did not show any related errors when this happened.

Why we expect the failure of rebuilding of index? I am afraid that some rebuilding of index is not triggered in some occasional case and this caused stale data.

Why we expect the failure of rebuilding of index?

Because as far as I can see there is no other explanation for the stale views since after the purge, a GET is performed on the views to trigger a re-indexing which (based on what we have discussed) rebuilds the indexes from scratch.

I am afraid that some rebuilding of index is not triggered in some occasional case and this caused stale data.

Is there a way I could track if this is what's actually happened or even recreate it in a local setup?

Using recon_trace(https://ferd.github.io/recon/recon_trace.html), we can see the underlying thing.

@wohali do you have any further input regarding this? Would appreciate some directions to get to the bottom of this.

This issue is being closed because it references CouchDB 1.x. We no longer support CouchdB 1.x.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dpdornseifer picture dpdornseifer  路  3Comments

popojargo picture popojargo  路  5Comments

wohali picture wohali  路  3Comments

stheobald picture stheobald  路  4Comments

denyeart picture denyeart  路  3Comments