Replication is failing due to a request failing which seems related to a atts_since limit.
Requesting a document with 30+ atts_since entries should work. But only when I remove some entries in the request array am I capable of getting a valid response.
Gives 400 error.
The problem occurred on a database where one document had a lot of conflicts. It looks like for each conflict an entry into atts_since is added to get new revisions since the conflict branches. But today I noticed the replication was hanging/looping and causing a lot of errors in the log. I saw errors mentioning the GET request with atts_since (error on windows, but later as notice when trying on Linux).
emulator -------- Error in process <0.1903.10> on node couchdb@localhost with exit value:
{function_clause,[{couch_replicator_api_wrap,'-open_doc_revs/6-fun-1-',[400,[{"Content-Length","0"},{"Date","Tue, 08 Jan 2019 14:58:13 GMT"},{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Strict-Transport-Security","max-age=31536000"}],#Fun<couch_replicator_httpc.0.15481832>],[{file,"src/couch_replicator_api_wrap.erl"},{line,254}]},{couch_replicator_httpc,process_stream_response,5,[{file,"src/couch_replicator_httpc.erl"},{line,205}]},{couch_replicator_httpc,send_req,3,[{file,"src/couch_replicator_httpc.erl"},{line,76}]}]}
Replication crashing because GET https://servername/db/conflicteddoc?atts_since=%5B%222-1631949226f78c645dde972daa1318a3%22%2C%222-47dcd3afe7e16850ebec40ce2f190351%22%2C%222-804d2065909cfd135d7c2a69d81733b1%22%2C%222-91e9a8310a73757819f48f9035e17fc4%22%2C%222-9329e8bb6290a6e10322cba14839bb62%22%2C%222-ac0c261b873d070d1817bfea3e0d1e57%22%2C%222-d5743060c5dcb5882f4eba4d724a7b48%22%2C%222-e14ae63e9d12f2dff7d683f249bf7f86%22%2C%222-fd7099b4b667c55093268f2589028457%22%2C%222-fdf27027813c424d8a5dc65a48a1f9e7%22%2C%22502-0f6282c92965d792e5cfd0a218dc90e1%22%2C%22576-74f1baae10dc93634c8f664a7080792f%22%2C%22693-23827d51bb67119753a8d43a47d7d90c%22%2C%22817-706ca2dadd94dd05067924a4cf3d62ae%22%2C%22925-83f6a73f51e1b47b9c745a81de306463%22%2C%221435-4ec081205b9fd732f9d0db597a2d9d61%22%2C%221663-0e9ffd5403634b8e127c2ad496ef30bd%22%2C%221875-921c203e382a203b31831fa8235ee293%22%2C%222089-81ca3e663a9e8b4d1eafe5f0a7c139ef%22%2C%222224-c4cab42628e96345b34bb509062b1e6c%22%2C%222575-4f437ad237ed3a5c0e979d9f63168cbb%22%2C%222752-c7d6fb9febf1c463e758a89fce901df7%22%2C%222774-8fed8922aefa1262d7cb67798edd69cc%22%2C%222991-e55c6ff1a19604904d7e83cdba3db2f3%22%2C%223105-17b3e2194c6cf2d406fed05665883213%22%2C%223145-a752288c0690f54ace8f82f58015bef7%22%2C%223338-c17c0cf94b92b67bde2611744e5c05fc%22%2C%223572-514320d19c476f068db5065cd04c3553%22%2C%223754-21e842ffb0320b44103e04718aa8b671%22%2C%224262-19ccbdc6f71b8aabf18dd31cbd1e5cdf%22%2C%224698-c74d7c5b75c7d4f0004f3328251de391%22%2C%226485-f75dc4ea029b0934a9788f21bc911bbb%22%2C%226525-c332948ceeabde42097167f7800afc6f%22%2C%226624-48315fc928413903f57cb90fdbb13da4%22%2C%226719-1a26fb75b6cbdfa064a10464e67a46fd%22%2C%226812-fed431bf100b563e2a10cae6661e9bbf%22%2C%226843-de87dbff1483490cb2ecb89ab51197e4%22%2C%227154-cbb96c93c6e5585e45f7cd95f9e40f22%22%2C%227165-80b31ecaa0a3eaf70b9a33513cb209d3%22%2C%227604-c7e1cbd8067fca122b00c43de08a1878%22%2C%227953-e1abb0cf7967b46ef45c7a251a0fd938%22%2C%228023-141bf3a1762377883f9995f584030b84%22%2C%228125-82b72ebbfcda41809f456b45f9c8f541%22%5D&revs=true&open_revs=%5B%228138-5dbae82b30eb49509cdf32f743c7ab58%22%5D&latest=true failed
Other documents seem to replicate fine. But even after I deleted the conflicting revs did the error remain.
2.2.0 on Windows
2.3.0 on Linux
[edit]
I should add that the initial replication seems to work. It is only in continuous mode that changes to the document are not kept up to date. (I also think one-time replication fails/hangs if the previous replication point already contains the documents previous version)
This is probably a duplicate of #1810 , can you test:
[chttpd]
server_option = [{recbuf, 8192}]
and see if this helps?
That sounds very plausible, will try it tomorrow and report back.
But if this is the solution, wouldn't that just be postponing the error until there are enough branches that the next limit is reached? I don't fully get why resolving the conflicts would not eliminate elements from the atts_since. I guess you could still bring back a deleted branch so it needs to be checked?
Would purging deleted_conflicts help here?
paging @nickva
@jlami I guess you could still bring back a deleted branch so it needs to be checked?
Exactly, deletions are replicated in the normal case. However, you could add replication selector filter to not replicate deletions to the target.
Ok, so my fear of receive buffer overflow in the future is kinda valid if this is the culprit. Couchdb would actually need to split up into multiple requests to work around this limitation to be correct. Does it do this currently?
@jlami open_revs request which is failing above does have logic to reduce url request size actually. However it needs to receive a 414 response (which is a marker for "url too long").
In this case the response is 400, likely because of the #1810 issue (see details there, related to a bug in the url parser, but setting a socket buffer fixes the issue).
Ok, that sounds better. Without knowing about the 414 I was expecting the normal 400 failure again.
Will let you know if a higher recbuf fixes it for now.
With a defined recvbuf of 8192 the 400 error goes away and replication seems to work.
I don't see a 414 if I make a very large dummy request. It just gives a 400 again.
@jlami
I think you were hitting the same bug in the Erlang http headers parser (exposed through Mochiweb webserver, which CouchDB is using) as #1810. Normally some proxies should send 414 and CouchDB would send it as well if you set [httpd] max_uri_length = SomeNumber. So perhaps setting that is another workaround, so you'd set it to 8000 or so and when request limit is hit 414 will be returned instead.
Ok, sounds logical. But I'm hoping something like that would end up in the stable setup. Thanks for looking into this!
Yup, we're planning a 2.3.1 release with this fix.