-CouchDB v2.3(master branch)
-Erlang OTP v21
-Elixir v1.8.2
Following error messages are found in multiple places(for around 130 to 170 tests) while running CouchDB test suite:-
** (KeyError) key :status_code not found in: %HTTPotion.ErrorResponse{message: "req_timedout"}
Almost all the occurences have the following stacktrace:-
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:166: anonymous fn/2 in Couch.DBTest.create_db/2
(couchdbtest) test/elixir/lib/couch/db_test.ex:304: Couch.DBTest.retry_until/4
Same failures are seen for make check as well as make elixir commands.
Does any of the recent commits could be causing this?
log_make_check_container_137failures.log
log_make_check_container_163Failures.log
req_timedout could be coming from ibrowse http client (which http potion uses underneath). I think the timeout there is set to 5 seconds. Any request longer than 5 seconds would fail like that.
I am not sure how easy it is to update the timeout and try again. O perhaps add more resources, like a faster disk or more CPU and try again.
@nickva thanks for the response, any particular file I should be looking at to increase the timeout to say 10 seconds or so?
@sarveshtamba hmm, not sure but try changing this value:
https://github.com/apache/couchdb/blob/master/test/elixir/lib/couch.ex#L182
Make it 10000 and see if it works better. If it does, we could just bump the default
I had made that change in my "fix all the CI timeouts" WIP branch and it helped.
Hi @nickva , @kocolosk , thanks a lot for your response.
I have set the timeouts at the following two places to 999_999_999(lower timeouts of 10000 from exisiting 5000 still caused some tests to fail, will try to figure the most optimum value after some more trials)
https://github.com/apache/couchdb/blob/master/test/elixir/lib/couch.ex#L182
https://github.com/apache/couchdb/blob/master/test/elixir/lib/couch/db_test.ex#L293
Right now the elixir test suite is in progress and all the tests except couple below seem to pass, currently test is stuck at the following for quite sometime:-
CompactTest
* test compaction reduces size of deleted docs
The failing tests are as below:-
ReshardAllDocsTest
* test all_docs after splitting the same range on all nodes (15901.7ms)
1) test all_docs after splitting the same range on all nodes (ReshardAllDocsTest)
test/elixir/test/reshard_all_docs_test.exs:39
** (RuntimeError) timed out after 10184 ms
code: |> Enum.each(fn id -> wait_job_completed(id) end)
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:301: Couch.DBTest.retry_until/4
(elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:769: Enum.each/2
test/elixir/test/reshard_all_docs_test.exs:51: (test)
* test all_docs after splitting all shards on node1 (13961.3ms)
2) test all_docs after splitting all shards on node1 (ReshardAllDocsTest)
test/elixir/test/reshard_all_docs_test.exs:21
** (RuntimeError) timed out after 10188 ms
code: wait_job_completed(jobid)
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:301: Couch.DBTest.retry_until/4
test/elixir/test/reshard_all_docs_test.exs:32: (test)
Is there anything that I should look out for?
Thanks once again! Appreciate it very much.
After the long timeout, the following test finally fails as below:-
CompactTest
* test compaction reduces size of deleted docs (1017513.1ms)
1) test compaction reduces size of deleted docs (CompactTest)
test/elixir/test/compact_test.exs:17
** (RuntimeError) timed out after 1000053 ms
code: retry_until(fn ->
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:301: Couch.DBTest.retry_until/4
test/elixir/test/compact_test.exs:38: (test)
This is the only test failure that I am seeing right now, and all the other tests pass successfully.
Hmm, yeah, that last one looks like something went screwy and was going to hang forever. Hard to tell what just from the stack trace there.
On a related note, I found a second HTTP-related timeout that seems to fire on rare occasions. I'm testing it now, but I think this is how you change it: e71eafd6875
Linking to #2088
Hmm, yeah, that last one looks like something went screwy and was going to hang forever. Hard to tell what just from the stack trace there.
On a related note, I found a second HTTP-related timeout that seems to fire on rare occasions. I'm testing it now, but I think this is how you change it: e71eafd
@kocolosk Any inputs/pointers to what could be going wrong in CompactTest and if I can help resolve this?
@kocolosk any updates for me on this one?
Following error messages are found intermittently while running CouchDB test suite:-
ReshardAllDocsTest
1) test all_docs after splitting all shards on node1 (ReshardAllDocsTest)
test/elixir/test/reshard_all_docs_test.exs:21
** (RuntimeError) timed out after 10162 ms
code: wait_job_completed(jobid)
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:301: Couch.DBTest.retry_until/4
test/elixir/test/reshard_all_docs_test.exs:32: (test)
test all_docs after splitting the same range on all nodes (15679.8ms)
2) test all_docs after splitting the same range on all nodes (ReshardAllDocsTest)
test/elixir/test/reshard_all_docs_test.exs:39
** (RuntimeError) timed out after 10164 ms
code: |> Enum.each(fn id -> wait_job_completed(id) end)
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:301: Couch.DBTest.retry_until/4
(elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:769: Enum.each/2
test/elixir/test/reshard_all_docs_test.exs:51: (test)
@sarveshtamba Are you up-to-date on master? I think 608caaf12904effc104fc86a8525eb51425e2311 fixed those ReshardAllDocsTest timeouts for me (specifically by also raising the inactivity_timeout beyond the default of 10 seconds)
Hi @kocolosk ,
Thanks, I just pulled all the latest changes from master and the ReshardAllDocsTest pass with the updates.
Now I am left with only the below error, and trying to investigate this, not much luck though. :-(
CompactTest
* test compaction reduces size of deleted docs (51576.1ms)
1) test compaction reduces size of deleted docs (CompactTest)
test/elixir/test/compact_test.exs:17
** (RuntimeError) timed out after 30015 ms
code: retry_until(fn ->
stacktrace:
(couchdbtest) test/elixir/lib/couch/db_test.ex:301: Couch.DBTest.retry_until/4
test/elixir/test/compact_test.exs:38: (test)
Is there a way one could just run this one test case in standalone way? And can one attach this to some debugger like gdb which could help debug at runtime?
Hi @kocolosk ,
I debugged the failing CompactTest test case and have managed to find the root cause of the failure.
After understanding the logic of the test case and tracing the code flow, I realised that the failure happened due to the incorrect assert check at the following location:-
https://github.com/apache/couchdb/blob/master/test/elixir/test/compact_test.exs#L46
This is because the final data size after deletion & further compaction is more than the deleted data size after only deletion, but not compaction.
The opposite was being checked due to which the test case was failing consistently.
Following are the values of the variables in question that I managed to trace:-
CompactTest
* test compaction reduces size of deleted docs
Value of orig_data_size = 4436.
Value of orig_disk_size = 103907.
Value of deleted_data_size = 7455.
Value of final_data_size = 11924.
Value of final_disk_size = 218681.
* test compaction reduces size of deleted docs (18819.2ms)
I have made the necessary changes and submitted a PR for the same.
https://github.com/apache/couchdb/pull/2127
Entire test suite for CouchDB v2.3(current master) executes successfully/passes with Erlang v21 and Elixir v1.8 on PowerPC64LE. Closing this issue. Thanks for all your help and support in getting this through.
Most helpful comment
I had made that change in my "fix all the CI timeouts" WIP branch and it helped.