Couchdb: Test suite `delayed_commits.js` unreliable.

Created on 29 Oct 2018 · 10Comments · Source: apache/couchdb

It became quite often that JS tests fail on Travis with the following

test/javascript/tests/delayed_commits.js                       
    Error: Failed to execute HTTP request: Failed to connect to 127.0.0.1 port 15984: Connection refused
Trace back (most recent call first):

  37: test/javascript/couch_http.js
      ("\"false\"\n")
 468: 127.0.0.1/_config/couchdb/delayed_commits",[object Object])@test/javascript/couch.js
      ("PUT","/_node/node1
 408: test/javascript/couch_test_runner.js
      run_on_modified_server([object Array],(function () {sleep(15000);T(db.
  29: test/javascript/tests/delayed_commits.js
      ()
  37: test/javascript/cli_runner.js
      runTest()
  48: test/javascript/cli_runner.js

fail

Expected Behavior

JS tests should pass.

Current Behavior

JS tests fail often locally in delayed_commits suite.

Possible Solution

It's either some changes introduced with new purge functionality or some tweaking happened around JS suite itself. A bit of digging what exactly delayed_commits is testing and how that part of functionality was affected in the latest changes is required.

Steps to Reproduce (for bugs)

Run make javascript until the failure. Note that isolated run make javascript suites=delayed_commits most of the time finishes fine.

bug testsuite

Source

eiri

All 10 comments

@eiri note that Jenkins is passing recently, all the time. If you're trying to reproduce this, maybe try on a heavily CPU-loaded or RAM-constrained system?

wohali on 29 Oct 2018

@wohali Annoyingly this particular test fails for me locally 4 times out of 5, even on the current master, and my MBP not very starved on the resources. Initially I thought this is my local issue, something in my env, but now when I've seen the same failure on Travis, my guess is that this test just became finicky with some of the recent changes.

eiri on 29 Oct 2018

😕1

So, a minimal set of the js suites I can reproduce this with is make javascript suites=coffee,compact,delayed_commits which leads me to believe it triggered by some interaction between a compaction ran on a freshly built view (from coffee suite) and delayed_commits.

My guess is that there are a race somewhere that's _less_ visible on a setup with slower HD, that's why Jenkins not affected that much.

eiri on 30 Oct 2018

@eiri now that Elixir tests have landed, maybe focus on porting this to Elixir so we can ditch the JS test suite ASAP? :) Unless you think there's an actual race condition here that needs addressing.

wohali on 8 Nov 2018

@wohali I'm actually digging into it right now, had to work on something else for a first half of the week.

The elixir tests landing is a good news, though it'll take me some time to get up to the speed with them. My current plan is to confirm that this is more of the tests issue rather than a real corner case problem and if it's the former I'll switch to just porting it.

I'll keep posting my progress on this thread.

eiri on 8 Nov 2018

So, this turned out not to be about delayed_commits tests, but a some kind of a race in a server restart code introduced in #1543. I haven't nailed what is it exactly yet, so far it seems to be triggered when restart's killing running compaction or/and when there are a previous content in the system databases in dev/lib/node*. This explains why it's less frequent on CI (it starts from a clean state every time) and it's more profound in dev env (I don't run make devclean on _each_ test run).

Since this is not the tests themselves, but actual restart API part we are using in elixir tests I'll keep looking.

eiri on 12 Nov 2018

Ok, I just can't reproduce this outside of javascript test suite, i.e. repeating the same steps in a bash script.

I'm going to concentrate on that elixir porting then, to see if the issue persist and if not I'll just write it off on javascript voodoo and we'll have our port.

eiri on 13 Nov 2018

Implementation details for restart server in Elixir tests

Ping @jaydoane @dottorblaster @iilyak @davisp for discussion.

So, I'm porting delayed_commits tests to elixir and one thing I'm adding is a restart_server helper I need in there. I've implemented it already in a fashion of javascript's helper, but not so thrilled with an idea of running retry_until to confirm that server goes down and then up. It's a time based thing and by definition is unreliable.

I'm thinking of joining elixir node to dev cluster and use global_group:monitor_nodes/1 instead, but since it seems to be a first time when we are actually leveraging elixir's BEAM side I don't want to dive into it without been sure it's not going to be rejected on principle.

Please share you opinion on the matter.

eiri on 3 Dec 2018

@eiri: I think it is ok to use erlang distribution to control SUT (system under test) as long as:

the extra node is hidden (so it is not considered by cluster as real node)
the APIs we call remotely from integration tests are encapsulated in a test helper module
- we carefully add functions in there on case by case basis
- if we would allow calling any function on the node our tests would quickly become implementation specific

iilyak on 3 Dec 2018

👍1

My main concern is that the current JS test suite uses HTTP, so adding a dependency on Erlang distribution protocol may make it more difficult to replace the existing suite in those environments. That said, leveraging the BEAM does seem like a more elegant approach.