It became quite often that JS tests fail on Travis with the following
test/javascript/tests/delayed_commits.js
Error: Failed to execute HTTP request: Failed to connect to 127.0.0.1 port 15984: Connection refused
Trace back (most recent call first):
37: test/javascript/couch_http.js
("\"false\"\n")
468: 127.0.0.1/_config/couchdb/delayed_commits",[object Object])@test/javascript/couch.js
("PUT","/_node/node1
408: test/javascript/couch_test_runner.js
run_on_modified_server([object Array],(function () {sleep(15000);T(db.
29: test/javascript/tests/delayed_commits.js
()
37: test/javascript/cli_runner.js
runTest()
48: test/javascript/cli_runner.js
fail
JS tests should pass.
JS tests fail often locally in delayed_commits suite.
It's either some changes introduced with new purge functionality or some tweaking happened around JS suite itself. A bit of digging what exactly delayed_commits is testing and how that part of functionality was affected in the latest changes is required.
Run make javascript until the failure. Note that isolated run make javascript suites=delayed_commits most of the time finishes fine.
@eiri note that Jenkins is passing recently, all the time. If you're trying to reproduce this, maybe try on a heavily CPU-loaded or RAM-constrained system?
@wohali Annoyingly this particular test fails for me locally 4 times out of 5, even on the current master, and my MBP not very starved on the resources. Initially I thought this is my local issue, something in my env, but now when I've seen the same failure on Travis, my guess is that this test just became finicky with some of the recent changes.
So, a minimal set of the js suites I can reproduce this with is make javascript suites=coffee,compact,delayed_commits which leads me to believe it triggered by some interaction between a compaction ran on a freshly built view (from coffee suite) and delayed_commits.
My guess is that there are a race somewhere that's _less_ visible on a setup with slower HD, that's why Jenkins not affected that much.
@eiri now that Elixir tests have landed, maybe focus on porting this to Elixir so we can ditch the JS test suite ASAP? :) Unless you think there's an actual race condition here that needs addressing.
@wohali I'm actually digging into it right now, had to work on something else for a first half of the week.
The elixir tests landing is a good news, though it'll take me some time to get up to the speed with them. My current plan is to confirm that this is more of the tests issue rather than a real corner case problem and if it's the former I'll switch to just porting it.
I'll keep posting my progress on this thread.
So, this turned out not to be about delayed_commits tests, but a some kind of a race in a server restart code introduced in #1543. I haven't nailed what is it exactly yet, so far it seems to be triggered when restart's killing running compaction or/and when there are a previous content in the system databases in dev/lib/node*. This explains why it's less frequent on CI (it starts from a clean state every time) and it's more profound in dev env (I don't run make devclean on _each_ test run).
Since this is not the tests themselves, but actual restart API part we are using in elixir tests I'll keep looking.
Ok, I just can't reproduce this outside of javascript test suite, i.e. repeating the same steps in a bash script.
I'm going to concentrate on that elixir porting then, to see if the issue persist and if not I'll just write it off on javascript voodoo and we'll have our port.
Ping @jaydoane @dottorblaster @iilyak @davisp for discussion.
So, I'm porting delayed_commits tests to elixir and one thing I'm adding is a restart_server helper I need in there. I've implemented it already in a fashion of javascript's helper, but not so thrilled with an idea of running retry_until to confirm that server goes down and then up. It's a time based thing and by definition is unreliable.
I'm thinking of joining elixir node to dev cluster and use global_group:monitor_nodes/1 instead, but since it seems to be a first time when we are actually leveraging elixir's BEAM side I don't want to dive into it without been sure it's not going to be rejected on principle.
Please share you opinion on the matter.
@eiri: I think it is ok to use erlang distribution to control SUT (system under test) as long as:
My main concern is that the current JS test suite uses HTTP, so adding a dependency on Erlang distribution protocol may make it more difficult to replace the existing suite in those environments. That said, leveraging the BEAM does seem like a more elegant approach.