Encountered an issue when trying to automate CouchDB 3.0 cluster setup with cluster setup API, described at https://docs.couchdb.org/en/stable/setup/cluster.html#the-cluster-setup-api.
If the / path is not accessed before finish_cluster is called, {"error":"unknown_error","reason":"undef","ref":1124911208} is given.
Pre-existing data directories are deleted and docker-compose up -d is executed on all nodes
Docker compose configuration
version: '3'
services:
couchdb:
image: couchdb:3
environment:
- COUCHDB_USER=user
- COUCHDB_PASSWORD=pass
- COUCHDB_SECRET=secret
- NODENAME=<ip of node>
command: "-setcookie cookie"
ports:
- "5984:5984"
- "4369:4369"
- "9100:9100"
volumes:
- ./data:/opt/couchdb/data
Assuming a 2 node setup:
On setup node: the following commands are executed:
curl http://user:pass@localhost:5984/_cluster_setup
curl --request POST \
--url http://user:pass@localhost:5984/_cluster_setup \
--header 'content-type: application/json' \
--data '{
"action": "enable_cluster",
"bind_address": "0.0.0.0",
"username": "user",
"password": "pass",
"port": 5984,
"node_count": 2,
"remote_node": "<ip of remote node>",
"remote_current_user": "user",
"remote_current_password": "pass"
}'
curl --request POST \
--url http://user:pass@localhost:5984/_cluster_setup \
--header 'content-type: application/json' \
--data '{
"action": "add_node",
"host": "<ip of remote node>",
"port": 5984,
"username": "user",
"password": "pass",
"singlenode": false
}'
# curl http://localhost:5984/
curl --request POST \
--url http://user:pass@localhost:5984/_cluster_setup \
--header 'content-type: application/json' \
--data '{ "action": "finish_cluster" }'
curl http://user:pass@localhost:5984/_cluster_setup
Output (after running commands above, with / request omitted):
{"state":"cluster_enabled"}
{"ok":true}
{"ok":true}
{"error":"unknown_error","reason":"undef","ref":1124911208}
{"state":"cluster_enabled"}
The finish_cluster request should complete and the state should be cluster_finished.
{"couchdb":"Welcome","version":"3.0.0","git_sha":"03a77db6c","uuid":"80f48d682bf369f142e1bb3632cf20bf","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
If a request to / is made before finish_cluster, finish_cluster returns response {"ok": true} and setup state becomes cluster_finished.
Edit:
I realised that I've neglected setting the cookie, after observing that nodes won't reconnect after restart.
I've now added command: "-setcookie cookie" to the Dockerfile but the issue is still reproducible.
heya, great report, thanks!
Can you share your couch.log on all connected notes for the commands above. Ideally at debug level
Please find attached.
Note that the logs are from a virtualbox vm (setup node) and my local machine (other node).
Apologies if I've done anything incorrectly as I'm new to couch.
Thanks.
Edit: On second thoughts, it may be better to have two actual nodes, given that 'other node' cannot reach 'setup node'. Let me know if you want me to grab the logs from a real environment.
Thanks!
first thing I noticed in the setup_node log is that the order of requests doesn鈥檛 match your listing with the curl commands above:
GET /_cluster_setup 200: L229POST /_cluster_setup {add_node: 10.0.2.2} 201: L232POST /_cluster_setup {finish_cluster} 201: L243POST /_cluster_setup 500: L260That last one doesn鈥檛 dump out what action it is, but I鈥檇 assume it is the second add_node call, and that won鈥檛 work after the finish_cluster in 3.
Can you verify the order of things?
I agree that a better error message would help, if this is the case, so happy to keep this issue open for this regardless.
Actually, it might be that 4. is a second call to finish_cluster which is also not correct, but also should not throw a weird error.
applying the patch in https://github.com/apache/couchdb/pull/2798/files should give you a better error message.
Yes, my bad, the script with the commands don't represent what I've written above.
Instead, it's:
The 5th request, the GET was not executed.
The strange thing is if I add a GET request to / between 3 and 4, the error seems to go away.
I'll try to get a better set of logs for you.
Here are 2 sets of test logs.
For the first set, the first line of run.sh was commented whereas it was uncommented in the second run.
run.sh
Set 1:
run_output.txt
setup_log.txt
other_log.txt
Set 2:
run_output2.txt
setup_log2.txt
other_log2.txt
I noticed that the second run (with the request to /) succeeded after having the error Request to create N=3 DB but only 2 node(s) while the first still? had [<<"log:error/2">>,<<"setup:sync_config/3 L265">>,<<"setup:finish_cluster/1 L203">>,<<"setup_httpd:handle_action/2 L99">>,<<"setup_httpd:handle_setup_req/1 L24">>,<<"chttpd:handle_req_after_auth/2 L322">>,<<"chttpd:process_request/1 L305">>,<<"chttpd:handle_request_int/1 L243">>].
Regarding the patch, I'm not quite sure about how I can apply it easily, given that I'm using docker.
If setup.erl is compiled when building from source, perhaps it's best if the issue is closed for now.
I can just GET / for now and investigate further once I have more time.
It spends me three days to fix this problem... Follow almost every config and command made by @YC. The official document is very ambiguous when it comes to which node should do what.
We welcome pull requests to improve documentation.
If you are looking for general support with using CouchDB, please try one of these other options:
If that results in a clear bug that needs to be resolved, we're happy to track it on GitHub in a new issue.
Most helpful comment
Yes, my bad, the script with the commands don't represent what I've written above.
Instead, it's:
The 5th request, the GET was not executed.
The strange thing is if I add a GET request to
/between 3 and 4, the error seems to go away.I'll try to get a better set of logs for you.