Couchdb: beam.smp spikes and eats all available CPU

Created on 6 Oct 2017  路  5Comments  路  Source: apache/couchdb

This is a production system, and a bit frustrated, sorry if this in the wrong spot. Don't know where else to look. I have searched high and low for answers, posted in the IRC, and nothing. In a nutshell, I have a 2 node cluster running, with cpu at normal levels, I do some inserts and queries, seriously, not a heavy load at all. The database(s) being used have ~39 million rows and some views, but mostly mango indexes. After ~12-20 insert/queries, the beam.smp process takes off. The current request to insert that caused the spike, never returns and times out. I have no idea where else I can look for clues. The logs are debug level and verbose, and everything looks pretty normal. The 2 nodes are very large 1TiB machines with 4cpu/4cores each. Resources are not an issue at all. Something is fundamentally wrong here, but don't know where to look. I have tweaked and turned every possible knob there is in the couch config, and have had no results. If someone can tell me which additional places to look or log - I just need to understand what rock to look under. I have no problem putting in the work to debug.

Version used: 2.0
Operating System and version (desktop or mobile): Ubuntu 16

waiting on user

Most helpful comment

Just incase if anyone facing the same issue, I managed to bring back the CPU utilisation to normal levels by shutting down the couchdb instance that is running as service.

      sudo service couchdb stop

And later spawning the couchdb as a background process by using

     sudo couchdb -b

Somehow If the couchdb instance is again started as a background service, it eats up all the available CPU. Didn't get enough time to debug this (I'm guessing upstart script to be debugged).

All 5 comments

Keep an eye in the logs for emfile errors, that might mean running out of file descriptors.

Also try increasing max_dbs_open if you see all_dbs_active in the logs.

In general try to see if there is something in the logs around the time this behavior starts.

Look for things that looks like stack traces (file names and lines of code) as well.

@nickva I'm also facing the same, above described issue. I'm using couchdb 1.6.1. In my case I'm doing a continuous replication between two couch databases for ~20K databases of each ~10MB on average, to and forth. After a certain time couch db crashes and beam process eats up all the available CPU. Restarting the couch process, or deleting the replications didn't help. Could you tell me what information I should be looking at ? Or can provide any pointers to solve this issue ? d

The out put of couchdb.stderr file

heart_beat_kill_pid = 15996
heart_beat_timeout = 11
heart: Fri Dec 15 22:01:14 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Fri Dec 15 22:01:15 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 16103
heart_beat_timeout = 11
heart: Fri Dec 15 22:03:52 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Fri Dec 15 22:03:53 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 1202
heart_beat_timeout = 11

heart_beat_kill_pid = 21749
heart_beat_timeout = 11
Killed
inet_gethost[1711]: WARNING:Unable to write to child process.
inet_gethost[1711]: WARNING:Unable to select on dying child file descriptor, errno = 9.

heart_beat_kill_pid = 8077
heart_beat_timeout = 11
heart: Mon Dec 18 01:32:56 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Mon Dec 18 01:32:57 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

Last lines out put of couchdb.stdout file

=ERROR REPORT==== 18-Dec-2017::01:33:31 ===
** Generic server <0.13137.79> terminating 
** Last message in was {'EXIT',<0.13139.79>,
                           {badarg,
                               [{ets,lookup,
                                    [couch_rep_id_to_rep_state,
                                     {"622580889a5576440ff2e9c08454d3b7",
                                      "+continuous+create_target"}],
                                    []},
                                {couch_replicator_manager,rep_state,1,
                                    [{file,"src/couch_replicator_manager.erl"},
                                     {line,617}]},
                                {couch_replicator_manager,
                                    replication_started,1,
                                    [{file,"src/couch_replicator_manager.erl"},
                                     {line,65}]},
                                {couch_replicator,do_init,1,
                                    [{file,"src/couch_replicator.erl"},
                                     {line,329}]},
                                {couch_replicator,init,1,
                                    [{file,"src/couch_replicator.erl"},
                                     {line,231}]},
                                {gen_server,init_it,6,
                                    [{file,"gen_server.erl"},{line,304}]},
                                {proc_lib,init_p_do_apply,3,
                                    [{file,"proc_lib.erl"},{line,239}]}]}}
** When Server state == {state,"https://<uname>:<pwd>@<domain.name>/lg39e96df4-f71a-42dc-96f1-da90bd46d872/",
                               20,
                               [<0.13136.79>],
                               [],
                               {[],[]}}
** Reason for termination == 
** {badarg,
       [{ets,lookup,
            [couch_rep_id_to_rep_state,
             {"622580889a5576440ff2e9c08454d3b7","+continuous+create_target"}],
            []},
        {couch_replicator_manager,rep_state,1,
            [{file,"src/couch_replicator_manager.erl"},{line,617}]},
        {couch_replicator_manager,replication_started,1,
            [{file,"src/couch_replicator_manager.erl"},{line,65}]},
        {couch_replicator,do_init,1,
            [{file,"src/couch_replicator.erl"},{line,329}]},
        {couch_replicator,init,1,
            [{file,"src/couch_replicator.erl"},{line,231}]},
        {gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
[error] [<0.296.0>] Could not open file /var/lib/couchdb/lg37be7786-fad0-4dd2-ae69-603e2c69fc1d.couch: file already exists
[info] [<0.269.0>] 10.15.0.2 - - PUT /lg37be7786-fad0-4dd2-ae69-603e2c69fc1d/ 412
[info] [<0.270.0>] 10.15.0.2 - - HEAD /lg37be7786-fad0-4dd2-ae69-603e2c69fc1d/ 200

Just incase if anyone facing the same issue, I managed to bring back the CPU utilisation to normal levels by shutting down the couchdb instance that is running as service.

      sudo service couchdb stop

And later spawning the couchdb as a background process by using

     sudo couchdb -b

Somehow If the couchdb instance is again started as a background service, it eats up all the available CPU. Didn't get enough time to debug this (I'm guessing upstart script to be debugged).

@penkeysuresh

When I tried sudo couchdb -b I did receive sudo: couchdb: command not found
Even that I have installed with sudo apt install couchdb

Was this page helpful?
0 / 5 - 0 ratings