Scylla: Command nodetool cfstats stucked for ~2days.

Created on 29 Jan 2021 · 59Comments · Source: scylladb/scylla

Installation details
Scylla version (or git commit hash): 4.4.rc0-0.20210125.f470c5d4d with build-id ca840e9f9c508f2b3fb3d4176d3f64127857285a
Cluster size: 4 nodes (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0e8ece5acf9742edf (aws: eu-central-1)

Test: longevity-200gb-48h-test
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):

[longevity-200GB-48h-verifier-LimitedMonkey-tls.yaml] (https://github.com/scylladb/scylla-cluster-tests/blob/7fdf66f6fc344f523ad904cf1415cc9c207bd9dd/test-cases/longevity/longevity-200GB-48h-verifier-LimitedMonkey-tls.yaml)
Test id: 775008b8-e5cb-4f19-99c8-b0cf0895735b

Issue description

====================================

During job nemesis several nemesis was executed without errors:
HardRebootNode
Enospc
NoCorruptRepair
AbortRepairMonkey
ManagementRepair
ManagementBackupWithSpecificKeyspaces
SnapshotOperations
Drainer
ManagementRepair

During these nemesis nodetool command executed many times and with different subcommands

The latest nemesis was TopPartitions. This nemesis had been running for 2d.
The nemesis check that toppartition is available with command nodetool help toppartition. then execute command nodetool cfstats and then execute nodetool toppartition ... with randomly generated parameters
The problem was that command nodetool cfstats stucked for 2 days. I trigger coredump manually for scylla-jmx and scylla kill -6 scylla-jmx|scylla.
The command nodetool cfstats failed after that it was killed:

Command: '/usr/bin/nodetool -u cassandra -pw cassandra  cfstats '

Exit code: 2

Stdout:

Using /etc/scylla/scylla.yaml as the config file
Cassandra has shutdown.

Stderr:

    at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.queryNames(RMIConnector.java:847)
    at org.apache.cassandra.tools.ColumnFamilyStoreMBeanIterator.getCFSMBeans(NodeProbe.java:1685)
    at org.apache.cassandra.tools.ColumnFamilyStoreMBeanIterator.<init>(NodeProbe.java:1645)
    at org.apache.cassandra.tools.NodeProbe.getColumnFamilyStoreMBeanProxies(NodeProbe.java:571)
    at org.apache.cassandra.tools.nodetool.stats.TableStatsHolder.initializeKeyspaces(TableStatsHolder.java:124)
    at org.apache.cassandra.tools.nodetool.stats.TableStatsHolder.<init>(TableStatsHolder.java:41)
    at org.apache.cassandra.tools.nodetool.TableStats.execute(TableStats.java:56)
    at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:270)
    at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:180)

Coredumps:

scylla-jmx

2021-01-28 14:23:22.991: (CoreDumpEvent Severity.ERROR) node=Node longevity-200gb-48h-verify-limited--db-node-775008b8-2 [18.198.189.25 | 10.0.3.88] (seed: False)
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla-jmx.997.8e76911fdab34376ab341f44a516319b.3742.1611843732000000/core.scylla-jmx.997.8e76911fdab34376ab341f44a516319b.3742.1611843732000000.gz
backtrace=           PID: 3742 (scylla-jmx)
           UID: 997 (scylla)
           GID: 1001 (scylla)
        Signal: 6 (ABRT)
     Timestamp: Thu 2021-01-28 14:22:12 UTC (55s ago)
  Command Line: /opt/scylladb/jmx/symlinks/scylla-jmx -Xmx256m -XX:+UseSerialGC -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.host=localhost -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=7199 -Djava.rmi.server.hostname=localhost -Dcom.sun.management.jmxremote.rmi.port=7199 -Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder -jar /opt/scylladb/jmx/scylla-jmx-1.0.jar
    Executable: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre/bin/java
 Control Group: /scylla.slice/scylla-helper.slice/scylla-jmx.service
          Unit: scylla-jmx.service
         Slice: scylla-helper.slice
       Boot ID: 8e76911fdab34376ab341f44a516319b
    Machine ID: cc2c86fe566741e6a2ff6d399c5d5daa
      Hostname: longevity-200gb-48h-verify-limited--db-node-775008b8-2
      Coredump: /var/lib/systemd/coredump/core.scylla-jmx.997.8e76911fdab34376ab341f44a516319b.3742.1611843732000000
       Message: Process 3742 (scylla-jmx) of user 997 dumped core.

                Stack trace of thread 3742:
                #0  0x00007f180c20c017 pthread_join (libpthread.so.0)
                #1  0x00007f180bde65b5 ContinueInNewThread0 (libjli.so)
                #2  0x00007f180bde2bc2 ContinueInNewThread (libjli.so)
                #3  0x00007f180bde3942 JLI_Launch (libjli.so)
                #4  0x000055f209a0079c main (java)
                #5  0x00007f180b82e555 __libc_start_main (libc.so.6)
                #6  0x000055f209a007c7 _start (java)

                Stack trace of thread 3838:
                #0  0x00007f180c20ea35 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f180b0b056b _ZN2os13PlatformEvent4parkEv (libjvm.so)
                #2  0x00007f180b08da4d _ZN13ObjectMonitor4waitElbP6Thread (libjvm.so)
                #3  0x00007f180b231ace _ZN18ObjectSynchronizer4waitE6HandlelP6Thread (libjvm.so)
                #4  0x00007f180ae9f9d0 JVM_MonitorWait (libjvm.so)
                #5  0x00007f17f55ae3e8 n/a (n/a)
                #6  0x00007f17f599d2b8 n/a (n/a)

download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla-jmx.997.8e76911fdab34376ab341f44a516319b.3742.1611843732000000/core.scylla-jmx.997.8e76911fdab34376ab341f44a516319b.3742.1611843732000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla-jmx.997.8e76911fdab34376ab341f44a516319b.3742.1611843732000000.gz

Scyllla core

corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.997.8e76911fdab34376ab341f44a516319b.3678.1611843926000000/core.scylla.997.8e76911fdab34376ab341f44a516319b.3678.1611843926000000.gz
backtrace=           PID: 3678 (scylla)
           UID: 997 (scylla)
           GID: 1001 (scylla)
        Signal: 6 (ABRT)
     Timestamp: Thu 2021-01-28 14:25:26 UTC (29min ago)
  Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
    Executable: /opt/scylladb/libexec/scylla
 Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
          Unit: scylla-server.service
         Slice: scylla-server.slice
       Boot ID: 8e76911fdab34376ab341f44a516319b
    Machine ID: cc2c86fe566741e6a2ff6d399c5d5daa
      Hostname: longevity-200gb-48h-verify-limited--db-node-775008b8-2
      Coredump: /var/lib/systemd/coredump/core.scylla.997.8e76911fdab34376ab341f44a516319b.3678.1611843926000000
       Message: Process 3678 (scylla) of user 997 dumped core.

                Stack trace of thread 3678:
                #0  0x00007f88231c2ecd sendmsg (libpthread.so.0)
                #1  0x0000000003901a2b _ZN7seastar9file_desc7sendmsgEPK6msghdri (scylla)
                #2  0x00000000038ae2e3 _ZN7seastar7reactor13do_write_someERNS_17pollable_fd_stateERNS_3net6packetE (scylla)
                #3  0x0000000003910004 _ZN7seastar19reactor_backend_aio10write_someERNS_17pollable_fd_stateERNS_3net6packetE (scylla)
                #4  0x00000000038ae8c8 _ZN7seastar17pollable_fd_state9write_allERNS_3net6packetE (scylla)
                #5  0x0000000003bd4b43 _ZN7seastar3net20posix_data_sink_impl3putENS0_6packetE (scylla)
                #6  0x000000000389f441 _ZN7seastar9data_sink3putENS_3net6packetE (scylla)
                #7  0x0000000003c166e4 _ZN7seastar3tls7session8vec_pushEPK5ioveci (scylla)
                #8  0x00007f8823976d20 _gnutls_io_write_flush (libgnutls.so.30)
                #9  0x00007f8823978738 _gnutls_send_tlen_int (libgnutls.so.30)
                #10 0x00007f8823979cb0 gnutls_record_send2 (libgnutls.so.30)
                #11 0x0000000003c13fb7 _ZZZN7seastar3tls7session6do_putEPNS_3net8fragmentES4_ENKUlRS3_E_clES5_ENUlvE_clEv (scylla)
                #12 0x0000000003c13d6b _ZN7seastar6repeatIZZNS_3tls7session6do_putEPNS_3net8fragmentES5_ENKUlRS4_E_clES6_EUlvE_EENS_6futureIvEEOT_ (scylla)
                #13 0x0000000003c13a47 _ZN7seastar8internal16do_for_each_implIPNS_3net8fragmentEZNS_3tls7session6do_putES4_S4_EUlRS3_E_EENS_6futureIvEET_SB_T0_ (scylla)
                #14 0x0000000003c12d8f _ZN7seastar3tls7session6do_putEPNS_3net8fragmentES4_ (scylla)
                #15 0x0000000003c13477 _ZZN7seastar14with_semaphoreINS_35semaphore_default_exception_factoryESt5_BindIFMNS_3tls7sessionEFNS_6futureIvEEPNS_3net8fragmentES9_EPS4_S9_S9_EENSt6chrono3_V212steady_clockEEENS_8futurizeINSt9result_ofIFT0_vEE4typeEE4typeERNS_15basic_semaphoreIT_T1_EEmOSK_ENUlSR_E_clINS_15semaphore_unitsIS1_SH_EEEEDaSR_ (scylla)
                #16 0x0000000003c132fe _ZN7seastar6futureINS_15semaphore_unitsINS_35semaphore_default_exception_factoryENSt6chrono3_V212steady_clockEEEE9then_implIZNS_14with_semaphoreIS2_St5_BindIFMNS_3tls7sessionEFNS0_IvEEPNS_3net8fragmentESG_EPSC_SG_SG_EES5_EENS_8futurizeINSt9result_ofIFT0_vEE4typeEE4typeERNS_15basic_semaphoreIT_T1_EEmOSO_EUlSV_E_SD_EESO_OSV_ (scylla)
                #17 0x0000000003c12d04 _ZN7seastar14with_semaphoreINS_35semaphore_default_exception_factoryESt5_BindIFMNS_3tls7sessionEFNS_6futureIvEEPNS_3net8fragmentES9_EPS4_S9_S9_EENSt6chrono3_V212steady_clockEEENS_8futurizeINSt9result_ofIFT0_vEE4typeEE4typeERNS_15basic_semaphoreIT_T1_EEmOSK_ (scylla)
                #18 0x0000000003c12b0b _ZN7seastar3tls7session3putENS_3net6packetE (scylla)
                #19 0x0000000003c12911 _ZN7seastar3tls25tls_connected_socket_impl9sink_impl3putENS_3net6packetE (scylla)
                #20 0x000000000389f441 _ZN7seastar9data_sink3putENS_3net6packetE (scylla)
                #21 0x00000000038f1799 _ZN7seastar13output_streamIcE10poll_flushEv (scylla)
                #22 0x0000000003904527 _ZN7seastar7reactor18batch_flush_pollfn4pollEv (scylla)
                #23 0x00000000038e8209 _ZNSt17_Function_handlerIFbvEZN7seastar7reactor3runEvE4$_79E9_M_invokeERKSt9_Any_data (scylla)
                #24 0x00000000038ca9c6 _ZN7seastar7reactor3runEv (scylla)
                #25 0x0000000003868866 _ZN7seastar12app_template14run_deprecatedEiPPcOSt8functionIFvvEE (scylla)
                #26 0x0000000003867c87 _ZN7seastar12app_template3runEiPPcOSt8functionIFNS_6futureIiEEvEE (scylla)
                #27 0x0000000000df9e9d main (scylla)
                #28 0x00007f88224901e2 __libc_start_main (libc.so.6)
                #29 0x0000000000df6d8e _start (scylla)

                Stack trace of thread 3692:
                #0  0x00007f88231c20fc read (libpthread.so.0)
                #1  0x000000000390da66 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x000000000390e120 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEE3$_0E9_M_invokeERKSt9_Any_data (scylla)
                #3  0x00000000038950fb _ZN7seastar12posix_thread13start_routineEPv (scylla)
                #4  0x00007f88231b83f9 start_thread (libpthread.so.0)
                #5  0x00007f8822569903 __clone (libc.so.6)

                Stack trace of thread 3693:
                #0  0x00007f88231c20fc read (libpthread.so.0)
                #1  0x000000000390da66 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
                #2  0x000000000390e120 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC1EPNS1

download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.997.8e76911fdab34376ab341f44a516319b.3678.1611843926000000/core.scylla.997.8e76911fdab34376ab341f44a516319b.3678.1611843926000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.997.8e76911fdab34376ab341f44a516319b.3678.1611843926000000.gz

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 775008b8-e5cb-4f19-99c8-b0cf0895735b
Show all stored logs command: $ hydra investigate show-logs 775008b8-e5cb-4f19-99c8-b0cf0895735b

Jenkins job URL

Backport candidate bug nodetool

Source

aleksbykov

All 59 comments

@amnonh please have a look - if you need help with opening the cores - please see if someone from benny's team can help you.

slivne on 31 Jan 2021

Refs #7244

bhalevy on 31 Jan 2021

Cores will not help in this case, can we have a system that is stack so I can test what part is not reponding?

amnonh on 31 Jan 2021

@elcallio there are a few different reports of nodetool commands getting stack, is it possible that it relates to:
RegistrationChecker? It's one of the only places in the flow thta uses locks

amnonh on 31 Jan 2021

Cores will not help in this case, can we have a system that is stack so I can test what part is not reponding?

@amnonh it is not reproduced always. Once i got same situation, i will provide the node

aleksbykov on 2 Feb 2021

Might I suggest using jconsole/jps or even just ctrl-break (sigint) to get the scylla-jmx threaddump instead? It will tell you a lot more about whether anything is broked on the jmx side.
If there is a deadlock, this will tell you.

elcallio on 2 Feb 2021

👍1

@roydahan / @fruch - please change ccm so we will have the info calle requested that is, instead of sending sigkill, to send a sigint - so we will be able to collect the information calle is asking for. maybe that will be enough. If not we may need more

slivne on 4 Feb 2021

@roydahan / @fruch - please change ccm so we will have the info calle requested that is, instead of sending sigkill, to send a sigint - so we will be able to collect the information calle is asking for. maybe that will be enough. If not we may need more

we could add such automation to CCM (I think @bhalevy added something similar, at least the timeout to be working correctly), but the report here was from SCT.
Alex was killing it as we usually kill scylla to get a coredump (I think by my recommendation), we'll try keep in mind JMX should be "murdered" in a bit of different approach, or to send the connection information straight to @elcallio if reproduced.

fruch on 4 Feb 2021

@elcallio - please take over - @fruch will explain how to run this in BYO

slivne on 18 Feb 2021

When the nodetool cfstats program is stuck, you can obtain a Java stack trace as follows:

jstack $(jps | grep jmx | awk -F' ' '{ print $1 }')

Example output looks as follows:

2021-02-18 14:45:29
Full thread dump OpenJDK 64-Bit Server VM (11.0.9+11 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007f4750000c30, length=14, elements={
0x00007f4788016000, 0x00007f47880b7800, 0x00007f47880b9800, 0x00007f47880c1000,
0x00007f47880c3000, 0x00007f47880c5800, 0x00007f47880c7800, 0x00007f4788128000,
0x00007f4788160800, 0x00007f4788289800, 0x00007f47882ca000, 0x00007f4788307800,
0x00007f4788376800, 0x00007f4750001000
}

"main" #1 prio=5 os_prio=0 cpu=504.94ms elapsed=176.05s tid=0x00007f4788016000 nid=0x3d9c8 waiting on condition  [0x00007f478f302000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep([email protected]/Native Method)
    at com.scylladb.jmx.main.Main.main(Main.java:58)

...

@elcallio Is it possible that the deadlock could be in nodetool itself?

penberg on 18 Feb 2021

@elcallio - please take over - @fruch will explain how to run this in BYO

just to clarify again, this issue is about something we don't have a clear reproducer (i.e. not in SCT), we seen stuck maybe similar in dtests for a while now, but again with no clear reproducer (as far as I understood from @bhalevy)

We did https://github.com/scylladb/scylla-ccm/pull/304 in CCM to try to help flush it out during dtests runs. it didn't helped since we still don't have the details we were expecting to get on the stdout (and we don't know why it's not available on those cases it does happen).

so again, what BYO are we talking about here ? (cause I'm not aware of a specific set of dtest that would reproduce it, nor an SCT test case that would)

fruch on 18 Feb 2021

dtests that tend to reproduce it quite frequently (at least on monster/godzilla and the parallel dtest fleet machines) are:

secondary_indexes_test:TestSecondaryIndexes.test_insert_data_after_recreating_cf
secondary_indexes_test:TestSecondaryIndexes.test_insert_data_after_recreating_ks
secondary_indexes_test:TestLocalIndexes.test_insert_data_after_recreating_ks_with_local_index
cfid_test:TestCFID.cfid_test

Ref #7244

bhalevy on 18 Feb 2021

@elcallio - can you please take this over - and potentially do a hack in CCM to get the info you need (@fruch can instruct you on how t run the BYO)

slivne on 22 Feb 2021

@elcallio - can you please take this over - and potentially do a hack in CCM to get the info you need (@fruch can instruct you on how t run the BYO)

this is running the tests @bhalevy pointed
https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/458/

@elcallio rerun and replace CCM_BRANCH and CCM_REPO to you own fork and branch of ccm, with any changes you'll need it in.

fruch on 22 Feb 2021

How do I know if a timeout has occurred? I almost assume this is a property of the test runner (jenkins)?

elcallio on 22 Feb 2021

nvm. should be an actual node.nodetool timeout, right?

elcallio on 22 Feb 2021

I think we are sure we hitting this code, on those cases:
https://github.com/scylladb/scylla-ccm/blob/next/ccmlib/scylla_node.py#L661

and on my desk sending SIGQUIT was enough to generate a threaddump, when setting the timeout of python subprocess to a very small number. see https://github.com/scylladb/scylla-ccm/blob/next/tests/test_scylla_relocatable_cluster.py#L20

but for some reason I can't explain, when the issue is actually reproduced (on CI) we don't see the threaddump on the log of jmx (as I can see it in the ccm unittest)

fruch on 22 Feb 2021

@bhalevy I can't see how any of the aforementioned tests can reach nodetool cfstats. It is afaict not called directly, nor it the only method leading to it in ccm - scylla_node::flush() called from what I can see?

@fruch - do you see your other printout (the error print in link)? Otherwise perhapsyou are not actually running the library you hoped? Fwiw, the change there is more or less what I intended, though for cfstats you actually have _two_ timeouts - one for the actual process, one for the log parsing (scylla_node.py:1198) - you could be hitting one before the other?

elcallio on 22 Feb 2021

The stack trace of e.g. https://jenkins.scylladb.com/view/scylla-4.4/job/scylla-4.4/job/dtest-release/20/testReport/secondary_indexes_test/TestSecondaryIndexes/test_insert_data_after_recreating_cf/ shows how we indirectly call nodetool (info in this case)

Traceback (most recent call last):
  File "/usr/lib64/python3.7/unittest/case.py", line 60, in testPartExecutor
    yield
  File "/usr/lib64/python3.7/unittest/case.py", line 645, in run
    testMethod()
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-dtest/secondary_indexes_test.py", line 443, in test_insert_data_after_recreating_cf
    self.config_keyspace(session, ks_name, table_name, index, ks_create=False)
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-dtest/secondary_indexes_test.py", line 247, in config_keyspace
    index['index_column'], index['index_name'], compaction=self.compaction_strategy),
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-dtest/secondary_indexes_test.py", line 54, in create_and_build_index
    return index_is_built(cluster, session, ks_name, table_name, index_name)
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-dtest/scylla_tools.py", line 983, in index_is_built
    wait_for_view(cluster, session, ks_name, get_index_view_name(index_name), raise_exception=raise_exception)
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-dtest/scylla_tools.py", line 1019, in wait_for_view
    if _view_build_finished_on_live_nodes():
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-dtest/scylla_tools.py", line 1006, in _view_build_finished_on_live_nodes
    if node.is_live() and not (UUID(node.hostid()) in done):
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-ccm/ccmlib/node.py", line 1334, in hostid
    info = self.nodetool('info', capture_output=True, timeout=timeout)[0]
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-ccm/ccmlib/scylla_node.py", line 654, in nodetool
    return super().nodetool(*args, **kwargs)
  File "/jenkins/workspace/scylla-4.4/dtest-release/scylla-ccm/ccmlib/node.py", line 769, in nodetool
    stdout, stderr = p.communicate(timeout=timeout)
  File "/usr/lib64/python3.7/subprocess.py", line 964, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib64/python3.7/subprocess.py", line 1716, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib64/python3.7/subprocess.py", line 1011, in _check_timeout
    stderr=b''.join(stderr_seq) if stderr_seq else None)
subprocess.TimeoutExpired: Command '['/jenkins/workspace/scylla-4.4/dtest-release/scylla/.ccm/scylla-repository/e11ae8c58fc142e020595698b2f1723dc8c3ea15/scylla-tools-java/bin/nodetool', '-h', '127.0.32.3', '-p', '7332', 'info']' timed out after 60 seconds

https://jenkins.scylladb.com/view/scylla-4.4/job/scylla-4.4/job/dtest-debug/20/testReport/junit/cfid_test/TestCFID/cfid_test/ shows a path to nodetool cfstats:

Traceback (most recent call last):
  File "/usr/lib64/python3.7/unittest/case.py", line 60, in testPartExecutor
    yield
  File "/usr/lib64/python3.7/unittest/case.py", line 645, in run
    testMethod()
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-dtest/cfid_test.py", line 25, in cfid_test
    node1.flush()
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-ccm/ccmlib/scylla_node.py", line 1198, in flush
    self._wait_no_pending_flushes()
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-ccm/ccmlib/scylla_node.py", line 1191, in _wait_no_pending_flushes
    result = wait_for(no_pending_flushes, timeout=wait_timeout, step=1.0)
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-ccm/ccmlib/scylla_node.py", line 54, in wait_for
    output = func()
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-ccm/ccmlib/scylla_node.py", line 1181, in no_pending_flushes
    stdout, _ = self.nodetool('cfstats', timeout=wait_timeout)
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-ccm/ccmlib/scylla_node.py", line 654, in nodetool
    return super().nodetool(*args, **kwargs)
  File "/jenkins/workspace/scylla-4.4/dtest-debug/scylla-ccm/ccmlib/node.py", line 769, in nodetool
    stdout, stderr = p.communicate(timeout=timeout)
  File "/usr/lib64/python3.7/subprocess.py", line 964, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib64/python3.7/subprocess.py", line 1716, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib64/python3.7/subprocess.py", line 1011, in _check_timeout
    stderr=b''.join(stderr_seq) if stderr_seq else None)
subprocess.TimeoutExpired: Command '['/jenkins/workspace/scylla-4.4/dtest-debug/scylla/.ccm/scylla-repository/e11ae8c58fc142e020595698b2f1723dc8c3ea15/scylla-tools-java/bin/nodetool', '-h', '127.0.44.1', '-p', '7144', 'cfstats']' timed out after 60 seconds

bhalevy on 22 Feb 2021

@bhalevy - that still makes it look like test_insert_data_after_recreating_cf does not call cfstats. It calls nodeid(), which is a different nodetool command. So is the issue now that we have general nodetool hangs - not limited to cfstats?

elcallio on 22 Feb 2021

@bhalevy - that still makes it look like test_insert_data_after_recreating_cf does not call cfstats. It calls nodeid(), which is a different nodetool command. So is the issue now that we have general nodetool hangs - not limited to cfstats?

@elcallio, yes, I think so. Both nodetool info and cfstats hang and we currently suspect both may be caused by scylla-jmx hanging.

bhalevy on 22 Feb 2021

@bhalevy if that is the case, then the change pointed to by @fruch should have printed the jmx stacktraces. Unless the dtest is running with wrong CCM, I don't know what else to add to the concoction.
My best guess is that it is querying all column family beans that hangs, since that is a common denominator between info and cfstats. But I have no idea why that would be.

elcallio on 22 Feb 2021

Why don't we see anything in the scylla-jmx log when it's killed with SIGQUIT?

I suggest you try to reproduce the issue with an instrumented ccm that will either leave the system running for further manual examination or somehow gather more information automatically.

bhalevy on 22 Feb 2021

My guess again is that you did not actually run the code. Which is why I asked if the logs showed the explicit error message stating "I will kill with SIGQUIT".

elcallio on 22 Feb 2021

Or does this even contain any of the output in the scylla-jmx log file btw? That is where the printout would end up. Not that I see the scylla_node printout in the log either though.

elcallio on 22 Feb 2021

@elcallio

My guess again is that you did not actually run the code. Which is why I asked if the logs showed the explicit error message stating "I will kill with SIGQUIT".
For the printout from ccm, see https://jenkins.scylladb.com/view/scylla-4.4/job/scylla-4.4/job/dtest-release/20/artifact/logs-release.2/dtest.log:

2021-02-19 03:06:20,612 23840   ccm                            ERROR    | node3: nodetool timeout, going to kill scylla-jmx with SIGQUIT
...
2021-02-19 03:07:02,168 24742   ccm                            ERROR    | node2: nodetool timeout, going to kill scylla-jmx with SIGQUIT

4.4 doesn't have the 5 seconds wait after that (@fruch please backport it), but it didn't help in master either.

Or does this even contain any of the output in the scylla-jmx log file btw? That is where the printout would end up. Not that I see the scylla_node printout in the log either though.
The scylla-jmx log contains just the normal contents.
E.g. https://jenkins.scylladb.com/view/scylla-4.4/job/scylla-4.4/job/dtest-release/20/artifact/logs-release.2/1613703983659_secondary_indexes_test.TestSecondaryIndexes.test_insert_data_after_recreating_cf/node3_system.log.jmx

Using config file: /jenkins/workspace/scylla-4.4/dtest-release/scylla/.dtest/dtest-2zu4hq7w/test/node3/conf/scylla.yaml
Connecting to http://127.0.32.3:10000
Starting the JMX server
JMX is enabled to receive remote connections on port: 7332

In other cases ion which scylla-jmx reported errors, they did appear in this log, which captures both stdout and stderr.

bhalevy on 22 Feb 2021

Yeah, I read that too finally. Tried to do a special jenkins run of this, but it broke on "No valid crumb" in the pipeline. Will try again tomorrow.

elcallio on 22 Feb 2021

So I noticed that while the kill -QUIT did run for a test in a test run, there are _no_ artifacts generated from that tests. Which is probably because self.jmx_pid might not be set yet (it is only updated on shutdown).
I made a version using jstack instead, and also refreshing jmx pid first. But I can't figure out how/if I can run a CCM fork in the dtest jenkins job? I can only set tag/commit/branch? @fruch - please advise...

elcallio on 24 Feb 2021

So I noticed that while the kill -QUIT did run for a test in a test run, there are _no_ artifacts generated from that tests. Which is probably because self.jmx_pid might not be set yet (it is only updated on shutdown).
I made a version using jstack instead, and also refreshing jmx pid first. But I can't figure out how/if I can run a CCM fork in the dtest jenkins job? I can only set tag/commit/branch? @fruch - please advise...

Git me the repo url and the branch name, I'll trigger and share it the link

fruch on 24 Feb 2021

The good news is I managed to get a thread dump. The bad news is that it is impossible.
Right now I am wondering if we're seeing a hotspot bug of some sorts. Either a runtime error being hidden and screwing up java concurrent locks, or an inliner bug.
What determines which jdk the dtest suite runs against? Is it in the docker image, or...? Can one say try forcing it to run using jdk 11?

elcallio on 24 Feb 2021

It's using the dtest docker image (which is based on old version of dbuild)

If you can hack the dockerfile to suit what you suggest, we can take it for
a spin.

On Wed, Feb 24, 2021, 17:35 Calle Wilund notifications@github.com wrote:

The good news is I managed to get a thread dump. The bad news is that it
is impossible.
Right now I am wondering if we're seeing a hotspot bug of some sorts.
Either a runtime error being hidden and screwing up java concurrent locks,
or an inliner bug.
What determines which jdk the dtest suite runs against? Is it in the
docker image, or...? Can one say try forcing it to run using jdk 11?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/7991#issuecomment-785161450,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACTH4YF2FKPRH6FAWMQTJ3TAUMDBANCNFSM4WYR2A4A
.

fruch on 24 Feb 2021

what's wrong with https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/build?delay=0sec ?
one can specify the ccm repository and branch, isn't it what we need?

bhalevy on 24 Feb 2021

We're past the ccm part. @fruch was kind enough to point me to the right dtest job. But seemingly, using a different JDK is more difficult. Not certain anything is even related to this of course, but...

elcallio on 24 Feb 2021

From what I can tell anyway, the tools Dockerimage is utlimately based fedora-29, which from what I can tell has a slightly older jdk8 even. But this is still grasping at straws.

For the interested, the dumps I got both look like this:

17:05:45  "RMI TCP Connection(11)-172.17.0.2" #17 daemon prio=5 os_prio=0 tid=0x00007f4e18001800 nid=0x4a58 waiting on condition [0x00007f4e0fff0000]
17:05:45     java.lang.Thread.State: WAITING (parking)
17:05:45    at sun.misc.Unsafe.park(Native Method)
17:05:45    - parking to wait for  <0x00000000f5731c68> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
17:05:45    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
17:05:45    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
17:05:45    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
17:05:45    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
17:05:45    at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
17:05:45    at com.scylladb.jmx.utils.APIBuilder$TableRepository.query(APIBuilder.java:232)
17:05:45    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.queryNamesImpl(DefaultMBeanServerInterceptor.java:562)
17:05:45    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.queryNames(DefaultMBeanServerInterceptor.java:554)
17:05:45    at com.sun.jmx.mbeanserver.JmxMBeanServer.queryNames(JmxMBeanServer.java:619)
17:05:45    at com.scylladb.jmx.metrics.APIMBean.queryNames(APIMBean.java:103)
17:05:45    at com.scylladb.jmx.metrics.APIMBean.checkRegistration(APIMBean.java:65)
17:05:45    at org.apache.cassandra.db.ColumnFamilyStore$1.doCheck(ColumnFamilyStore.java:201)
17:05:45    at com.scylladb.jmx.metrics.RegistrationChecker.check(RegistrationChecker.java:61)
17:05:45    at com.scylladb.jmx.metrics.RegistrationChecker.check(RegistrationChecker.java:44)
17:05:45    at com.scylladb.jmx.utils.APIMBeanServer.checkRegistrations(APIMBeanServer.java:318)
17:05:45    at com.scylladb.jmx.utils.APIMBeanServer.queryNames(APIMBeanServer.java:135)
17:05:45    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
17:05:45    at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
17:05:45    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
17:05:45    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
17:05:45    at javax.management.remote.rmi.RMIConnectionImpl.queryNames(RMIConnectionImpl.java:572)
17:05:45    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
17:05:45    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
17:05:45    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
17:05:45    at java.lang.reflect.Method.invoke(Method.java:498)
17:05:45    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
17:05:45    at sun.rmi.transport.Transport$1.run(Transport.java:200)
17:05:45    at sun.rmi.transport.Transport$1.run(Transport.java:197)
17:05:45    at java.security.AccessController.doPrivileged(Native Method)
17:05:45    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
17:05:45    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
17:05:45    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
17:05:45    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
17:05:45    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$49/1128551206.run(Unknown Source)
17:05:45    at java.security.AccessController.doPrivileged(Native Method)
17:05:45    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
17:05:45    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
17:05:45    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
17:05:45    at java.lang.Thread.run(Thread.java:748)
17:05:45  
17:05:45     Locked ownable synchronizers:
17:05:45    - <0x00000000f564cff0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
17:05:45    - <0x00000000f5731d80> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

---

17:05:45  "pool-1-thread-1" #10 prio=5 os_prio=0 tid=0x00007f4e54371800 nid=0x49d5 waiting on condition [0x00007f4e44cfb000]
17:05:45     java.lang.Thread.State: TIMED_WAITING (parking)
17:05:45    at sun.misc.Unsafe.park(Native Method)
17:05:45    - parking to wait for  <0x00000000f564d840> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
17:05:45    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
17:05:45    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
17:05:45    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
17:05:45    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
17:05:45    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
17:05:45    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
17:05:45    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
17:05:45    at java.lang.Thread.run(Thread.java:748)
17:05:45  
17:05:45     Locked ownable synchronizers:
17:05:45    - <0x00000000f5731c68> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

For those familiar with scylla-jmx, the first thread is a RMI response thread, processing a JMX call. It is doing a CF population check, since we are looking at CF:s/CF metrics. It ultimately needs to read-lock the JMX registry object.
This lock is held by the second thread, which is a background worker doing reaping of dead CF objects etc. But note that this thread is in wait state. It has done its stuff, but seemingly holds the (write) lock to registry. Which is bananas. Because if you look at the code in question, you can clearly see these lock/unlock clauses are correctly guarded by finally clauses.

Which is why I wonder if we've managed to get (silent) either stack overflow in lock call, causing a lock to think it is locked before actually returning to the guarded code block, or if hotspot managed to foobar control flow (compiler bug).

Of course, occams razor would still suggest something is wrong with my reading here, but...

elcallio on 24 Feb 2021

@fruch can we update the docker running dtest to use java 11? (Do we require that also in documentation for installing scylla)
Also, moving to fedora 33 is fine.

roydahan on 28 Feb 2021

I remain confused by jenkins. Can I run the byo-dtest with a modified scylla/scylla-tools-java?

elcallio on 1 Mar 2021

I remain confused by jenkins. Can I run the byo-dtest with a modified scylla/scylla-tools-java?

Currently not. @hagitsegev / @yaronkaikov how are would it be to add support for specifying also repo and branch for scylla-tools-java and scylla-jmx?

bhalevy on 1 Mar 2021

If you are using master - then you can reference it from your scylla repo -
scylla-jmx and scylla-tools are submodules of scylla.

On Mon, Mar 1, 2021 at 12:34 PM Benny Halevy notifications@github.com
wrote:

I remain confused by jenkins. Can I run the byo-dtest with a modified
scylla/scylla-tools-java?

Currently not. @hagitsegev https://github.com/hagitsegev / @yaronkaikov
https://github.com/yaronkaikov how are would it be to add support for
specifying also repo and branch for scylla-tools-java and scylla-jmx?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/7991#issuecomment-787841389,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCERO5D7Y2X6QYPMJD3TBNUR7ANCNFSM4WYR2A4A
.

slivne on 1 Mar 2021

(yet I have not done that - I hope it works ....) - you will probably need
to make sure you to use absolute references and not relative

[shlomi@localhost scylla]$ cat .gitmodules
[submodule "seastar"]
path = seastar
url = ../seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
url = ../scylla-swagger-ui
ignore = dirty
[submodule "libdeflate"]
path = libdeflate
url = ../libdeflate
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-jmx"]
path = tools/jmx
url = ../scylla-jmx
[submodule "scylla-tools"]
path = tools/java
url = ../scylla-tools-java
[submodule "scylla-python3"]
path = tools/python3
url = ../scylla-python3

On Mon, Mar 1, 2021 at 12:37 PM Shlomi Livne shlomi@scylladb.com wrote:

If you are using master - then you can reference it from your scylla repo

scylla-jmx and scylla-tools are submodules of scylla.

On Mon, Mar 1, 2021 at 12:34 PM Benny Halevy notifications@github.com
wrote:

I remain confused by jenkins. Can I run the byo-dtest with a modified
scylla/scylla-tools-java?

Currently not. @hagitsegev https://github.com/hagitsegev / @yaronkaikov
https://github.com/yaronkaikov how are would it be to add support for
specifying also repo and branch for scylla-tools-java and scylla-jmx?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/7991#issuecomment-787841389,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCERO5D7Y2X6QYPMJD3TBNUR7ANCNFSM4WYR2A4A
.

slivne on 1 Mar 2021

I know how submodules work. But dtest-byo runs from an existing jenkins scylla build from what I can see, and I can't see how I make one of those from a scylla fork.

elcallio on 1 Mar 2021

There is a package-byo, but I cannot see any place to put a scylla git url there either.

elcallio on 1 Mar 2021

@elcallio what you will need to do is to to use https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/byo_build_tests_dtest with your fork scylla repo (including the changes in submodules) . then you will need to build scylla from your fork

yaronkaikov on 1 Mar 2021

It's a bitch to provide your own scylla repo to byo_build_tests_dtest since unless you overhaul .gitmodules, you need all the sibling submodule repositories forked in your account.
It should be much easier to override the dtest scripts/run_test.sh TOOLS_JAVA_DIR and JMX_DIR environment variables, and symlink in the scylla repository TOOLS_JAVA_DIR to `resources/cassandra'.

bhalevy on 1 Mar 2021

Well, I got a build+dtest with an augmented jmx agent, droppping the RW-lock in favour of plain old synchronized. It has not deadlocked yet (been running the dtest portion for ~3.5h - just the index tests). However, this does not really say anything very useful, esp. since I am not convinced it is running on the same machine as the previous, deadlocked dtest run (even though I set the same node name/criteria).
So, very slim info so far. Still wish I could repro this somewhere I have ssh + X...

elcallio on 1 Mar 2021

Well, I got a build+dtest with an augmented jmx agent, droppping the RW-lock in favour of plain old synchronized. It has not deadlocked yet (been running the dtest portion for ~3.5h - just the index tests). However, this does not really say anything very useful, esp. since I am not convinced it is running on the same machine as the previous, deadlocked dtest run (even though I set the same node name/criteria).
So, very slim info so far. Still wish I could repro this somewhere I have ssh + X...

It is reproduced about 1 out of 3-4 runs on my local machine when I'm running the test with scripts/run_test.sh (that uses the dtest docker image). If you give me a url to your jmx repo and branch I can try testing it too.

bhalevy on 2 Mar 2021

jmx repo:
seastar-dev: calle/cpdtest

elcallio on 2 Mar 2021

👍1

CASSANDRA_DIR=../scylla/build/release JMX_DIR=~/dev/seastar-dev NOSE_PROCESSES=2 ./scripts/run_test.sh secondary_indexes_test:TestSecondaryIndexes.test_insert_data_after_recreating_{ks,cf} passed 3 consecutive times times with scylladb/seastar-dev@d5da598063 (calle/cpdtest), seem to look good so far.

bhalevy on 2 Mar 2021

@bhalevy so whats the verdict - should calle send the patch ?

slivne on 2 Mar 2021

It seem to work so I'm in favor.
But at the same time we should upgrade the dtest docker image to match the AMI. (@penberg would that be based on Ubuntu 20.04 LTS?)

bhalevy on 2 Mar 2021

Even if the problem magically goes away by the change I made, we still don't have a reasonable explanation for the problem. The RW-lock should _not_ fail, at least not without any visible side effects (i.e. dead threads and/or exception printouts). At the very least, I'd like the issue reproed and be able to inspect it, maybe check the machine code generated (ugh for doing that on hotspot).

elcallio on 2 Mar 2021

@elcallio
Seem like I have a cluster that reproduces this if it helps:

ssh -i ~/.ssh/scylla-qa-ec2 [email protected]

[scyllaadm@longevity-200gb-48h-verify-limited--db-node-268b9c73-4 ~]$ ps aux | grep node
scyllaa+  87407  0.0  0.0 113300  3196 ?        Ss   12:43   0:00 /bin/sh /opt/scylladb/share/cassandra/bin/nodetool -u cassandra -pw cassandra cfstats keyspace1.standard1

feel free to get any info you want out of this one

fruch on 3 Mar 2021

@elcallio
I've took a threaddump out of it:
https://pastebin.com/bKG8b4A4

and I'm killing that cluster

fruch on 3 Mar 2021

That dump is more interesting; Here we have a thread waiting for the RW-lock, but nobody owns it. That at least would be more consistent with an exception or something.
You do not happen to have recovered the console log for the jmx process?

elcallio on 3 Mar 2021

@elcallio if you have a workaround - I think it would be good to merge it to make sure that we will not hit in the field when / if someone runs with the problematic env (whatever that maybe).

slivne on 3 Mar 2021

I think it is very bad to add a "fix" for which you do not know the reason. But I've made a PR with the change @bhalevy ran. It makes me cry.
/scylladb/scylla-jmx/pull/161

elcallio on 3 Mar 2021

I verified the fix locally with scylladb/scylla@e12e57c91 with scripts/run_test.sh secondary_indexes_test:TestSecondaryIndexes.test_insert_data_after_recreating_{ks,cf}

bhalevy on 4 Mar 2021

That dump is more interesting; Here we have a thread waiting for the RW-lock, but nobody owns it. That at least would be more consistent with an exception or something.
You do not happen to have recovered the console log for the jmx process?

you are look at the sudo journalctl -t scylla-jmx there, as far as I know all the stdout goes there
I've killed it with SIGQUIT since the other mean where available out of the box in our AMIs.

fruch on 4 Mar 2021

We got this issue with 4.4.rc1-0.20210223.9fc582ee8.
I don't know if it's relevant or already fixed.

Installation details
Scylla version (or git commit hash): 4.4.rc1-0.20210223.9fc582ee8 with build-id a6ce2528451d7c29e1555c15960085dab0751b78
Cluster size: 4 nodes (i3en.3xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0b8e9fcc7bfa8fab3 (aws: eu-north-1)

Test: longevity-large-partition-4days-test
Test name: longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity
Test config file(s):

longevity-large-partition-4days.yaml

Issue description

During ShowTopPartitions nemesis was run "nodetool cfstats". This command stuck for about 2.5 days.

nodetool help toppartitions was running (successfully) before cfstats:

< t:2021-02-26 15:07:51,712 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool  help toppartitions"...

After that:

< t:2021-02-26 15:07:52,464 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool  cfstats "...

This command never back.

Restore Monitor Stack command: $ hydra investigate show-monitor 292de59d-840c-493c-a164-263a3ea59ea3
Show all stored logs command: $ hydra investigate show-logs 292de59d-840c-493c-a164-263a3ea59ea3

Test id: 292de59d-840c-493c-a164-263a3ea59ea3

Logs:
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_040748/grafana-screenshot-longevity-large-partition-4days-test-scylla-per-server-metrics-nemesis-20210301_041130-longevity-large-partitions-4d-4-4-monitor-node-292de59d-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_040748/grafana-screenshot-overview-20210301_040748-longevity-large-partitions-4d-4-4-monitor-node-292de59d-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/db-cluster-292de59d.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/loader-set-292de59d.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/monitor-set-292de59d.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/sct-runner-292de59d.zip

Jenkins job URL