Scylla: tuned undoes scylla-kernel-conf sysctl tuning (Slowdown of inserts during LWT longevity)

Created on 25 Nov 2020  Â·  67Comments  Â·  Source: scylladb/scylla

Installation details
Scylla version (or git commit hash): 4.1.10-0.20201119.01c822301f
Cluster size: 4 nodes (i3.2xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0e2cbb2c9e1e42eea

longevity-lwt-3h runs more time then planned.

"pre-regression" test

Last ("pre-regression") test was run with ami-0b36d79ff69153396 (Scylla version: 4.1.9-0.20201109.fb14fae79b)

From https://jenkins.scylladb.com/job/scylla-4.1/job/longevity/job/longevity-lwt-3h/76 (test was finished in 4h15m) found that prepare (prefill data by c-s inserts) part duration was 30 min.
Prepare started at 20:46 and finished 21:15

C-s statistics

From cassandra-stress stats:

'op rate': '5960', 
'partition rate': '5960', 
'row rate': '6748', 
'latency mean': '5.0', 
'latency median': '3.7', 
'latency 95th percentile': '10.0', 
'latency 99th percentile': '32.7',
'total partitions': '10000000'

Get monitor for "pre-regression" run: hydra investigate show-monitor 4f2613dc-9f5a-481c-88ba-5b282cd5a844
Logs:

db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/4f2613dc-9f5a-481c-88ba-5b282cd5a844/20201111_003708/db-cluster-4f2613dc.zip  |
monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/4f2613dc-9f5a-481c-88ba-5b282cd5a844/20201111_003708/monitor-set-4f2613dc.zip |
loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/4f2613dc-9f5a-481c-88ba-5b282cd5a844/20201111_003708/loader-set-4f2613dc.zip  |
sct-runner  | https://cloudius-jenkins-test.s3.amazonaws.com/4f2613dc-9f5a-481c-88ba-5b282cd5a844/20201111_003708/sct-runner-4f2613dc.zip  |

"Regression" test

"Regression" run with ami-0e2cbb2c9e1e42eea(Scylla version 4.1.10-0.20201119.01c822301f, test_id=d79dd949-53b5-49b9-9eef-2a2e8a675f58)
From https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/yulia/job/staging-4.1-longevity-lwt-3h/1/console (test was aborted after 5 hours) found that prepare part duration was about 2 hours.

Prepare started at 07:05 and finished 09:03

C-s statistics

From cassandra-stress stats:

'op rate': '1430', 
'partition rate': '1430',
'row rate': '1620', 
'latency mean': '20.9', 
'latency median': '21.5', 
'latency 95th percentile': '24.7', 
'latency 99th percentile': '44.3', 
'total partitions': '10000000'

"Regression" run live monitor

Get monitor for "regression" run: hydra investigate show-monitor d79dd949-53b5-49b9-9eef-2a2e8a675f58
All logs:
https://cloudius-jenkins-test.s3.amazonaws.com/d79dd949-53b5-49b9-9eef-2a2e8a675f58/20201125_152600/20201125-064419-816674.tar.gz

  • 'op rate' is higher in "pre-regression" run x4.1 time
  • 'latency mean' is higher in "regression" run also x4.1

No error, reactor stalls found in the log

First time slowdown received with Scylla version 4.1.9-0.20201111.7325996510d (ami-0c71cce348923c8b9 ).
Jenkins job
Test id: 610ec4a5-ead5-4241-9d19-5e5260c6fce4

Regression bug high lwt performance

Most helpful comment

Here's a machine I did not yet upgrade:

[centos@ip-172-31-3-11 ~]$ cat /proc/sys/kernel/sched_wakeup_granularity_ns
15000000

and an upgraded machine:

$ cat /proc/sys/kernel/sched_wakeup_granularity_ns
450000

All 67 comments

@juliayakovlev is that a regression? what is the last version that was good?

@juliayakovlev is that a regression? what is the last version that was good?

last version that was good: 4.1.9-0.20201109.fb14fae79b (ami-0b36d79ff69153396)

@slivne @avikivity

Do you need I'll keep the monitor or can terminate it?

@gleb-cloudius can you please scan the diffs in 4.1.9...4.1.10 in scylla / seastar and see if you have a suspect

As we all know it may be a kernel issue - we need to extract the kernels from the AMIs and see if they are the same or not - if they aren't (and gleb doesn't have a suspect) it may be kernel related

this is blocking 4.1.10 from going out

On Thu, Nov 26, 2020 at 02:07:37AM -0800, Shlomi Livne wrote:

@gleb-cloudius can you please scan the diffs in 4.1.9...4.1.10 in scylla / seastar and see if you have a suspect

Why scan doff between those two and not fb14fae79b and 01c822301f which
are two versions that are compared. The diff between those is:

Botond Dénes (2):
types: validate(): linearize values lazily
mutation_reader: queue_reader: don't set EOS flag on abort

Hagit Segev (1):
release: prepare for 4.1.10

Piotr Dulikowski (1):
hints: don't read hint files when it's not allowed to send

Raphael S. Carvalho (1):
compaction: Make sure a partition is filtered out only by producer

So it is unlikely Scylla related.

--
Gleb.

4.1.10  ami-0e2cbb2c9e1e42eea - 5.9.9-1.el7.elrepo.x86_64 #1 SMP Tue Nov 17 19:12:43 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

4.1.9  ami-0b36d79ff69153396  5.9.8-1.el7.elrepo.x86_64 #1 SMP Wed Nov 11 09:49:09 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

@slivne
I am going to terminate the monitor. Any objections?

I am not able to reproduce this with a single server and running cassadnra-stress

even when attempting to use the same profile

cassandra-stress user profile=a.yaml n=10000000 ops'(insert=1)' -rate threads=30 -node ...

the prepare part doesn;t use any LWT

this is the profile

# LWT test: create and update data using LWT.
### DML ###

# Keyspace Name
keyspace: cqlstress_lwt_example

# The CQL for creating a keyspace (optional if it already exists)
keyspace_definition: |
  CREATE KEYSPACE IF NOT EXISTS cqlstress_lwt_example WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

# Table name
table: blogposts

# The CQL for creating a table you wish to stress (optional if it already exists)
table_definition: |
  CREATE TABLE IF NOT EXISTS blogposts (
        domain int,
        published_date int,
        lwt_indicator int,
        url text,
        author text,
        title text,
        PRIMARY KEY(domain, published_date)
  ) WITH compaction = { 'class':'LeveledCompactionStrategy' }
    AND comment='A table to hold blog posts'

extra_definitions:
  - create MATERIALIZED VIEW blogposts_update_one_column_lwt_indicator as select domain, lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator >=-150000 and lwt_indicator < 0 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_update_one_column_lwt_indicator_after_update as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator = 30000000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_update_2_columns_lwt_indicator as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 0 and lwt_indicator <= 150000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_update_2_columns_lwt_indicator_after_update as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator = 20000000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_not_updated_lwt_indicator as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 1000000 and lwt_indicator < 20000000 PRIMARY KEY(lwt_indicator, domain, published_date);

### Column Distribution Specifications ###

columnspec:
  - name: domain
    population: seq(1..10000000)  #10M possible domains to pick from

  - name: published_date
    cluster: uniform(1..1000)         #under each domain we will have max 1000 posts

  - name: lwt_indicator
    population: seq(1..1001000)

  - name: url
    size: uniform(30..30)

  - name: title                  #titles shouldn't go beyond 200 chars
    size: gaussian(10..20)

  - name: author
    size: uniform(5..20)         #author names should be short

### Batch Ratio Distribution Specifications ###

insert:
  partitions: fixed(1)            # Our partition key is the domain so only insert one per batch

  select:    fixed(1)/1000        # We have 1000 posts per domain so 1/1000 will allow 1 post per batch

  batchtype: UNLOGGED             # Unlogged batches


#
# A list of queries you wish to run against the schema
#
queries:
   insert_query:
      cql: insert into blogposts (domain, published_date, lwt_indicator, url, author, title) values (?,?,?,?,?,?) if not exists
      fields: samerow
   select:
      cql: select * from blogposts where domain = ? LIMIT 1
      fields: samerow
   lwt_update_one_column:
      cql: update blogposts set lwt_indicator = 30000000 where domain = ? and published_date = ? if lwt_indicator >=-150000 and lwt_indicator < 0
      fields: samerow
   lwt_update_two_columns:
      cql: update blogposts set lwt_indicator = 20000000, author = 'text' where domain = ? and published_date = ? if lwt_indicator > 0 and lwt_indicator <= 150000 and author != 'text'
      fields: samerow

and the prepare stage does

cassandra-stress user profile=/tmp/c-s_lwt_basic.yaml n=10000000 ops'(insert=1)' cl=QUORUM -mode native cql3 -rate threads=30 -node ... -errors skip-unsupported-columns

******************** Stress Settings ********************
Command:
  Type: user
  Count: 10,000,000
  No Warmup: false
  Consistency Level: LOCAL_ONE
  Target Uncertainty: not applicable
  Command Ratios: {insert=1.0}
  Command Clustering Distribution: clustering=gaussian(1..10)
  Profile File: a.yaml
Rate:
  Auto: false
  Thread Count: 100
  OpsPer Sec: 0
Population:
  Sequence: 1..10000000
  Order: ARBITRARY
  Wrap: true
Insert:
  Revisits: Uniform:  min=1,max=1000000
  Visits: Fixed:  key=1
  Row Population Ratio: Ratio: divisor=1.000000;delegate=Fixed:  key=1
  Batch Type: not batching
Errors:
  Ignore: false
  Tries: 10
Log:
  No Summary: false
  No Settings: false
  File: null
  Interval Millis: 1000
  Level: NORMAL
Mode:
  API: JAVA_DRIVER_NATIVE
  Connection Style: CQL_PREPARED
  CQL Version: CQL3
  Protocol Version: V4
  Username: null
  Password: null
  Auth Provide Class: null
  Max Pending Per Connection: 128
  Connections Per Host: 8
  Compression: NONE
Node:
  Nodes: [172.31.9.202]
  Is White List: false
  Datacenter: null
Schema:
  Keyspace: cqlstress_lwt_example
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Replication Strategy Pptions: {replication_factor=1}
  Table Compression: null
  Table Compaction Strategy: null
  Table Compaction Strategy Options: {}
Transport:
  factory=org.apache.cassandra.thrift.TFramedTransportFactory; truststore=null; truststore-password=null; keystore=null; keystore-password=null; ssl-protocol=TLS; ssl-alg=SunX509; store-type=JKS; ssl-ciphers=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA; 
Port:
  Native Port: 9042
  Thrift Port: 9160
  JMX Port: 7199
Send To Daemon:
  *not set*
Graph:
  File: null
  Revision: unknown
  Title: null
  Operation: USER
TokenRange:
  Wrap: false
  Split Factor: 1

******************** Profile ********************
  Keyspace Name: cqlstress_lwt_example
  Keyspace CQL: 
***
CREATE KEYSPACE IF NOT EXISTS cqlstress_lwt_example WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
***

  Table Name: blogposts
  Table CQL: 
***
CREATE TABLE IF NOT EXISTS blogposts (
      domain int,
      published_date int,
      lwt_indicator int,
      url text,
      author text,
      title text,
      PRIMARY KEY(domain, published_date)
) WITH compaction = { 'class':'LeveledCompactionStrategy' }
  AND comment='A table to hold blog posts'
***

  Extra Schema Definitions: [create MATERIALIZED VIEW blogposts_update_one_column_lwt_indicator as select domain, lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator >=-150000 and lwt_indicator < 0 PRIMARY KEY(lwt_indicator, domain, published_date);, create MATERIALIZED VIEW blogposts_update_one_column_lwt_indicator_after_update as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator = 30000000 PRIMARY KEY(lwt_indicator, domain, published_date);, create MATERIALIZED VIEW blogposts_update_2_columns_lwt_indicator as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 0 and lwt_indicator <= 150000 PRIMARY KEY(lwt_indicator, domain, published_date);, create MATERIALIZED VIEW blogposts_update_2_columns_lwt_indicator_after_update as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator = 20000000 PRIMARY KEY(lwt_indicator, domain, published_date);, create MATERIALIZED VIEW blogposts_not_updated_lwt_indicator as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 1000000 and lwt_indicator < 20000000 PRIMARY KEY(lwt_indicator, domain, published_date);]
  Generator Configs:
    lwt_indicator: Identity: Sequence:  start=1,end=1001000;
    author: Size: Uniform:  min=5,max=20;
    domain: Identity: Sequence:  start=1,end=10000000;
    title: Size: Gaussian:  min=10,max=20,mean=15.000000,stdev=1.666667;
    published_date: Clustering: Uniform:  min=1,max=1000;
    url: Size: Fixed:  key=30;
  Query Definitions:
    insert_query: CQL:insert into blogposts (domain, published_date, lwt_indicator, url, author, title) values (?,?,?,?,?,?) if not exists;Fields:samerow;
    select: CQL:select * from blogposts where domain = ? LIMIT 1;Fields:samerow;
    lwt_update_one_column: CQL:update blogposts set lwt_indicator = 30000000 where domain = ? and published_date = ? if lwt_indicator >=-150000 and lwt_indicator < 0;Fields:samerow;
    lwt_update_two_columns: CQL:update blogposts set lwt_indicator = 20000000, author = 'text' where domain = ? and published_date = ? if lwt_indicator > 0 and lwt_indicator <= 150000 and author != 'text';Fields:samerow;
  Token Range Queries:
  Insert Settings:
    partitions: fixed(1)
    select: fixed(1)/1000
    batchtype: UNLOGGED
===== Using optimized driver!!! =====
Connected to cluster: scylladb-cluster-1606775542, max pending requests per connection 128, max connections per host 8
Datatacenter: eu-west; Host: /172.31.9.202; Rack: 1a
Created schema. Sleeping 1s for propagation.
Created extra schema. Sleeping 1s for propagation.
Generating batches with [1..1] partitions and [0..1] rows (of [1..1000] total rows in the partitions)

So the prepare phase is doing inserts into a keyspace rf=3 with a simple table with 5 MVs.

There is no LWT related code

On Mon, Nov 30, 2020 at 10:52:58PM -0800, Shlomi Livne wrote:

the prepare part doesn;t use any LWT

I think you are using incorrect one. The one in the branch-4.1 has:

insert:
partitions: fixed(1) # Our partition key is the domain so only insert one per batch

select: fixed(1)/1000 # We have 1000 posts per domain so 1/1000 will allow 1 post per batch

batchtype: UNLOGGED # Unlogged batches

condition: IF title = NULL # LWT: Do not override

--
Gleb.

using the following ami for loader (ami-0bf19545bc7bc9e1f) scylla-qa-loader-ami-lwt-v5 on a 4 nodecluster with LWT in inserts I am able to reproduce the issue

Starts from the good 4.1.9 AMI ami-0b36d79ff69153396 did an upgrade to 4.1.10

http://downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-19T18%3A17%3A45Z/scylla.repo

scylla --version 4.1.10-0.20201119.01c822301f

And the performance is ok

Results:
Op rate                   :    6,732 op/s  [insert: 6,778 op/s]
Partition rate            :    6,732 pk/s  [insert: 6,778 pk/s]
Row rate                  :    7,623 row/s [insert: 7,675 row/s]
Latency mean              :    4.4 ms [insert: 4.4 ms]
Latency median            :    4.0 ms [insert: 4.0 ms]
Latency 95th percentile   :    7.5 ms [insert: 7.5 ms]
Latency 99th percentile   :   10.3 ms [insert: 10.3 ms]
Latency 99.9th percentile :   17.1 ms [insert: 17.1 ms]
Latency max               :   46.9 ms [insert: 46.9 ms]
Total partitions          :  1,000,000 [insert: 1,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:02:28

So the issue is not in scylla code

which means the issue is (again) in kernel (or ami) somehow.

start from bad ami: ami-0e2cbb2c9e1e42eea

verified bad performance

downgraded to 4.1.9 and started to check which version is cuasing the issue

scylla version: 4.1.9-0.20201026.b80dab6d58

Results:
Op rate                   :    6,308 op/s  [insert: 6,308 op/s]
Partition rate            :    6,308 pk/s  [insert: 6,308 pk/s]
Row rate                  :    7,137 row/s [insert: 7,137 row/s]
Latency mean              :    4.5 ms [insert: 4.5 ms]
Latency median            :    3.6 ms [insert: 3.6 ms]
Latency 95th percentile   :    7.7 ms [insert: 7.7 ms]
Latency 99th percentile   :   30.4 ms [insert: 30.4 ms]
Latency 99.9th percentile :   44.8 ms [insert: 44.8 ms]
Latency max               :   53.0 ms [insert: 53.0 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:15

END
version:  4.1.9-0.20201106.db6303dba0d

Results:
Op rate                   :    6,412 op/s  [insert: 6,412 op/s]
Partition rate            :    6,412 pk/s  [insert: 6,412 pk/s]
Row rate                  :    7,255 row/s [insert: 7,255 row/s]
Latency mean              :    4.5 ms [insert: 4.5 ms]
Latency median            :    3.6 ms [insert: 3.6 ms]
Latency 95th percentile   :    7.5 ms [insert: 7.5 ms]
Latency 99th percentile   :   31.2 ms [insert: 31.2 ms]
Latency 99.9th percentile :   44.6 ms [insert: 44.6 ms]
Latency max               :   58.4 ms [insert: 58.4 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:15

END
scylla version:
repo: http:////downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-09T16%3A32%3A49Z/scylla.repo

Results:
Op rate                   :    6,272 op/s  [insert: 6,272 op/s]
Partition rate            :    6,272 pk/s  [insert: 6,272 pk/s]
Row rate                  :    7,095 row/s [insert: 7,095 row/s]
Latency mean              :    4.5 ms [insert: 4.5 ms]
Latency median            :    3.6 ms [insert: 3.6 ms]
Latency 95th percentile   :    7.3 ms [insert: 7.3 ms]
Latency 99th percentile   :   30.5 ms [insert: 30.5 ms]
Latency 99.9th percentile :   44.9 ms [insert: 44.9 ms]
Latency max               :   61.8 ms [insert: 61.8 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:15

END
scylla version: 4.1.9-0.20201111.7325996510d
repo //downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-11T18%3A43%3A35Z/scylla.repo

Results:
Op rate                   :    6,492 op/s  [insert: 6,492 op/s]
Partition rate            :    6,492 pk/s  [insert: 6,492 pk/s]
Row rate                  :    7,333 row/s [insert: 7,333 row/s]
Latency mean              :    4.6 ms [insert: 4.6 ms]
Latency median            :    3.5 ms [insert: 3.5 ms]
Latency 95th percentile   :    8.0 ms [insert: 8.0 ms]
Latency 99th percentile   :   32.4 ms [insert: 32.4 ms]
Latency 99.9th percentile :   45.4 ms [insert: 45.4 ms]
Latency max               :   52.8 ms [insert: 52.8 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:15

scylla version: 4.1.9-0.20201115.b144b93cd8
repo: //downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-15T11%3A20%3A00Z/scylla.repo


Results:
Op rate                   :    6,276 op/s  [insert: 6,276 op/s]
Partition rate            :    6,276 pk/s  [insert: 6,276 pk/s]
Row rate                  :    7,112 row/s [insert: 7,112 row/s]
Latency mean              :    4.7 ms [insert: 4.7 ms]
Latency median            :    3.5 ms [insert: 3.5 ms]
Latency 95th percentile   :    9.0 ms [insert: 9.0 ms]
Latency 99th percentile   :   33.7 ms [insert: 33.7 ms]
Latency 99.9th percentile :   45.0 ms [insert: 45.0 ms]
Latency max               :   50.8 ms [insert: 50.8 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:15

scylla version: 4.1.10-0.20201119.01c822301f
repo: //downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-19T18%3A17%3A45Z/scylla.repo


Results:
Op rate                   :    6,129 op/s  [insert: 6,129 op/s]
Partition rate            :    6,129 pk/s  [insert: 6,129 pk/s]
Row rate                  :    6,935 row/s [insert: 6,935 row/s]
Latency mean              :    4.6 ms [insert: 4.6 ms]
Latency median            :    3.6 ms [insert: 3.6 ms]
Latency 95th percentile   :    8.2 ms [insert: 8.2 ms]
Latency 99th percentile   :   30.7 ms [insert: 30.7 ms]
Latency 99.9th percentile :   44.2 ms [insert: 44.2 ms]
Latency max               :   50.7 ms [insert: 50.7 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:16
scylla version:  4.1.10-0.20201122.01c822301f

repo: //downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-22T13%3A32%3A10Z/scylla.repo
Results:
Op rate                   :    6,448 op/s  [insert: 6,448 op/s]
Partition rate            :    6,448 pk/s  [insert: 6,448 pk/s]
Row rate                  :    7,299 row/s [insert: 7,299 row/s]
Latency mean              :    4.5 ms [insert: 4.5 ms]
Latency median            :    3.6 ms [insert: 3.6 ms]
Latency 95th percentile   :    7.3 ms [insert: 7.3 ms]
Latency 99th percentile   :   29.8 ms [insert: 29.8 ms]
Latency 99.9th percentile :   44.8 ms [insert: 44.8 ms]
Latency max               :   52.8 ms [insert: 52.8 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:00:15

We are back in 4.1.10 and the performance is good ?


this is the upgrade code - lets see if its causing this to work well.

rm scylla.repo*;  wget http://downloads.scylladb.com.s3.amazonaws.com/rpm/unstable/centos/branch-4.1/2020-11-22T13%3A32%3A10Z/scylla.repo ;  sudo cp scylla.repo /etc/yum.repos.d/

sudo yum upgrade scylla*4.1.10; sudo service scylla-server stop; sudo rm -Rf /var/lib/scylla/data/* /var/lib/scylla/commitlog/* 

The bottom line I suspect if we restart / upgrade - he issue will be gone - need to check whats the smallest number of steps that will cause performance to be good

May be affinity setup is broken on the ami and upgrade fixes it?

Or maybe its an issue that scylla starts to early when the node starts we will know soon.

Restarting scylla on all servers did not work ....

Results:
Op rate                   :    1,487 op/s  [insert: 1,487 op/s]
Partition rate            :    1,487 pk/s  [insert: 1,487 pk/s]
Row rate                  :    1,683 row/s [insert: 1,683 row/s]
Latency mean              :   20.1 ms [insert: 20.1 ms]
Latency median            :   20.9 ms [insert: 20.9 ms]
Latency 95th percentile   :   24.3 ms [insert: 24.3 ms]
Latency 99th percentile   :   40.2 ms [insert: 40.2 ms]
Latency 99.9th percentile :   55.2 ms [insert: 55.2 ms]
Latency max               :   79.7 ms [insert: 79.7 ms]
Total partitions          :    100,000 [insert: 100,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:07

Trying this did not help as well

sudo service scylla-server stop; sudo rm -Rf /var/lib/scylla/data/* /var/lib/scylla/commitlog/* ; sudo service scylla-server start

Next reinstall

sudo yum reinstall scylla*
sudo service scylla-server stop; sudo rm -Rf /var/lib/scylla/data/* /var/lib/scylla/commitlog/* ; sudo service scylla-server start

this fixes the issue.

So @gleb-cloudius - you are correct - reinstall fixed it - yet what did it do - and how do we find whats broken

So @gleb-cloudius - you are correct - reinstall fixed it - yet what did it do - and how do we find whats broken

Runs scylla-setup may be?

--
Gleb.

4.1.10 has the revert the code of the unpinning of the kernel @penberg can you please check the patch in scylla-machine-image for the unpinning

@avikivity as discussed the regression is visible between

4.1.10  ami-0e2cbb2c9e1e42eea - 5.9.9-1.el7.elrepo.x86_64 #1 SMP Tue Nov 17 19:12:43 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

4.1.9  ami-0b36d79ff69153396  5.9.8-1.el7.elrepo.x86_64 #1 SMP Wed Nov 11 09:49:09 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

reinstall of scylla "fixes" the issue.

On Sun, Dec 06, 2020 at 04:46:04AM -0800, Avi Kivity wrote:

Candidate:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ab39aed544fd6b590d606e7b7db6f4f8a4e1fd1c

How reinstalling Scylla on the same kernel would have helped?

--
Gleb.

@avikivity
Monitor of "regression" run:
http://52.215.0.102:3000/d/yW_A6WJGz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1606285776157&to=1606307608372&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All

@juliayakovlev can I have a monitor for a good run? the bad run actually looks good, so I'd like to compare to a good run.

@juliayakovlev can I have a monitor for a good run? the bad run actually looks good, so I'd like to compare to a good run.

@avikivity
Do you need both of them or just for good run?

I need both (now looking at the bad run).

Some rates (per shard), bad run:

coordinator ops: ~28-64/s
replica reads: ~390-450/s
replica writes: ~650-740/s
cache reads: ~520-590/s
view local update: 14-19/s
view remote update: 4-6/s
lwt writes: 42-47/s
lwt average write latency: 19-27ms
lwt condition not met: rises to 7-9/s
lwt prune: 42-49/s
commitlog iops: 300/s - 370/s
commitlog bandwidth: ~1.3-3.1 MB/s

Average commitlog write is 8k.

I need both (now looking at the bad run).

@avikivity

"Good" run monitor: http://52.16.7.182:3000/d/fQUrzMJMk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1605040696854&to=1605054283344&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All

For the bad run, c-s latency 95% was ~25ms. For the good run, ~10ms.

I've seen this many times when system.paxos needs compaction, so its memtable is empty, every round has to do a disk read.

The question would be though why is this a regression?

Some rates (per shard), good run:

  • coordinator ops: ~700/s
  • replica reads: ~7000/s
  • replica writes: ~12050/s
  • cache reads: ~10000/s
  • view local update: 270/s
  • view remote update: 90/s
  • lwt writes: 770/s
  • lwt average write latency: 3.3ms
  • lwt condition not met: rises to 128/s
  • lwt prune: 770/s
  • commitlog iops: 1200/s - 2500/s
  • commitlog bandwidth: ~10-20 MB/s

The question would be though why is this a regression?

4.1.10 results are ~5X worse than 4.1.9.

I've seen this many times when system.paxos needs compaction, so its memtable is empty, every round has to do a disk read.

I don't see disk reads during the populate phase.

Commitlog I/O queue delay is low in both runs.

Somehow the latency is much higher for the bad run, although I/O is fast for both.

@gleb-cloudius any idea?

On Tue, Dec 08, 2020 at 06:41:58AM -0800, Avi Kivity wrote:

@gleb-cloudius any idea?

Not really. I already looked at bad results back when the bug was posted
and also did not see anything strange. Cpu utilization was low as well.
Look at Shlomi's findings. Reinstalling the package seams to help, so
may be there are problems with cpu affinity or something bad runs on the
host and steals cpu. If we have host metrics we can compare context
switches.

--
Gleb.

We do have host metrics. But look at reactor load, it is always low.

In the bad run, we take 7%-10% in statement group. In the good run, 38%-69%.

I did reinstall 3 times and it "fixed" the issue

I'll provide instructions on how to reproduce - @penberg / @avi - will take a look.

reproducing the setup

please note there are params you will need to specify

setup cluster <your-ec2-ireland-key>

aws cloudformation create-stack --stack-name sl-t --template-url http://downloads.scylladb.com.s3.amazonaws.com/downloads/scylla/rpm/centos/scylladb-4.1/scylla_cluster_4.1.9.yaml  --parameters ParameterKey=AvailabilityZone,ParameterValue=eu-west-1b ParameterKey=ScyllaAmi,ParameterValue=ami-0e2cbb2c9e1e42eea ParameterKey=ClusterName,ParameterValue=sl-t-4-1-10 ParameterKey=InstanceType,ParameterValue=i3.2xlarge ParameterKey=InstanceCount,ParameterValue=4 ParameterKey=KeyName,ParameterValue=<your-ec2-ireland-key>

setup loader (need an instance from the cluster) <subnet-id-from-one-instance>,<security-group-ids-from-one-instance>,<your-ec2-ireland-key>

aws ec2 run-instances --image=ami-0bf19545bc7bc9e1f --instance-type c5.4xlarge --subnet-id <subnet-id-from-one-instance> --security-group-ids <security-group-ids-from-one-instance> --key-name <your-ec2-ireland-key> 

copy the cassandra-stress profile need <loader-ip>

scylla-cluster-tests
git checkout branch-4.1

scp   -i <your-ec2-ireland-key> centos@<loader-ip>:/home/centos/

on loader-ip run load

cassandra-stress user profile=./c-s_lwt_basic.yaml n=10000000 ops'(insert=1)' cl=QUORUM -mode native cql3 -rate threads=30 -errors skip-unsupported-columns -node 172.31.0.1

```

@slivne those setup 4.1.9, no? need to install 4.1.10 on top, and then reinstall?

@avikivity the template is 4.1.9 as its the last released version the ami is 4.1.10 (0e2cbb2c9e1e42eea)

ah!

reinstallling just the scylla-server subpackage and restarting did not help.

reinstalling scylla-kernel-conf helped immediately, even without restart.

Here's a machine I did not yet upgrade:

[centos@ip-172-31-3-11 ~]$ cat /proc/sys/kernel/sched_wakeup_granularity_ns
15000000

and an upgraded machine:

$ cat /proc/sys/kernel/sched_wakeup_granularity_ns
450000

Rebooting a reinstalled node restores the bad value.

So maybe we are not persisting the setting and only changing it for the
runtime?

On Wed, Dec 9, 2020 at 6:20 PM Avi Kivity notifications@github.com wrote:

Rebooting a reinstalled node restores the bad value.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/7696#issuecomment-741881531,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCAU66EUTPWRQKNGXO3ST6PTXANCNFSM4UCLJ6XQ
.

I straced the systemd-sysctl.service invocation, which is responsible for loading the value at boot:

open("/proc/sys/kernel/sched_wakeup_granularity_ns", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa8d473c000
write(4, "450000\n", 7)                 = 7
close(4)

So something is overwriting it afterwards.

/usr/lib/tuned/throughput-performance/tuned.conf:kernel.sched_wakeup_granularity_ns = 15000000
Name        : tuned
Version     : 2.11.0
Release     : 10.el7
Architecture: noarch
Install Date: 2020-11-20T03:23:48 UTC
Group       : Unspecified
Size        : 815813
License     : GPLv2+
Signature   : RSA/SHA256, 2020-11-17T14:19:32 UTC, Key ID 24c6a8a7f4a80eb5
Source RPM  : tuned-2.11.0-10.el7.src.rpm
Build Date  : 2020-11-16T16:46:25 UTC
Build Host  : x86-02.bsys.centos.org
Relocations : (not relocatable)
Packager    : CentOS BuildSystem <http://bugs.centos.org>
Vendor      : CentOS
URL         : http://www.tuned-project.org/
Summary     : A dynamic adaptive system tuning daemon
Description :
The tuned package contains a daemon that tunes system settings dynamically.
It does so by monitoring the usage of several system components periodically.
Based on that information components will then be put into lower or higher
power saving modes to adapt to the current usage. Currently only ethernet
network and ATA harddisk devices are implemented.
* Wed Aug 12 2020 Jaroslav Å karvada <[email protected]> - 2.11.0-10
- realtime-virtual-host: remove lapic advancement calculation and
  related qemu-kvm-tools-rhev requirement
  Resolves: rhbz#1852253

* Thu Mar 05 2020 Jaroslav Å karvada <[email protected]> - 2.11.0-9
- Fixed SIGHUP handling
  Resolves: rhbz#1702724
- Tune irqbalance service
  Resolves: rhbz#1720042
- Added netcat requirement
  Resolves: rhbz#1746436
- sysctl: made reapply_sysctl ignore configs from /usr
  Resolves: rhbz#1776149
- profiles: define variables before use
  Resolves: rhbz#1781664

"- sysctl: made reapply_sysctl ignore configs from /usr"

So we have two options:

  • disable tuned (and lose any good things it does, if any)
  • set reapply_sysctl = 0 in /etc/tuned/tuned.conf

Looking at /usr/lib/tuned/throughput-performance/tuned.conf, we are better off disabling tuned.

Perhaps we should add Obsoletes: tuned to scylla-kernel-conf.

@slivne bottom line, it is not a regression in scylla, but a change in tuned behavior that we got by running yum update somehere.

from yum.log:

Nov 20 03:23:48 Updated: tuned-2.11.0-10.el7.noarch

@avikivity
Do you need the monitors?

@juliayakovlev no, we found the cause and a fix is on the way. Please kill the monitors.

Backported to 4.1, 4.2, 4.3.

Was this page helpful?
0 / 5 - 0 ratings