Scylla: nodetool status returns wrong IPv6 addresses

Created on 13 Feb 2020  路  28Comments  路  Source: scylladb/scylla

i'm running with 3.3 (ami-0b19be243453bd8e8 on eu-west-1) - 3.3.rc1-0.20200209.0d0c1d43188

both AWS console and local network manager show the same IPv6 address (and it is pingable using ping6):

2a05:d018:223:f00:b111:3a2:aa5f:2295 is the correct IPv6 address of my instance

ip addr  | grep inet6
inet6 ::1/128 scope host 
inet6 2a05:d018:223:f00:b111:3a2:aa5f:2295/128 scope global dynamic 
inet6 fe80::88d:a6ff:fe32:6cce/64 scope link

while nodetool status returns (2a05:d018:223:f00:b111:ffa2:aa5f:ff95) something else:

nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:1c55:d40e:5a7d:ffb7  461.13 KB  256          ?       8fd87d4c-9fb6-44d9-b016-48de9529f453  UNKNOWN_RACK
UN  2a05:d018:223:f00:b111:ffa2:aa5f:ff95  572.09 KB  256          ?       05a39bc1-3fbb-481a-93c8-e96b1b1f2222  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff85:be21:de55:ffbf  461.97 KB  256          ?       bdac46b6-0069-49ba-8acd-ff6db13a2ddd  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

used to get a pcap using tshark (will try to attach to here, or will provide a link to some S3 storage.

bug ipv6

All 28 comments

capture.txt

capture.txt is the pcap captured using tshark on one of the nodes

i have tested the same scenario using 3.2 and it doesn't show the issue

Lets try to reproduce on monday morning and get the machine up - @elcallio will be able to check this ount on the machine

What machine? And check out what? Since server address is not exactly automatically resolved, I would assume that whatever is showing in nodetool is also in scylla.yaml?
ifconfig?
netstat -nlp?
raw api output?

What machine? And check out what? Since server address is not exactly automatically resolved, I would assume that whatever is showing in nodetool is also in scylla.yaml?
ifconfig?
netstat -nlp?
raw api output?

wrong, in scylla.yaml we see the correct IPv6 address, hence the cluster is up and running.
I can create an environment for you, from one of our manager sanity tests, if you will.

Yes please.

Yes please.

nodetool status

Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:ff95:6149:ffb5:3266  433.04 KB  256          ?       a2c865a5-1309-4cc0-a4cb-29c7492a18a6  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff99:921c:2362:ffd6  502.62 KB  256          ?       39d5ffc2-5340-4c0c-9357-9963be6a1d3b  UNKNOWN_RACK
UN  2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f  547.12 KB  256          ?       eb70f0e1-d880-444d-9bec-121db27d0601  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
This EC2 instance is optimized for Scylla.

[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ ip addr | grep inet6
    inet6 ::1/128 scope host 
    inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f/128 scope global dynamic 
    inet6 fe80::8c1:15ff:fe82:65f0/64 scope link 
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ #2a05:d018:223:f00:97af:f4d9:eac2:6a0f
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ #2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ grep 6a0f /etc/scylla/scylla.yaml 
          - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f

@elcallio , node can be accessed on 34.241.115.182

I don't see what is wrong here.

~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.0.164.48  netmask 255.255.0.0  broadcast 10.0.255.255
        inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::8c1:15ff:fe82:65f0  prefixlen 64  scopeid 0x20<link>
...
~]$ grep 2a05:d018:223:f00:97af:f4d9:eac2:6a0f /etc/scylla/scylla.yaml 
          - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f

~]$ nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:ff95:6149:ffb5:3266  507.87 KB  256          ?       a2c865a5-1309-4cc0-a4cb-29c7492a18a6  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff99:921c:2362:ffd6  507.71 KB  256          ?       39d5ffc2-5340-4c0c-9357-9963be6a1d3b  UNKNOWN_RACK
UN  2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f  622.49 KB  256          ?       eb70f0e1-d880-444d-9bec-121db27d0601  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The eth0 interface is 2a05:d018:223:f00:97af:f4d9:eac2:6a0f, scylla is bound to the address, and it shows up in the output.

All local info says consistent.

I don't see what is wrong here.

~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.0.164.48  netmask 255.255.0.0  broadcast 10.0.255.255
        inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f  prefixlen 128  scopeid 0x0<global>
        inet6 fe80::8c1:15ff:fe82:65f0  prefixlen 64  scopeid 0x20<link>
...
~]$ grep 2a05:d018:223:f00:97af:f4d9:eac2:6a0f /etc/scylla/scylla.yaml 
          - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f

~]$ nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                Load       Tokens       Owns    Host ID                               Rack
UN  2a05:d018:223:f00:ff95:6149:ffb5:3266  507.87 KB  256          ?       a2c865a5-1309-4cc0-a4cb-29c7492a18a6  UNKNOWN_RACK
UN  2a05:d018:223:f00:ff99:921c:2362:ffd6  507.71 KB  256          ?       39d5ffc2-5340-4c0c-9357-9963be6a1d3b  UNKNOWN_RACK
UN  2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f  622.49 KB  256          ?       eb70f0e1-d880-444d-9bec-121db27d0601  UNKNOWN_RACK

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The eth0 interface is 2a05:d018:223:f00:97af:f4d9:eac2:6a0f, scylla is bound to the address, and it shows up in the output.

All local info says consistent.

nope, look closely... they are similar, but the addresses are different:

ffaf:ffd9:ffc2:6a0f
97af:f4d9:eac2:6a0f

Sorry, missed that (confused by the talk of aa5f:2295 above).
Looks like somehow the accumulator mask in print routine is not cleared...?

So, a simple test to parse and re-print the addresses in question yields no errors. There has been no changes in the address formatting code since 2019-12-04, which should be 3.3 no?
On the other hand, the pattern above does not indicate any consistent "leak" of byte mask/accumulator, so I am not exactly eager to yell "compiler error" either.
But I cannot repro any issue parsing/printing locally. And you say this node/cluster _works_ properly, so we should assume all addresses are ok in memory? (netstat supports this).
Which gcc was this build with? Did we change gcc version somewhere?
I think _maybe_ it might be different gcc doing a wild sign extension/type promotion that does not happen with mine... It would explain this (need high bit in high nibble -> 9)

I've pushed a patch to https://github.com/elcallio/scylla.git. Is it possible for you to build this with the same compiler as the image was built with and test?

I've pushed a patch to https://github.com/elcallio/scylla.git. Is it possible for you to build this with the same compiler as the image was built with and test?

in general lines, the test uses an AMI pre-created with specific version, or in my case, i'm calling latest AMI created for 3.3.
I will check later how i can create an AMI with your patch e get a ride to it.

@fgelcer ping

@slivne @elcallio , after lots of fighting I succeeded to build an AMI with @elcallio code, but is looks bogus:
/tmp/tmp.TVQD8HscRO/cassandra-env.sh: line 90: [: -ge: unary operator expected

then it fails the whole SCT execution as it was Unable to get nodetool status

@fgelcer I think you a scylla-tools-java that was broken - can you please provide the commit you used for scylla-tools-java

So if you used a version that is scylladb/scylla-tools-java@6d76e51472ae3350cbe9b9adc2134eec616d9318 or scylladb/scylla-tools-java@d3e5bbe208e1fd3dea7f4bdf82489c14b2f3e210 then its clear that it why it was broken.

there is a fix in master to fix this scylladb/scylla-tools-java@2f73d3ae37718ec9f7f1e1cd4bffc96aa4db5d9c

after scylla-tools-java were fixed, i'm building a new AMI and hopefully tomorrow it will be ready to give a try

@slivne / @elcallio , I confirm now (after lots of efforts to create and execute the AMI with @elcallio 's commit) that it works (actually the test failed with some unrelated issues).
@elcallio , can you please patch your change?
Thank you.

@slivne was this fix backported to 4.0? Because of I receive it with 4.0.rc3-0.20200501.eee4c00e29 version.

Gossip info also returns wrong IP (like nodetool status)

@slivne was this fix backported to 4.0? Because of I receive it with 4.0.rc3-0.20200501.eee4c00e29 version.

Gossip info also returns wrong IP (like nodetool status)

nope, this commit isn't in branch-4.0

Backported to 4.0, 3.3 (4.1 is safe).

Is there any plan for a new release on packages for 3.3 and 4.0 to address this?

It appears this bug also affects all currently packaged versions, and appears to prevent initializing a cluster with IPv6.

Yes, new 3.3 and 4.0 point releases will be issued with this fix.

Is there a planned date for the point releases @avikivity?

(Asking only because we're trying to schedule a deployment on our side that needs this fix. )

@lseelenbinder we'll likely to start rolling them tomorrow (we were waiting for #6627, which just landed).

Perfect鈥攖hank you! Definitely makes sense to wait on #6627.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gnumoreno picture gnumoreno  路  5Comments

avikivity picture avikivity  路  4Comments

LouayKamel picture LouayKamel  路  6Comments

hellowaywewe picture hellowaywewe  路  3Comments

duarten picture duarten  路  5Comments