i'm running with 3.3 (ami-0b19be243453bd8e8 on eu-west-1) - 3.3.rc1-0.20200209.0d0c1d43188
both AWS console and local network manager show the same IPv6 address (and it is pingable using ping6):
2a05:d018:223:f00:b111:3a2:aa5f:2295 is the correct IPv6 address of my instance
ip addr | grep inet6
inet6 ::1/128 scope host
inet6 2a05:d018:223:f00:b111:3a2:aa5f:2295/128 scope global dynamic
inet6 fe80::88d:a6ff:fe32:6cce/64 scope link
while nodetool status returns (2a05:d018:223:f00:b111:ffa2:aa5f:ff95) something else:
nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 2a05:d018:223:f00:1c55:d40e:5a7d:ffb7 461.13 KB 256 ? 8fd87d4c-9fb6-44d9-b016-48de9529f453 UNKNOWN_RACK
UN 2a05:d018:223:f00:b111:ffa2:aa5f:ff95 572.09 KB 256 ? 05a39bc1-3fbb-481a-93c8-e96b1b1f2222 UNKNOWN_RACK
UN 2a05:d018:223:f00:ff85:be21:de55:ffbf 461.97 KB 256 ? bdac46b6-0069-49ba-8acd-ff6db13a2ddd UNKNOWN_RACK
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
used to get a pcap using tshark (will try to attach to here, or will provide a link to some S3 storage.
capture.txt is the pcap captured using tshark on one of the nodes
i have tested the same scenario using 3.2 and it doesn't show the issue
Lets try to reproduce on monday morning and get the machine up - @elcallio will be able to check this ount on the machine
What machine? And check out what? Since server address is not exactly automatically resolved, I would assume that whatever is showing in nodetool is also in scylla.yaml?
ifconfig?
netstat -nlp?
raw api output?
What machine? And check out what? Since server address is not exactly automatically resolved, I would assume that whatever is showing in nodetool is also in scylla.yaml?
ifconfig?
netstat -nlp?
raw api output?
wrong, in scylla.yaml we see the correct IPv6 address, hence the cluster is up and running.
I can create an environment for you, from one of our manager sanity tests, if you will.
Yes please.
Yes please.
nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 2a05:d018:223:f00:ff95:6149:ffb5:3266 433.04 KB 256 ? a2c865a5-1309-4cc0-a4cb-29c7492a18a6 UNKNOWN_RACK
UN 2a05:d018:223:f00:ff99:921c:2362:ffd6 502.62 KB 256 ? 39d5ffc2-5340-4c0c-9357-9963be6a1d3b UNKNOWN_RACK
UN 2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f 547.12 KB 256 ? eb70f0e1-d880-444d-9bec-121db27d0601 UNKNOWN_RACK
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
This EC2 instance is optimized for Scylla.
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ ip addr | grep inet6
inet6 ::1/128 scope host
inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f/128 scope global dynamic
inet6 fe80::8c1:15ff:fe82:65f0/64 scope link
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ #2a05:d018:223:f00:97af:f4d9:eac2:6a0f
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ #2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f
[centos@manager-regression-mermaid--db-node-9b3bd7e9-1 ~]$ grep 6a0f /etc/scylla/scylla.yaml
- seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
@elcallio , node can be accessed on 34.241.115.182
I don't see what is wrong here.
~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.0.164.48 netmask 255.255.0.0 broadcast 10.0.255.255
inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f prefixlen 128 scopeid 0x0<global>
inet6 fe80::8c1:15ff:fe82:65f0 prefixlen 64 scopeid 0x20<link>
...
~]$ grep 2a05:d018:223:f00:97af:f4d9:eac2:6a0f /etc/scylla/scylla.yaml
- seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f"
listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f
~]$ nodetool status
Datacenter: UNKNOWN_DC
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 2a05:d018:223:f00:ff95:6149:ffb5:3266 507.87 KB 256 ? a2c865a5-1309-4cc0-a4cb-29c7492a18a6 UNKNOWN_RACK
UN 2a05:d018:223:f00:ff99:921c:2362:ffd6 507.71 KB 256 ? 39d5ffc2-5340-4c0c-9357-9963be6a1d3b UNKNOWN_RACK
UN 2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f 622.49 KB 256 ? eb70f0e1-d880-444d-9bec-121db27d0601 UNKNOWN_RACK
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
The eth0 interface is 2a05:d018:223:f00:97af:f4d9:eac2:6a0f, scylla is bound to the address, and it shows up in the output.
All local info says consistent.
I don't see what is wrong here.
~]$ ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001 inet 10.0.164.48 netmask 255.255.0.0 broadcast 10.0.255.255 inet6 2a05:d018:223:f00:97af:f4d9:eac2:6a0f prefixlen 128 scopeid 0x0<global> inet6 fe80::8c1:15ff:fe82:65f0 prefixlen 64 scopeid 0x20<link> ... ~]$ grep 2a05:d018:223:f00:97af:f4d9:eac2:6a0f /etc/scylla/scylla.yaml - seeds: "2a05:d018:223:f00:97af:f4d9:eac2:6a0f" listen_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f broadcast_rpc_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f prometheus_address: 2a05:d018:223:f00:97af:f4d9:eac2:6a0f ~]$ nodetool status Datacenter: UNKNOWN_DC ====================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 2a05:d018:223:f00:ff95:6149:ffb5:3266 507.87 KB 256 ? a2c865a5-1309-4cc0-a4cb-29c7492a18a6 UNKNOWN_RACK UN 2a05:d018:223:f00:ff99:921c:2362:ffd6 507.71 KB 256 ? 39d5ffc2-5340-4c0c-9357-9963be6a1d3b UNKNOWN_RACK UN 2a05:d018:223:f00:ffaf:ffd9:ffc2:6a0f 622.49 KB 256 ? eb70f0e1-d880-444d-9bec-121db27d0601 UNKNOWN_RACK Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaninglessThe eth0 interface is
2a05:d018:223:f00:97af:f4d9:eac2:6a0f, scylla is bound to the address, and it shows up in the output.All local info says consistent.
nope, look closely... they are similar, but the addresses are different:
ffaf:ffd9:ffc2:6a0f
97af:f4d9:eac2:6a0f
Sorry, missed that (confused by the talk of aa5f:2295 above).
Looks like somehow the accumulator mask in print routine is not cleared...?
So, a simple test to parse and re-print the addresses in question yields no errors. There has been no changes in the address formatting code since 2019-12-04, which should be 3.3 no?
On the other hand, the pattern above does not indicate any consistent "leak" of byte mask/accumulator, so I am not exactly eager to yell "compiler error" either.
But I cannot repro any issue parsing/printing locally. And you say this node/cluster _works_ properly, so we should assume all addresses are ok in memory? (netstat supports this).
Which gcc was this build with? Did we change gcc version somewhere?
I think _maybe_ it might be different gcc doing a wild sign extension/type promotion that does not happen with mine... It would explain this (need high bit in high nibble -> 9)
I've pushed a patch to https://github.com/elcallio/scylla.git. Is it possible for you to build this with the same compiler as the image was built with and test?
I've pushed a patch to https://github.com/elcallio/scylla.git. Is it possible for you to build this with the same compiler as the image was built with and test?
in general lines, the test uses an AMI pre-created with specific version, or in my case, i'm calling latest AMI created for 3.3.
I will check later how i can create an AMI with your patch e get a ride to it.
@fgelcer ping
@slivne @elcallio , after lots of fighting I succeeded to build an AMI with @elcallio code, but is looks bogus:
/tmp/tmp.TVQD8HscRO/cassandra-env.sh: line 90: [: -ge: unary operator expected
then it fails the whole SCT execution as it was Unable to get nodetool status
@fgelcer I think you a scylla-tools-java that was broken - can you please provide the commit you used for scylla-tools-java
So if you used a version that is scylladb/scylla-tools-java@6d76e51472ae3350cbe9b9adc2134eec616d9318 or scylladb/scylla-tools-java@d3e5bbe208e1fd3dea7f4bdf82489c14b2f3e210 then its clear that it why it was broken.
there is a fix in master to fix this scylladb/scylla-tools-java@2f73d3ae37718ec9f7f1e1cd4bffc96aa4db5d9c
after scylla-tools-java were fixed, i'm building a new AMI and hopefully tomorrow it will be ready to give a try
@slivne / @elcallio , I confirm now (after lots of efforts to create and execute the AMI with @elcallio 's commit) that it works (actually the test failed with some unrelated issues).
@elcallio , can you please patch your change?
Thank you.
@slivne was this fix backported to 4.0? Because of I receive it with 4.0.rc3-0.20200501.eee4c00e29 version.
Gossip info also returns wrong IP (like nodetool status)
@slivne was this fix backported to 4.0? Because of I receive it with
4.0.rc3-0.20200501.eee4c00e29version.Gossip info also returns wrong IP (like nodetool status)
nope, this commit isn't in branch-4.0
Backported to 4.0, 3.3 (4.1 is safe).
Is there any plan for a new release on packages for 3.3 and 4.0 to address this?
It appears this bug also affects all currently packaged versions, and appears to prevent initializing a cluster with IPv6.
Yes, new 3.3 and 4.0 point releases will be issued with this fix.
Is there a planned date for the point releases @avikivity?
(Asking only because we're trying to schedule a deployment on our side that needs this fix. )
@lseelenbinder we'll likely to start rolling them tomorrow (we were waiting for #6627, which just landed).
Perfect鈥攖hank you! Definitely makes sense to wait on #6627.