Nomad: Cannot use consul connect deployed through nomad with envoy proxy on centos 7

Created on 28 Oct 2019 · 13Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

Centos 7 running inside a vagrant environment in a virtualbox with 2 network interfaces attached. I have a 3 node cluster deployed but I set 2 of the nodes to draining so that i could debug this issue on the single node.

Edit: I also tested stock centos 7 with a single interface in a single node cluster with the latest consul and nomad running in dev / dev-connect modes and had the same result, I tested both cni builds 0.8.1 and 0.8.2 and the latest docker from get.docker.com.

Issue

My sidecar containers are unable to communicate with their upstreams using the external ip of the VM. The connection attempt times out and the connection is never received by the upstream proxy. I can verify with 'curl' that requests are not getting through to the upstream envoy proxy.

I can make requests directly to the upstream proxy's internal ip and I can make requests to the upstream proxy from the host machine but it does not work from inside the container.

I have used the same job on the ubuntu xenial vm vagrant vm that is provided by vagrant and this same test works. I suspect the issue lies in the routing or in iptables but I am not sure exactly where. I have attached an iptables trace logs on my requests that are being lost.

not-working-trace.log
ips_and_firewall_rules.txt
working-trace.log

Reproduction steps

I am essentially deploying the example job from https://www.nomadproject.io/guides/integrations/consul-connect/index.html with modifications and then making a request to the "dashboard proxy" which should be routed to the "api service".

Job file (if appropriate)

```job "countdash" {
datacenters = ["dc1"]

group "api" {
network {
mode = "bridge"
}

service {
  name = "count-api"
  port = "9001"

  connect {
    sidecar_task {
       config {
          args  = [
            "-c",
            "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
            "-l",
            "debug"
          ]
       }
    }
    sidecar_service {}
  }
}

task "web" {
  driver = "docker"

  config {
    image = "hashicorpnomad/counter-api:v1"
    auth_soft_fail = true
  }
}

}

group "dashboard" {
network {
mode = "bridge"

  port "http" {
    static = 9002
    to     = 9002
  }
}

service {
  name = "count-dashboard"
  port = "9002"

  connect {
    sidecar_task {
       config {
          args  = [
            "-c",
            "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
            "-l",
            "debug"
          ]
       }
    }
    sidecar_service {
      proxy {
        upstreams {
          destination_name = "count-api"
          local_bind_port  = 8080
        }
      }
    }
  }
}

task "dashboard" {
  driver = "docker"

  env {
    COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
  }

  config {
    image = "hashicorpnomad/counter-dashboard:v1"
    auth_soft_fail = true
  }
}

}
} ```

themconsuconnect typbug

Source

dneray

Most helpful comment

@dneray for now we're going to document this and that's just been merged as #6607, but I've added some notes to #6557 so we can make checking for this class of problem part of "pre-flight" checking as you've suggested here. Thanks again so much for your help on this!

tgross on 1 Nov 2019

👍2

All 13 comments

Hi @dneray and thanks for reporting this! I'm going to look into seeing if I can replicate this, but seeing as how you're seeing it work on an Ubuntu vagrant environment but not CentOS, I'm wondering if there's something environment specific. Any chance you could share a (redacted) Vagrantfile and/or the Nomad client logs?

tgross on 29 Oct 2019

Thanks so much for looking into it @tgross .

Vagrantfile is attached in a zip, basically it's just minimal centos with docker, nomad, consul and that cni plugins and everything running in dev mode through systemd.

My test was just to ssh in, exec into the dashboard container and run "apk add curl; curl ${COUNTING_SERVICE_URL}"

nomad.log
vagrant.zip

dneray on 29 Oct 2019

Thanks, that helps a lot. I see the following log lines in the nomad.log file:

Oct 29 18:37:06 test-centos nomad[5629]: 2019-10-29T18:37:06.232Z [WARN ] client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=ab916589-89f4-a78d-4f58-7cbbc13519e1 err="unable to create chain CNI-HOSTPORT-SETMARK: running [/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait]: exit status 1: iptables: Chain already exists.
...
Oct 29 18:37:07 test-centos nomad[5629]: 2019-10-29T18:37:07.364Z [WARN ] client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=ab916589-89f4-a78d-4f58-7cbbc13519e1 err="container veth name provided (eth0) already exists" attempt=2

I suspect that you've encountered the same issue reported in https://github.com/hashicorp/nomad/issues/6567. Let me tag in @nickethier and see whether the logs he was emailed match up.

tgross on 29 Oct 2019

Thanks,

Indeed, it does appear the first allocation of the second task group failed as mentioned in #6567 which looks like the same bug however it was rescheduled on the same node and the second attempt started successfully but the communication is still broken.

dneray on 29 Oct 2019

Ok, I've spent some time digging into this and don't yet have an answer. But I thought I'd dump some of the results of my investigation for discussion here so that I can pull in some of my colleagues @nickethier @shoenig and/or @schmichael as well.

My current theory is that there are three distinct issues:

We're not quite handling the known race condition in the setup of networking with CNI correctly, and that's causing the container veth name provided (eth0) already exists error.
When the allocation fails because of (1), we sometimes leave behind unused iptables rules.
There's some sort of issue with what information we're registering with Consul that's distro-specific.

First, I was able to reproduce the behavior using your Vagrantfile and job file. The resulting Nomad alloc status:

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
3e214065  0f4617ba  api         0        run      running  42m57s ago  42m45s ago
0738c4c0  0f4617ba  api         0        stop     failed   43m33s ago  42m57s ago
ececf324  0f4617ba  dashboard   0        run      running  43m33s ago  43m13s ago

And then from inside the dashboard container:

/app # env | grep COUNTING_SERVICE_URL
COUNTING_SERVICE_URL=http://127.0.0.1:8080

/app # curl -v http://127.0.0.1:8080
* Rebuilt URL to: http://127.0.0.1:8080/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/7.61.1
> Accept: */*
>
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

The next thing I did is verify that this job works with our Vagrant Ubuntu environment and using 0.10.0 with Consul 1.6.1 just as you have, so this isn't a regression that we missed since the last time I looked at it a couple weeks ago. Connect works fine there... however, it turns out I can occasionally get the container veth name provided (eth0) already exists error even on Ubuntu! The allocation fails, but on Ubuntu when Nomad reschedules the allocation, everything works just fine afterwards.

Then I dug into the logs to match against the ones you provided. Here's what this looks like from the CentOS machine:

Nomad logs excerpt

Oct 30 13:59:01 test-centos nomad[5390]: 2019-10-30T13:59:01.079Z [WARN ] client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=0738c4c0-0b51-930b-3d15-2db7bf11a0e1 err="u
nable to create chain CNI-HOSTPORT-SETMARK: running [/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait]: exit status 1: iptables: Chain already exists.
Oct 30 13:59:01 test-centos nomad[5390]: " attempt=1
...
Oct 30 13:59:01 test-centos nomad[5390]: 2019-10-30T13:59:01.204Z [DEBUG] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: bootstrapping Connect proxy sidecar: alloc_id=ececf324-125a-9a56-f283-37f76c9a5ec7 task=connect-proxy-count-dashboard task=connect-proxy-count-dashboard service=count-dashboard
Oct 30 13:59:01 test-centos nomad[5390]: 2019-10-30T13:59:01.204Z [DEBUG] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: bootstrapping envoy: alloc_id=ececf324-125a-9a56-f283-37f76c9a5ec7 task=connect-proxy-count-dashboard sidecar_for=count-dashboard boostrap_file=/tmp/NomadClient094443659/ececf324-125a-9a56-f283-37f76c9a5ec7/connect-proxy-count-dashboard/secrets/envoy_bootstrap.json sidecar_for_id=_nomad-task-ececf324-125a-9a56-f283-37f76c9a5ec7-group-dashboard-count-dashboard-9002 grpc_addr=unix://alloc/tmp/consul_grpc.sock
...
Oct 30 13:59:02 test-centos nomad[5390]: 2019-10-30T13:59:02.882Z [WARN ] client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=0738c4c0-0b51-930b-3d15-2db7bf11a0e1 err="container veth name provided (eth0) already exists" attempt=2
...
Oct 30 13:59:04 test-centos nomad[5390]: 2019-10-30T13:59:04.663Z [WARN ] client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=0738c4c0-0b51-930b-3d15-2db7bf11a0e1 err="container veth name provided (eth0) already exists" attempt=3

The exact initial error message varies slightly on each attempt, but the result ends up being the same. For example, sometimes it looks like this:

Oct 30 15:22:29 test-centos nomad[4179]: 2019-10-30T15:22:29.421Z [WARN ] client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=90d5ab91-cfea-8cab-c698-7b9e1a67a8ea err="failed to set bridge addr: could not add IP address to "nomad": file exists" attempt=1

My hypothesis here is that when we call out to CNI to set up the veth pair, the veth pair is being set up correctly but then we're getting an error for bridge or iptables rules setup, which bubbles up as an error to networking_bridge_linux.go#L157-L159. In the second and third attempts, we find that the veth pair is already created, which causes the allocation to fail and get rescheduled. Maybe we're handling what the CNI plugin is giving us incorrectly or maybe there's a bug in the CNI plugin, but it's definitely a bug either way.

Next I wanted to verify that the two processes and their envoy proxies are in the appropriate network namespaces, and that I can reach the envoy proxy from the dashboard container via curl localhost:19000/. Unfortunately nothing here looks unexpected.

Unremarkable output of various networking tools

# dashboard process
[vagrant@test-centos ~]$ sudo ls -lah /proc/6594/ns/net
lrwxrwxrwx. 1 root root 0 Oct 30 18:13 /proc/6594/ns/net -> net:[4026532112]
[vagrant@test-centos ~]$ sudo ls -lah /proc/6948/ns/net
lrwxrwxrwx. 1 root root 0 Oct 30 18:26 /proc/6948/ns/net -> net:[4026532112]

# api process
[vagrant@test-centos ~]$ sudo ls -lah /proc/7791/ns/net
lrwxrwxrwx. 1 root root 0 Oct 30 18:15 /proc/7791/ns/net -> net:[4026532254]
[vagrant@test-centos ~]$ sudo ls -lah /proc/7602/ns/net
lrwxrwxrwx. 1 root root 0 Oct 30 18:36 /proc/7602/ns/net -> net:[4026532254]

Inside the dashboard container:

/app # curl localhost:19000/listeners
public_listener:0.0.0.0:25327::0.0.0.0:25327
count-api:127.0.0.1:8080::127.0.0.1:8080

/app # netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:24615           0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      -

ARP table:

[vagrant@test-centos ~]$ arp
Address                  HWtype  HWaddress           Flags Mask            Iface
172.26.64.4              ether   de:c0:6b:dd:cc:02   C                     nomad
172.26.64.3              ether   62:fc:2b:0d:20:d0   C                     nomad
gateway                  ether   52:54:00:12:35:02   C                     eth0
10.0.2.3                 ether   52:54:00:12:35:03   C                     eth0

Route table:

[vagrant@test-centos ~]$ ip route
default via 10.0.2.2 dev eth0 proto dhcp metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.26.64.0/20 dev nomad proto kernel scope link src 172.26.64.1

Next I looked into the iptables rules. I can see that there are rules tagged with the alloc ID for the failed allocation, as well as ones routing for an IP that doesn't belong to either of the two running allocations (and therefore belongs to the failed one). As long as Consul, Envoy, and the dashboard don't try to route to a missing IP (which they shouldn't), these rules should be ok to leave in place. However, in the interest of experimenting, I removed these rules as follows. But no joy. This is another case of #6385 but a red herring for this issue.

# note there are no mangle, security, raw rules, or any filter rules for this alloc/IP

[vagrant@test-centos ~]$ sudo iptables -L -v -n --line-number
[vagrant@test-centos ~]$ sudo iptables -D CNI-FORWARD <number>

[vagrant@test-centos ~]$ sudo iptables -t nat -L -v -n --line-number
[vagrant@test-centos ~]$ sudo iptables -t nat -D POSTROUTING <number>

[vagrant@test-centos ~]$ sudo iptables -t nat -F CNI-<chain ID>
[vagrant@test-centos ~]$ sudo iptables -t nat -X CNI-<chain ID>

Next I checked the Envoy logs for the dashboard and can see there are connection timeouts. The IP address 10.0.2.15 is the correct address for the count-api container.

Envoy logs

[2019-10-31 15:39:29.663][1][debug][main] [source/server/server.cc:170] flushing stats
[2019-10-31 15:39:30.055][11][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:201] [C17] new tcp proxy session
[2019-10-31 15:39:30.055][11][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:344] [C17] Creating connection to cluster count-api.default.dc1.internal.9de3a058-b965-a307-1980-c306821f5de4.consul
[2019-10-31 15:39:30.055][11][debug][pool] [source/common/tcp/conn_pool.cc:82] creating a new connection
[2019-10-31 15:39:30.055][11][debug][pool] [source/common/tcp/conn_pool.cc:362] [C18] connecting
[2019-10-31 15:39:30.055][11][debug][connection] [source/common/network/connection_impl.cc:704] [C18] connecting to 10.0.2.15:25461
[2019-10-31 15:39:30.055][11][debug][connection] [source/common/network/connection_impl.cc:713] [C18] connection in progress
[2019-10-31 15:39:30.055][11][debug][pool] [source/common/tcp/conn_pool.cc:108] queueing request due to no available connections
[2019-10-31 15:39:30.055][11][debug][main] [source/server/connection_handler_impl.cc:280] [C17] new connection
[2019-10-31 15:39:33.341][12][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:201] [C19] new tcp proxy session
[2019-10-31 15:39:33.341][12][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:344] [C19] Creating connection to cluster local_app
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:82] creating a new connection
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:362] [C20] connecting
[2019-10-31 15:39:33.341][12][debug][connection] [source/common/network/connection_impl.cc:704] [C20] connecting to 127.0.0.1:9002
[2019-10-31 15:39:33.341][12][debug][connection] [source/common/network/connection_impl.cc:713] [C20] connection in progress
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:108] queueing request due to no available connections
[2019-10-31 15:39:33.341][12][debug][main] [source/server/connection_handler_impl.cc:280] [C19] new connection
[2019-10-31 15:39:33.341][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:168] [C19] handshake error: 5
[2019-10-31 15:39:33.341][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C19]
[2019-10-31 15:39:33.341][12][debug][connection] [source/common/network/connection_impl.cc:190] [C19] closing socket: 0
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:204] canceling pending request
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:212] canceling pending connection
[2019-10-31 15:39:33.341][12][debug][connection] [source/common/network/connection_impl.cc:101] [C20] closing data_to_write=0 type=1
[2019-10-31 15:39:33.341][12][debug][connection] [source/common/network/connection_impl.cc:190] [C20] closing socket: 1
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:123] [C20] client disconnected
[2019-10-31 15:39:33.341][12][debug][main] [source/server/connection_handler_impl.cc:80] [C19] adding to cleanup list
[2019-10-31 15:39:33.341][12][debug][pool] [source/common/tcp/conn_pool.cc:236] [C20] connection destroyed
[2019-10-31 15:39:34.663][1][debug][main] [source/server/server.cc:170] flushing stats
[2019-10-31 15:39:35.055][11][debug][pool] [source/common/tcp/conn_pool.cc:401] [C18] connect timeout
[2019-10-31 15:39:35.055][11][debug][connection] [source/common/network/connection_impl.cc:101] [C18] closing data_to_write=0 type=1
[2019-10-31 15:39:35.055][11][debug][connection] [source/common/network/connection_impl.cc:190] [C18] closing socket: 1
[2019-10-31 15:39:35.055][11][debug][pool] [source/common/tcp/conn_pool.cc:123] [C18] client disconnected
[2019-10-31 15:39:35.055][11][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:450] [C17] connect timeout
[2019-10-31 15:39:35.055][11][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:344] [C17] Creating connection to cluster count-api.default.dc1.internal.9de3a058-b965-a307-1980-c306821f5de4.consul
[2019-10-31 15:39:35.055][11][debug][connection] [source/common/network/connection_impl.cc:101] [C17] closing data_to_write=0 type=1
[2019-10-31 15:39:35.055][11][debug][connection] [source/common/network/connection_impl.cc:190] [C17] closing socket: 1
[2019-10-31 15:39:35.055][11][debug][main] [source/server/connection_handler_impl.cc:80] [C17] adding to cleanup list
[2019-10-31 15:39:35.055][11][debug][pool] [source/common/tcp/conn_pool.cc:236] [C18] connection destroyed

However, that made me think to check the Consul service catalog for our service, and this is where I found at least some noticeable difference between Ubuntu and CentOS.

On CentOS:

[
  {
    "ID": "7d9be3d7-b88c-f61d-8a43-3e542a393f01",
    "Node": "test-centos",
    "Address": "127.0.0.1",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "127.0.0.1",
      "wan": "127.0.0.1"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "connect-proxy",
    "ServiceID": "_nomad-task-7cc62290-db29-3194-a73a-bc0f1cafa9cd-group-api-count-api-9001-sidecar-proxy",
    "ServiceName": "count-api-sidecar-proxy",
    "ServiceTags": [],
    "ServiceAddress": "10.0.2.15",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 26527,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "DestinationServiceName": "count-api",
      "DestinationServiceID": "_nomad-task-7cc62290-db29-3194-a73a-bc0f1cafa9cd-group-api-count-api-9001",
      "LocalServiceAddress": "127.0.0.1",
      "LocalServicePort": 9001,
      "Config": {
        "bind_address": "0.0.0.0",
        "bind_port": 26527
      },
      "MeshGateway": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 37,
    "ModifyIndex": 37
  }
]

On Ubuntu:

[
  {
    "ID": "7bebf72b-2c06-a140-270b-feb878a5fde0",
    "Node": "linux",
    "Address": "10.0.2.15",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "10.0.2.15",
      "wan": "10.0.2.15"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "connect-proxy",
    "ServiceID": "_nomad-task-85e5658f-ece4-6717-2705-4e9d8b57bf55-group-api-count-api-9001-sidecar-proxy",
    "ServiceName": "count-api-sidecar-proxy",
    "ServiceTags": [],
    "ServiceAddress": "10.0.2.15",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 29496,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "DestinationServiceName": "count-api",
      "DestinationServiceID": "_nomad-task-85e5658f-ece4-6717-2705-4e9d8b57bf55-group-api-count-api-9001",
      "LocalServiceAddress": "127.0.0.1",
      "LocalServicePort": 9001,
      "Config": {
        "bind_address": "0.0.0.0",
        "bind_port": 29496
      },
      "MeshGateway": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 364,
    "ModifyIndex": 364
  }
]

The key difference here being that in the Ubuntu case, the ServiceAddress and Address fields match, whereas in the CentOS case, they do not. So my hypothesis for the difference in behavior between CentOS and Ubuntu is currently that there's something that the underlying CNI tool is doing that's not cross-distro compatible, and so we're getting the wrong IP to send up to Consul. There's some precedent for that with Debian and CNI plugins relying on iptables vs nftables binaries.

tgross on 31 Oct 2019

I checked the Consul catalog API and I think the address vs service address is a red herring as well. The Address is the address of the _node_, which in the Ubuntu case I have here is the bind_addr from my Consul configuration: "{{ GetInterfaceIP \"eth0\" }}"

tgross on 31 Oct 2019

I just took a pass at building bcc-tools on both CentOS and Ubuntu to check the network flows. It looks like we're trying to connect to the right IP address but never getting an accept from the other side. That narrows it down quite a bit. I'm currently grubbing through some execsnoop output to see if I can figure out what might have been set up differently (there's definitely some differences I'm seeing right off the bat but I need to figure out if those are at all relevant). These are here https://gist.github.com/tgross/41f7a2800d847cc5423a719f6e439d3c

[vagrant@test-centos build]$ sudo /usr/share/bcc/tools/tcpconnect
Tracing connect ... Hit Ctrl-C to end
PID    COMM         IP SADDR            DADDR            DPORT
...
29644  curl         4  127.0.0.1        127.0.0.1        8080
2652   envoy        4  172.26.64.15     10.0.2.15        23795

[vagrant@test-centos build]$ sudo /usr/share/bcc/tools/tcpaccept
PID     COMM         IP RADDR            RPORT LADDR            LPORT
...
2652    envoy        4  127.0.0.1        41918 127.0.0.1        8080

[vagrant@test-centos ~]$ sudo docker exec -it 965 netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:23795           0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:19000         0.0.0.0:*               LISTEN      -
...
tcp        0      0 :::9001                 :::*                    LISTEN      1/counting-service

tgross on 31 Oct 2019

Yup, I can see the same! Looks like we're getting closer.

Ubuntu:

20:43:58.327532 IP 172.26.64.13.59866 > 10.0.2.15.26120: Flags [S], seq 1290541388, win 29200, options [mss 1460,sackOK,TS val 9569942 ecr 0,nop,wscale 7], length 0
20:43:58.327558 IP 172.26.64.13.59866 > 172.26.64.14.26120: Flags [S], seq 1290541388, win 29200, options [mss 1460,sackOK,TS val 9569942 ecr 0,nop,wscale 7], length 0
20:43:58.327576 IP 172.26.64.14.26120 > 172.26.64.13.59866: Flags [S.], seq 3566183920, ack 1290541389, win 28960, options [mss 1460,sackOK,TS val 9569942 ecr 9569942,nop,wscale 7], length 0
20:43:58.327580 IP 10.0.2.15.26120 > 172.26.64.13.59866: Flags [S.], seq 3566183920, ack 1290541389, win 28960, options [mss 1460,sackOK,TS val 9569942 ecr 9569942,nop,wscale 7], length 0
20:43:58.327590 IP 172.26.64.13.59866 > 10.0.2.15.26120: Flags [.], ack 1, win 229, options [nop,nop,TS val 9569942 ecr 9569942], length 0
20:43:58.327594 IP 172.26.64.13.59866 > 172.26.64.14.26120: Flags [.], ack 1, win 229, options [nop,nop,TS val 9569942 ecr 9569942], length 0
20:43:58.327790 IP 172.26.64.13.59866 > 10.0.2.15.26120: Flags [P.], seq 1:1127, ack 1, win 229, options [nop,nop,TS val 9569942 ecr 9569942], length 1126

CentOS:

20:53:54.094624 IP 172.26.64.15.37040 > test-centos.23795: Flags [S], seq 189422371, win 29200, options [mss 1460,sackOK,TS val 6218508 ecr 0,nop,wscale 7], length 0
20:53:54.094624 IP 172.26.64.15.37040 > test-centos.23795: Flags [S], seq 189422371, win 29200, options [mss 1460,sackOK,TS val 6218508 ecr 0,nop,wscale 7], length 0
20:53:54.094660 IP 172.26.64.15.37040 > 172.26.64.16.23795: Flags [S], seq 189422371, win 29200, options [mss 1460,sackOK,TS val 6218508 ecr 0,nop,wscale 7], length 0
20:53:54.094662 IP 172.26.64.15.37040 > 172.26.64.16.23795: Flags [S], seq 189422371, win 29200, options [mss 1460,sackOK,TS val 6218508 ecr 0,nop,wscale 7], length 0
20:53:54.094680 IP 172.26.64.16.23795 > 172.26.64.15.37040: Flags [S.], seq 3597319214, ack 189422372, win 28960, options [mss 1460,sackOK,TS val 6218508 ecr 6218508,nop,wscale 7], length 0
20:53:54.094683 IP 172.26.64.16.23795 > 172.26.64.15.37040: Flags [S.], seq 3597319214, ack 189422372, win 28960, options [mss 1460,sackOK,TS val 6218508 ecr 6218508,nop,wscale 7], length 0
20:53:54.094691 IP 172.26.64.15.37040 > 172.26.64.16.23795: Flags [R], seq 189422372, win 0, length 0
20:53:54.094692 IP 172.26.64.15.37040 > 172.26.64.16.23795: Flags [R], seq 189422372, win 0, length 0
20:53:54.604277 IP test-centos.51618 > 172.26.64.16.23795: Flags [S], seq 3110955261, win 43690, options [mss 65495,sackOK,TS val 6219018 ecr 0,nop,wscale 7], length 0
20:53:54.604286 IP test-centos.51618 > 172.26.64.16.23795: Flags [S], seq 3110955261, win 43690, options [mss 65495,sackOK,TS val 6219018 ecr 0,nop,wscale 7], length 0
20:53:54.604309 IP 172.26.64.16.23795 > test-centos.51618: Flags [S.], seq 3975660752, ack 3110955262, win 28960, options [mss 1460,sackOK,TS val 6219018 ecr 6219018,nop,wscale 7], length 0

tgross on 31 Oct 2019

Ah I deleted my reply to update it because I had my tcpdumps mixed up but you beat me, indeed the public ip seems to be replaced by the time the packets reach the upstream proxy in centos. I am looking back at the iptables traces that i posted originally and those also look quite different.

There is no mention of the PHYSIN and PHYSOUT in the centos traces referring to the virtual interface, I'll see if I can figure out what that means because I'm not too familiar with iptables.

Edit: Seems like this is likely just differences in the iptables version, given the rules look the same, I'm wondering if this could be differences in behaviour between iptables versions. CentOS is using 1.4.21 and ubuntu is using 1.6.0

dneray on 31 Oct 2019

It seems to be the same issue from here: https://bugzilla.redhat.com/show_bug.cgi?id=1703261#c18 can confirm that their fix does work. 'echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables' resolves the issue.

Thanks a lot for helping investigate @tgross

Maybe a pre deployment check might make sense for this?

dneray on 31 Oct 2019

I've verified that toggle fixes the connectivity problem. Thanks @dneray!

After a quick test, I then tore down the CentOS VM I was testing on and started over from scratch, and flipped the toggle. I do still see the problem where the API container is initially failing with the container veth name provided (eth0) already exists error. But now when Nomad reschedules it, connectivity is good. So #6567 turns out to be a separate issue entirely.

Next steps:

Figure out the best way for Nomad to handle this situation; we can document it as a warning, do a pre-deployment check to warn for it, have Nomad set the toggle itself (I'm not wild about this b/c it's an operator surprise), or try to push through a fix in the upstream CNI plugin.
Continue investigating the initial failure, but I'll do that work under #6567.

Thanks again @dneray for sticking it out and helping through the investigation!

tgross on 1 Nov 2019

Aside, now that I know what to look for, I found this in dmesg. It was telling us right there the whole time! 😊

[    6.140229] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.

And this is part of the CentOS sysctl configuration:

$ cat /usr/lib/sysctl.d/00-system.conf
# Kernel sysctl configuration file
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

tgross on 1 Nov 2019