$ nomad --version
Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)
Amazon Linux 2
An allocation being scheduled failed with:
failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: container veth name provided (eth0) already exists
Nothing special, an update to a job was submitted to increase its count and change some meta to force allocation recreate. The rest of the allocations scheduled fine, just saw this one. It did reschedule just fine on the same host, just figured this should be logged.
$ cat /var/log/messages | grep nomad | grep 917cdc42-b4ba-2004-876a-0aafdbdabf98
Oct 25 20:08:18 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="unable to create chain CNI-HOSTPORT-SETMARK: running [/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait]: exit status 1: iptables: Chain already exists.#012" attempt=1
Oct 25 20:08:20 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="container veth name provided (eth0) already exists" attempt=2
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="container veth name provided (eth0) already exists" attempt=3
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner: prerun failed: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: container veth name provided (eth0) already exists"
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.gc: marking allocation for GC: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 alloc=917cdc42-b4ba-2004-876a-0aafdbdabf98 error="missing prevResult from earlier plugin"
More logs sent via email.
Hi @djenriquez and thank you for reporting this! Just wanted to let you know that this looks very similar to #6580 and I'm investigating this currently.
Thanks @tgross. I saw it a few more times, but again, not _really_ an issue since it seems to resolve right away on reschedule. Definitely seems to be some kind of race condition.
Ok, so as described in https://github.com/hashicorp/nomad/issues/6580#issuecomment-548596608, it turns out this is separate from the connectivity problem described there.
My theory from #6580 which I'll investigate here:
My hypothesis here is that when we call out to CNI to set up the veth pair, the veth pair is being set up correctly but then we're getting an error for bridge or iptables rules setup, which bubbles up as an error to
networking_bridge_linux.go#L163-L166. In the second and third attempts, we find that the veth pair is already created, which causes the allocation to fail and get rescheduled. Maybe we're handling what the CNI plugin is giving us incorrectly or maybe there's a bug in the CNI plugin, but it's definitely a bug either way.
tl;dr
There are two things to do here:
firewall and portmap plugin. I'll start on this immediately.nomad bridge and iptables chains we need at startup to avoid any undiscovered or future race conditions in the CNI plugins. (See https://github.com/hashicorp/nomad/issues/6618)Nomad's Connect integration uses CNI plugins to create the network namespaces, the bridge network, container veth pairs, and iptables rules. In the happy path case, we configure CNI using the go-cni library and pass a chain of CNI plugin configurations. When Nomad gets a Connect-enabled job, it hands off control to the go-cni library, which invokes the three plugins as external binaries. These external binaries in turn shell out to tools like iptables on the host. When things go well:
bridge plugin creates the nomad bridge and the veth pair for the container, placing one end inside the container as eth0.firewall plugin creates some iptables chains and associated rules.portmap plugin creates some iptables chains and associated rules.The problem we're seeing is a race condition in creating iptables chains when multiple Connect-enabled task groups are starting on the same client node, which is of course the common case for folks who are trying out our Connect integration in a dev environment.
In order to reproduce this, you need to flush all firewall chains/rules (it's easiest to simply restart the node). One of the allocations will fail. Then Nomad will retry the allocation a few seconds later. At that point everything proceeds as planned, and we'll never hit the race condition again on the same client node.
The error messages we get back from CNI are unhelpful but I was able to determine what was happening by using execsnoop to trace CNI's invocations of iptables.
Here are relevant excerpts from a log where I ran execsnoop -tx | tee execsnoop.log:
# in these logs, PID 2886 is Nomad itself
# this is the portmap execution run for task group 1
$ head -1 execsnoop.log && grep 3332 execsnoop.log
TIME(s) PCOMM PID PPID RET ARGS
2.082 portmap 3332 2886 0 "/opt/cni/bin/portmap"
2.087 iptables 3342 3332 0 "/usr/sbin/iptables --version"
2.089 iptables 3343 3332 0 "/usr/sbin/iptables -t nat -S --wait"
2.090 iptables 3345 3332 0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait"
2.091 iptables 3346 3332 0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.101 iptables 3353 3332 0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.103 iptables 3356 3332 0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
2.105 iptables 3358 3332 0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
2.107 iptables 3359 3332 0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
...
# this is the portmap execution run for task group 2
$ head -1 execsnoop.log && grep 3334 execsnoop.log
2.082 portmap 3334 2886 0 "/opt/cni/bin/portmap"
2.089 iptables 3341 3334 0 "/usr/sbin/iptables --version"
2.091 iptables 3344 3334 0 "/usr/sbin/iptables -t nat -S --wait"
2.092 iptables 3347 3334 0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.100 iptables 3352 3334 0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.102 iptables 3355 3334 0 "/usr/sbin/iptables -t nat -S --wait"
2.104 iptables 3357 3334 0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
# this portmap invocation stops here!
So in this situation, we're seeing a timeline like the following:
bridge plugin creates the nomad bridge and the veth pair for the container, placing one end inside the container as eth0.bridge plugin tries to create the nomad bridge.eth0.firewall plugin creates some IP tables rules (eliding details here as we missed the race in this run).firewall plugin creates some IP tables rules.portmap plugin checks for the existence of the CNI-HOSTPORT-MASQ chain and doesn't find it.portmap plugin checks for the existence of the CNI-HOSTPORT-MASQ chain and doesn't find it.portmap plugin creates the CNI-HOSTPORT-MASQ chain.portmap plugin tries to create the CNI-HOSTPORT-MASQ chain but its exists so it fails!go-cni and starts over _at the top_.bridge plugin tries to create the nomad bridge and safely no-ops.bridge plugin tries to create the veth pair and finds it already exists, so it fails!It turns out (through a bunch of experimenting) that there are two opportunities to hit the race: one in the firewall plugin and one in the portmap plugin.
Unfortunately, because we're handing off control to the chain of CNI plugins at this point, there's not really a good way for us to handle this condition inside Nomad itself. We don't have enough information at the call site to determine what's happened. Any error message that bubbles up from the plugin comes from the iptables call and those aren't really part of the CNI spec API contract. And once we get into this state, no matter how many times we retry we'll get the same result; the only way Nomad can handle this is by giving up on the allocation and starting over.
I dug into the code for the CNI plugins and found that portmap claims to create its iptables chain idempotently but has a TOCTOU race condition. The firewall plugin doesn't even make this claim. I've verified my theory above by running CNI plugins built from the following patch which is, uh, "not production ready" :grinning: Using this patch the jobs run every time.
diff --git a/plugins/meta/firewall/iptables.go b/plugins/meta/firewall/iptables.go
index faae35c..ded69ce 100644
--- a/plugins/meta/firewall/iptables.go
+++ b/plugins/meta/firewall/iptables.go
@@ -43,7 +43,9 @@ func ensureChain(ipt *iptables.IPTables, table, chain string) error {
}
}
- return ipt.NewChain(table, chain)
+ // DEBUG(tgross): ignore errors
+ ipt.NewChain(table, chain)
+ return nil
}
func generateFilterRule(privChainName string) []string {
diff --git a/plugins/meta/portmap/chain.go b/plugins/meta/portmap/chain.go
index bca8214..129e1c0 100644
--- a/plugins/meta/portmap/chain.go
+++ b/plugins/meta/portmap/chain.go
@@ -41,9 +41,8 @@ func (c *chain) setup(ipt *iptables.IPTables) error {
return err
}
if !exists {
- if err := ipt.NewChain(c.table, c.name); err != nil {
- return err
- }
+ // DEBUG(tgross): ignore errors
+ ipt.NewChain(c.table, c.name)
}
// Add the rules to the chain
I'm going to try to work this up into a production-ready patch for submission to the CNI project.
Also, while the iptables _rules_ for each task group are specific to an allocation, the _chains_ are a client-global resource. So from the Nomad side we could create the nomad bridge and all expected CNI iptables chains on startup, rather than risking concurrent creation by CNI and playing wack-a-mole with their implementation. There's probably some design discussion to be had about whether that's overreaching in terms of changing the the client's state without being explicitly asked, but I'd lean towards doing it. (See https://github.com/hashicorp/nomad/issues/6618 for further discussion.)
cc @nickethier @shoenig and @schmichael as a heads up and for discussion
I've opened https://github.com/containernetworking/plugins/pull/408 for the upstream fix.
https://github.com/containernetworking/plugins/pull/408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.
I'll keep this issue open until that's been released and we have our docs updated.
Glad to hear, @tgross. So this fix would require an update to the CNI binary, would it require any changes to the Nomad side?
Sent from my iPhone
On Nov 15, 2019, at 10:51 AM, Tim Gross notifications@github.com wrote:

containernetworking/plugins#408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.I'll keep this issue open until that's been released and we have our docs updated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Right. No changes to Nomad.
https://github.com/containernetworking/plugins/releases/tag/v0.8.4 was just released. I'm going to get that into testing and then we'll update the docs to recommend that version.
Most helpful comment
tl;dr
There are two things to do here:
firewallandportmapplugin. I'll start on this immediately.nomadbridge and iptables chains we need at startup to avoid any undiscovered or future race conditions in the CNI plugins. (See https://github.com/hashicorp/nomad/issues/6618)Nomad's Connect integration uses CNI plugins to create the network namespaces, the bridge network, container veth pairs, and iptables rules. In the happy path case, we configure CNI using the
go-cnilibrary and pass a chain of CNI plugin configurations. When Nomad gets a Connect-enabled job, it hands off control to thego-cnilibrary, which invokes the three plugins as external binaries. These external binaries in turn shell out to tools likeiptableson the host. When things go well:bridgeplugin creates thenomadbridge and the veth pair for the container, placing one end inside the container aseth0.firewallplugin creates some iptables chains and associated rules.portmapplugin creates some iptables chains and associated rules.The problem we're seeing is a race condition in creating iptables chains when multiple Connect-enabled task groups are starting on the same client node, which is of course the common case for folks who are trying out our Connect integration in a dev environment.
In order to reproduce this, you need to flush all firewall chains/rules (it's easiest to simply restart the node). One of the allocations will fail. Then Nomad will retry the allocation a few seconds later. At that point everything proceeds as planned, and we'll never hit the race condition again on the same client node.
The error messages we get back from CNI are unhelpful but I was able to determine what was happening by using
execsnoopto trace CNI's invocations of iptables.Here are relevant excerpts from a log where I ran
execsnoop -tx | tee execsnoop.log:So in this situation, we're seeing a timeline like the following:
bridgeplugin creates thenomadbridge and the veth pair for the container, placing one end inside the container aseth0.bridgeplugin tries to create thenomadbridge.eth0.firewallplugin creates some IP tables rules (eliding details here as we missed the race in this run).firewallplugin creates some IP tables rules.portmapplugin checks for the existence of theCNI-HOSTPORT-MASQchain and doesn't find it.portmapplugin checks for the existence of theCNI-HOSTPORT-MASQchain and doesn't find it.portmapplugin creates theCNI-HOSTPORT-MASQchain.portmapplugin tries to create theCNI-HOSTPORT-MASQchain but its exists so it fails!go-cniand starts over _at the top_.bridgeplugin tries to create thenomadbridge and safely no-ops.bridgeplugin tries to create the veth pair and finds it already exists, so it fails!It turns out (through a bunch of experimenting) that there are two opportunities to hit the race: one in the
firewallplugin and one in theportmapplugin.Unfortunately, because we're handing off control to the chain of CNI plugins at this point, there's not really a good way for us to handle this condition inside Nomad itself. We don't have enough information at the call site to determine what's happened. Any error message that bubbles up from the plugin comes from the
iptablescall and those aren't really part of the CNI spec API contract. And once we get into this state, no matter how many times we retry we'll get the same result; the only way Nomad can handle this is by giving up on the allocation and starting over.I dug into the code for the CNI plugins and found that
portmapclaims to create its iptables chain idempotently but has a TOCTOU race condition. Thefirewallplugin doesn't even make this claim. I've verified my theory above by running CNI plugins built from the following patch which is, uh, "not production ready" :grinning: Using this patch the jobs run every time.I'm going to try to work this up into a production-ready patch for submission to the CNI project.
Also, while the iptables _rules_ for each task group are specific to an allocation, the _chains_ are a client-global resource. So from the Nomad side we could create the
nomadbridge and all expected CNI iptables chains on startup, rather than risking concurrent creation by CNI and playing wack-a-mole with their implementation. There's probably some design discussion to be had about whether that's overreaching in terms of changing the the client's state without being explicitly asked, but I'd lean towards doing it. (See https://github.com/hashicorp/nomad/issues/6618 for further discussion.)cc @nickethier @shoenig and @schmichael as a heads up and for discussion