Flannel: Flannel not picking up etcd_endpoints flag

Created on 27 Apr 2015  ·  20Comments  ·  Source: coreos/flannel

I am installing flannel on a coreos machine on aws using a cloud-config file.
I provide the etcd_endpoints flag and point it to my etcd cluster. Yet the journalctl shows flannel trying to retrieve from http://localhost:4001 and flannel install fails.

kinbug

Most helpful comment

@vaijab If you're using CoreOS, specifying etcd-endpoints is easy with cloud-config. See flannel section under https://coreos.com/os/docs/latest/cloud-config.html#coreos

But it's like this:

#cloud-config

coreos:
  flannel:
      etcd_endpoints: http://1.2.3.4:2379

All 20 comments

The documentation regarding command-line flags (including the built-in help) is incorrect. I got this to work:

flanneld -ip-masq -etcd-endpoints https://127.0.0.1:2379 -etcd-cafile /etc/etcd/ca.crt -etcd-certfile /etc/etcd/host.crt -etcd-keyfile /etc/etcd/host.key

Note the _lack_ of equal signs. A lot of CoreOS' tools have similar problems. A lot of it has to do with the fact that people are used to GNU getopt-style command line parsing where "--opt=value" is equivalent to "--opt value". The Go "flag" library claims to do the same thing but clearly doesn't.

Problem is that flanneld.service doesn't use /run/flannel/options.env file for env-vars before it starts flanneld container.

/opt/bin/flanneld --ip-masq=true (missing additional flags here)

@tanmaybinaykiya @jcollie we tested ./flanneld -etcd-endpoints="http://192.168.198.130:2379" with go1.4.2 and it worked fine. Which version of go are you guys using?

@mnemotiv is this the flanneld.service that is shipped with coreos?

@MohdAhmad yes, with latest alpha 695

this problem exists if I deploy server with cloud config:

...
  flannel:
    etcd-endpoints: https://127.0.0.1:2379
    etcd-cafile: /var/run/...../ca.crt
    etcd-certfile: /var/run/...../master.crt
    etcd-keyfile: /var/run/...../master.key
...

flanneld.service from journalctl

Jun 10 13:44:01 k8s-master-001 systemd[1]: Starting Network fabric for containers...
Jun 10 13:44:02 k8s-master-001 etcdctl[644]: Error:  cannot sync with the cluster using endpoints http://127.0.0.1:4001, http://127.0.0.1:2379
Jun 10 13:44:02 k8s-master-001 systemd[1]: flanneld.service: control process exited, code=exited status=2
Jun 10 13:44:02 k8s-master-001 systemd[1]: Failed to start Network fabric for containers.
Jun 10 13:44:02 k8s-master-001 systemd[1]: Unit flanneld.service entered failed state.
Jun 10 13:44:02 k8s-master-001 systemd[1]: flanneld.service failed.
Jun 10 13:44:07 k8s-master-001 systemd[1]: flanneld.service holdoff time over, scheduling restart.

systemctl status flanneld

● flanneld.service - Network fabric for containers
   Loaded: loaded (/usr/lib64/systemd/system/flanneld.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/flanneld.service.d
           └─50-certs-config.conf, 51-network-config.conf
   Active: activating (auto-restart) (Result: exit-code) since Thu 2015-06-11 06:30:30 UTC; 1s ago
     Docs: https://github.com/coreos/flannel
  Process: 31186 ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config {"Network":"10.244.0.0/16", "Backend": {"Type": "vxlan"}} (code=exited, status=4)
  Process: 31182 ExecStartPre=/usr/bin/touch /run/flannel/options.env (code=exited, status=0/SUCCESS)
  Process: 31180 ExecStartPre=/usr/bin/mkdir -p ${ETCD_SSL_DIR} (code=exited, status=0/SUCCESS)
  Process: 31176 ExecStartPre=/usr/bin/mkdir -p /run/flannel (code=exited, status=0/SUCCESS)
  Process: 31175 ExecStartPre=/sbin/modprobe ip_tables (code=exited, status=0/SUCCESS)

Jun 11 06:30:30 k8s-master-001 systemd[1]: Unit flanneld.service entered failed state.
Jun 11 06:30:30 k8s-master-001 systemd[1]: flanneld.service failed.

p.s. none of flanneld drop-ins in cloud-config doesn't use any env-vars at all which could overwrite any of flanneld-native env-vars - flanneld generates correct env-vars to its env-file, but it doesn't use it for argument flags for its container.

I'm starting to think that's more etcdctl problem than flanneld...

etcd2 works fine, fleet - fine, flannel with etcdctl - not

@mnemotiv Sorry to have dropped the ball on this. cloud-config values get put into /run/flannel/options.env file which get passed to docker via --env-file. flannel knows to use FLANNELD_* env-vars together with cmd line args (basically merges them). Any env-vars set in drop-ins unfortunately will not propagate into the flannel daemon. They can only be used to influence the flanneld.service file itself (e.g. FLANNEL_VER and ETCD_SSL_DIR).

@eyakubovich ok cool. as I saw on coreos documentation - flannel is being packed as docker container to save disk space for OS "for those who won't use flannel" - who won't use flannel? :)

@mnemotiv it's partially to save space (only pay for what your use) but also to demonstrate that you can run infrastructure pieces in a container (although currently with Docker, it does require some acrobatics). So if somebody wants to run an overlay other than flannel, they can swap it out.

@eyakubovich so how do I tell flannel that my etcd endpoint is on https://localhost:2379 ?

@vaijab

FLANNELD_ETCD_ENDPOINTS=http://127.0.0.1:2379

just source this env-var as file or just as var in flanneld.service / docker run...

https://github.com/nodetemple/nodetemple-deprecated/blob/master/daemon/environment-setup
https://github.com/nodetemple/nodetemple-deprecated/blob/master/fleet/flanneld.service

@mnemotiv I am not sure that actually works. You probably got an impression that it works, because flannel defaults to http://127.0.0.1:2379 anyway.

core@ip-10-50-0-221 ~ $ systemctl cat flanneld
# /usr/lib64/systemd/system/flanneld.service
[Unit]
Description=Network fabric for containers
Documentation=https://github.com/coreos/flannel
Requires=early-docker.service
After=etcd.service etcd2.service early-docker.service
Before=early-docker.target

[Service]
Type=notify
Restart=always
RestartSec=5
Environment="TMPDIR=/var/tmp/"
Environment="DOCKER_HOST=unix:///var/run/early-docker.sock"
Environment="FLANNEL_VER=0.5.1"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
LimitNOFILE=40000
LimitNPROC=1048576
ExecStartPre=/sbin/modprobe ip_tables
ExecStartPre=/usr/bin/mkdir -p /run/flannel
ExecStartPre=/usr/bin/mkdir -p ${ETCD_SSL_DIR}
ExecStartPre=/usr/bin/touch /run/flannel/options.env

ExecStart=/usr/libexec/sdnotify-proxy /run/flannel/sd.sock \
  /usr/bin/docker run --net=host --privileged=true --rm \
  --volume=/run/flannel:/run/flannel \
  --env=NOTIFY_SOCKET=/run/flannel/sd.sock \
  --env=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
  --env=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
  --env-file=/run/flannel/options.env \
  --volume=/usr/share/ca-certificates:/etc/ssl/certs:ro \
  --volume=${ETCD_SSL_DIR}:/etc/ssl/etcd:ro \
  quay.io/coreos/flannel:${FLANNEL_VER} /opt/bin/flanneld --ip-masq=true

# Update docker options
ExecStartPost=/usr/bin/docker run --net=host --rm -v /run:/run \
  quay.io/coreos/flannel:${FLANNEL_VER} \
  /opt/bin/mk-docker-opts.sh -d /run/flannel_docker_opts.env -i

# /etc/systemd/system/flanneld.service.d/10-configuration.conf
[Service]
Environment="FLANNELD_INTERFACE=10.50.0.221"
Environment="FLANNELD_ETCD_ENDPOINTS=https://127.0.0.1:2379"
# /etc/systemd/system/flanneld.service.d/50-network-config.conf
[Service]
ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{"Network":"10.10.0.0/16","Backend":{"Type":"vxlan"}}'


core@ip-10-50-0-221 ~ $ journalctl -fn -u flanneld.service
-- Logs begin at Thu 2015-07-30 13:53:22 UTC. --
Jul 30 13:54:36 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: flanneld.service: Unit entered failed state.
Jul 30 13:54:36 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: flanneld.service: Failed with result 'exit-code'.
Jul 30 13:54:36 ip-10-50-0-221.eu-west-1.compute.internal etcdctl[3843]: Error:  cannot sync with the cluster using endpoints http://127.0.0.1:4001, http://127.0.0.1:2379
Jul 30 13:54:41 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: flanneld.service: Service hold-off time over, scheduling restart.
Jul 30 13:54:41 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: Starting Network fabric for containers...
Jul 30 13:54:42 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: flanneld.service: Control process exited, code=exited status=2
Jul 30 13:54:42 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: Failed to start Network fabric for containers.
Jul 30 13:54:42 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: flanneld.service: Unit entered failed state.
Jul 30 13:54:42 ip-10-50-0-221.eu-west-1.compute.internal systemd[1]: flanneld.service: Failed with result 'exit-code'.
Jul 30 13:54:42 ip-10-50-0-221.eu-west-1.compute.internal etcdctl[3887]: Error:  cannot sync with the cluster using endpoints http://127.0.0.1:4001, http://127.0.0.1:2379

My files doesn't include env-vars for docker container of flannel - so you should add env-vars to container runtime. You can also run flannel not in a container - then it will fetch env-vars directly.

@mnemotiv gotcha! Thanks, I have it working now. Flannel config and startup on CoreOS seems convoluted and rather complicated. But flannel is still awesome anyway.

@vaijab If you're using CoreOS, specifying etcd-endpoints is easy with cloud-config. See flannel section under https://coreos.com/os/docs/latest/cloud-config.html#coreos

But it's like this:

#cloud-config

coreos:
  flannel:
      etcd_endpoints: http://1.2.3.4:2379

@eyakubovich that's right, but I also had to drop an additional drop-in to make this all work, maybe documentation needs to be updated or flanneld.service to add EnvironmentFile=/run/flannel/options.env

- name: 10-env-config.conf
  content: |
    [Service]
    EnvironmentFile=/run/flannel/options.env

I am not quite sure if this is applicable but I had the same issue working on a CentOS Atomic system. What ultimately resolved it was changing the following parameters in /etc/etcd/etcd.conf to the following:

ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"

After restarting the etcd service, this seemed to fix my problem. Hopefully this is helpful to someone.

I believe this is fixed now and can be closed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TLmaK0 picture TLmaK0  ·  3Comments

a365541453 picture a365541453  ·  5Comments

Inv0k-er picture Inv0k-er  ·  4Comments

bboreham picture bboreham  ·  4Comments

benmoss picture benmoss  ·  6Comments