Consul: Azure: consul cluster start fails during auto discovery

Created on 26 Jun 2017  路  15Comments  路  Source: hashicorp/consul

consul version for both Client and Server

Client: v0.8.4
Server: v0.8.4

Operating system and Environment details

ubuntu 16.04

Description of the Issue (and unexpected/desired result)

consul fails during start...
this is how I start it:

./consul agent -server -bind 0.0.0.0 -client 0.0.0.0 -bootstrap-expect 3 -raft-protocol 3 -ui -retry-join-azure-tag-name purpose -retry-join-azure-tag-value consulcluster -config-file /opt/consul/consul.json

this is my config:

{
  "log_level": "TRACE",
  "data_dir": "/opt/consul/data",
  "server": true,
  "retry_join_azure": {
    "tenant_id": "xxx",
    "subscription_id": "xxx",
    "client_id": "xxx",
    "secret_access_key": "xxx"
  }
}

and this is what happens:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.4'
           Node ID: '5babf506-db72-24ca-5600-d572f3b30c46'
         Node name: 'vm2'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 10.0.0.5 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/06/26 13:33:18 [INFO] raft: Initial configuration (index=0): []
    2017/06/26 13:33:18 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state (Leader: "")
    2017/06/26 13:33:18 [INFO] serf: EventMemberJoin: vm2 10.0.0.5
    2017/06/26 13:33:18 [INFO] serf: EventMemberJoin: vm2.dc1 10.0.0.5
    2017/06/26 13:33:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/06/26 13:33:18 [WARN] serf: Failed to re-join any previously known node
    2017/06/26 13:33:18 [INFO] consul: Adding LAN server vm2 (Addr: tcp/10.0.0.5:8300) (DC: dc1)
    2017/06/26 13:33:18 [WARN] serf: Failed to re-join any previously known node
    2017/06/26 13:33:18 [INFO] consul: Handled member-join event for server "vm2.dc1" in area "wan"
    2017/06/26 13:33:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/06/26 13:33:18 [INFO] agent: Started HTTP server on [::]:8500
    2017/06/26 13:33:18 [INFO] agent: Joining cluster...
    2017/06/26 13:33:18 Sending GET https://management.azure.com/subscriptions/b6f483f3-1f96-4d11-9087-18afcd9fd22a/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01
    2017/06/26 13:33:20 GET https://management.azure.com/subscriptions/b6f483f3-1f96-4d11-9087-18afcd9fd22a/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01 received 200 OK
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe822df]

goroutine 71 [running]:
github.com/hashicorp/consul/command/agent.(*Config).discoverAzureHosts(0xc420220900, 0xc420224780, 0x20, 0x0, 0x0, 0x0, 0x2b4cc720004c0358)
    /gopath/src/github.com/hashicorp/consul/command/agent/config_azure.go:44 +0x42f
github.com/hashicorp/consul/command/agent.(*Agent).retryJoin(0xc4201c66c0)
    /gopath/src/github.com/hashicorp/consul/command/agent/retry_join.go:40 +0x748
created by github.com/hashicorp/consul/command/agent.(*Agent).Start
    /gopath/src/github.com/hashicorp/consul/command/agent/agent.go:315 +0x9b2

I'm getting the same error if I compile consul from source btw...

Reproduction steps

just install consul on azure vms add the right credentials to the config file and try the above cml to start

Thanks,
Brande

typbug typcrash

Most helpful comment

I'm having the same issue as @brande. This happen on both Windows and Linux servers. I already had the tags set on the network interfaces and when testing the account used it is able to retrieve them correctly.

However, Consul keeps failing to start. below you can find the config and the error:

consul agent -config-dir=C:\HashiCorp\Consul\consul.conf
{
    "datacenter": "datacenter",
    "log_level": "INFO",
    "server": true,
    "ui": true,
    "data_dir": "C:\\HashiCorp\\Consul\\consul.data",
    "bind_addr": "0.0.0.0",
    "client_addr": "0.0.0.0",
    "ports": {
        "https": 8501
    },
    "key_file": "C:\\HashiCorp\\Consul\\consul.cert\\server.key",
    "cert_file": "C:\\HashiCorp\\Consul\\consul.cert\\server.crt",
    "ca_file": "C:\\HashiCorp\\Consul\\consul.cert\\ca_bundle.crt",
    "protocol": 3,
    "retry_join_azure": {
        "tag_name": "service",
        "tag_value": "consulCluster",
        "subscription_id": "xxx",
        "tenant_id": "yyy",
        "client_id": "zzz",
        "secret_access_key": "password"
    }
}

The output error:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.5'
           Node ID: 'e520793c-9dca-8bd2-6d6a-f086921f248a'
         Node name: 'vm03'
        Datacenter: datacenter'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8501, DNS: 8600)
      Cluster Addr: 10.8.2.5 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/07/06 16:25:18 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.8.2.5:8300 Address:10.8.2.5:8300}]
    2017/07/06 16:25:18 [INFO] raft: Node at 10.8.2.5:8300 [Follower] entering Follower state (Leader: "")
    2017/07/06 16:25:18 [INFO] serf: EventMemberJoin: vm03 10.8.2.5
    2017/07/06 16:25:18 [WARN] serf: Failed to re-join any previously known node
    2017/07/06 16:25:18 [INFO] consul: Adding LAN server vm03 (Addr: tcp/10.8.2.5:8300) (DC: datacenter)
    2017/07/06 16:25:18 [INFO] serf: EventMemberJoin: vm03.datacenter 10.8.2.5
    2017/07/06 16:25:18 [WARN] serf: Failed to re-join any previously known node
    2017/07/06 16:25:18 [INFO] consul: Handled member-join event for server "vm03.datacenter" in area "wan"
    2017/07/06 16:25:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/07/06 16:25:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/07/06 16:25:18 [INFO] agent: Started HTTP server on [::]:8500
    2017/07/06 16:25:18 [INFO] agent: Started HTTPS server on [::]:8501
    2017/07/06 16:25:18 [INFO] agent: Joining cluster...
    2017/07/06 16:25:19 Sending GET https://management.azure.com/subscriptions/xxx/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01
    2017/07/06 16:25:19 GET https://management.azure.com/subscriptions/xxx/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01 received 200 OK
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0xeb1f56]

goroutine 38 [running]:
github.com/hashicorp/consul/agent.(*Config).discoverAzureHosts(0xc04224e000, 0xc0421d4d70, 0x20, 0x0, 0x0, 0x0, 0x327938d0004d5d0f)
        /gopath/src/github.com/hashicorp/consul/agent/config_azure.go:44 +0x436
github.com/hashicorp/consul/agent.(*Agent).retryJoin(0xc042212a00)
        /gopath/src/github.com/hashicorp/consul/agent/retry_join.go:40 +0x74f
created by github.com/hashicorp/consul/agent.(*Agent).Start
        /gopath/src/github.com/hashicorp/consul/agent/agent.go:328 +0x9fc

All 15 comments

I'll have a look.

@brande I was able to create the Azure VM and get some of the credentials but I could use some help. How do I get the client id and secret access key?

found out that the tag used by consul has to be set to the network interface ressource not to the vm. However, it would be great if consul didn't crash in case of an unset tag. It should ignore the ressource without the tag and should not crash as it is right now... Thoughts?

@magiconair Thanks for looking into it. Here is a nice explanation about how to get all these together: https://www.terraform.io/docs/providers/azurerm/

@brande yes, consul shouldn't crash :) I'll have another look

I'm having the same issue as @brande. This happen on both Windows and Linux servers. I already had the tags set on the network interfaces and when testing the account used it is able to retrieve them correctly.

However, Consul keeps failing to start. below you can find the config and the error:

consul agent -config-dir=C:\HashiCorp\Consul\consul.conf
{
    "datacenter": "datacenter",
    "log_level": "INFO",
    "server": true,
    "ui": true,
    "data_dir": "C:\\HashiCorp\\Consul\\consul.data",
    "bind_addr": "0.0.0.0",
    "client_addr": "0.0.0.0",
    "ports": {
        "https": 8501
    },
    "key_file": "C:\\HashiCorp\\Consul\\consul.cert\\server.key",
    "cert_file": "C:\\HashiCorp\\Consul\\consul.cert\\server.crt",
    "ca_file": "C:\\HashiCorp\\Consul\\consul.cert\\ca_bundle.crt",
    "protocol": 3,
    "retry_join_azure": {
        "tag_name": "service",
        "tag_value": "consulCluster",
        "subscription_id": "xxx",
        "tenant_id": "yyy",
        "client_id": "zzz",
        "secret_access_key": "password"
    }
}

The output error:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.5'
           Node ID: 'e520793c-9dca-8bd2-6d6a-f086921f248a'
         Node name: 'vm03'
        Datacenter: datacenter'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8501, DNS: 8600)
      Cluster Addr: 10.8.2.5 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/07/06 16:25:18 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.8.2.5:8300 Address:10.8.2.5:8300}]
    2017/07/06 16:25:18 [INFO] raft: Node at 10.8.2.5:8300 [Follower] entering Follower state (Leader: "")
    2017/07/06 16:25:18 [INFO] serf: EventMemberJoin: vm03 10.8.2.5
    2017/07/06 16:25:18 [WARN] serf: Failed to re-join any previously known node
    2017/07/06 16:25:18 [INFO] consul: Adding LAN server vm03 (Addr: tcp/10.8.2.5:8300) (DC: datacenter)
    2017/07/06 16:25:18 [INFO] serf: EventMemberJoin: vm03.datacenter 10.8.2.5
    2017/07/06 16:25:18 [WARN] serf: Failed to re-join any previously known node
    2017/07/06 16:25:18 [INFO] consul: Handled member-join event for server "vm03.datacenter" in area "wan"
    2017/07/06 16:25:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/07/06 16:25:18 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/07/06 16:25:18 [INFO] agent: Started HTTP server on [::]:8500
    2017/07/06 16:25:18 [INFO] agent: Started HTTPS server on [::]:8501
    2017/07/06 16:25:18 [INFO] agent: Joining cluster...
    2017/07/06 16:25:19 Sending GET https://management.azure.com/subscriptions/xxx/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01
    2017/07/06 16:25:19 GET https://management.azure.com/subscriptions/xxx/providers/Microsoft.Network/networkInterfaces?api-version=2016-09-01 received 200 OK
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0xeb1f56]

goroutine 38 [running]:
github.com/hashicorp/consul/agent.(*Config).discoverAzureHosts(0xc04224e000, 0xc0421d4d70, 0x20, 0x0, 0x0, 0x0, 0x327938d0004d5d0f)
        /gopath/src/github.com/hashicorp/consul/agent/config_azure.go:44 +0x436
github.com/hashicorp/consul/agent.(*Agent).retryJoin(0xc042212a00)
        /gopath/src/github.com/hashicorp/consul/agent/retry_join.go:40 +0x74f
created by github.com/hashicorp/consul/agent.(*Agent).Start
        /gopath/src/github.com/hashicorp/consul/agent/agent.go:328 +0x9fc

I'm having the same issue as @brande and @draggeta the NIC tag is set

Same issue

I've moved the code to the new https://github.com/hashicorp/go-discover repo which I'll merge back into consul once that is tested. There is also a command line client that you can use. I'm waiting for a colleague to get out of jetlag to help me but if someone wants to verify that the cmd line client does not crash on azure then that would help already.

We found the issue here. It reproduces when you have network interfaces that have tags but doesn't have the tag configured in consul. The hot fix on our end was to add the tag to all the NIC in Azure. Consider having different values for different consul clusters across the same subscription. BTW Are we going to have breaking changes on 0.9.0?

@sdluxeon thx for the info. I'll have a look on how to make that more robust.

Re breaking changes: Some smaller ones. Pls keep an eye on the Changelog.

@magiconair: I can confirm @sdluxeon's findings. With both the tool and Consul, the error occurs only when not all nics have the tag name set. The value doesn't matter in this case.

Got it. The Azure API data structures are somewhat unusual. You don't see *map[string]*string often used in Go. I've pushed fixes for both go-discover and consul.

@draggeta I'd appreciate if you could test the go-discover code one more time.

@magiconair That indeed fixed it. Thanks for the fix. Now to wait for the next release :)

We're planning one next week. So the consul version of the fix should go in there.

Was this page helpful?
0 / 5 - 0 ratings