Terraform: DNS resolution issues when connected to a VPN

Created on 17 Oct 2015  路  33Comments  路  Source: hashicorp/terraform

Using 0.6.3:

卤 tf --version
Terraform v0.6.3

卤 TF=INFO tf plan
Refreshing Terraform state prior to plan...

openstack_lb_pool_v1.clusters_preprod_pool: Refreshing state... (ID: a7b1aac5-7e24-4010-be81-c9f278729468)
openstack_lb_vip_v1.clusters-preprod: Refreshing state... (ID: f46a09a0-4666-4f98-bd30-2b8ce871f411)

The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed.

Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.

I upgraded to 0.6.4, ran with same Openstack env vars and:

卤 tf --version
Terraform v0.6.4

卤 tf plan
Refreshing Terraform state prior to plan...

Error refreshing state: 1 error(s) occurred:

* Post https://os-identity.vip.foo.com:5443/v2.0/tokens: dial tcp: lookup os-identity.vip.foo.com: no such host
bug build core v0.10 v0.11 v0.12 v0.7 v0.8 v0.9

Most helpful comment

I think we all agree that this is a go problem, but since hashcorp is giving us compiled binaries to use on a mac, it seems that they should be compiling it in a way that works when people are VPNed into their (likely corporate) environments. The days of being directly attached to your production environments are going bye-bye as we all shift to cloud providers (and hence using more complicated dns resolver chains).

5 more days to the 3rd anniversary of this ticket -- @mitchellh can we get some love on this already? You guys are enterprise-ready post 1.0 now ;) Let's put the last nail in this coffin already...

All 33 comments

Is https://os-identity.vip.foo.com a resolvable domain? Additionally, is Keystone running on port 5443? It usually runs on port 5000.

edit: err... I see. You simply swapped out versions of Terraform. Can you still confirm that the Keystone URL is resolvable? Is the domain an entry in /etc/hosts (or equivalent) and not necessarily a real domain name?

If Terraform isn't resolving it, that might be a problem with Terraform core and not specifically OpenStack.

Let me know :smile:

I was also reading your description of the issue in #3345. Can you verify the problem either exists or doesn't exist with the latest, unmodified 0.6.6 binaries?

I ran into this issue on a private OpenStack instance. For me, I'm logging in over a VPN on OS X and while I can hit the various endpoints in the browser (and resolve correctly via ping and other utilities), Terraform seems to not resolve the IP address correctly for the various endpoints (DNS lookups I think fail but hard to tell).

If I hardcode the IP addresses of the various OpenStack domain names, I can get it to work by editing my /etc/hosts. Notably Packer does not have this issue in spawning up an instance and building an image. This is on Terraform 0.6.8 through 0.6.10.

@bluk Thank you for the info!

To confirm: When you are logged in over a VPN, you are then using a VPN-specific DNS resolver in order to resolve hosts/domains that are only accessible over the VPN?

@jtopjian Yes, it's a VPN specific DNS resolver.

@bluk OK, thanks. Does the VPN software update your /etc/resolv.conf file so that all DNS requests now go through your VPN? Or are lookups done by some other means?

I noticed there hasn't been activity on this issue in a while, but I am experiencing the same issue and @jtopjian I can confirm that the VPN software does NOT update /etc/resolv.conf (at least not in my case). The VPN software I am using is Sonicwall Mobile Connect and I'm on OS X El Capitan. I understand that the old NetExtender software does update /etc/resolv.conf; however, there are issues with it on El Cap, so we're stuck with the mobile connect client.

@btyler97 Thanks for the information. To confirm: this is only happening when you're connected to the VPN? Are you able to use the OpenStack command line tools while you're connected to the VPN?

@jtopjian here is a link on how dns works with mobile connect. Might help with diagnosing the issue. https://support.software.dell.com/kb/sw11559

@pryorda Thanks!

At this point, I'm trying to make a confident determination that the issue everyone is seeing is only happening when they are connected to a VPN. If so, then I believe this issue isn't local to just the OpenStack provider, but possibly Terraform core and/or Golang.

I think the main reason why this problem is manifesting within the OpenStack provider is because it's one of the few providers within Terraform that communicates with a non-public cloud provider. DNS resolution behavior might be different depending on how the DNS infrastructure that contains the OpenStack endpoint records is configured along with the VPN. The link @pryorda gave, seems to support that theory.

@jtopjian I can confirm that this is only an issue when connected via the VPN. After some late night research I'm also of the belief that the issue isn't local to the Openstack provider. The Golang docs on the "net" package hint at a possible cause (https://golang.org/pkg/net/) under the "Name Resolution" heading. I tried setting the ENV variables they suggested, but I'm probably doing something wrong as I didn't notice any change. Unfortunately, I just don't have enough familiarity with Go to know if they aren't applicable in this situation or if I'm missing something.

@jtopjian Here is what we found... The issue is that Mac OS X native net dns resolver goes directly to resolv.conf and our vpn client does not update the resolv.conf since it split tunnels the queries based on dns suffix. We fixed the issue by having it build using this command:

export CGO_ENABLED=1; XC_OS="darwin" XC_ARCH="amd64" make bin

A packet capture confirmed that it was traversing the vpn rather then going directly to the servers in resolv.conf.

@pryorda @btyler97 Nice! Thank you for the investigation.

I'm going to label this as a Core bug to get some other eyes on it.

Seems like you need this upstream change to go language networking for this to work as expected:
https://github.com/golang/go/issues/12524

We tried that and that doesnt work well with split horizon dns.

I had mentioned this in passing in #14781, but want to put it here too for posterity:

Currently we use Go's native cross-compilation support to build the release binaries for all supported platforms, but that approach doesn't give us the OS-specific libraries and headers needed to use CGo on OS X, and thus we aren't able to use the libc resolver. In future we may be able to use xgo to work around this, but we won't have time to do this in the immediate term, unfortunately.

Just going to throw my $0.02 in here in case it helps someone else.

I currently have a Vault installation sitting in AWS in a VPC using a private Route53 Hosted Zone. This means that the zone is not publicly distributed and can only be accessed within the VPC with which it is associated. To access resources in this VPC I have EC2 instances in the VPC that are used as VPN connectors. I'm running OS X and the VPN software does not update /etc/hosts, rather the OS-level DNS hooks which can be inspected via scutil --dns.

When configuring Terraform's Vault provider I get the dial tcp: lookup vault.internal.company.com on 192.168.130.1:53: no such host error. The quick way around this for me was to run route get vault.internal.company.com (again, on OS X) and put that IP into my /etc/hosts file. I may be way off but it _seems_ like if we just let the OS do the resolution (rather than do it explicitly) it should work. But I'm sure it's not that simple.

Not sure what happened, but it looks like this was resolved? I can't replicate this anymore.

Nothing specific has changed within Terraform itself to support this, but we did switch to Go 1.9 for the latest two releases, so possibly there is some new behavior in Go 1.9 that is making this smoother.

I didn't see anything in the release notes specifically about this, but there were some DNS-related changes in the 1.9 timeframe that may have changed the situation here. Versions 0.10.3 and 0.10.4 were built with Go 1.9, while 0.10.2 was built with 1.8. If someone has the time to compare the behavior on 0.10.2 vs. 0.10.4, that could help confirm whether this got resolved by changes in Go 1.9.

I can confirm that this is still a major issue with Terraform 0.10.8. See man 5 resolver on macOS for complete background and this note from /etc/resolv.conf:

# Mac OS X Notice
#
# This file is not used by the host name and address resolution
# or the DNS query routing mechanisms used by most processes on
# this Mac OS X system.

Terraform really must be built so that it will use macOS's native resolver, as /etc/resolv.conf is not sufficient and is documented by Apple as not the supported method for doing DNS resolution. Yes, macOS is UNIXish, but definitely has it's own ways of doing various things that are not UNIXish.

This is a major problem in our environment, as access to our cloud provider is not allowed via their public Internet addresses. Instead, DNS queries for their management systems are answered by non-public DNS servers that hand out different, internal addresses that takes our traffic over a private connection with the cloud provider. DNS queries for these domains will only get sent to the correct DNS servers when the macOS-native resolver is used. The DNS servers in /etc/resolv.conf are just plain-jane DNS servers that know nothing of the special addresses. As a result, Terraform on macOS is completely unusable for us.

Please enable the cgo netdns support so that the macOS-native resolver will be used.

I have been able to workaround this issue by rebuilding the aws and nomad providers (my use case requires them) as described in https://github.com/terraform-providers/terraform-provider-aws/issues/1392

This is a huge issue for us. We use openDNS which rewrites the /etc/resolv.conf to point to localhost for the umbrella client and this breaks terraform. The workarounds are all painful to work with.

Noticing this when DNS resolves now. Do the providers get built with the same options as the terraform bin?

Terraform v0.11.3 and this is still a bug with consul and rabbitmq provider.

* consul_keys.press_release_crawler_properties: 1 error(s) occurred:

* consul_keys.press_release_crawler_properties: consul_keys.press_release_crawler_properties: Failed to read Consul key 'config/application/data': Get http://consul..../v1/kv/config/application/data?dc=fhaid-dc: dial tcp: lookup consul....internal on 10.84.1.41:53: no such host

Meanwhile curl works fine:

 curl http://consul..../v1/kv/config/application/data?dc=fhaid-dc
[{"LockIndex":0,"Key":"config/application/data","Flags":0,"Value":"....

This will continue to be a problem for any Terraform binary (in fact, any Go program) which does not include cgo. Usually running with GODEBUG=netdns=9 in the environment will output something like:

go package net: built with netgo build tag; using Go's DNS resolver

This doesn't seem to work with Terraform, perhaps because it's the provider binaries doing the name resolution, and GODEBUG is not passed through to them?

Another way is to check with what libraries the binaries are linked. For example:

$ otool -L .terraform/plugins/darwin_amd64/terraform-provider-aws_v1.13.0_x4 
.terraform/plugins/darwin_amd64/terraform-provider-aws_v1.13.0_x4:

Nothing is listed, meaning this binary isn't liked with anything. In particular it's not linked with libc, where the resolver is implemented. So it can not use the macOS resolver.

Building Terraform from source will "fix" it. Though since most people install binary releases, ideally the release process would produce a cgo resolver. I haven't tried it, but apparently it's possible to cross-compile while using a cgo enabled net module without much difficulty.

Note this issue isn't limited to macOS, either. The net package will fall back to the cgo resolver under a number of conditions on non-macOS platforms where it can detect the native go resolver's behavior isn't compatible with expected semantics.

Thanks for sharing that link to "gonative", @bitglue!

If I'm understanding correctly, it seems like that works because the binary distributions of Go for other platforms already include _already-compiled_ package library files (.a files) that already include the C library bindings, and so they can just be linked in to the final executable without requiring access to the target system C library headers, toolchain, etc.

If so, that seems like a nice way to get around the requirement of having the OS X SDK available at build time. The Terraform Core team at HashiCorp is currently focused on the configuration language improvements for the next major release, but I'll make a note to investigate this further and see what it'd take to weave this into our build process for a later release.

Hi.
I have faced with the same problem, have investigated issues and want to share the results with any looking an answer.
@bitglue is right that this is actually not a problem of terraform. The main issue is in Go itself.

Sometimes I got error:

13:22 [master] n.shalnov:~/cloudflare/tf-fff.ru$ terraform plan

Error: Error loading state: Failed to open state file at gs://terraform-cf/cloudflare/semrush.ru/default.tfstate: Get https://storage.googleapis.com/terraform-cf/cloudflare/semrush.ru/default.tfstate: dial tcp: lookup storage.googleapis.com on 192.168.1.1:53: read udp 192.168.3.66:33635->192.168.1.1:53: i/o timeout

192.168.1.1 - is my office DNS server (actually Mikrotik router). Flushing DNS cache helps resolve this problem, but if someone in the office makes a request to "root ns servers" (e.g. dig something.com +trace, mikrotik will save this answer and will answer with all root servers and so on on every DNS request. You can see it in tcpdump captured on 53 port.

Moreover it answers using UDP protocol with packages which length is more than 520 bytes. It's not RFC compliant and mikrotik must answer with TCP protocol if a segment is too large. So go lib used to resolve names cannot work with this response correctly.
Switching on cgo lib forces Go being able to work with such requests.

So, if you're facing the same issue with terraform, you can:

  • change your DNS server in /etc/resolv.conf
  • flush DNS cache on your DNS server
  • compile terraform with cgo (?)

For more info see:
https://github.com/golang/go/issues/21160
https://golang.org/pkg/net/

I think we all agree that this is a go problem, but since hashcorp is giving us compiled binaries to use on a mac, it seems that they should be compiling it in a way that works when people are VPNed into their (likely corporate) environments. The days of being directly attached to your production environments are going bye-bye as we all shift to cloud providers (and hence using more complicated dns resolver chains).

5 more days to the 3rd anniversary of this ticket -- @mitchellh can we get some love on this already? You guys are enterprise-ready post 1.0 now ;) Let's put the last nail in this coffin already...

It looks like golang/go#12524 is moving again... so maybe there is hope?

Looks like the Go ticket isn't heading anywhere. We've suffered this same problem on Vault as well. Wouldn't it be possible to just build a CGO-enabled binary using the Travis macOS environment? https://docs.travis-ci.com/user/reference/osx/

I understand that there are benefits with Go's resolver.
However, I am missing what's the technical reason why Terraform does not switch to cgo for MacOSX binaries to satisfy users that are impacted by the current behavior described in this GH issue?
Is this driven purely on licensing concerns?

Adding some more fuel to this fire. On macOS Mojave (10.14.6) with _no VPN installed_ I am getting this behaviour attempting to perform a stock terraform init with only the AWS provider in the main.tf file. The /etc/resolv.conf file that the golang network stack expects to exist, does not exist. Other go programs seem fine, and I can curl the well known path fine. curl -k https://registry.terraform.io/.well-known/terraform.json prints out {"modules.v1":"/v1/modules/","providers.v1":"/v1/providers/"}. So it's not a network stack issue.

With these debug options...

export TF_LOG=TRACE
export GODEBUG=netdns=cgo+1

... in the shell environment, the terraform init output logs are as follows:

2020/04/06 15:38:04 [INFO] Terraform version: 0.12.24  
2020/04/06 15:38:04 [INFO] Go runtime version: go1.13.8
2020/04/06 15:38:04 [INFO] CLI args: []string{"/usr/local/bin/terraform", "init"}
2020/04/06 15:38:04 [DEBUG] Attempting to open CLI config file: /Users/sam/.terraformrc
2020/04/06 15:38:04 [DEBUG] File doesn't exist, but doesn't need to. Ignoring.
2020/04/06 15:38:04 [INFO] CLI command args: []string{"init"}
2020/04/06 15:38:04 [TRACE] Meta.Backend: no config given or present on disk, so returning nil config
2020/04/06 15:38:04 [TRACE] Meta.Backend: backend has not previously been initialized in this working directory
2020/04/06 15:38:04 [DEBUG] New state was assigned lineage "41f50108-c09f-fdd9-5be7-05053b1380b3"
2020/04/06 15:38:04 [TRACE] Meta.Backend: using default local state only (no backend configuration, and no existing initialized backend)
2020/04/06 15:38:04 [TRACE] Meta.Backend: instantiated backend of type <nil>
2020/04/06 15:38:04 [DEBUG] checking for provider in "."
go package net: built with netgo build tag; using Go's DNS resolver

Initializing the backend...
2020/04/06 15:38:04 [ERR] Checkpoint error: Get https://checkpoint-api.hashicorp.com/v1/check/terraform?arch=amd64&os=darwin&signature=fbac2f67-67bf-ee92-f08d-ab394075ba45&version=0.12.24: dial tcp: lookup checkpoint-api.hashicorp.com on [::1]:53: read udp [::1]:61518->[::1]:53: read: connection refused
2020/04/06 15:38:04 [DEBUG] checking for provider in "/usr/local/bin"
2020/04/06 15:38:04 [DEBUG] checking for provisioner in "."
2020/04/06 15:38:04 [DEBUG] checking for provisioner in "/usr/local/bin"
2020/04/06 15:38:04 [INFO] Failed to read plugin lock file .terraform/plugins/darwin_amd64/lock.json: open .terraform/plugins/darwin_amd64/lock.json: no such file or directory
2020/04/06 15:38:04 [TRACE] Meta.Backend: backend <nil> does not support operations, so wrapping it in a local backend
2020/04/06 15:38:04 [TRACE] backend/local: state manager for workspace "default" will:
 - read initial snapshot from terraform.tfstate
 - write new snapshots to terraform.tfstate
 - create any backup at terraform.tfstate.backup
2020/04/06 15:38:04 [TRACE] statemgr.Filesystem: reading initial snapshot from terraform.tfstate
2020/04/06 15:38:04 [TRACE] statemgr.Filesystem: snapshot file has nil snapshot, but that's okay
2020/04/06 15:38:04 [TRACE] statemgr.Filesystem: read nil snapshot
2020/04/06 15:38:04 [DEBUG] checking for provider in "."
2020/04/06 15:38:04 [DEBUG] checking for provider in "/usr/local/bin"
2020/04/06 15:38:04 [DEBUG] plugin requirements: "aws"=""
2020/04/06 15:38:04 [DEBUG] Service discovery for registry.terraform.io at https://registry.terraform.io/.well-known/terraform.json
2020/04/06 15:38:04 [TRACE] HTTP client GET request to https://registry.terraform.io/.well-known/terraform.json

Initializing provider plugins...
- Checking for available provider plugins...

2020/04/06 15:38:04 [DEBUG] Failed to request discovery document: Get https://registry.terraform.io/.well-known/terraform.json: dial tcp: lookup registry.terraform.io on [::1]:53: read udp [::1]:54136->[::1]:53: read: connection refused
Registry service unreachable.

This issue should probably be renamed as its quite clear by now that this is not a VPN related issue. This is as simple as _macOS + ( go without cgo ) = DNS issues in some cases_.

The linked upstream issues in the core go project tracker do not indicate this is a priority for them and I don't really blame them as it appears to be relatively easy for affected projects to work around by using cgo as part of their macOS builds. Unfortunately ( because an upstream fix would be the best outcome ) It looks like it will be necessary for hashicorp to make this workaround part of their build process somehow if we want this fix to happen in a timely manner.

Was this page helpful?
0 / 5 - 0 ratings