Grpc-go: DNS resolution does not work on Docker

Created on 27 Apr 2020  ·  20Comments  ·  Source: grpc/grpc-go

What version of gRPC are you using?

v1.28.0

What version of Go are you using (go version)?

1.14

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

When trying to use chirpstack on Docker DNS resolution does not seem to work.

https://github.com/brocaar/chirpstack-docker

What did you expect to see?

chirpstack-application-server and chirpstack-network-server connect and talk to each other.

What did you see instead?

DNS name resolution failures.

chirpstack-network-server_1 | time="2020-04-27T05:06:18Z" level=warning msg="creating insecure application-server client" server="chirpstack-application-server:8001"
chirpstack-network-server_1 | time="2020-04-27T05:06:23Z" level=warning msg="ccResolverWrapper: reporting error to cc: dns: A record lookup error: lookup chirpstack-application-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-network-server_1 | time="2020-04-27T05:06:23Z" level=error msg="gateway: handle gateway stats error" ctx_id=68623603-d2d2-4f2c-a6e4-34f56eae1dad error="get application-server client error: get application-server client error: create application-server api client error: dial application-server api error: context deadline exceeded"

Chirpstack-docker sets up a couple of containers that talk to each other over the network. When run the chirpstack apps connect to the redis, postgresql and mosquitto servers just fine using their DNS names, but the 2 chirpstack apps cannout talk to each other using gRPC due to DNS resolution failures. I verified the DNS names resolve in each container using nslookup on the container's cli. I also verified that the apps work correctly when using the IP addresses rather than the DNS names.

P2 Bug

Most helpful comment

This is still an issue here, hopefully the original poster has time to work out some repro code. 👍

All 20 comments

Seeing the same issues

The dns resolver failed with: dns: A record lookup error: lookup chirpstack-application-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"

This error doesn't show much. Do you have the dns server side logs?

gRPC doesn't implement its dns client, it uses the official package. Can you also try to net.Dial() the dns name and see if that works?

I placed a call to net.Dial() right before the grpc.DialContext() that is throwing this error, and it worked. net.Dial() was able resolve the hostname passed to it but grpc.DialContext() failed.

I'm going to see if I can work up some sample code this weekend that reproduces this issue.

Is there anyway to enable additional logging from grpc or the underlying net library to further troubleshoot this issue?

gRPC logs can be turned on by setting env variables: https://github.com/grpc/grpc-go#how-to-turn-on-logging

DNS logs can be turned on by setting GODEBUG=netdns=1 (https://golang.org/pkg/net/#hdr-Name_Resolution, this I personally haven't tried)

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

This is still an issue here, hopefully the original poster has time to work out some repro code. 👍

@blocknonip if you are experiencing the same issue and you can provide some information, it might help resolve this faster.

any progress?

I've done some more investigating and this seems even stranger now.

I have some more logging:

chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=warning msg="creating insecure network-server client" server="chirpstack-network-server:8000"
chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=debug msg="parsed scheme: \"\""
chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=debug msg="scheme \"\" not registered, fallback to default scheme"
chirpstack-application-server_1  | time="2020-05-01T16:44:31Z" level=debug msg="Channel Connectivity change to SHUTDOWN"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=debug msg="dns: SRV record lookup error: lookup _grpclb._tcp.chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=debug msg="dns: A record lookup error: lookup chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=warning msg="ccResolverWrapper: reporting error to cc: dns: A record lookup error: lookup chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=error msg="finished unary call with code Unknown" ctx_id=f420a2bb-6f03-48a8-9ac9-36849f82fb45 error="rpc error: code = Unknown desc = context deadline exceeded" grpc.code=Unknown grpc.method=Create grpc.service=api.NetworkServerService grpc.start_time="2020-05-01T16:44:30Z" grpc.time_ms=5004.83 peer.address="127.0.0.1:52666" span.kind=server system=grpc

At this point I need to provide a little more background to explain the weird part I discovered.
I have Chirpstack setup to create 4 services each running in their own container (redis, postgresql, chirpstack-network-server, and chirpstack-application-server). These containers communicate over a virtual network that does not have access to the internet. I also have chirpstack-network-server and chirpstack-application-server connected to another virtual network to allow my nginx proxy to serve the frontend for chirpstack and for access to my mqtt container. I want to reiterate that neither of these two internal virtual networks have access to the internet.

Here is the interesting part. If I include any network that has internet access to the container then the grpc.DialContext() call resolves the IP and connects right away. I did some experiments and it still seems to me to be a bug in either gprc (likely) or go itself (less likely). I figure this as if it was a bug in go then other libraries would be affected like mqtt, redis, and postgresql. But since those all work with or without internet access, I think this is something deep inside of grpc.

One experiment I did was to start up the container, try adding the network-server address to the application server and it failed as before. I then added and removed a network with internet access without trying to resolve the address. And after that I tried adding the network-server again and it continued to fail. This shows that it's not simply the fact of adding the internet connected network kick starts something in docker. It shows that there is something that is failing because there is not an internet connection.

Another experiment I tried was increasing the timeout set for the grpc.DialContext() call from chirpstack. This was set at 500ms and I changed it to 60s, however, it appeares that the call still fails at about the same time. So based on the so based on the message "operation was canceled" I'm assuming there is another timeout somewhere I didn't see when looking at the grpc resolver code.

I'm still trying to setup some example code to repro this, but I'm just a little time limited right now.

I guess the interim workaround for this would be to setup the docker containers with an internet connected virtual network rather than an isolated one. Or, use an isolated virtual network with static IPs assigned to the container and use the IPs in the software.

I can confirm that the workaround @evil-dog suggested by setting static IPs for the containers works.

For those using docker-compose, here is a useful guide that helps if you're not already familiar:
https://blog.alejandrocelaya.com/2017/04/21/set-specific-ip-addresses-to-docker-containers-created-with-docker-compose/

chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=debug msg="parsed scheme: \"\""
chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=debug msg="scheme \"\" not registered, fallback to default scheme"

chirpstack-application-server_1  | time="2020-05-01T16:44:31Z" level=debug msg="Channel Connectivity change to SHUTDOWN"

chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=debug msg="dns: SRV record lookup error: lookup _grpclb._tcp.chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"

Among the logs, the line "Channel Connectivity change to SHUTDOWN" looks very weird. Channel (the ClientConn) switching to SHUTDOWN only happens when the ClientConn is closed: https://github.com/grpc/grpc-go/blob/a6ab4473c5a469332c1bdee691293affeaaece25/clientconn.go#L992

This happened 1 second after the scheme was parsed, 4 seconds before the SRV lookup failure. This seems like the ClientConn is closed almost immediately after it's created.
Can anyone provide more details of how the ClientConn is created and when it's closed?

This is the line of code that is called to create the grpc connection in chirpstack.

All DNS errors are "operation canceled", instead of not found or rejected or timeout. So it's very likely that the ClientConn is actively closed.

Next to the DNS error, the client received another __RPC error__ from another server. Since the DNS error doesn't have much information, I looked a bit at the RPC errors.

  • in the original report, the error is chirpstack-network-server_1 | time="2020-04-27T05:06:23Z" level=error msg="gateway: handle gateway stats error" ctx_id=68623603-d2d2-4f2c-a6e4-34f56eae1dad error="get application-server client error: get application-server client error: create application-server api client error: dial application-server api error: context deadline exceeded"
  • in the reply, the error is "rpc error: code = Unknown desc = context deadline exceeded". (Which is weird itself, because code should be Canceled. But this isn't as helpful as the other one.)

The error is printed by chirpstack-application-server, but happens in server chirpstack-network-server, part of stats. This failure line makes a blocking grpc.DialContext(). This connection cannot be created for some reason, thus the grpc.DialContext() timeout.

Any clue what the stats service is? Could it be something that needs internet?

Oops, I didn't realize that in the original post I put the logs from the network server.

The network-server and application-server basically work as a pair where the network-server acts as the backend and the application-server works as the frontend. There can be multiple network-servers feeding into the application-server. They each open grpc services on the other for communications.

I have been approaching this from the application-server to network-server side as that is the user interactive portion I can easily test. The general setup is to open a webpage on the application-server and input the DNS name of the network-server to talk to. This is where I first encountered the issue and where I have been focusing my troubleshooting as it appears to me that the application server never actually connects and talks with the network-server. I think in my original post that I accidentally copied the network-server error log from when I was playing with my workaround using the IP Addresses. When I was doing the testing with IP addresses I put the IP address in the webpage, but the application-server told the network server to talk back to it using the DNS name and that's why it got the error. Once I changed the config of the application-server to use it's IP Address rather than DNS the network-server was able to connect back.

As for that Stats.go, it appears to be sending the stats that it's gathering from the LoRA gateways to the application-server using grpc. So this seems to be that same issue just in the reverse (e.g. network-server talking to the application-server) from what I have been talking about mostly (e.g. application-server talking to the network-server).

I didn't have other findings other than the stats RPC failure.

With the current information, it's still not clear why (or how) the DNS stops working. We will need more details.

Another thing could be, run the helleworld example between the containers. That should generate easier to follow logs, and show the problem in a simpler setup.

I've pulled together some sample code based on the helloworld example that reproduces some of this issue. I've uploaded it to github here:

https://github.com/evil-dog/grpc-go-issue-3572-repro

While pulling this together I was only able to repro part of the issue. It appears that with the sample code in the repo I linked it is able to resolve DNS, but it is not able to connect to the server. It has a similar deadline exceeded error like the original. And from @menghanl description of the client entering the "SHUTDOWN" state, that is happening in the sample code too.

I have included a README in the describing how to execute the sample code in docker. This is a fork of the grpc-go repo and I removed all the non-relevant files. You can view the commit history to see what I changed from the original.

In the modified client I noted where in the chirpstack-application-server source I copied the lines from. This should help anyone that is trying to cross reference the code.

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@evil-dog unfortunately, we don't have much experience with docker, and so I don't think we can help based on the example you've provided. So far we're unable to isolate any specific grpc bug. If you are able to investigate further to find the root cause, it would be helpful. Otherwise, you might want to try another forum for help, like stackoverflow. Thank you!

I experience the same problems. Is there a solution already?

same here.
seems to be related to: https://github.com/gliderlabs/docker-alpine/issues/539
a apline/busybox problem.

Was this page helpful?
0 / 5 - 0 ratings