Virtual-environments: Ubuntu 18.04 unable to resolve cognitiveservices DNS names

Created on 28 Apr 2020  路  26Comments  路  Source: actions/virtual-environments

Describe the bug
Ubuntu 18.04 is unable to resolve *.cognitiveservices.azure.com DNS names by default. As a workaround, we are bypassing the local (stub) DNS server using the following command:

sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

This may be related to https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1822416 and/or https://github.com/actions/virtual-environments/issues/397.

Area for Triage:
Servers

Question, Bug, or Feature?:
Bug

Virtual environments affected

  • [ ] macOS 10.15
  • [ ] Ubuntu 16.04 LTS
  • [x] Ubuntu 18.04 LTS
  • [ ] Windows Server 2016 R2
  • [ ] Windows Server 2019

Expected behavior
Ubuntu 18.04 should be able to resolve *.cognitiveservices.azure.com DNS names by default.

Actual behavior
If we try to resolve a *.cognitiveservices.azure.com DNS name, it fails with SERVFAIL:

https://dev.azure.com/mharder/public/_build/results?buildId=634&view=logs&j=3dc411e8-b5bf-57f2-a8a7-b25d565c86b1&t=f636eda2-37c8-5cad-c3dc-807f9e9ed0bb&l=59

However, if we bypass the local (stub) DNS server using the following command:

sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

Then the DNS name can be resolved successfully:

https://dev.azure.com/mharder/public/_build/results?buildId=634&view=logs&j=3dc411e8-b5bf-57f2-a8a7-b25d565c86b1&t=aef3c3f0-973b-547b-f96d-ab903995d1d8&l=59

This doesn't repro on Ubuntu 16.04:

https://dev.azure.com/mharder/public/_build/results?buildId=634&view=logs&j=88c4e28e-b89e-5514-cbb8-a3c153cbe716&t=a85e08da-6704-59a1-1759-4e62f4964eb3&l=59

Pipeline Sources: https://github.com/mikeharder/AzurePipelineTests/blob/f920bf50f72fe45c1d653a3bbaac9dcaf3df7682/azure-pipelines.yml

Ubuntu bug

Most helpful comment

After much digging, I believe this is relevant: https://github.com/systemd/systemd/issues/10672

I see that with EDNS extensions, we are specifying a maximum 512-byte response size. When Azure responds, the response does not include the final A record. If the EDNS extension is removed, Azure then responds with the final A record. I find this strange since neither response goes over 512 bytes. I'm still following up with Azure here, but perhaps we may be able to work around it by disabling the EDNS extension or attempting to raise the max response size.

All 26 comments

Also, this does not repro on a new Azure Ubuntu 18.04 VM. It only repros on a DevOps Ubuntu 18.04 Hosted Agent.

@mikeharder Hello, Thank you for provided details and the investigation. I was able to reproduce the issue on Ubuntu 18.04 agent, and it seems something is configured incorrectly here, since file directly shows us that systemd-resolve service should be responsible for nameservers, but it is not.

systemd-resolve --status
.........
Link 2 (eth0)
          DNS Servers: 168.63.129.16
          DNS Domain: mqljooayxuiuncg2ys3siqhtpb.xx.internal.cloudapp.net

Locally, we still use 127.0.0.53 address, which is recorded in the file /etc/resolv.conf. It seems you are right, it is required to link local /etc/resolv.conf file with systemd-resolve file..
I will keep you posted.

@mikeharder I have created Pull Request with suggested workaround. As soon as all verification processes are complete, workaround will be applied on Ubuntu 18 agents, until then your workaround is the best option here.
I will let you know additionally, when PR will be merged.

@mikeharder We have added fix for the issue to the image and it will be rolled out next week.
In case of any questions, please let us know, we will be glad to assist you further.

@mikeharder I have created Pull Request with suggested workaround. As soon as all verification processes are complete, workaround will be applied on Ubuntu 18 agents, until then your workaround is the best option here.
I will let you know additionally, when PR will be merged.

The PR breaks agent build for self hosted agent pools:

Create resolv.conf link.
ln: failed to create symbolic link '/etc/resolv.conf': Device or resource busy

@nerijusk
Agree, We didn't take into account the fact that docker self-hosted agents occupy /etc/resolv.conf file and use it inside containers. We have rolled the pull request back.
@mikeharder it seems, suggested solution cannot be applied for ubuntu 18 image due to described above limitations. I will try to find another solution for the issue, which hopefully does not affect anything else.

@Darleev: Sounds good. I don't think my workaround is the correct long-term solution, it's just sufficient to unblock our builds. The local (stub) DNS server should be able to resolve all DNS names.

As I mentioned earlier, this doesn't repro on an Azure Ubuntu 18 VM, so you might want to start by figuring out which additional component on a DevOps Hosted Ubuntu 18 VM is causing this behavior difference. Maybe Docker?

Last time I tested it I am pretty sure it did not repro on an Azure Ubuntu 18 VM. However, I just created a new Azure Ubuntu 18 VM and now it does repro until I use the same workaround.

However, it does not repro on a Hyper-V VM created from ubuntu-18.04.4-live-server-amd64.iso.

@mikeharder I have created Pull Request with suggested workaround. As soon as all verification processes are complete, workaround will be applied on Ubuntu 18 agents, until then your workaround is the best option here.
I will let you know additionally, when PR will be merged.

The PR breaks agent build for self hosted agent pools:

Create resolv.conf link.
ln: failed to create symbolic link '/etc/resolv.conf': Device or resource busy

@nerijusk, Could you please append and validate script with small changes ?

if [[ -f /run/systemd/resolve/resolv.conf ]]; then
    echo "Create resolv.conf link."
    ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
fi

@mikeharder @nerijusk we have applied new fix according to comment above and it will be rolled out next week.

However, it does not repro on a Hyper-V VM created from ubuntu-18.04.4-live-server-amd64.iso.

Are we pulling in kernel upgrades with each new image? Do we suspect a DNS bug that was recently introduced into systemd? If so, we should probably file an issue upstream.

@chkimes: It might be this bug, but I am not an expert in this area so I am not certain:

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1822416

While it does not repro on a new VM created from the ISO, it does also repro on a new Azure Ubuntu 18.04 image, so the issue does appear to be upstream from DevOps.

One thing I know recently changed in the Azure Ubuntu 18.04 image is the default culture was changed from en-US to C (invariant), which caused changes like the enumeration order of filesystem items. I wouldn't expect this to be related to this DNS issue, but maybe?

It's possible that they're related, but the behaviors appear to be different from the bug description. I'm seeing the DNS resolver return SERVFAIL while the linked ticket describes NXDOMAIN responses.

I took a packet capture and, interestingly, I see the systemd resolver making an external DNS query and the response making it back to systemd, however it appears to completely ignore the response and re-issue queries 1 and 2 seconds later (likely due to a configured timeout).

#   TIMESTAMP   Source      Destination Protocol Length Details
176 17:02:13.621323 127.0.0.1   127.0.0.53  DNS  91 Standard query 0xbd0c A sandbox.app.blackduck.com
177 17:02:13.621535 10.1.0.4    168.63.129.16   DNS  102    Standard query 0x289f A sandbox.app.blackduck.com OPT
178 17:02:13.724710 168.63.129.16   10.1.0.4    DNS  474    Standard query response 0x289f A sandbox.app.blackduck.com A 34.66.31.136 A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 OPT
271 17:02:14.430476 10.1.0.4    168.63.129.16   DNS  102    Standard query 0x289f A sandbox.app.blackduck.com OPT
272 17:02:14.445568 168.63.129.16   10.1.0.4    DNS  474    Standard query response 0x289f A sandbox.app.blackduck.com A 34.66.31.136 A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 OPT
283 17:02:16.180575 10.1.0.4    168.63.129.16   DNS  102    Standard query 0x289f A sandbox.app.blackduck.com OPT
284 17:02:16.217957 168.63.129.16   10.1.0.4    DNS  474    Standard query response 0x289f A sandbox.app.blackduck.com A 34.66.31.136 A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 OPT
349 17:02:18.621333 127.0.0.1   127.0.0.53  DNS  91 Standard query 0xbd0c A sandbox.app.blackduck.com
350 17:02:19.430426 10.1.0.4    168.63.129.16   DNS  102    Standard query 0x289f A sandbox.app.blackduck.com OPT
351 17:02:19.432270 168.63.129.16   10.1.0.4    DNS  474    Standard query response 0x289f A sandbox.app.blackduck.com A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 A 35.239.196.223 OPT
481 17:02:23.621557 127.0.0.1   127.0.0.53  DNS  91 Standard query 0xbd0c A sandbox.app.blackduck.com
482 17:02:23.930485 127.0.0.53  127.0.0.1   DNS  91 Standard query response 0xbd0c Server failure A sandbox.app.blackduck.com
483 17:02:23.930567 127.0.0.53  127.0.0.1   DNS  91 Standard query response 0xbd0c Server failure A sandbox.app.blackduck.com
484 17:02:23.930587 127.0.0.53  127.0.0.1   DNS  91 Standard query response 0xbd0c Server failure A sandbox.app.blackduck.com

@mikeharder fix has been applied to the current images and initial DNS issue should not be reproduced anymore. Could you please verify?
@chkimes if you believe that need to investigate the issue further and report it to systemd team, please let us know.

I think at least reporting the bug to systemd is the responsible thing to do here. It was clearly regressed in a recent release, so something broke and we shouldn't have to work around it.

@Darleev @chkimes this workaround breaks the stuff for some users, we have to rollback the changes
https://github.com/actions/virtual-environments/issues/929#issuecomment-634233788

@chkimes @mikeharder That looks like a systemd-resolve bug, that cannot be fixed on our side due to possible unpredictable impact on other customers( example ). As a workaround I suggest using the way described in the initial message.
In order to find a root cause for the issue, please fill the question here for systemd-resolve team or report bug directly to their bug tracking system.
In case of any questions, feel free to contact us.

@Darleev, @chkimes: Last time I tested, I could repro this on a new Azure Ubuntu 18 VM, but not on a new Hyper-V VM created from the latest Ubuntu Server ISO. And I believe both VMs were using the same version of systemd-resolve.

So I am not sure if this issue is in base Ubuntu Server image, or specific to the Azure Ubuntu image. Do you know how to report issues against the Azure Ubuntu images?

@mikeharder Could you please provide an output of commands:

sudo systemd-resolve --status
sudo systemd-resolve --version
apt-cache policy libnss-resolve

from Hyper-V VM? It helps to understand the difference.

@mikeharder,
Let me gently remind you that we are looking forward to your reply regarding ubuntu virutal machine where the issue with systemd does not reproduce.
Could you please provide output of the aforementioned commands?
We are looking forward to your reply.

@Darleev: The output of the latter two commands appear to be identical on both an Azure Ubuntu 18 VM and a Hyper-V Ubuntu 18 VM (created from the Ubuntu Server ISO).

$ sudo systemd-resolve --version

*** Azure ***
systemd 237
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN
-PCRE2 default-hierarchy=hybrid

*** Hyper-V ***
systemd 237
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN
-PCRE2 default-hierarchy=hybrid
$ apt-cache policy libnss-resolve

*** Azure ***
libnss-resolve:
  Installed: (none)
  Candidate: 237-3ubuntu10.41
  Version table:
     237-3ubuntu10.41 500
        500 http://azure.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages
     237-3ubuntu10.38 500
        500 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages
     237-3ubuntu10 500
        500 http://azure.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages

*** Hyper-V ***
libnss-resolve:
  Installed: (none)
  Candidate: 237-3ubuntu10.41
  Version table:
     237-3ubuntu10.41 500
        500 http://us.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages
     237-3ubuntu10.38 500
        500 http://us.archive.ubuntu.com/ubuntu bionic-security/universe amd64 Packages
     237-3ubuntu10 500
        500 http://us.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages

The first command appears to be identical in the "Global" section, with slight differences in the "Link" sections:

$ sudo systemd-resolve --status

*** Azure ***
Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 3 (rename3)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

Link 2 (eth0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: 168.63.129.16
          DNS Domain: vy4dqvqijknelj0nz0uufugejc.xx.internal.cloudapp.net

*** Hyper-V ***
Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 2 (eth0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: <redacted>
                      <redacted>
          DNS Domain: <redacted>

Just for reference #191441649

After much digging, I believe this is relevant: https://github.com/systemd/systemd/issues/10672

I see that with EDNS extensions, we are specifying a maximum 512-byte response size. When Azure responds, the response does not include the final A record. If the EDNS extension is removed, Azure then responds with the final A record. I find this strange since neither response goes over 512 bytes. I'm still following up with Azure here, but perhaps we may be able to work around it by disabling the EDNS extension or attempting to raise the max response size.

Successful query:

1135    16:27:25.964386 10.1.0.4    168.63.129.16   DNS 128 Standard query 0xc2d4 A mharder-formrec.cognitiveservices.azure.com OPT

Domain Name System (query)
    Transaction ID: 0xc2d4
    Flags: 0x0120 Standard query
        0... .... .... .... = Response: Message is a query
        .000 0... .... .... = Opcode: Standard query (0)
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..1. .... = AD bit: Set
        .... .... ...0 .... = Non-authenticated data: Unacceptable
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        mharder-formrec.cognitiveservices.azure.com: type A, class IN
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41)
            UDP payload size: 4096
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 12
            Option: COOKIE

Unsuccessful query:

1128    16:27:25.713886 10.1.0.4    168.63.129.16   DNS 116 Standard query 0x198d A mharder-formrec.cognitiveservices.azure.com OPT

Domain Name System (query)
    Transaction ID: 0x198d
    Flags: 0x0100 Standard query
        0... .... .... .... = Response: Message is a query
        .000 0... .... .... = Opcode: Standard query (0)
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... .0.. .... = Z: reserved (0)
        .... .... ...0 .... = Non-authenticated data: Unacceptable
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        mharder-formrec.cognitiveservices.azure.com: type A, class IN
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41)
            UDP payload size: 512
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 0

Notable difference:

Success:
            UDP payload size: 4096

Failure:
            UDP payload size: 512

And notable differences in the responses:

Success:
    Flags: 0x8180 Standard query response, No error
        .... ..0. .... .... = Truncated: Message is not truncated

Failure:
    Flags: 0x8380 Standard query response, No error
        .... ..1. .... .... = Truncated: Message is truncated

Interestingly, systemd-resolved is setting the maximum payload size to 512 regardless of whether EDNS0 is configured and regardless of what is sent to it for the payload size. I'm reasonably sure that the way to fix this is to increase the payload size that systemd-resolved is using but I can't find any details about how to do that in the docs.

This explains why bypassing the local resolver was effective as a workaround.

Hello @mikeharder,
Finally, we didn't find how to change UDP payload size for virtual machines, it seems can be changed only on systemd-resolved side. I have filled a bug in Ubuntu issue tracker.

In case of any questions or issues, feel free to contact us.

Was this page helpful?
0 / 5 - 0 ratings