Datadog-agent: There was an error querying the ntp host

Created on 27 Mar 2018 · 21Comments · Source: DataDog/datadog-agent

Describe what happened:
Launched the agent and the logs look like:

[ AGENT ] 2018-03-27 18:42:06 UTC | INFO | (transaction.go:129 in Process) | Successfully posted payload to "https://6-1-0-app.agent.datadoghq.com/intake/?api_key=*************************370fe"
[ AGENT ] 2018-03-27 18:42:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:49888->45.76.244.202:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:27 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:58747->198.137.202.56:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:37 UTC | INFO | (serializer.go:196 in SendJSONToV1Intake) | Sent processes metadata payload, size: 410 bytes.
[ AGENT ] 2018-03-27 18:42:42 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:44762->38.229.71.1:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check cpu
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check cpu
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check disk
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check disk
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check docker
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check docker
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check file_handle
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check file_handle
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check io
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check io
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check load
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check load
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check memory
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check memory
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check network
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check network
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check ntp
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:56134->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (runner.go:302 in work) | Done running check ntp
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (runner.go:246 in work) | Running check uptime
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (runner.go:302 in work) | Done running check uptime
[ AGENT ] 2018-03-27 18:43:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:43058->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:43:27 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:46164->107.161.29.207:123: i/o timeout
[ AGENT ] 2018-03-27 18:43:42 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:39099->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:43:57 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:52643->171.66.97.126:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:34427->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:21 UTC | INFO | (transaction.go:129 in Process) | Successfully posted payload to "https://6-1-0-app.agent.datadoghq.com/api/v1/check_run?api_key=*************************370fe"
[ AGENT ] 2018-03-27 18:44:27 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:53556->204.11.201.10:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:42 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:59500->172.98.193.44:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:57 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:33902->96.244.96.19:123: i/o timeout
[ AGENT ] 2018-03-27 18:45:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:36223->198.137.202.56:123: i/o timeout

Additional environment details (Operating System, Cloud provider, etc):
Docker, AWS, Amazon Linux

teaagent-core

Source

adamgotterer

👍18

Most helpful comment

Could we look at support the ability to pass a runtime var to define an NTP host for the Docker container within AWS so we could leverage their internal endpoint (169.254.169.123)

As a number of organisations aren't always keen to allow NTP in/out of their VPC/Nat Instances.

neoghostz on 26 Jul 2018

👍3

All 21 comments

am also facing the same issue . any update on this issue ? any datadog engineer working on this ? please prioritize this

2018-04-14 06:17:25 UTC | INFO | (runner.go:246 in work) | Running check ntp
2018-04-14 06:17:30 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:37261->216.218.220.101:123: i/o timeout
2018-04-14 06:17:30 UTC | INFO | (runner.go:302 in work) | Done running check ntp
2018-04-14 06:17:30 UTC | INFO | (runner.go:246 in work) | Running check uptime
2018-04-14 06:17:30 UTC | INFO | (runner.go:302 in work) | Done running check uptime
2018-04-14 06:17:45 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:49068->162.210.111.4:123: i/o timeout
2018-04-14 06:18:00 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:57120->216.218.220.101:123: i/o timeout

am using datadog docker image datadog/agent:6.1.2-jmx as a side car container in K8s

It seems because of this time sync issue , am not seeing the jmx metrics in datadog dashboards.
https://github.com/DataDog/datadog-agent/blob/68f10761cbcc3b541f645fc4f5cefc65036c3794/pkg/collector/corechecks/network/ntp.go#L122

shine17 on 14 Apr 2018

This is what i get when i do a datadog agent ntp check from the side car

root@something*********-hj2hx:/opt/datadog-agent/bin/agent# ./agent check ntp -r -l INFO
%!s(int=442840760) | INFO | (tagger.go:78 in Init) | starting the tagging system
%!s(int=442840760) | INFO | (runner.go:92 in NewRunner) | Runner started with 1 workers.
%!s(int=442840760) | INFO | (collector.go:51 in NewCollector) | Embedding Python 2.7.14 (default, Apr  4 2018, 16:58:02) [GCC 4.7.2]
%!s(int=442840760) | INFO | (file.go:69 in Collect) | File Configuration Provider: searching for configuration files at: /etc/datadog-agent/conf.d
%!s(int=442840760) | INFO | (file.go:69 in Collect) | File Configuration Provider: searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
%!s(int=442840760) | WARN | (file.go:73 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
%!s(int=442840760) | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
%!s(int=442840760) | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
%!s(int=442840760) | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (disk).
%!s(int=442845760) | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:55411->162.210.111.4:123: i/o timeout
%!s(int=442850760) | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:43615->192.155.90.13:123: i/o timeout
=== Service Checks ===
[
  {
    "check": "ntp.in_sync",
    "host_name": "something**********hj2hx",
    "timestamp": 1523688165,
    "status": 3,
    "message": "",
    "tags": null
  },
  {
    "check": "ntp.in_sync",
    "host_name": "something*************-7b64fd57fb-hj2hx",
    "timestamp": 1523688170,
    "status": 3,
    "message": "",
    "tags": null
  }
]
%!s(int=442850760) | ERROR | (host.go:168 in getCPUInfo) | failed to retrieve cpu info at init time
=========
Collector
=========

  Running Checks
  ==============
    ntp
    ---
      Total Runs: 2
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 2

shine17 on 14 Apr 2018

@adamgotterer @shine17 check that your firewalls/security groups allow outgoing connections on high ports (or source port 123 or target port 123).
Regular NTP makes connections from port 123 into port 123, but datadog ntp check initiates connection from high ports (example in log above: 25.128.37.38:55411->162.210.111.4:123). I had the same error logged when my firewall was restricting high ports (and only allowing 123<->123).

pbudzon on 15 Apr 2018

@pbudzon I'm on AWS with DD running on an ECS cluster. That machines running those containers have egress rules for TCP open on ports 0 - 65535. So I don't think its a security group issue.

adamgotterer on 16 Apr 2018

@adamgotterer what about your Network ACL, though?

acmcelwee on 16 Apr 2018

Just double checked the network and ACL and it's allow all traffic on all protocols outbound.

adamgotterer on 16 Apr 2018

Remember that network acls are stateless so you need to enable traffic in
as well as out.

pbudzon on 16 Apr 2018

... why doesn't datadog just trust the host's time?

SleepyBrett on 16 Apr 2018

👍1

This doesn’t have much to do with trust. It’s one of the default checks
(like pulling out cpu and memory until) which validates that your system’s
time didn’t drift off - you can see the time difference (between system’s
time and ntp reported time) in datadog metrics, just like you can see cpu,
memory and bunch of other stuff out of the box.
If you have ntpd or similar service enabled and working on your server then
this check is usually one you can get by without, but still it’s nice to
have it. Especially if you’re doing any time-sensitive stuff on the server,
like crypto or some authentications.

pbudzon on 16 Apr 2018

I'm getting the same error. Dedicated server. Ports opened.

2018-05-15 20:55:10 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp x.x.x.x:60440->64.113.44.55:123: i/o timeout

bompus on 15 May 2018

Any update on this?

bruno-carrier-lookout on 10 Jul 2018

on a fresh installation :

agent.log INFO UTC | INFO | (ntp.go:123 in Run) | There was an error querying the ntp host: read udp x.x.x.x:52870->138.96.64.10:123: i/o timeout

still broken.

mad42 on 13 Jul 2018

We're experiencing this as well. This seems to be causing the dd agent to stop reporting momentarily and as a result our monitors start to fire for No Data alerts. Any ETA on the fix?

simar7 on 17 Jul 2018

Could we look at support the ability to pass a runtime var to define an NTP host for the Docker container within AWS so we could leverage their internal endpoint (169.254.169.123)

As a number of organisations aren't always keen to allow NTP in/out of their VPC/Nat Instances.

neoghostz on 26 Jul 2018

👍3

Isn't there an option to just turn this off?

ChrisMcKee on 1 Aug 2018

We're also having this issue, and it appears to be intermittent. The agent will suddenly recover and start posting metrics briefly, then later fail again for a while.

asmoran on 15 Aug 2018

It looks as though there are configuration options for NTP (https://docs.datadoghq.com/integrations/ntp/#configuration).

The page reads "The Agent enables the NTP check by default, but if you want to configure the check yourself, edit the file ntp.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory."

There doesn't seem to be an option to disable NTP checks (although stating that the check is enabled by default does seem to imply there is a way to turn it off), for organizations that want to keep NTP internal, NTP server locations are configurable.

micahsmith on 25 Sep 2018

Hi all,

Apologies for the late reply. As @micahsmith mentioned, you can configure the Agent's NTP check to query your own/your service provider's NTP servers by following the instructions at https://docs.datadoghq.com/integrations/ntp/#configuration.

Also, as with any other Agent check that's enabled by default, you can disable the check completely by removing the file at <agent_conf_dir>/conf.d/ntp.d/conf.yaml.default and restarting the Agent (on a standard Linux install, <agent_conf_dir> is /etc/datadog-agent).

That said, we don't recommend disabling the NTP check entirely as:
1) it allows you to know when your host's clock is skewed
2) an Agent running on a host on which the system clock is significantly skewed (approx. >60s) may report data so much in the past or future that it'll affect your graphs and monitors on Datadog.

We've implemented significant improvements to the NTP check in v6.4.0 and 6.5.0, if you're having issues with the NTP check please upgrade to Agent >= 6.5.0. If you still run into issues with the NTP check after upgrading please reach out to our support team and send them your Agent's logs.

olivielpeau on 3 Oct 2018

Having this issue on K8S Deployment on AWS , version of agent 6.10.2

ahharu on 4 Apr 2019

@ahharu Thanks for the heads up, could you send us a note via [email protected] with a flare and some details about your configuration? We could then assess the situation.

dabcoder on 6 May 2019

We have merged a couple PRs for the upcoming 6.14 release that should make this better. Please let us know if this is still an issue for you after upgrading.