Tendermint: light: rpctest.Tendermint instance's clock drifts +7min

Created on 27 Feb 2020  路  34Comments  路  Source: tendermint/tendermint

Discovered in: https://github.com/tendermint/tendermint/pull/4487

I think it's because we're stripping the monotonic part (see types/time/time.go) & using only a single instance (so not using medium time really). But it would be good to confirm this.

bug test

Most helpful comment

@melekes As for the original problem with rpctest, it's likely to be a similar cause. config.TestConfig() uses a commit timeout of 10ms, but ConsensusParams.TimeIotaMs needs to be set to something similar as it defaults to 1s.

All 34 comments

Any idea if this is related to https://github.com/informalsystems/tendermint-rs/issues/512 ? When running tendermint in a docker container on Mac it seems the node's clock drifts ahead weirdly which makes it a challenge to test the light client against it.

UPDATE: might just be a docker for mac issue: https://github.com/informalsystems/tendermint-rs/issues/512#issuecomment-690882655

It's likely not an issue with Docker for Mac:

In the error above, the header_time is parsed from the JSON returned over RPC by the Go implementation, while now is queried via the chrono crate, which may explain the difference in precision. I don't think that this really matters, as the times reported by

a) chrono on my machine
b) $ date on my machine
c) $ date within the container

all agree with each other. It's only the time seen by the Tendermint node that is drifting.

This doesn't sound good. It should be possible to confirm that by running a local node without Docker.

This doesn't sound good. It should be possible to confirm that by running a local node without Docker.

A local node does not exhibit the same behavior AFAICT: https://github.com/informalsystems/tendermint-rs/issues/512#issuecomment-673416280

so it's not an issue with Docker on Mac and it's not a local node issue. How's that possible? 馃槃

Maybe the combination of the two?

I'm having similar issues with Tendermint 0.34 in a Docker guest on macOS. Versions I am testing are

  • v0.33.5 from DockerHub
  • latest from DockerHub

with --proxy_app=kvstore.

The time of the latest block is more than one hour in the future. The following command shows how I see the problem:

  1. Time on my host
  2. Time inside the guest
  3. Time reported by Tendermint RPC (/status)
date && docker exec tendermint-34 date && (curl -sS http://localhost:11134/status | jq .result.sync_info)       
Thu Sep 17 18:03:40 CEST 2020
Thu Sep 17 16:03:41 UTC 2020
{
  "latest_block_hash": "CBD00A672A29CAC1DCC481026B880FCC1B59B5F1D1555AB35A30EF0C88BF7F67",
  "latest_app_hash": "A001000000000000",
  "latest_block_height": "10327",
  "latest_block_time": "2020-09-17T17:21:45.4951156Z",
  "earliest_block_hash": "C4A7DD2C1F5CEA6F13C2A42D506313A37782E4753EC1542741455414B50EEDAB",
  "earliest_app_hash": "",
  "earliest_block_height": "1",
  "earliest_block_time": "2020-09-17T14:29:38.7926075Z",
  "catching_up": false
}

As you can see, at 16:03 UTC I have a block from 17:21 UTC.

This problem does not happen with Tendermint 0.33.5 in the same environment:

$ date && docker exec tendermint-33 date && (curl -sS http://localhost:11133/status | jq .result.sync_info)
Thu Sep 17 17:43:29 CEST 2020
Thu Sep 17 15:43:30 UTC 2020
{
  "latest_block_hash": "52894231AB5545A115326715D9000B7D6FB60EE0D73E8A1F1132574154D48DB8",
  "latest_app_hash": "7800000000000000",
  "latest_block_height": "4209",
  "latest_block_time": "2020-09-17T15:43:28.8252145Z",
  "earliest_block_hash": "ADF18D97F9F23E0B7387CE7C2BECFCD643E6947BC21F44B521A35F03E9172BDC",
  "earliest_app_hash": "",
  "earliest_block_height": "1",
  "earliest_block_time": "2020-09-17T14:29:30.9058132Z",
  "catching_up": false
}

Thanks for the details! I haven't been able to reproduce this locally, but will let a Docker testnet run for a few hours and see.

Can you send the output of /block please? I'd like to have a look at the signature timestamps of the block. Also, is this with a single node or a network?

Can you send the output of /block please? I'd like to have a look at the signature timestamps of the block.

Sure, this is the block 10327 of the Tendermint 34 instance I mentioned before

$ curl -sS "http://localhost:11134/block?height=10327" | jq .result
{
  "block_id": {
    "hash": "CBD00A672A29CAC1DCC481026B880FCC1B59B5F1D1555AB35A30EF0C88BF7F67",
    "parts": {
      "total": 1,
      "hash": "61A950C86F1B8433F93C4C0E7892CA66A8D04336D72248212C0220E235F87EF2"
    }
  },
  "block": {
    "header": {
      "version": {
        "block": "11",
        "app": "1"
      },
      "chain_id": "dockerchain",
      "height": "10327",
      "time": "2020-09-17T17:21:45.4951156Z",
      "last_block_id": {
        "hash": "436132FA314D04F56BB52B2EF09A31164977C194C488DEDCA83F8A0A4F98C5E9",
        "parts": {
          "total": 1,
          "hash": "528832A26FAC85D5AB8D7937756F65E1A3DC5D241B73C3A2C5FBCCA4399A4FFA"
        }
      },
      "last_commit_hash": "D6A17FFC16AC2BB4B1489931BD47B1CA044C089D3750932F25C12C1D1D0C4F42",
      "data_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "validators_hash": "B2A6CC880A471433B7CDA48AA4ED0724A05BB6CF8C7705F3A37FEAEA1AA51821",
      "next_validators_hash": "B2A6CC880A471433B7CDA48AA4ED0724A05BB6CF8C7705F3A37FEAEA1AA51821",
      "consensus_hash": "048091BC7DDC283F77BFBF91D73C44DA58C3DF8A9CBC867405D8B7F3DAADA22F",
      "app_hash": "A001000000000000",
      "last_results_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "evidence_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "proposer_address": "72D0CAACF4564BE66DD44A697DB9A740090618AB"
    },
    "data": {
      "txs": []
    },
    "evidence": {
      "evidence": []
    },
    "last_commit": {
      "height": "10326",
      "round": 0,
      "block_id": {
        "hash": "436132FA314D04F56BB52B2EF09A31164977C194C488DEDCA83F8A0A4F98C5E9",
        "parts": {
          "total": 1,
          "hash": "528832A26FAC85D5AB8D7937756F65E1A3DC5D241B73C3A2C5FBCCA4399A4FFA"
        }
      },
      "signatures": [
        {
          "block_id_flag": 2,
          "validator_address": "72D0CAACF4564BE66DD44A697DB9A740090618AB",
          "timestamp": "2020-09-17T17:21:45.4951156Z",
          "signature": "yJTaxi25WbkdWOjKczdQrk3rT0bcnMPkB52BP+shgjcwT8o2R9KqqxL1j8WZu7cdxaJAWLAVaqGraT+BgqlPDQ=="
        }
      ]
    }
  }
}

Also, is this with a single node or a network?

A single node.

Another observation I just made: the node is running for 5 hours now but my machine was in sleep for some hours in between. When it woke up the time was somehow reset. First it was less than 2 minutes ahead, now it is 9 minutes ahead already.

Tendermint 0.33.8 from DockerHub does not have the issue and behaves as 0.33.5.

I can reproduce the same behaviour (v0.33.8 ok, latest produces blocks in the future) using docker on a Linux host

Did something happen to the way we manage time in v0.34 that could explain this?

Did something happen to the way we manage time in v0.34 that could explain this?

nobody has touched median time calculation nor types/time package to my knowledge.

Was the Go compiler or standard library upgraded between the two versions?

Was the Go compiler or standard library upgraded between the two versions?

yep

I can reproduce this when I use the image from Docker Hub, but not when running a Docker testnet via make localnet-start. Most likely an artifact of the environment, build, or configuration - will dig into it.

The Docker Hub image sets timeout_commit = "500ms" - when I instead build an image with the default value of 1s, the blocks have correct timestamps. Any thoughts on what may cause this?

For reference, the Docker image is configured here: https://github.com/tendermint/tendermint/blob/master/DOCKER/docker-entrypoint.sh

I also see the incorrect behavior with 0.33.5, 0.33.6 and 0.33.8 from Docker Hub.

It's caused by using the default time_iota_ms: 1000 setting along with timeout_commit = "500ms". Setting time_iota_ms to 500 as well fixes the drift. Opened https://github.com/tendermint/tendermint/pull/5385 for this.

Wow nice find! Did something about this change since v0.33?

No, the 0.33 images have the same problem. It's possible some users have old versions of the images cached locally or something.

@melekes As for the original problem with rpctest, it's likely to be a similar cause. config.TestConfig() uses a commit timeout of 10ms, but ConsensusParams.TimeIotaMs needs to be set to something similar as it defaults to 1s.

Cool, great find!

No, the 0.33 images have the same problem. It's possible some users have old versions of the images cached locally or something.

Are you sure? Those are the config values I see:

0.33.5: timeout_commit = "1s" (image downloaded long ago)
0.33.6: timeout_commit = "1s" (image downloaded just now)
0.33.7: timeout_commit = "1s" (image downloaded just now)
0.33.8: timeout_commit = "1s" (image downloaded in the last week)
latest: timeout_commit = "500ms" (image downloaded today)

Looks like this problem occurred before: https://github.com/tendermint/tendermint/issues/3723

Are you sure? Those are the config values I see

Yes:

$ docker pull tendermint/tendermint:v0.33.5
v0.33.5: Pulling from tendermint/tendermint
Digest: sha256:e81ef363944ef272408f0a2828b4fd5511eca7f0e53afd9b07a4f615e95f71da
Status: Image is up to date for tendermint/tendermint:v0.33.5
docker.io/tendermint/tendermint:v0.33.5

$ docker run --entrypoint '' tendermint/tendermint:v0.33.5 grep -E ^timeout_commit config/config.toml
timeout_commit = "500ms"

Are you using a different image?

Eh wait, that command shouldn't even have worked since the config is generated by the image entrypoint at runtime, and I'm explicitly disabling the entrypoint here. The image specifies a volume in the Dockerfile, and I think an implicit volume is set up unless one is provided when running the command - this stateful volume ends up being reused:

$ docker run --entrypoint '' -v $(PWD)/tendermint:/tendermint tendermint/tendermint:v0.33.5 grep -E ^timeout_commit config/config.toml
grep: config/config.toml: No such file or directory

This seems like a real footgun. Is there a reason why we're using an implicit volume in the Dockerfile rather than having it be explicit and opt-in?

Ah, found the difference: In my setup I always run init on an empty directory instead of using the default config:

$ mkdir empty
$ docker run --rm \
  --user="$UID" \
  -v "$PWD/empty:/tendermint" \
  "tendermint/tendermint:v0.33.5" \
  init
$ grep -E ^timeout_commit empty/config/config.toml
timeout_commit = "1s"

# Then run tendermint by mounting `$PWD/empty` into `/tendermint`

Yeah, that explains it - Tendermint's default config has 1000ms, but the Docker entrypoint will change that to 500ms if it doesn't find an existing config. However, since you generate your own config, that one's used instead.

Perfect, then this explains the different observations for 0.33

Tendermint's default config has 1000ms

Oh but then there is another issue with this. This default seems to have changed between 0.33 and 0.34. When you init in an emtpy directory, you get the problematic cmbo 500ms/1000ms:

$ mkdir empty
$ docker run --rm \
  --user="$UID" \
  -v "$PWD/empty:/tendermint" \
  "tendermint/tendermint:latest" \
  init
$ grep -E ^timeout_commit empty/config/config.toml
timeout_commit = "500ms"
$ jq .consensus_params.block.time_iota_ms < empty/config/genesis.json
"1000"

Oh but then there is another issue with this. This default seems to have changed between 0.33 and 0.34. When you init in an emtpy directory, you get the problematic cmbo 500ms/1000ms:

No, but in #5087 the config generation was moved from the Dockerfile (image build time) to an entrypoint (runtime). When your command above is executed, it first runs the entrypoint which runs tendermint init itself and modifies the config, and _then_ runs _your_ tendermint init. tendermint init appears to preserve existing files if they already exist, so you're getting the config generated by the image.

I suppose this also interacts with the volume, possibly. Before, mounting the volume would hide the runtime-generated configuration, so your init would run without any existing files, but since the config is now generated runtime it runs with the volume already mounted, so your init sees the existing files.

Thats a complicated flow for a default config but everything you say makes perfect sense, @erikgrinaker. Thanks a lot. All questions answered then.

Yeah, it could definitely be simpler. I think it was moved runtime because otherwise encryption keys and genesis times and similar parameters would be embedded in the image and identical for all nodes, while we really want them to be generated on the first run.

Good find 鉂わ笍

q: why do we have 500ms in Docker anyway?
a: https://github.com/tendermint/tendermint/pull/4577#discussion_r393745244

Was this page helpful?
0 / 5 - 0 ratings

Related issues

skdjango001 picture skdjango001  路  23Comments

zmanian picture zmanian  路  31Comments

ying2025 picture ying2025  路  35Comments

webmaster128 picture webmaster128  路  21Comments

melekes picture melekes  路  30Comments