Tendermint: light: rpctest.Tendermint instance's clock drifts +7min

Created on 27 Feb 2020 · 34Comments · Source: tendermint/tendermint

Discovered in: https://github.com/tendermint/tendermint/pull/4487

I think it's because we're stripping the monotonic part (see types/time/time.go) & using only a single instance (so not using medium time really). But it would be good to confirm this.

bug test

Source

melekes

Most helpful comment

@melekes As for the original problem with rpctest, it's likely to be a similar cause. config.TestConfig() uses a commit timeout of 10ms, but ConsensusParams.TimeIotaMs needs to be set to something similar as it defaults to 1s.

erikgrinaker on 22 Sep 2020

👍2

All 34 comments

Any idea if this is related to https://github.com/informalsystems/tendermint-rs/issues/512 ? When running tendermint in a docker container on Mac it seems the node's clock drifts ahead weirdly which makes it a challenge to test the light client against it.

UPDATE: might just be a docker for mac issue: https://github.com/informalsystems/tendermint-rs/issues/512#issuecomment-690882655

ebuchman on 11 Sep 2020

It's likely not an issue with Docker for Mac:

In the error above, the header_time is parsed from the JSON returned over RPC by the Go implementation, while now is queried via the chrono crate, which may explain the difference in precision. I don't think that this really matters, as the times reported by

a) chrono on my machine
b) $ date on my machine
c) $ date within the container

all agree with each other. It's only the time seen by the Tendermint node that is drifting.

romac on 11 Sep 2020

👍2

This doesn't sound good. It should be possible to confirm that by running a local node without Docker.

erikgrinaker on 11 Sep 2020

This doesn't sound good. It should be possible to confirm that by running a local node without Docker.

A local node does not exhibit the same behavior AFAICT: https://github.com/informalsystems/tendermint-rs/issues/512#issuecomment-673416280

romac on 11 Sep 2020

👍1

so it's not an issue with Docker on Mac and it's not a local node issue. How's that possible? 😄

melekes on 11 Sep 2020

Maybe the combination of the two?

romac on 11 Sep 2020

I'm having similar issues with Tendermint 0.34 in a Docker guest on macOS. Versions I am testing are

v0.33.5 from DockerHub
latest from DockerHub

with --proxy_app=kvstore.

The time of the latest block is more than one hour in the future. The following command shows how I see the problem:

Time on my host
Time inside the guest
Time reported by Tendermint RPC (/status)

date && docker exec tendermint-34 date && (curl -sS http://localhost:11134/status | jq .result.sync_info)       
Thu Sep 17 18:03:40 CEST 2020
Thu Sep 17 16:03:41 UTC 2020
{
  "latest_block_hash": "CBD00A672A29CAC1DCC481026B880FCC1B59B5F1D1555AB35A30EF0C88BF7F67",
  "latest_app_hash": "A001000000000000",
  "latest_block_height": "10327",
  "latest_block_time": "2020-09-17T17:21:45.4951156Z",
  "earliest_block_hash": "C4A7DD2C1F5CEA6F13C2A42D506313A37782E4753EC1542741455414B50EEDAB",
  "earliest_app_hash": "",
  "earliest_block_height": "1",
  "earliest_block_time": "2020-09-17T14:29:38.7926075Z",
  "catching_up": false
}

As you can see, at 16:03 UTC I have a block from 17:21 UTC.

This problem does not happen with Tendermint 0.33.5 in the same environment:

$ date && docker exec tendermint-33 date && (curl -sS http://localhost:11133/status | jq .result.sync_info)
Thu Sep 17 17:43:29 CEST 2020
Thu Sep 17 15:43:30 UTC 2020
{
  "latest_block_hash": "52894231AB5545A115326715D9000B7D6FB60EE0D73E8A1F1132574154D48DB8",
  "latest_app_hash": "7800000000000000",
  "latest_block_height": "4209",
  "latest_block_time": "2020-09-17T15:43:28.8252145Z",
  "earliest_block_hash": "ADF18D97F9F23E0B7387CE7C2BECFCD643E6947BC21F44B521A35F03E9172BDC",
  "earliest_app_hash": "",
  "earliest_block_height": "1",
  "earliest_block_time": "2020-09-17T14:29:30.9058132Z",
  "catching_up": false
}

webmaster128 on 17 Sep 2020

Thanks for the details! I haven't been able to reproduce this locally, but will let a Docker testnet run for a few hours and see.

Can you send the output of /block please? I'd like to have a look at the signature timestamps of the block. Also, is this with a single node or a network?

erikgrinaker on 17 Sep 2020

Can you send the output of /block please? I'd like to have a look at the signature timestamps of the block.

Sure, this is the block 10327 of the Tendermint 34 instance I mentioned before

$ curl -sS "http://localhost:11134/block?height=10327" | jq .result
{
  "block_id": {
    "hash": "CBD00A672A29CAC1DCC481026B880FCC1B59B5F1D1555AB35A30EF0C88BF7F67",
    "parts": {
      "total": 1,
      "hash": "61A950C86F1B8433F93C4C0E7892CA66A8D04336D72248212C0220E235F87EF2"
    }
  },
  "block": {
    "header": {
      "version": {
        "block": "11",
        "app": "1"
      },
      "chain_id": "dockerchain",
      "height": "10327",
      "time": "2020-09-17T17:21:45.4951156Z",
      "last_block_id": {
        "hash": "436132FA314D04F56BB52B2EF09A31164977C194C488DEDCA83F8A0A4F98C5E9",
        "parts": {
          "total": 1,
          "hash": "528832A26FAC85D5AB8D7937756F65E1A3DC5D241B73C3A2C5FBCCA4399A4FFA"
        }
      },
      "last_commit_hash": "D6A17FFC16AC2BB4B1489931BD47B1CA044C089D3750932F25C12C1D1D0C4F42",
      "data_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "validators_hash": "B2A6CC880A471433B7CDA48AA4ED0724A05BB6CF8C7705F3A37FEAEA1AA51821",
      "next_validators_hash": "B2A6CC880A471433B7CDA48AA4ED0724A05BB6CF8C7705F3A37FEAEA1AA51821",
      "consensus_hash": "048091BC7DDC283F77BFBF91D73C44DA58C3DF8A9CBC867405D8B7F3DAADA22F",
      "app_hash": "A001000000000000",
      "last_results_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "evidence_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "proposer_address": "72D0CAACF4564BE66DD44A697DB9A740090618AB"
    },
    "data": {
      "txs": []
    },
    "evidence": {
      "evidence": []
    },
    "last_commit": {
      "height": "10326",
      "round": 0,
      "block_id": {
        "hash": "436132FA314D04F56BB52B2EF09A31164977C194C488DEDCA83F8A0A4F98C5E9",
        "parts": {
          "total": 1,
          "hash": "528832A26FAC85D5AB8D7937756F65E1A3DC5D241B73C3A2C5FBCCA4399A4FFA"
        }
      },
      "signatures": [
        {
          "block_id_flag": 2,
          "validator_address": "72D0CAACF4564BE66DD44A697DB9A740090618AB",
          "timestamp": "2020-09-17T17:21:45.4951156Z",
          "signature": "yJTaxi25WbkdWOjKczdQrk3rT0bcnMPkB52BP+shgjcwT8o2R9KqqxL1j8WZu7cdxaJAWLAVaqGraT+BgqlPDQ=="
        }
      ]
    }
  }
}

Also, is this with a single node or a network?

A single node.

Another observation I just made: the node is running for 5 hours now but my machine was in sleep for some hours in between. When it woke up the time was somehow reset. First it was less than 2 minutes ahead, now it is 9 minutes ahead already.

webmaster128 on 17 Sep 2020

Tendermint 0.33.8 from DockerHub does not have the issue and behaves as 0.33.5.

webmaster128 on 17 Sep 2020

I can reproduce the same behaviour (v0.33.8 ok, latest produces blocks in the future) using docker on a Linux host

webmaster128 on 21 Sep 2020

Did something happen to the way we manage time in v0.34 that could explain this?

ebuchman on 21 Sep 2020

Did something happen to the way we manage time in v0.34 that could explain this?

nobody has touched median time calculation nor types/time package to my knowledge.

melekes on 22 Sep 2020

Was the Go compiler or standard library upgraded between the two versions?

webmaster128 on 22 Sep 2020

Was the Go compiler or standard library upgraded between the two versions?

yep

melekes on 22 Sep 2020

I can reproduce this when I use the image from Docker Hub, but not when running a Docker testnet via make localnet-start. Most likely an artifact of the environment, build, or configuration - will dig into it.

erikgrinaker on 22 Sep 2020

The Docker Hub image sets timeout_commit = "500ms" - when I instead build an image with the default value of 1s, the blocks have correct timestamps. Any thoughts on what may cause this?

For reference, the Docker image is configured here: https://github.com/tendermint/tendermint/blob/master/DOCKER/docker-entrypoint.sh

I also see the incorrect behavior with 0.33.5, 0.33.6 and 0.33.8 from Docker Hub.

erikgrinaker on 22 Sep 2020

It's caused by using the default time_iota_ms: 1000 setting along with timeout_commit = "500ms". Setting time_iota_ms to 500 as well fixes the drift. Opened https://github.com/tendermint/tendermint/pull/5385 for this.

erikgrinaker on 22 Sep 2020

🎉1

Wow nice find! Did something about this change since v0.33?

ebuchman on 22 Sep 2020

No, the 0.33 images have the same problem. It's possible some users have old versions of the images cached locally or something.

erikgrinaker on 22 Sep 2020

I think this is the offending commit: https://github.com/tendermint/tendermint/commit/c05b2f23849dfc5affd41ea9eb6f64b5c07ad997

erikgrinaker on 22 Sep 2020

👍2

erikgrinaker on 22 Sep 2020

👍2

Cool, great find!

No, the 0.33 images have the same problem. It's possible some users have old versions of the images cached locally or something.

Are you sure? Those are the config values I see:

0.33.5: timeout_commit = "1s" (image downloaded long ago)
0.33.6: timeout_commit = "1s" (image downloaded just now)
0.33.7: timeout_commit = "1s" (image downloaded just now)
0.33.8: timeout_commit = "1s" (image downloaded in the last week)
latest: timeout_commit = "500ms" (image downloaded today)

webmaster128 on 22 Sep 2020

Looks like this problem occurred before: https://github.com/tendermint/tendermint/issues/3723

webmaster128 on 22 Sep 2020

Are you sure? Those are the config values I see

Yes:

$ docker pull tendermint/tendermint:v0.33.5
v0.33.5: Pulling from tendermint/tendermint
Digest: sha256:e81ef363944ef272408f0a2828b4fd5511eca7f0e53afd9b07a4f615e95f71da
Status: Image is up to date for tendermint/tendermint:v0.33.5
docker.io/tendermint/tendermint:v0.33.5

$ docker run --entrypoint '' tendermint/tendermint:v0.33.5 grep -E ^timeout_commit config/config.toml
timeout_commit = "500ms"

Are you using a different image?

erikgrinaker on 22 Sep 2020

Eh wait, that command shouldn't even have worked since the config is generated by the image entrypoint at runtime, and I'm explicitly disabling the entrypoint here. The image specifies a volume in the Dockerfile, and I think an implicit volume is set up unless one is provided when running the command - this stateful volume ends up being reused:

$ docker run --entrypoint '' -v $(PWD)/tendermint:/tendermint tendermint/tendermint:v0.33.5 grep -E ^timeout_commit config/config.toml
grep: config/config.toml: No such file or directory

This seems like a real footgun. Is there a reason why we're using an implicit volume in the Dockerfile rather than having it be explicit and opt-in?

erikgrinaker on 22 Sep 2020

Ah, found the difference: In my setup I always run init on an empty directory instead of using the default config:

$ mkdir empty
$ docker run --rm \
  --user="$UID" \
  -v "$PWD/empty:/tendermint" \
  "tendermint/tendermint:v0.33.5" \
  init
$ grep -E ^timeout_commit empty/config/config.toml
timeout_commit = "1s"

# Then run tendermint by mounting `$PWD/empty` into `/tendermint`

webmaster128 on 22 Sep 2020

Yeah, that explains it - Tendermint's default config has 1000ms, but the Docker entrypoint will change that to 500ms if it doesn't find an existing config. However, since you generate your own config, that one's used instead.

erikgrinaker on 22 Sep 2020

Perfect, then this explains the different observations for 0.33

webmaster128 on 22 Sep 2020

👍1

Tendermint's default config has 1000ms

Oh but then there is another issue with this. This default seems to have changed between 0.33 and 0.34. When you init in an emtpy directory, you get the problematic cmbo 500ms/1000ms:

$ mkdir empty
$ docker run --rm \
  --user="$UID" \
  -v "$PWD/empty:/tendermint" \
  "tendermint/tendermint:latest" \
  init
$ grep -E ^timeout_commit empty/config/config.toml
timeout_commit = "500ms"
$ jq .consensus_params.block.time_iota_ms < empty/config/genesis.json
"1000"

webmaster128 on 22 Sep 2020

Oh but then there is another issue with this. This default seems to have changed between 0.33 and 0.34. When you init in an emtpy directory, you get the problematic cmbo 500ms/1000ms:

No, but in #5087 the config generation was moved from the Dockerfile (image build time) to an entrypoint (runtime). When your command above is executed, it first runs the entrypoint which runs tendermint init itself and modifies the config, and _then_ runs _your_ tendermint init. tendermint init appears to preserve existing files if they already exist, so you're getting the config generated by the image.

I suppose this also interacts with the volume, possibly. Before, mounting the volume would hide the runtime-generated configuration, so your init would run without any existing files, but since the config is now generated runtime it runs with the volume already mounted, so your init sees the existing files.

erikgrinaker on 22 Sep 2020

Thats a complicated flow for a default config but everything you say makes perfect sense, @erikgrinaker. Thanks a lot. All questions answered then.

webmaster128 on 22 Sep 2020

Yeah, it could definitely be simpler. I think it was moved runtime because otherwise encryption keys and genesis times and similar parameters would be embedded in the image and identical for all nodes, while we really want them to be generated on the first run.

erikgrinaker on 22 Sep 2020

Good find ❤️

q: why do we have 500ms in Docker anyway?
a: https://github.com/tendermint/tendermint/pull/4577#discussion_r393745244

melekes on 23 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings