Thanos: Store Gateway: S3 very slow when using IAM Roles for Service Accounts on EKS

Created on 5 Feb 2020  路  13Comments  路  Source: thanos-io/thanos

@bwplotka

Following on from: https://github.com/thanos-io/thanos/issues/1494

Thanos Docker image used: master-2020-02-04-7c02430c

Object Storage Provider: S3

What happened:
When using the service account IAM role the store gateway response times are slow.

What you expected to happen:
Expected similar response times to provisioned keys provided via env vars.

Full logs to relevant components:

Logs

# Using service account:
thanos-store-0-5dcf4b6946-42fxg thanos-store level=debug ts=2020-02-04T22:37:40.815241796Z caller=bucket.go:913 msg="stats query processed" stats="&{blocksQueried:6 postingsTouched:12 postingsTouchedSizeSum:27344 postingsToFetch:0 postingsFetched:12 postingsFetchedSizeSum:27344 postingsFetchCount:12 postingsFetchDurationSum:340447630158 seriesTouched:64 seriesTouchedSizeSum:4256 seriesFetched:64 seriesFetchedSizeSum:405920 seriesFetchCount:6 seriesFetchDurationSum:98626502318 chunksTouched:64 chunksTouchedSizeSum:21411 chunksFetched:64 chunksFetchedSizeSum:138459 chunksFetchCount:6 chunksFetchDurationSum:111593524051 getAllDuration:65278873883 mergedSeriesCount:32 mergedChunksCount:64 mergeDuration:202935}" err=null
# Using access keys:
thanos-store-0-5dcf4b6946-wljrv thanos-store level=debug ts=2020-02-04T23:55:32.675050158Z caller=bucket.go:913 msg="stats query processed" stats="&{blocksQueried:9 postingsTouched:9 postingsTouchedSizeSum:21644 postingsToFetch:0 postingsFetched:9 postingsFetchedSizeSum:21644 postingsFetchCount:9 postingsFetchDurationSum:1495567555 seriesTouched:5402 seriesTouchedSizeSum:403248 seriesFetched:5402 seriesFetchedSizeSum:1067920 seriesFetchCount:9 seriesFetchDurationSum:701591264 chunksTouched:5398 chunksTouchedSizeSum:1537799 chunksFetched:5398 chunksFetchedSizeSum:1716993 chunksFetchCount:9 chunksFetchDurationSum:641095967 getAllDuration:352031207 mergedSeriesCount:2018 mergedChunksCount:5398 mergeDuration:23976320}" err=null

store help wanted

Most helpful comment

_This issue is still valid and looks quite critical for production setups on AWS. We should fix it._

All 13 comments

I think the access mode is unrelated. You literally use totally different queries as the numbers of postings/chunks fetch are different.

Firstly, thanks for the prompt reply. I have captured the logs at startup which shows a clear difference. First block is with access keys set via env vars and second is exactly the same deployment but with env vars removed so falling back to IAM for service accounts.

Access keys set on env:

thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.71367574Z caller=main.go:152 msg="Tracing will be disabled"
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.713798158Z caller=factory.go:46 msg="loading bucket configuration"
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.714146391Z caller=inmemory.go:170 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.71431954Z caller=store.go:232 msg="index-header instead of index-cache.json enabled"
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.714465814Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.714734556Z caller=store.go:306 msg="starting store node"
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.714908233Z caller=intrumentation.go:64 msg="changing probe status" status=healthy
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.714930825Z caller=http.go:56 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.715097855Z caller=store.go:261 msg="initializing bucket store"
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:16.967332253Z caller=fetcher.go:364 component=block.MetaFetcher msg="successfully fetched block metadata" duration=252.199915ms cached=23 returned=23 partial=0
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:17.634427891Z caller=store.go:267 msg="bucket store ready" init_duration=919.298284ms
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:17.634768764Z caller=intrumentation.go:52 msg="changing probe status" status=ready
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:17.63486302Z caller=grpc.go:105 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
thanos-store-0-7bc784969-6wmkv thanos-store level=info ts=2020-02-05T22:47:17.723047383Z caller=fetcher.go:364 component=block.MetaFetcher msg="successfully fetched block metadata" duration=88.569239ms cached=23 returned=23 partial=0

IAM roles for service accounts:

thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.955157672Z caller=main.go:152 msg="Tracing will be disabled"
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.95530467Z caller=factory.go:46 msg="loading bucket configuration"
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.955629396Z caller=inmemory.go:170 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.955792707Z caller=store.go:232 msg="index-header instead of index-cache.json enabled"
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.955932457Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.956217016Z caller=store.go:306 msg="starting store node"
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.956424525Z caller=intrumentation.go:64 msg="changing probe status" status=healthy
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.956468535Z caller=http.go:56 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:09.95662651Z caller=store.go:261 msg="initializing bucket store"
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:22.012691239Z caller=fetcher.go:364 component=block.MetaFetcher msg="successfully fetched block metadata" duration=12.056020621s cached=23 returned=23 partial=0
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:50.884455974Z caller=store.go:267 msg="bucket store ready" init_duration=40.927788181s
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:50.884835132Z caller=intrumentation.go:52 msg="changing probe status" status=ready
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:50.884913591Z caller=grpc.go:105 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
thanos-store-0-c6d6968bf-zbsbq thanos-store level=info ts=2020-02-05T22:49:55.964910877Z caller=fetcher.go:364 component=block.MetaFetcher msg="successfully fetched block metadata" duration=5.080413443s cached=23 returned=23 partial=0

I have observed exactly the same behaviour:

_Access keys_ startup log:

level=info ts=2020-02-17T11:37:43.63859407Z caller=http.go:56 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-02-17T11:37:43.988328185Z caller=fetcher.go:368 component=block.MetaFetcher msg="successfully fetched block metadata" duration=349.725951ms cached=132 returned=132 partial=0
level=info ts=2020-02-17T11:37:46.45878283Z caller=store.go:268 msg="bucket store ready" init_duration=2.820185702s
level=info ts=2020-02-17T11:37:46.459126881Z caller=intrumentation.go:52 msg="changing probe status" status=ready
level=info ts=2020-02-17T11:37:46.459185787Z caller=grpc.go:105 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2020-02-17T11:37:46.711520906Z caller=fetcher.go:368 component=block.MetaFetcher msg="successfully fetched block metadata" duration=252.696663ms cached=132 returned=132 partial=0

_IAM Roles for Service Accounts_ startup log:

level=info ts=2020-02-17T11:32:49.434800692Z caller=store.go:262 msg="initializing bucket store"
level=info ts=2020-02-17T11:32:49.434820401Z caller=intrumentation.go:64 msg="changing probe status" status=healthy
level=info ts=2020-02-17T11:32:49.434860109Z caller=http.go:56 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-02-17T11:33:11.005699772Z caller=fetcher.go:368 component=block.MetaFetcher msg="successfully fetched block metadata" duration=21.570840469s cached=132 returned=132 partial=0
level=info ts=2020-02-17T11:34:14.487258483Z caller=store.go:268 msg="bucket store ready" init_duration=1m25.052403205s
level=info ts=2020-02-17T11:34:14.487430951Z caller=intrumentation.go:52 msg="changing probe status" status=ready
level=info ts=2020-02-17T11:34:14.487500796Z caller=grpc.go:105 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2020-02-17T11:34:24.753085112Z caller=fetcher.go:368 component=block.MetaFetcher msg="successfully fetched block metadata" duration=10.265772769s cached=132 returned=132 partial=0

Tried with version: master-2020-02-14-66921a3d

Also, queries are super slow with _IAM Roles for Service Accounts_.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Moved thanos to EKS but holding off on moving to IAM roles.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Closing for now as promised, let us know if you need this to be reopened! 馃

I ran into the same issue and tracked it down to the STS credentials not being cached by minio.
https://github.com/minio/minio-go/issues/1321

minio 7.0.0 should fix this issue (it includes the above patch for caching STS Web Identity Credentials).

7.0.0 is a major API version change so I'm not sure how much work it is to upgrade.

_This issue is still valid and looks quite critical for production setups on AWS. We should fix it._

Let's update minio - that could be our first step (:

We also ran into this issue and after upgrading minio and adjusting the functions accordingly it looks better (we've been running it for a week now without any major issues).
Happy to submit a PR @bwplotka, though not sure if it's 100% correct and I'm not at all familiar with the codebase 馃槄

Happy to submit a PR @bwplotka, though not sure if it's 100% correct and I'm not at all familiar with the codebase 馃槄

Yes, please! Submit a PR. I will be very happy to review it and help you getting merged in case there's anything to change / improve.

Was this page helpful?
0 / 5 - 0 ratings