Describe the bug
I'm unable to read logs from Loki older than 1h. In grafana interface or using logcli I can see only logs no older than 1h. If searching for older - get empty response.
I've tried this scenario with few configuration types:
- chunks in azure SA; - index in Cassandra( scylladb)
- chunks and index in azure SA
- chunks in local filesystem; index in Cassandra( scylladb)
Every one of those scenarios acts the same.
To Reproduce
v1.5.0v1.5.0{job="rc/api-service"}Expected behavior
Ability to search logs from the past, ie from yesterday 1:00 PM - 1:30 PM.
Environment:
- Infrastructure:
* Kubernetes,
* promtail as daemonset
* loki as separated roles - ingester( 3 instances), distributor( 3 instances), table manager( 1 instance), querier( 3 instances)
* grafana: v7.0.0
* consul
* azure storage account to store chunks
* scylladb to store index - scylla has the same API as cassandra
scylla --version
4.0.0-0.20200505.d95aa77b62
* promtail and Loki deployed from helm charts
Screenshots, Promtail config, or terminal output
Promtail config:
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail
labels:
app: promtail
data:
promtail.yaml: |
client:
backoff_config:
max_period: 5s
max_retries: 20
min_period: 100ms
batchsize: 102400
batchwait: 1s
external_labels: {}
timeout: 10s
positions:
filename: /run/promtail/positions.yaml
server:
http_listen_port: 3101
target_config:
sync_period: 10s
scrape_configs:
- job_name: kubernetes-pods-name
pipeline_stages:
- docker: {}
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_name
target_label: __service__
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: drop
regex: ''
source_labels:
- __service__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __service__
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
- job_name: kubernetes-pods-app
pipeline_stages:
- docker: {}
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: drop
regex: .+
source_labels:
- __meta_kubernetes_pod_label_name
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: __service__
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: drop
regex: ''
source_labels:
- __service__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __service__
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
- job_name: kubernetes-pods-direct-controllers
pipeline_stages:
- docker: {}
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: drop
regex: .+
separator: ''
source_labels:
- __meta_kubernetes_pod_label_name
- __meta_kubernetes_pod_label_app
- action: drop
regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
source_labels:
- __meta_kubernetes_pod_controller_name
- source_labels:
- __meta_kubernetes_pod_controller_name
target_label: __service__
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: drop
regex: ''
source_labels:
- __service__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __service__
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
- job_name: kubernetes-pods-indirect-controller
pipeline_stages:
- docker: {}
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: drop
regex: .+
separator: ''
source_labels:
- __meta_kubernetes_pod_label_name
- __meta_kubernetes_pod_label_app
- action: keep
regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
source_labels:
- __meta_kubernetes_pod_controller_name
- action: replace
regex: '([0-9a-z-.]+)-[0-9a-f]{8,10}'
source_labels:
- __meta_kubernetes_pod_controller_name
target_label: __service__
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: drop
regex: ''
source_labels:
- __service__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __service__
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
- job_name: kubernetes-pods-static
pipeline_stages:
- docker: {}
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: drop
regex: ''
source_labels:
- __meta_kubernetes_pod_annotation_kubernetes_io_config_mirror
- action: replace
source_labels:
- __meta_kubernetes_pod_label_component
target_label: __service__
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: drop
regex: ''
source_labels:
- __service__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __service__
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_annotation_kubernetes_io_config_mirror
- __meta_kubernetes_pod_container_name
target_label: __path__
Loki config:
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
labels:
app: loki-config
data:
loki.yaml: |-
auth_enabled: false
server:
http_listen_port: 3100
register_instrumentation: true
graceful_shutdown_timeout: 10s
distributor:
ring:
kvstore:
store: "consul"
consul:
host: "consul-loki-server:8500"
consistent_reads: true
ingester:
lifecycler:
ring:
kvstore:
store: "consul"
consul:
host: "consul-loki-server:8500"
http_client_timeout: 30s
replication_factor: 2
num_tokens: 512
heartbeat_period: 15s
join_after: 30s
querier:
query_timeout: 2m
tail_max_duration: 12h
engine:
timeout: 30m
max_look_back_period: 360h
table_manager:
retention_deletes_enabled: true
retention_period: 360h
poll_interval: 2m
ingester_client:
pool_config:
health_check_ingesters: true
client_cleanup_period: 30s
remote_timeout: 120s
storage_config:
index_cache_validity: 30m
max_chunk_batch_size: 10
cassandra:
addresses: xxxx
auth: false
keyspace: index_logs_s
port: 9042
SSL: false
azure:
container_name: xxxx
account_name: xxxx
account_key: xxxx
schema_config:
configs:
- from: "2020-05-24"
index:
period: 120h
prefix: index_
chunks:
period: 120h
prefix: chunk_
store: cassandra
object_store: azure
schema: v11
chunk_store_config:
max_look_back_period: 360h
limits_config:
ingestion_rate_strategy: global
ingestion_rate_mb: 16
ingestion_burst_size_mb: 32
max_label_value_length: 4096
max_query_length: 0
max_chunks_per_query: 200000
cardinality_limit: 200000
max_streams_matchers_per_query: 2000
max_entries_limit_per_query: 5000
Response body from the query:
"stats": {
"summary": {
"bytesProcessedPerSecond": 0,
"linesProcessedPerSecond": 0,
"totalBytesProcessed": 0,
"totalLinesProcessed": 0,
"execTime": 0.08042639
},
"store": {
"totalChunksRef": 8,
"totalChunksDownloaded": 4,
"chunksDownloadTime": 0.065408829,
"headChunkBytes": 0,
"headChunkLines": 0,
"decompressedBytes": 0,
"decompressedLines": 0,
"compressedBytes": 0,
"totalDuplicates": 0
},
"ingester": {
"totalReached": 3,
"totalChunksMatched": 4,
"totalBatches": 0,
"totalLinesSent": 0,
"headChunkBytes": 0,
"headChunkLines": 0,
"decompressedBytes": 0,
"decompressedLines": 0,
"compressedBytes": 0,
"totalDuplicates": 0
}
}
I've found some suspicious log in loki-ingester instances:
+ loki-ingester-557fdb9c66-s4rkc › loki-ingester
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.634873244Z caller=server.go:179 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=warn ts=2020-05-29T09:29:42.635036744Z caller=experimental.go:19 msg="experimental feature in use" feature="Cassandra Backend"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=warn ts=2020-05-29T09:29:42.662352491Z caller=experimental.go:19 msg="experimental feature in use" feature="Azure Blob Storage"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.66280409Z caller=main.go:79 msg="Starting Loki" version="(version=1.5.0, branch=HEAD, revision=12c7eab8b)"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.66296119Z caller=module_service.go:58 msg=initialising module=store
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.663457389Z caller=module_service.go:58 msg=initialising module=runtime-config
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.663452189Z caller=module_service.go:58 msg=initialising module=memberlist-kv
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.663481789Z caller=manager.go:109 msg="runtime config disabled: file not specified"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.663538489Z caller=module_service.go:58 msg=initialising module=ingester
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.663615689Z caller=lifecycler.go:485 msg="not loading tokens from file, tokens file path is empty"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.664348387Z caller=loki.go:228 msg="Loki started"
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:29:42.669340278Z caller=lifecycler.go:509 msg="instance not found in ring, adding with no tokens" ring=ingester
loki-ingester-557fdb9c66-s4rkc loki-ingester level=info ts=2020-05-29T09:30:12.719286177Z caller=lifecycler.go:361 msg="auto-joining cluster after timeout" ring=ingester
loki-ingester-557fdb9c66-s4rkc loki-ingester level=error ts=2020-05-29T09:30:42.830200357Z caller=flush.go:198 org_id=fake msg="failed to flush user" err="-> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /src/loki/vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42\n===== RESPONSE ERROR (ServiceCode=InvalidBlockList) =====\nDescription=The specified block list is invalid.\nRequestId:697b4f86-301e-0039-3d9b-35106a000000\nTime:2020-05-29T09:30:42.8278021Z, Details: \n Code: InvalidBlockList\n PUT https://xxxxx.blob.core.windows.net/xxxxx/fake/11a35aa0e520d8e4-1725fa374d3-1725fa7194e-d678575b?comp=blocklist&timeout=10\n Authorization: REDACTED\n Content-Length: [233]\n Content-Type: [application/xml]\n User-Agent: [Azure-Storage/0.7 (go1.13.11; linux)]\n X-Ms-Blob-Cache-Control: []\n X-Ms-Blob-Content-Disposition: []\n X-Ms-Blob-Content-Encoding: []\n X-Ms-Blob-Content-Language: []\n X-Ms-Blob-Content-Type: []\n X-Ms-Client-Request-Id: [0ce7b614-4c1c-4902-5c8b-ad61bc50bbd0]\n X-Ms-Date: [Fri, 29 May 2020 09:30:42 GMT]\n X-Ms-Version: [2018-11-09]\n --------------------------------------------------------------------------------\n RESPONSE Status: 400 The specified block list is invalid.\n Content-Length: [221]\n Content-Type: [application/xml]\n Date: [Fri, 29 May 2020 09:30:42 GMT]\n Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]\n X-Ms-Error-Code: [InvalidBlockList]\n X-Ms-Request-Id: [697b4f86-301e-0039-3d9b-35106a000000]\n X-Ms-Version: [2018-11-09]\n\n\n"
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.450816313Z caller=server.go:179 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
loki-ingester-557fdb9c66-fskjh loki-ingester level=warn ts=2020-05-29T09:29:42.451034815Z caller=experimental.go:19 msg="experimental feature in use" feature="Cassandra Backend"
loki-ingester-557fdb9c66-fskjh loki-ingester level=warn ts=2020-05-29T09:29:42.479427921Z caller=experimental.go:19 msg="experimental feature in use" feature="Azure Blob Storage"
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.479704724Z caller=main.go:79 msg="Starting Loki" version="(version=1.5.0, branch=HEAD, revision=12c7eab8b)"
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.479833125Z caller=module_service.go:58 msg=initialising module=runtime-config
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.479849626Z caller=module_service.go:58 msg=initialising module=store
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.479865326Z caller=manager.go:109 msg="runtime config disabled: file not specified"
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.479890326Z caller=module_service.go:58 msg=initialising module=memberlist-kv
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.479952727Z caller=module_service.go:58 msg=initialising module=ingester
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.48117764Z caller=lifecycler.go:485 msg="not loading tokens from file, tokens file path is empty"
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.481242541Z caller=loki.go:228 msg="Loki started"
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:29:42.48487268Z caller=lifecycler.go:509 msg="instance not found in ring, adding with no tokens" ring=ingester
loki-ingester-557fdb9c66-fskjh loki-ingester level=info ts=2020-05-29T09:30:12.52683684Z caller=lifecycler.go:361 msg="auto-joining cluster after timeout" ring=ingester
loki-ingester-557fdb9c66-fskjh loki-ingester level=warn ts=2020-05-29T09:48:53.50285832Z caller=grpc_logging.go:55 method=/logproto.Querier/Query duration=45.236991ms err="rpc error: code = Internal desc = transport: transport: the stream is done or WriteHeader was already called" msg="gRPC\n"
Screens:


Thanks for the issue; let's see if we can get this sorted. First off, one hour is the default max duration that the ingester will buffer logs before flushing it to backend storage.
I also see the error
loki-ingester-557fdb9c66-s4rkc loki-ingester
level=error
ts=2020-05-29T09:30:42.830200357Z
caller=flush.go:198 org_id=fake
msg="failed to flush user"
err="-> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /src/loki/vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidBlockList) =====
Description=The specified block list is invalid.
RequestId:697b4f86-301e-0039-3d9b-35106a000000
Time:2020-05-29T09:30:42.8278021Z,
Details:
Code: InvalidBlockList
So it looks like we're seeing issues flushing chunks to azure storage. This is why you're only able to see the last 1h of logs -- we're failing to persist them to azure.
I'm unfamiliar with this client and it's a recent addition upstreamed from thanos -> cortex -> loki. This appears to be a bug.
/cc @cyriltovena @slim-bean
It's hard for me tell you exactly what's wrong, but I can tell you that I run a production cluster for Grafana Labs in Azure (we have a marketplace offering there) using (cosmodb/storage blob) without any issue.
chunks:
period: 120h
prefix: chunk_
You should remove this section from your schema config.
Loki does not use the index client to store chunks, this is only valid for Cortex. This may be the cause of your issues.
If that fixes the issue then this is a duplicate of https://github.com/grafana/loki/issues/2140
@slim-bean @cyriltovena thanks, I removed the section you provided but without any results - loki still acts the same, with the same error log
loki-ingester-557fdb9c66-mdhxx loki-ingester level=error ts=2020-06-09T03:23:44.611435072Z caller=flush.go:198 org_id=fake msg="failed to flush user" err="-> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /src/loki/vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42\n===== RESPONSE ERROR (ServiceCode=InvalidBlockList) =====\nDescription=The specified block list is invalid.\nRequestId:31fcd0e0-501e-00aa-760d-3e8621000000\nTime:2020-06-09T03:23:44.6125031Z, Details: \n Code: InvalidBlockList\n PUT https://lokilogsstage.blob.core.windows.net/xxxxx/fake/436f8e09bfe189fa-17296f46ecd-172971ab9c6-5d923f98?comp=blocklist&timeout=10\n Authorization: REDACTED\n Content-Length: [233]\n Content-Type: [application/xml]\n User-Agent: [Azure-Storage/0.7 (go1.13.11; linux)]\n X-Ms-Blob-Cache-Control: []\n X-Ms-Blob-Content-Disposition: []\n X-Ms-Blob-Content-Encoding: []\n X-Ms-Blob-Content-Language: []\n X-Ms-Blob-Content-Type: []\n X-Ms-Client-Request-Id: [3852472d-f1b8-4e9d-4601-355925c73270]\n X-Ms-Date: [Tue, 09 Jun 2020 03:23:44 GMT]\n X-Ms-Version: [2018-11-09]\n --------------------------------------------------------------------------------\n RESPONSE Status: 400 The specified block list is invalid.\n Content-Length: [221]\n Content-Type: [application/xml]\n Date: [Tue, 09 Jun 2020 03:23:43 GMT]\n Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]\n X-Ms-Error-Code: [InvalidBlockList]\n X-Ms-Request-Id: [31fcd0e0-501e-00aa-760d-3e8621000000]\n X-Ms-Version: [2018-11-09]\n\n\n"
maybe the issue is related with too many ingesters trying to do the same push request to azure storage?
@cyriltovena
If that fixes the issue then this is a duplicate of #2140
That didn't fix the issue in #2140. Issue is still not fixed.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This still happens with 1.5.
Reading a bit more something is up with Azure SDK, I’ll dig a bit more
@khaines We're not experiencing this on our side at Grafana (we're running one cluster on azure), not sure what this error is.
The specified block list is invalid
EDIT: I found one instance of this error in our cluster too.
^^ a little more context on this, this error is not uncommon but appears transient, we typically don't see it more than once or twice and the operation succeeds.
Can you confirm cassandra and block storage have data in your environment, so that we can exclude this possibility.
It would also be worth checking to see what's in the ring by port-forwarding one of your distributors and making an http request from a browser to the /ring endpoint
@cyriltovena confirming other comments here, this is an error that happens when there are attempts to write to the same blob location by multiple requests. It is most likely a side effect of the replicated flushes if all the ingesters are attempting to flush the same block simultaneously.
On its own, it shouldn't be the cause of query issues as the retry operation should succeed.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Anyone who made it work?
I am facing a similar issue with Azure Blob and Cosmos(Cassandra API). In my case for some labels, I can go back and search it but not for all.
Loki Version: 1.6.0
Here is my Loki config.
auth_enabled: false
chunk_store_config:
max_look_back_period: 672h
ingester:
chunk_block_size: 262144
chunk_idle_period: 3m
chunk_retain_period: 1m
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
max_transfer_retries: 0
limits_config:
enforce_metric_name: false
max_entries_limit_per_query: 12000
reject_old_samples: true
reject_old_samples_max_age: 168h
schema_config:
configs:
- from: "2020-09-03"
index:
period: 168h
prefix: loki
object_store: azure
schema: v11
store: cassandra
server:
http_listen_port: 3100
storage_config:
azure:
account_key: XXXX
account_name: XXXX
container_name: loki
boltdb:
directory: /data/loki/index
cassandra:
SSL: true
addresses: XXXXX.cassandra.cosmos.azure.com
auth: true
disable_initial_host_lookup: true
host_verification: false
keyspace: lokiindex
password: XXXXX
port: 10350
username: XXXXXX
filesystem:
directory: /data/loki/chunks
table_manager:
retention_deletes_enabled: true
retention_period: 672h
I am seeing similar issue, with specific labels alone not showing up in the search after 5 mins. I have logs written with log level as one of labels from fluentbit to loki. Logs with level=ERROR which is multiline alone don't show up in the search after 5 minutes. Others like INFO, DEBUG show up.
I had updated my config using this comment last Friday and till now it's working fine for me.
FYI @arvindiway @krewi1
Most helpful comment
I am facing a similar issue with Azure Blob and Cosmos(Cassandra API). In my case for some labels, I can go back and search it but not for all.
Loki Version:
1.6.0Here is my
Loki config.