Google-cloud-go: logging: fatal error: concurrent map iteration and map write

Created on 5 Nov 2020  路  10Comments  路  Source: googleapis/google-cloud-go

Client
cloud.google.com/go/logging v1.1.0

Environment
Used in Fluent Bit GKE exporter docker image deployed in GKE

Go Environment
Binary is built inside of golang:1.15 docker image.

Code
We are calling mainly two methods from the library

for _, e := range entries {
    measurements.append(e.Timestamp)
    l.Log(*e)
}
err := l.Flush()

Expected behavior
Entries are correctly send to Cloud Logging API.

Actual behavior
Calls are failing with: fatal error: concurrent map iteration and map write

Screenshots

goroutine 2992851 [running]:
runtime.throw(0x115f98a, 0x26)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc000af7358 sp=0xc000af7328 pc=0x4376d2
runtime.mapiternext(0xc000b54360)
/usr/local/go/src/runtime/map.go:853 +0x554 fp=0xc000af73d8 sp=0xc000af7358 pc=0x410d14
reflect.mapiternext(0xc000b54360)
/usr/local/go/src/runtime/map.go:1337 +0x2b fp=0xc000af73f0 sp=0xc000af73d8 pc=0x467c8b
reflect.(MapIter).Next(0xc00072bc20, 0xc00041fcea)
/usr/local/go/src/reflect/value.go:1276 +0xd6 fp=0xc000af7418 sp=0xc000af73f0 pc=0x4a2376
google.golang.org/protobuf/internal/impl.sizeMap(0xfd8600, 0xc000309c40, 0x195, 0xc0006a2500, 0xc0001b0f10, 0x100, 0x539700)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/internal/impl/codec_map.go:98 +0xd1 fp=0xc000af7478 sp=0xc000af7418 pc=0xccab11
google.golang.org/protobuf/internal/impl.encoderFuncsForMap.func1(0xc000309c40, 0xc0001b0f10, 0xc000af7500, 0xc8)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/internal/impl/codec_map.go:54 +0xab fp=0xc000af74c8 sp=0xc000af7478 pc=0xcf812b
google.golang.org/protobuf/internal/impl.(
MessageInfo).sizePointerSlow(0xc00017e138, 0xc000309c00, 0xfe0200, 0x1990d18)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/internal/impl/encode.go:79 +0xc5 fp=0xc000af7518 sp=0xc000af74c8 pc=0xcddee5
google.golang.org/protobuf/internal/impl.(MessageInfo).sizePointer(0xc00017e138, 0xc000309c00, 0xc000999700, 0xfe02c0)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/internal/impl/encode.go:56 +0x70 fp=0xc000af7548 sp=0xc000af7518 pc=0xcdddb0
google.golang.org/protobuf/internal/impl.(
MessageInfo).size(0xc00017e138, 0x12cea00, 0xc000309c00, 0xc00057c400, 0x7f6d7c12d108)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/internal/impl/encode.go:40 +0x55 fp=0xc000af7578 sp=0xc000af7548 pc=0xcddcd5
google.golang.org/protobuf/internal/impl.(MessageInfo).size-fm(0x12cea00, 0xc000309c00, 0x129d200, 0x418905)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/internal/impl/encode.go:33 +0x47 fp=0xc000af75b0 sp=0xc000af7578 pc=0xd04487
google.golang.org/protobuf/proto.MarshalOptions.marshal(0xc000000001, 0xc000488480, 0x0, 0x439, 0x12cea00, 0xc000309c00, 0x129d280, 0xc000309c00, 0x480, 0xc000488480, ...)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/proto/encode.go:156 +0x304 fp=0xc000af7668 sp=0xc000af75b0 pc=0xc6e044
google.golang.org/protobuf/proto.MarshalOptions.MarshalAppend(0x1000001, 0xc000488480, 0x0, 0x439, 0x129d280, 0xc000309c00, 0x7f6d7c12d108, 0x0, 0x8, 0xc0008d7c00, ...)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/protobuf/proto/encode.go:125 +0x98 fp=0xc000af76d0 sp=0xc000af7668 pc=0xc6dbb8
github.com/golang/protobuf/proto.marshalAppend(0xc000488480, 0x0, 0x439, 0x7f6d550fc380, 0xc000309c00, 0xc000af7700, 0x478994, 0x1943100, 0x0, 0x110dfa0, ...)
/go/src/gke-logmon/fluent-bit/vendor/github.com/golang/protobuf/proto/wire.go:40 +0xc7 fp=0xc000af7770 sp=0xc000af76d0 pc=0xd0e447
github.com/golang/protobuf/proto.(
Buffer).Marshal(...)
/go/src/gke-logmon/fluent-bit/vendor/github.com/golang/protobuf/proto/buffer.go:111
google.golang.org/grpc/encoding/proto.marshal(0x10aa060, 0xc000309c00, 0xc0006f4d50, 0x0, 0x0, 0x0, 0xc000af7878, 0xda4ac5)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/encoding/proto/proto.go:59 +0x111 fp=0xc000af77f0 sp=0xc000af7770 pc=0xd749b1
google.golang.org/grpc/encoding/proto.codec.Marshal(0x10aa060, 0xc000309c00, 0x203000, 0x1, 0x11b4e48, 0x203000, 0x203000)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/encoding/proto/proto.go:74 +0xb3 fp=0xc000af7870 sp=0xc000af77f0 pc=0xd74b53
google.golang.org/grpc/encoding/proto.(codec).Marshal(0x1990d18, 0x10aa060, 0xc000309c00, 0x1909028, 0xc000af7978, 0x40d950, 0xc00072bbe0, 0x20)
:1 +0x46 fp=0xc000af78b8 sp=0xc000af7870 pc=0xd75426
google.golang.org/grpc.encode(0x7f6d550faad8, 0x1990d18, 0x10aa060, 0xc000309c00, 0x0, 0xc00057c400, 0x7f6d7c12d108, 0x0, 0x10000914780)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/rpc_util.go:538 +0x52 fp=0xc000af7918 sp=0xc000af78b8 pc=0xd9ea12
google.golang.org/grpc.prepareMsg(0x10aa060, 0xc000309c00, 0x7f6d550faad8, 0x1990d18, 0x0, 0x0, 0x0, 0x0, 0x9, 0xc000186730, ...)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/stream.go:1541 +0x107 fp=0xc000af79b8 sp=0xc000af7918 pc=0xda9f67
google.golang.org/grpc.(
clientStream).SendMsg(0xc0007e1560, 0x10aa060, 0xc000309c00, 0x0, 0x0)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/stream.go:722 +0x165 fp=0xc000af7b48 sp=0xc000af79b8 pc=0xda60a5
google.golang.org/grpc.invoke(0x12bd1a0, 0xc000999560, 0x1170a09, 0x33, 0x10aa060, 0xc000309c00, 0x1038580, 0xc000999590, 0xc00055c380, 0xc000186790, ...)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/call.go:70 +0xe4 fp=0xc000af7bd0 sp=0xc000af7b48 pc=0xd8fd24
google.golang.org/grpc.(ClientConn).Invoke(0xc00055c380, 0x12bd1a0, 0xc000999560, 0x1170a09, 0x33, 0x10aa060, 0xc000309c00, 0x1038580, 0xc000999590, 0x0, ...)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/grpc/call.go:37 +0x1b3 fp=0xc000af7c70 sp=0xc000af7bd0 pc=0xd8fb13
google.golang.org/genproto/googleapis/logging/v2.(
loggingServiceV2Client).WriteLogEntries(0xc000186d60, 0x12bd1a0, 0xc000999560, 0xc000309c00, 0x0, 0x0, 0x0, 0x10, 0x20, 0xc00072ba00)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/genproto/googleapis/logging/v2/logging.pb.go:1294 +0xd4 fp=0xc000af7cf8 sp=0xc000af7c70 pc=0xe4d934
cloud.google.com/go/logging/apiv2.(Client).WriteLogEntries.func1(0x12bd1a0, 0xc000999560, 0x11b3090, 0x0, 0x0, 0x0, 0xc0008c8dc0, 0x40e2b8)
/go/src/gke-logmon/fluent-bit/vendor/cloud.google.com/go/logging/apiv2/logging_client.go:226 +0x84 fp=0xc000af7d60 sp=0xc000af7cf8 pc=0xe5a544
github.com/googleapis/gax-go/v2.invoke(0x12bd1a0, 0xc000999560, 0xc000af7eb0, 0x11b3090, 0x0, 0x0, 0x0, 0x11b4180, 0x0, 0x0)
/go/src/gke-logmon/fluent-bit/vendor/github.com/googleapis/gax-go/v2/invoke.go:70 +0x90 fp=0xc000af7dd0 sp=0xc000af7d60 pc=0xdafdf0
github.com/googleapis/gax-go/v2.Invoke(0x12bd1a0, 0xc000999560, 0xc0008c8eb0, 0xc000186c60, 0x1, 0x1, 0xc000999560, 0xc0008a06d0)
/go/src/gke-logmon/fluent-bit/vendor/github.com/googleapis/gax-go/v2/invoke.go:48 +0xf6 fp=0xc000af7e48 sp=0xc000af7dd0 pc=0xdafb76
cloud.google.com/go/logging/apiv2.(
Client).WriteLogEntries(0xc0002037d0, 0x12bd1a0, 0xc000999560, 0xc000309c00, 0x0, 0x0, 0x1, 0xc0008a0758, 0xf6da00, 0xc00072b9c0)
/go/src/gke-logmon/fluent-bit/vendor/cloud.google.com/go/logging/apiv2/logging_client.go:224 +0x1b7 fp=0xc000af7ee0 sp=0xc000af7e48 pc=0xe58977
cloud.google.com/go/logging.(Logger).writeLogEntries(0xc000d64240, 0xc00072b9a0, 0x4, 0x4)
/go/src/gke-logmon/fluent-bit/vendor/cloud.google.com/go/logging/logging.go:860 +0x151 fp=0xc000af7f60 sp=0xc000af7ee0 pc=0xe614b1
cloud.google.com/go/logging.(
Client).Logger.func2(0xf6da00, 0xc00072b9c0)
/go/src/gke-logmon/fluent-bit/vendor/cloud.google.com/go/logging/logging.go:464 +0x5a fp=0xc000af7f90 sp=0xc000af7f60 pc=0xe621ba
google.golang.org/api/support/bundler.(Bundler).handle(0xc0006b3e40, 0xc000999470)
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/api/support/bundler/bundler.go:324 +0x71 fp=0xc000af7fd0 sp=0xc000af7f90 pc=0xe5d4f1
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000af7fd8 sp=0xc000af7fd0 pc=0x46e701
created by google.golang.org/api/support/bundler.(
Bundler).enqueueCurBundle
/go/src/gke-logmon/fluent-bit/vendor/google.golang.org/api/support/bundler/bundler.go:179 +0x1ad

So far we have only one report of this problem, but it seems that similar issue was already opened some time ago: https://github.com/googleapis/google-cloud-go/issues/1810

logging p3 bug

All 10 comments

@codyoss it is possible bundler package is somehow calling the handler function on the same log bundles concurrently? I'm not as familiar with the bundle library.

Without seeing more information, my guess so far is: bundler executes concurrent routines calling WriteLogEntries on the same log entries. Note: a log entry does contain maps (e.g. the label field). So if we're writing/touching the same log entries, then this fatal error could occur. But it could just as likely be caused by how we're using the bundle library.

@loburm can you provide more of the client's code so I can reproduce this? Or, can they set up a Data Race Detector and send us the report? I'll ask internally as well in case you don't directly communicate with the user.

I have seen an issue similar to this in the past and it was an issue with editing of labels by the calling code I believe. I will look for the issue.

_Thanks to @codyoss_

@loburm can you check with the client if they are passing in the label field using a pointer?

Previously, we've found that Log() is asynchronous, so sharing pointers to Labels like this between different log entries is not valid.

This would panic:

l.Logger.Log(cloudlogging.Entry {
    Timestamp: time.Now(), 
    Severity: severity, 
    Labels: l.Labels, 
    Payload: formattedPayload, 
    Resource: &r, 
}) 

More context
When l.Labels is passed to Stackdriver Log API, it's passed by reference, and there is a thread running in parallel that might manipulate l.Labels. If instead of l.Labels, its copy is passed, then it will be thread safe.

Next steps:
Since this has been experienced by many people now, I'll update docs/samples to reflect this caveat.

@codyoss @tbpg do you think we should handle this for users in the client lib, rather than just warning them about not passing in label pointers? i.e. have writeLogEntries make a copy of label, but it's not memory efficient...

I have verified our code and for each log entry we are creating a unique map with labels and it's not possible that something else modifies that field during flushing.

Another option might be a commonLabels that are passed to Logger. Code is a little bit more complicated there, but I also haven't found any evidence that it might be modified in different goroutine.

@nicoleczhu I would avoid make defensive copies here. Logging is something I think of as on a hot path in a lot of applications, so extra allocations could be costly.

@loburm Is it easy for you to test not setting any common logs to try to isolate issue?

I can do it locally, but we can't release such logging agent to our customers. Internal tests has shown 20-30% decrease in CPU usage while using common labels.

For now I'm going to create a copy of common labels when creating Logger. Also I'll add recover and new metric that will notify us when this issue repeats. And if we are going to see it, then I assume problem is hidden somewhere in logging client library.

Unfortunately rollout process is very slow. So it will take near a month to get reasonable amount of clients switching to the new version.

I reported a similar issue to the fluent-bit project and they closed it saying it was a GKE-specific problem. Linking here in for context since the error I see is identical https://github.com/fluent/fluent-bit/issues/2956

Hi Daniel,
Yes this problem is not related to OSS Fluent Bit. Issue is supposed to be fixed on clusters starting from 1.17.14-gke.1700 or 1.18.12-gke.100. Could you verify if your are running the latest cluster version?

@loburm Hey, you're right! I had silenced our alarm for this because it was so noisy, but the silence expired and we are no longer experiencing the problem. While the alarm was silenced the cluster was auto-upgraded to v1.18.12-gke.1206, and so far does not appear to be exhibiting the problem. 馃帀

Was this page helpful?
0 / 5 - 0 ratings