Go: sync: Map has poor performance

Created on 25 Nov 2018  Â·  44Comments  Â·  Source: golang/go

What version of Go are you using (go version)?

1.11.2

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

darwin, osx, core i7

What did you do?

Please see the full test analysis here but in summary, the performance of sync.Map is far worse than expected, even with read-only workloads.

Note, all of the maps are pre-populated, so get will always succeed, and put will always replace an existing element.

BenchmarkMain/unshared.get-8            20000000            91.8 ns/op
BenchmarkMain/unshared.put-8            20000000           100 ns/op
BenchmarkMain/unshared.putget-8         10000000           178 ns/op
BenchmarkMain/unshared.multiget-8       20000000            87.0 ns/op
BenchmarkMain/lock.get-8                20000000           110 ns/op
BenchmarkMain/lock.put-8                10000000           122 ns/op
BenchmarkMain/lock.putget-8             10000000           233 ns/op
BenchmarkMain/lock.multiget-8           10000000           146 ns/op
BenchmarkMain/lock.multiput-8            5000000           304 ns/op
BenchmarkMain/lock.multiputget-8         2000000           596 ns/op
BenchmarkMain/sync.get-8                 5000000           297 ns/op
BenchmarkMain/sync.put-8                 5000000           327 ns/op
BenchmarkMain/sync.putget-8              2000000           688 ns/op
BenchmarkMain/sync.multiget-8            5000000           382 ns/op
BenchmarkMain/sync.multiput-8            5000000           395 ns/op
BenchmarkMain/sync.multiputget-8         2000000           740 ns/op

The "multi" above is when 2 Go routines are concurrently accessing the map.

What did you expect to see?

The performance of sync.Map for reads should be nearly equal to the standard built-in map. The is the demonstrated behavior with the similar Java ConcurrentHashMap.

What did you see instead?

sync.Map performance is nearly 3x slower in all cases, including when compared to using a RW mutex to control access to a shared standard map, even with uncontended usages.

Since the sync.Map implementation is lock-free, a 'read operation' should be at most a single extra volatile read - to load the map reference. This points to a possible problem with atomic.Value ?

FrozenDueToAge NeedsInvestigation Performance

Most helpful comment

@robaho A program with a data race in it is not a valid Go program.

Even if the data races don't impact your benchmark results (which may be possible), I suggest you fix them, just to make your benchmarks valid Go code.

All 44 comments

\cc @bcmills

For anyone looking, the code is actually located here

I'm not sure how valid these benchmarks are. But removing the remainder operation % baked into almost all key index operations yields more-expected results:

goos: linux
goarch: amd64
pkg: github.com/robaho/go-concurrency-test

BenchmarkMain/unshared.get-4            20000000           115 ns/op
BenchmarkMain/unshared.put-4            20000000           124 ns/op
BenchmarkMain/unshared.putget-4         10000000           224 ns/op
BenchmarkMain/unshared.multiget-4       20000000           123 ns/op
BenchmarkMain/lock.get-4                50000000            37.5 ns/op
BenchmarkMain/lock.put-4                30000000            48.2 ns/op
BenchmarkMain/lock.putget-4             20000000            83.5 ns/op
BenchmarkMain/lock.multiget-4           10000000           137 ns/op
BenchmarkMain/lock.multiput-4           10000000           134 ns/op
BenchmarkMain/lock.multiputget-4        10000000           220 ns/op
BenchmarkMain/sync.get-4                20000000            63.7 ns/op
BenchmarkMain/sync.put-4                10000000           130 ns/op
BenchmarkMain/sync.putget-4             10000000           207 ns/op
BenchmarkMain/sync.multiget-4           20000000            67.0 ns/op
BenchmarkMain/sync.multiput-4           10000000           182 ns/op
BenchmarkMain/sync.multiputget-4         5000000           267 ns/op

Benchcmp before/after % removal.

BenchmarkMain/unshared.get-4            116           115           -0.86%
BenchmarkMain/unshared.put-4            118           124           +5.08%
BenchmarkMain/unshared.putget-4         229           224           -2.18%
BenchmarkMain/unshared.multiget-4       123           123           +0.00%
BenchmarkMain/lock.get-4                160           37.5          -76.56%
BenchmarkMain/lock.put-4                188           48.2          -74.36%
BenchmarkMain/lock.putget-4             362           83.5          -76.93%
BenchmarkMain/lock.multiget-4           206           137           -33.50%
BenchmarkMain/lock.multiput-4           367           134           -63.49%
BenchmarkMain/lock.multiputget-4        728           220           -69.78%
BenchmarkMain/sync.get-4                510           63.7          -87.51%
BenchmarkMain/sync.put-4                468           130           -72.22%
BenchmarkMain/sync.putget-4             1132          207           -81.71%
BenchmarkMain/sync.multiget-4           633           67.0          -89.42%
BenchmarkMain/sync.multiput-4           487           182           -62.63%
BenchmarkMain/sync.multiputget-4        1177          267           -77.32%

Additionally, I suggest anyone investigating to examine the method of which these benchmarks are being implemented and whether they are even valid: b.N is being updated by multiple goroutines, for instance.

    m.Run(names[i]+".multiputget", func(b *testing.B) {
            wg := sync.WaitGroup{}
            for g := 0; g < NGOS; g++ {
                wg.Add(1)
                go func() {
                    r := time.Now().Nanosecond()

                    var sum int
                    for i := 0; i < b.N; i++ {
                        r = rand(r)
                        sum += impl.Get(r)
                        r = rand(r)
                        impl.Put(r, r)
                    }
                    Sink = sum
                    wg.Done()
                }()
            }
            wg.Wait()
        })

Of course, after doing this, I noticed the data race in the benchmarks and ran it with the race detector. So all bets are off:

goos: linux
goarch: amd64
pkg: github.com/robaho/go-concurrency-test
BenchmarkRand-4     100000000           10.4 ns/op
populating maps...
BenchmarkMain/unshared.get-4             5000000           328 ns/op
BenchmarkMain/unshared.put-4             5000000           289 ns/op
BenchmarkMain/unshared.putget-4          2000000           631 ns/op
==================
WARNING: DATA RACE
Write at 0x0000007100d8 by goroutine 23:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4.1()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:99 +0xf9

Previous write at 0x0000007100d8 by goroutine 22:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4.1()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:99 +0xf9

Goroutine 23 (running) created at:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:92 +0xeb
  testing.(*B).runN()
      /go/src/testing/benchmark.go:141 +0x129
  testing.(*B).run1.func1()
      /go/src/testing/benchmark.go:214 +0x6b

Goroutine 22 (finished) created at:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:92 +0xeb
  testing.(*B).runN()
      /go/src/testing/benchmark.go:141 +0x129
  testing.(*B).run1.func1()
      /go/src/testing/benchmark.go:214 +0x6b
==================
--- FAIL: BenchmarkMain/unshared.multiget
    benchmark.go:147: race detected during execution of benchmark

I am fairly certain the tests are valid. The b.N is not being updated - it is being used by multiple Go routines - the actual testing method does not exit until the Go routines complete - thus the WaitGroup. Still, the "multi" ones are not even the ones being referenced by this bug - the non-multi test use the single Go routine as provided by the testing harness.

The reason the mod is there is to limit the number of entries in the map - a cap.

Yes, the 'unshared' has a data-race when run under multi - but since the map is pre-populated and doesn't change in size during a read, it is a benign data race (but that is why it doesn't run the multi put tests - or an exception is thrown).

And @as your "changed tests" are not invalid, as there is no way a test WITH locks (lock) should outperform the same test WITHOUT locks (unshared) - especially being 4x faster! ... so something is wrong... maybe if you linked to the code I could point it out.

To make the point stronger - it is the performance difference between non-multi "sync" as compared to "lock" and "unshared".

The "unshared" benchmarks still contain the remainder operation. I should have pointed that out -- I didn't remove it for all of your types.

This was originally to illustrate how the use of high cycle cost division instructions are dominating the CPU time in the benchmark itself.

There is no such thing as a benign data race. It is counterproductive to further discuss a racy benchmark (which is why I did not publish any code changes).

@as I'm sorry but you are not correct on a couple of counts:

  1. If you read the complete read-me, it states that the "unshared" is not designed or valid for concurrent use - it works because the map is pre-populated, in fact, the built-in map code detects concurrent writes. The supporting evidence for this bug is not racey - you can remove those tests - the claim still holds.

  2. Your "high cost of division" claim is nonsense. It is a single mod operation. The built-in map itself uses many... In fact, I changed the mod to a single AND with a mask, capping at 1024*1024. I posted the updated code to the repo. Here are the results:

BenchmarkMain/unshared.get-8            20000000            91.8 ns/op
BenchmarkMain/unshared.put-8            20000000            95.7 ns/op
BenchmarkMain/unshared.putget-8         10000000           174 ns/op
BenchmarkMain/unshared.multiget-8       20000000            89.9 ns/op
BenchmarkMain/lock.get-8                20000000           112 ns/op
BenchmarkMain/lock.put-8                10000000           123 ns/op
BenchmarkMain/lock.putget-8             10000000           231 ns/op
BenchmarkMain/lock.multiget-8           10000000           144 ns/op
BenchmarkMain/lock.multiput-8            5000000           305 ns/op
BenchmarkMain/lock.multiputget-8         3000000           495 ns/op
BenchmarkMain/sync.get-8                 5000000           287 ns/op
BenchmarkMain/sync.put-8                 5000000           340 ns/op
BenchmarkMain/sync.putget-8              2000000           748 ns/op
BenchmarkMain/sync.multiget-8            3000000           407 ns/op
BenchmarkMain/sync.multiput-8            5000000           354 ns/op
BenchmarkMain/sync.multiputget-8         2000000           773 ns/op

@as your code changes are probably incorrect, since if you remove the mod/and, there is a very good chance that the random read will go for a value not in the map, which would be very fast... that is why you are seeing the "improved results". I think you should be less hostile in the future.

edit: this is especially true since the go map uses a 'top hash' as a quick determination - many of the reads would quickly terminate

@as lastly, your "racey" assessment is incorrect as well - if you examined the code, it was the use of the "sink" to avoid dead code elimination (not a race in using the map) - which btw, most of the internal Go benchmark tests don't do and they are invalid because of it... only some have been fixed to use a package sink.

@as I would kindly ask that you delete your comments, as they are invalid, and only bring noise to this issue.

@robaho A program with a data race in it is not a valid Go program.

Even if the data races don't impact your benchmark results (which may be possible), I suggest you fix them, just to make your benchmarks valid Go code.

@ALTree The data race is only the setting of the sink in the benchmark code - it is debatable that the 'race detector' should even report this as a data race, as it is only there to prevent dead code removal. Still, I went ahead and changed it to an atomic.Value so I assume it will be ok now.

also, I moved the "masking" into the test harness and out of the core code to make it even simpler.

After all of the changes, the results still show an issue:

BenchmarkMain/unshared.get-8            20000000            85.5 ns/op
BenchmarkMain/unshared.put-8            20000000            92.3 ns/op
BenchmarkMain/unshared.putget-8         10000000           177 ns/op
BenchmarkMain/unshared.multiget-8       20000000            88.2 ns/op
BenchmarkMain/lock.get-8                10000000           109 ns/op
BenchmarkMain/lock.put-8                10000000           125 ns/op
BenchmarkMain/lock.putget-8              5000000           233 ns/op
BenchmarkMain/lock.multiget-8           10000000           158 ns/op
BenchmarkMain/lock.multiput-8            5000000           293 ns/op
BenchmarkMain/lock.multiputget-8         3000000           571 ns/op
BenchmarkMain/sync.get-8                 5000000           299 ns/op
BenchmarkMain/sync.put-8                 5000000           344 ns/op
BenchmarkMain/sync.putget-8              2000000           678 ns/op
BenchmarkMain/sync.multiget-8            5000000           375 ns/op
BenchmarkMain/sync.multiput-8            5000000           337 ns/op
BenchmarkMain/sync.multiputget-8         2000000           707 ns/op

Thanks for the report. The line seems correct to me.

While it's true the rand.NewSource(g) return value cannot be used by multiple goroutines concurrently, in the map_test.go code the rand.NewSource() calls are inside the loop that spawns the goroutines, so each goroutine ends up with a different random source.

for g := int64(runtime.GOMAXPROCS(0)); g > 0; g-- {
    r := rand.New(rand.NewSource(g))
    wg.Add(1)
    go func(g int64) {
        [...]

goroutine # 1 uses a random source instantiated as r := rand.New(rand.NewSource(1)), goroutine # 2 has r := rand.New(rand.NewSource(2)), and so on. Each goroutine has its own randomness source, and those are not shared.

If you move the randomness source instantiation outside of the loop, the race detector immediately complains about a data race; it doesn't on the current code.

@ALTree You are of course correct, when I read it, I was thinking Go closure gotchas, and didn't see the new variable declaration.... sorry, that you need to pass g as an arg, but create a new var for r is pretty silly... tripped me up. :)

@ALTree you can delete these comments to clean this up if you wish... point understood, and I had already corrected the testing code

~You are not disagreeing with me, you are disagreeing with the race detector. The justification for having the race and other provided assumptions are of little interest to me.~

After rewriting the benchmark in a way I could understand, I obtained the results below. My variables of interest are:

  • The size of the map
    Fixed to 2^20
  • Number of writers to the cache
    0 - 2
  • Access patterns (random vs iteration)
    Incr vs Rand (XorShift)
  • The access pattern variance (if applicable).
    A simple bitmask added to the XORShift

With just one concurrent writer, the sync map already beats the lock.

goos: linux
goarch: amd64
pkg: github.com/as/goissues/28938
BenchmarkMap/0Writer/Sync/Incr/Mask0-4          50000000            31.0 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskf-4          50000000            31.3 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskff-4         30000000            35.7 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskfff-4        30000000            51.1 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskffff-4       10000000           203 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskfffff-4      10000000           194 ns/op
BenchmarkMap/0Writer/Sync/Rand/Mask0-4          50000000            31.4 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskf-4          50000000            33.4 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskff-4         30000000            36.4 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskfff-4        20000000            80.6 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskffff-4        5000000           343 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4       3000000           509 ns/op
BenchmarkMap/0Writer/Lock/Incr/Mask0-4          50000000            24.5 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskf-4          50000000            24.0 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskff-4         50000000            27.0 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskfff-4        50000000            31.1 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskffff-4       20000000           124 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskfffff-4      10000000           177 ns/op
BenchmarkMap/0Writer/Lock/Rand/Mask0-4          50000000            23.8 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskf-4          50000000            23.8 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskff-4         50000000            25.1 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfff-4        50000000            30.3 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskffff-4       20000000           121 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4      10000000           183 ns/op
BenchmarkMap/1Writer/Lock/Incr/Mask0-4           3000000           522 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskf-4           3000000           526 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskff-4          3000000           539 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskfff-4         3000000           593 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskffff-4        2000000           718 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskfffff-4       2000000           738 ns/op
BenchmarkMap/1Writer/Lock/Rand/Mask0-4           3000000           525 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskf-4           3000000           519 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskff-4          3000000           532 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskfff-4         3000000           541 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskffff-4        2000000           639 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskfffff-4       2000000           742 ns/op
BenchmarkMap/1Writer/Sync/Incr/Mask0-4          20000000            52.7 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskf-4          30000000            47.2 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskff-4         30000000            54.8 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskfff-4        20000000            66.7 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskffff-4       10000000           154 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskfffff-4      10000000           185 ns/op
BenchmarkMap/1Writer/Sync/Rand/Mask0-4          20000000            53.6 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskf-4          30000000            46.6 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskff-4         30000000            50.0 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskfff-4        20000000            75.5 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskffff-4       10000000           279 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskfffff-4       5000000           398 ns/op
BenchmarkMap/2Writer/Sync/Incr/Mask0-4          20000000            70.3 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskf-4          30000000            58.3 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskff-4         20000000            60.4 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskfff-4        20000000            94.8 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskffff-4       10000000           185 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskfffff-4      10000000           203 ns/op
BenchmarkMap/2Writer/Sync/Rand/Mask0-4          20000000            72.7 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskf-4          30000000            66.7 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskff-4         30000000            69.2 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskfff-4        20000000           104 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskffff-4        5000000           308 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskfffff-4       5000000           403 ns/op
BenchmarkMap/2Writer/Lock/Incr/Mask0-4           2000000           766 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskf-4           2000000           757 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskff-4          2000000           785 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskfff-4         2000000           799 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskffff-4        2000000           883 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskfffff-4       2000000           912 ns/op
BenchmarkMap/2Writer/Lock/Rand/Mask0-4           2000000           768 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskf-4           2000000           778 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskff-4          2000000           782 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskfff-4         2000000           799 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskffff-4        2000000           865 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskfffff-4       2000000           941 ns/op
PASS
ok      github.com/as/goissues/28938    213.446s

Why are these results so much different than the other benchmark?

https://github.com/as/goissues/tree/master/28938

@as They are not that different. Your results agree with mine. Look at these two lines in your results.

BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4       3000000           509 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4      10000000           183 ns/op

The Lock map is 2x faster. This should never happen. In a "read only" test, the sync.Map should behave with the properties of a standard map, so that should be the bound, the "lock" being higher (lower performance) as it has the locking overhead.

@as Actually, it starts even earlier:

BenchmarkMap/0Writer/Sync/Rand/Maskfff-4        20000000            80.6 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskffff-4        5000000           343 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4       3000000           509 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfff-4        50000000            30.3 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskffff-4       20000000           121 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4      10000000           183 ns/op

In all of these cases, the lock is faster than the sync.

@as your results don't seem consistent even within a category, as in

BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4       3000000           509 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskfffff-4       5000000           398 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskfffff-4       5000000           403 ns/op

I don't think that adding writers should make the reader perform faster... I think this is only possible if the writers are putting more elements into the L1/L2/L3 cache... Unless something in the read-only workload is causing mass cache evictions - maybe an interaction between atomic.Value and the GC ? Not sure, but there is definitely a problem here.

An easy way to test thus is to use multiple readers only... My "multi-get" tests that, and still shows the "incorrect" results.

@as as an aside, I'm sorry that you needed to rewrite my tests - the latest version I posted is far easier to follow and maintain - simpler I think than even yours, but to each his own. I'm happy you were able to duplicate my findings.

I'm not sure I understand the benchmarks. It's not clear to me that they are testing the case that sync.Map is intended for: multiple goroutines concurrently reading from the map, while permitting an occasional goroutine to write to the map. It is not particularly surprising that sync.Map is not as fast as the builtin map type when using a single goroutine; in that case there is no contention for the lock. sync.Map is a specialized type, as described in the doc comment; it's not a general replacement for a lock around a map. See the benchmarks in sync/map_bench_test.go; see how those benchmarks perform with your alternative implementations.

They are IMO. The 'sync' should not perform worse than 'lock' in a primarily read mode. (in this case it is an "only read", but you could sprinkle a few writes in there if you want).

The implementation of sync.Map is an atomic read of the value (the standard map), then access the standard map. So the only performance difference between sync.Map and std map in a read only scenario should be the cost of the atomic read. In a read-only environment this cost is negligible as it will already be in the cache.

There is definitely something wrong. It is why I included the "equivalent" java tests - in a read-only (or rare write), the sync.Map should perform the same as a unsynchronized std map - that is the point of lock-free data structures.

@ianlancetaylor the tests by @as test the "read only" access in the Writer0 cases, which demonstrate the problem. His tests also show that the reader gets faster as writers are added - which should not be the case according to the documentation on/usage of sync.Map either...

The point of sync.Map is to avoid lock contention. You need to test multiple concurrent readers. Otherwise you aren't going to see any lock contention.

@ianlancetaylor that is what my 'multi-get" tests and shows the same problem - but even still - a single reader and no writer should not perform worse than a lock based solution - it is a single atomic read. If it does, there is something wrong with atomic.Value

I guess the powers that be can just ignore this, and claim this is the way it is supposed to work, but I've been working with lock-free structures for quite a while, and this is not a traditional performance profile.

Modified the test so that

  • Each writer executes a Put with 1/8 probability (i.e., an occasion).
  • For each writer, there are an additional 10 concurrent readers
  • The main benchmarking routine executes a Get, as usual.

The behavior is expected and conforms to the documentation.


Results

; cd /g/src/github.com/as/goissues/28938
; go test -bench .
goos: linux
goarch: amd64
pkg: github.com/as/goissues/28938
BenchmarkMap/0W1R/Sync/Incr/Mask0-4             50000000            31.6 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskf-4             50000000            32.7 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskff-4            30000000            36.0 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskfff-4           30000000            51.5 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskffff-4          10000000           190 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskfffff-4         10000000           215 ns/op
BenchmarkMap/0W1R/Sync/Rand/Mask0-4             50000000            32.4 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskf-4             50000000            32.9 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskff-4            30000000            35.3 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskfff-4           20000000            57.0 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskffff-4           5000000           320 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskfffff-4          3000000           500 ns/op
BenchmarkMap/0W1R/Lock/Incr/Mask0-4             50000000            23.5 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskf-4             50000000            23.4 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskff-4            50000000            23.6 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskfff-4           50000000            31.3 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskffff-4          20000000           121 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskfffff-4         10000000           171 ns/op
BenchmarkMap/0W1R/Lock/Rand/Mask0-4             50000000            23.7 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskf-4             50000000            24.0 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskff-4            50000000            23.8 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskfff-4           50000000            29.8 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskffff-4          20000000            99.6 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskfffff-4         10000000           185 ns/op
BenchmarkMap/1W11R/Lock/Incr/Mask0-4             1000000          1412 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskf-4             1000000          1427 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskff-4            1000000          1464 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskfff-4           1000000          1619 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskffff-4          1000000          2184 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskfffff-4          500000          2229 ns/op
BenchmarkMap/1W11R/Lock/Rand/Mask0-4             1000000          1421 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskf-4             1000000          1485 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskff-4            1000000          1380 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskfff-4           1000000          1428 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskffff-4          1000000          2151 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskfffff-4          500000          2564 ns/op
BenchmarkMap/1W11R/Sync/Incr/Mask0-4            10000000           219 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskf-4            10000000           224 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskff-4           10000000           242 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskfff-4           5000000           352 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskffff-4          3000000           653 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskfffff-4         2000000           658 ns/op
BenchmarkMap/1W11R/Sync/Rand/Mask0-4            10000000           213 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskf-4            10000000           226 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskff-4            5000000           304 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskfff-4           2000000           596 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskffff-4          1000000          1095 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskfffff-4         1000000          1559 ns/op
BenchmarkMap/2W21R/Lock/Incr/Mask0-4              500000          2734 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskf-4              500000          2621 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskff-4             500000          2972 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskfff-4            500000          3232 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskffff-4           500000          3334 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskfffff-4          300000          3745 ns/op
BenchmarkMap/2W21R/Lock/Rand/Mask0-4             3000000          2564 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskf-4              500000          2499 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskff-4             500000          2637 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskfff-4            500000          3160 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskffff-4           500000          3586 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskfffff-4         1000000          3150 ns/op
BenchmarkMap/2W21R/Sync/Incr/Mask0-4             5000000           419 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskf-4             5000000           431 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskff-4            5000000           430 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskfff-4           3000000           632 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskffff-4          1000000          1206 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskfffff-4         1000000          1400 ns/op
BenchmarkMap/2W21R/Sync/Rand/Mask0-4             3000000           476 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskf-4             3000000           450 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskff-4            5000000           488 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskfff-4           3000000           625 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskffff-4          1000000          1807 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskfffff-4         1000000          2809 ns/op

However there is an interesting (but likely irrelevant) observation:

A large access window (i.e., the entire 2^20 keyspace) results in a 2x slower sync.Map when that access is random (and there are no writers, and only one reader). However, when the access is sequential, we see performance on par with the lock.

BenchmarkMap/0W1R/Lock/Incr/Maskfffff-4         10000000           171 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskfffff-4         10000000           185 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskfffff-4         10000000           215 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskfffff-4          3000000           500 ns/op

@as that is exactly the problem... but also, it is not on par for incr, it is 30% slower... but that doesn't matter, as a map is NOT designed to be read incrementally - that is an invalid use case - it is designed for random access, so being 3x slower than when using locks means a broken implementation somewhere... 30% is a problem for a common structure, 300% is a severe issue.

What no one seems to accept is that even 1/8 writes is far too many - a common data structure in most HFT and HCP systems is a lock-free "read-only" concurrent cache (i.e. very, very few writes). It should be on par with a unsynchronized map when doing reads (which is why I also included the Java ConcurrentHashMap examples), but still allow some updates.

I think your tests show that it doesn't match the documentation, to quote:

// The Map type is optimized for two common use cases: (1) when the entry for a given
// key is only ever written once but read many times, as in caches that only grow,
// or ...

so clearly this implementation is not optimized for goal 1, and your tests with 1/8 writes is not testing goal 1 either.

My tests are designed to test goal 1, that is why the map is pre-populated, and the 'get' tests are most important.

I really can't see how multiple people are arguing against this - it is almost like you are afraid to admit there is an underlying problem, and it is easier to just try and argue that "well, it works according to the design", but even that is an invalid argument.

I don't think you are benchmarking your shareshard.

https://github.com/robaho/go-concurrency-test/commit/aed8afc93795047fb30cb4337b4010e637752956

    impls := []go_concurrency.Cache{um, lm, sm, cm, sc, im, im2}
    names := []string{"unshared", "lock", "sync", "channel", "shard", "shareshard", "intmap", "intmap2"}

@as correct, bummer... knew it was too good to be true... still trying to figure out where the slowness is coming from... I wrote a different "shared shard" originally, with my own atomic.Value of a standard map, and it worked well for reads, but the writes were incredibly slow

** I deleted the comment with timings as it was noise...

OK, so I think I figured out why the performance of sync.Map is poor in this case. Because Map uses the standard map implementation which isn't designed for CAS/concurrency, it has no choice but to have an extra layer of indirection in order to perform the CAS on an individual key/value pair. As the map working set increases, this level of indirection adds 100s of nanosecs to each lookup, since the entry must be loaded (map lookup), then the pointer loaded atomically from the entry, then the pointer dereferenced into the value - whereas the standard map is only the first map retrieval - there is the 3x slowness.

This is why the newly created "shared sharded" map, which is multiple map[int]int it performs the same as lock map for concurrent reads, and outperforms for concurrent writes (the first shard lookup will always be in the cache, then a single map lookup)

Here are the performance numbers:

With 2 concurrent Go routines (multi)
BenchmarkMain/shareshard.get-8          20000000           112 ns/op
BenchmarkMain/shareshard.put-8          10000000           127 ns/op
BenchmarkMain/shareshard.putget-8       10000000           239 ns/op
BenchmarkMain/shareshard.multiget-8     10000000           123 ns/op
BenchmarkMain/shareshard.multiput-8     10000000           147 ns/op
BenchmarkMain/shareshard.multiputget-8           5000000           294 ns/op

and with 8 concurrent Go routines (multi)
BenchmarkMain/shareshard.get-8          20000000           114 ns/op
BenchmarkMain/shareshard.put-8          10000000           130 ns/op
BenchmarkMain/shareshard.putget-8        5000000           252 ns/op
BenchmarkMain/shareshard.multiget-8     10000000           188 ns/op
BenchmarkMain/shareshard.multiput-8      5000000           284 ns/op
BenchmarkMain/shareshard.multiputget-8           2000000           846 ns/op

which are far superior to standard "lock" and "sync" maps, but still are still 20% slower in the common read-only case.

Which is why I am going to write a shared intmap that uses CAS as the table row level... stay tuned.

The "shared intmap" shows the best true concurrent performance so far:

with 2 Go routines:
BenchmarkMain/sharedint.get-8                   20000000            61.4 ns/op
BenchmarkMain/sharedint.put-8                   10000000           166 ns/op
BenchmarkMain/sharedint.putget-8                10000000           199 ns/op
BenchmarkMain/sharedint.multiget-8              20000000            64.0 ns/op
BenchmarkMain/sharedint.multiput-8              10000000           170 ns/op
BenchmarkMain/sharedint.multiputget-8           10000000           218 ns/op

with 8 Go routines:
BenchmarkMain/sharedint.get-8                   20000000            59.0 ns/op
BenchmarkMain/sharedint.put-8                   10000000           167 ns/op
BenchmarkMain/sharedint.putget-8                10000000           198 ns/op
BenchmarkMain/sharedint.multiget-8              20000000           116 ns/op
BenchmarkMain/sharedint.multiput-8               5000000           233 ns/op
BenchmarkMain/sharedint.multiputget-8            5000000           404 ns/op

which suggest that the performance problems with sync.Map are fundamental. I believe the only way to fix would be to write a much lower level sync.Map that mimicked the internal map but that supported concurrency behavior similar to Java's ConcurrentHashMap - probably the most often used structure in Java when developing concurrent applications.

The current sync.Map implementation that leverages the existing internal map is not going to work since it was not designed for concurrent access & modification.

Writing your own concurrent maps is also very cumbersome, as the internal "hash code" functionality is not available, which means all hashing code needs to be duplicated.

Lastly, I wrote a true CAS shared intmap that is fully concurrent (although fixed size, no removals, etc). The performance numbers:

with 2 Go routines:
BenchmarkMain/sharedint.get-8           30000000            53.5 ns/op
BenchmarkMain/sharedint.put-8           10000000           126 ns/op
BenchmarkMain/sharedint.putget-8        10000000           198 ns/op
BenchmarkMain/sharedint.multiget-8      30000000            56.3 ns/op
BenchmarkMain/sharedint.multiput-8      10000000           128 ns/op
BenchmarkMain/sharedint.multiputget-8   10000000           208 ns/op

with 8 Go routines
BenchmarkMain/sharedint.get-8           30000000            54.1 ns/op
BenchmarkMain/sharedint.put-8           10000000           128 ns/op
BenchmarkMain/sharedint.putget-8        10000000           202 ns/op
BenchmarkMain/sharedint.multiget-8      20000000            97.2 ns/op
BenchmarkMain/sharedint.multiput-8      10000000           158 ns/op
BenchmarkMain/sharedint.multiputget-8    5000000           224 ns/op

The performance numbers show what is possible with a true fully concurrent map. There is almost no degradation as the concurrency rises (most in this case can be attributed to the test machine only having 4 real cores, 4 hyper-threaded).

The sync.Map implementation needs to be updated to use a similar methodology, leveraging the internal map/hash support functions.

As it is now, a "sharded lock map" is a better solution than sync.Map for read-heavy use cases, as the "shared intmap" would require lots of custom code for general usage.

For summary analysis, here are the complete timings using 8 Go routines during (multi):

BenchmarkMain/unshared.get-8            20000000            87.5 ns/op
BenchmarkMain/unshared.put-8            20000000            97.6 ns/op
BenchmarkMain/unshared.putget-8         10000000           179 ns/op
BenchmarkMain/unshared.multiget-8       10000000           131 ns/op
BenchmarkMain/lock.get-8                20000000           111 ns/op
BenchmarkMain/lock.put-8                10000000           125 ns/op
BenchmarkMain/lock.putget-8             10000000           235 ns/op
BenchmarkMain/lock.multiget-8            5000000           379 ns/op
BenchmarkMain/lock.multiput-8            1000000          2485 ns/op
BenchmarkMain/lock.multiputget-8         1000000          2401 ns/op
BenchmarkMain/sync.get-8                 5000000           294 ns/op
BenchmarkMain/sync.put-8                 5000000           317 ns/op
BenchmarkMain/sync.putget-8              2000000           699 ns/op
BenchmarkMain/sync.multiget-8            3000000           504 ns/op
BenchmarkMain/sync.multiput-8            3000000           593 ns/op
BenchmarkMain/sync.multiputget-8         1000000          1035 ns/op
BenchmarkMain/channel.get-8              2000000           955 ns/op
BenchmarkMain/channel.put-8              3000000           564 ns/op
BenchmarkMain/channel.putget-8           1000000          1460 ns/op
BenchmarkMain/channel.multiget-8          200000          6345 ns/op
BenchmarkMain/channel.multiput-8          300000          4755 ns/op
BenchmarkMain/channel.multiputget-8       200000          9575 ns/op
BenchmarkMain/shard.get-8               20000000           100 ns/op
BenchmarkMain/shard.put-8               20000000           101 ns/op
BenchmarkMain/shard.putget-8            10000000           197 ns/op
BenchmarkMain/shard.multiget-8          10000000           146 ns/op
BenchmarkMain/shareshard.get-8          20000000           112 ns/op
BenchmarkMain/shareshard.put-8          10000000           128 ns/op
BenchmarkMain/shareshard.putget-8        5000000           241 ns/op
BenchmarkMain/shareshard.multiget-8     10000000           178 ns/op
BenchmarkMain/shareshard.multiput-8      5000000           276 ns/op
BenchmarkMain/shareshard.multiputget-8           2000000           879 ns/op
BenchmarkMain/intmap.get-8                      20000000           105 ns/op
BenchmarkMain/intmap.put-8                      10000000           212 ns/op
BenchmarkMain/intmap.putget-8                    5000000           299 ns/op
BenchmarkMain/intmap.multiget-8                 10000000           183 ns/op
BenchmarkMain/intmap.multiput-8                  5000000           258 ns/op
BenchmarkMain/intmap.multiputget-8               3000000           456 ns/op
BenchmarkMain/intmap2.get-8                     30000000            53.7 ns/op
BenchmarkMain/intmap2.put-8                     10000000           130 ns/op
BenchmarkMain/intmap2.putget-8                  10000000           200 ns/op
BenchmarkMain/intmap2.multiget-8                20000000            99.4 ns/op
BenchmarkMain/intmap2.multiput-8                10000000           151 ns/op
BenchmarkMain/intmap2.multiputget-8             10000000           228 ns/op
BenchmarkMain/sharedint.get-8                   30000000            54.5 ns/op
BenchmarkMain/sharedint.put-8                   10000000           127 ns/op
BenchmarkMain/sharedint.putget-8                10000000           206 ns/op
BenchmarkMain/sharedint.multiget-8              20000000            96.1 ns/op
BenchmarkMain/sharedint.multiput-8              10000000           152 ns/op
BenchmarkMain/sharedint.multiputget-8           10000000           221 ns/op

I don’t really understand what thumbs down means in this context. Why is that even an option on bug reports? If you disagree with the conclusions then offer an alternative. I stand by them. The only way to fix sync.Map requires a different data structure with reuse of the standard map internals.

The performance of sync.Map for reads should be nearly equal to the standard built-in map. The is the demonstrated behavior with the similar Java ConcurrentHashMap.

Parity with the built-in map wouldn't tell us much: would it mean that sync.Map is sufficiently fast, or that the built-in map is much too slow?

Have you run benchstat on these numbers? It will help.

@bcmills i am assuming that the built in map has been sufficiently tuned given it’s pervasiveness that parity with it would be the goal. @rlh not sure what benchstat would offer. The results have been reproduced by @as. They are valid.

The sync.Map implementation needs to be updated to use a similar methodology, leveraging the internal map/hash support functions.

That's #21031.

Also note that your benchmarks add one more layer of pointer indirection when using sync.Map, but not when using built-in maps: the built-in map operates on typed values (and you are using int values for your benchmarks), but sync.Map operates on interface{} keys and values, which imposes an additional allocation and pointer indirection. A more efficient solution should avoid interface overhead, but — without making sync.Map a built-in type — that would require language support for generic containers (#15292).

Also note that much of the overhead that sync.Map avoids comes from cache contention, which can make individual operations O(N) with the number of physical cores: with N cores executing those operations concurrently, you can get O(N²) behavior.

N=4, as you are using for benchmarking, is low enough that the constant factors likely still dominate.

(The original sync.Map implementation was benchmarked on a machine with 24 physical cores, and the tradeoff point for many of the concrete call sites occurred at around 8 cores.)

Writing your own concurrent maps is also very cumbersome, as the internal "hash code" functionality is not available, which means all hashing code needs to be duplicated.

See #21195.

As it is now, a "sharded lock map" is a better solution than sync.Map for read-heavy use cases,

Sharding is one of several possible solutions proposed in #21035.

@bcmills thanks for your input, it seems the Go team has a good handle on the issues. I’m not sure about the cache contention claim - I tested with 8x - and I’m more concerned with the “read only” workload, but that is minor.

My bad for not finding these issues before this - I just figured for such an important common data structure a lot of these would of already been addressed, thus I was looking for more esoteric causes.

I would suggest that the internal Go map benchmarks be enhanced to test larger maps where the cache locality and number of indirections would surface as a performance problem as shown here.

I guess this can be closed as duplicate given @bcmills references. Given the lack of generics I think the cited issues should be given higher priority - this is just too an important of a data structure to exhibit such poor performance.

Thanks. Closing as a duplicate of #21031, #21035, and #15292.

If you're interested in concurrent data structures, you're certainly welcome to take a run at those.

(Personally, I use sync.Map so infrequently that I still have to look up the API every time, even though I implemented it. It's an important use-case when it does occur, but it doesn't seem to occur all that frequently in Go programs.)

@bcmills as I mentioned earlier, large "mainly read" shared maps are an important structure in many financial applications.

As a final closing to this, I'll report the standard cases: idiomatic locks in Go, sync.Map, and Java's ConcurrentHashMap, with 6x concurrency for multi, 2^20 map with full range:

BenchmarkMain/lock.get-8            20000000           110 ns/op
BenchmarkMain/lock.put-8            10000000           129 ns/op
BenchmarkMain/lock.putget-8          5000000           240 ns/op
BenchmarkMain/lock.multiget-8        5000000           328 ns/op
BenchmarkMain/lock.multiput-8        1000000          1549 ns/op
BenchmarkMain/lock.multiputget-8     1000000          1819 ns/op
BenchmarkMain/sync.get-8             5000000           287 ns/op
BenchmarkMain/sync.put-8             5000000           327 ns/op
BenchmarkMain/sync.putget-8          2000000           706 ns/op
BenchmarkMain/sync.multiget-8        3000000           486 ns/op
BenchmarkMain/sync.multiput-8        3000000           535 ns/op
BenchmarkMain/sync.multiputget-8     2000000           957 ns/op

Benchmark                            (arg)  Mode  Cnt    Score    Error  Units
TestJavaCache.Test0Get          concurrent  avgt    5   72.301 ±  2.706  ns/op
TestJavaCache.Test2Put          concurrent  avgt    5  168.011 ± 12.797  ns/op
TestJavaCache.Test3PutGet       concurrent  avgt    5  293.249 ± 31.050  ns/op
TestJavaCache.Test4MultiGet     concurrent  avgt    5  121.128 ±  5.543  ns/op
TestJavaCache.Test5MultiPut     concurrent  avgt    5  261.543 ± 14.179  ns/op
TestJavaCache.Test6MultiPutGet  concurrent  avgt    5  497.539 ± 34.685  ns/op
Was this page helpful?
0 / 5 - 0 ratings