Guava: Cache performance

Created on 29 May 2015 · 7Comments · Source: google/guava

Guava's cache performance has a lot of potential for improvement, at the cost of a higher initial memory footprint. The benchmark, with the raw numbers below, shows Guava's performance on a 16-core Xeon E5-2698B v3 @ 2.00GHz (hyperthreading disabled) running Ubuntu 15.04 with JDK 1.8.0_45.

Of note is that Guava's default concurrency level (4) results in similar performance to a synchronized LinkedHashMap. This is due to the cache's overhead, in particular the GC thrashing from ConcurrentLinkedQueue and a hot read counter. It should also be noted that at the time of development synchronization was slow in JDK5, but significantly improved in JDK6_22 and beyond. This and other improvements help LinkedHashMap's performance compared to what was observed when Guava's cache was in development.

The easiest way to double Guava's performance would be to replace the recencyQueue and readCount with a ring buffer. This would replace an unbounded linked queue with a 16-element array. When the buffer is full then a clean-up is triggered and subsequent additions are skipped (no CAS) until slots become available. This could be adapted from this implementation.

Given the increase in server memory, it may also be time to reconsider the default concurrency level. Early design concerns included the memory overhead, as most caches were not highly contended and the value could be adjusted as needed. As the number of cores and system memory has increased, it may be worth revisiting the default setting.

Read (100%)

| Unbounded | ops/s (8 threads) | ops/s (16 threads) |
| :-: | :-: | :-: |
| ConcurrentHashMap (v8) | 560,367,163 | 1,171,389,095 |
| ConcurrentHashMap (v7) | 301,331,240 | 542,304,172 |
| | | |
| Bounded | | |
| Caffeine | 181,703,298 | 365,508,966 |
| ConcurrentLinkedHashMap | 154,771,582 | 313,892,223 |
| LinkedHashMap_Lru | 9,209,065 | 13,598,576 |
| Guava (default) | 12,434,655 | 10,647,238 |
| Guava (64) | 24,533,922 | 43,101,468 |
| Ehcache2_Lru | 11,252,172 | 20,750,543 |
| Ehcache3_Lru | 11,415,248 | 17,611,169 |
| Infinispan_Old_Lru | 29,073,439 | 49,719,833 |
| Infinispan_New_Lru | 4,888,027 | 4,749,506 |

Read (75%) / Write (25%)

| Unbounded | ops/s (8 threads) | ops/s (16 threads) |
| :-: | :-: | :-: |
| ConcurrentHashMap (v8) | 441,965,711 | 790,602,730 |
| ConcurrentHashMap (v7) | 196,215,481 | 346,479,582 |
| | | |
| Bounded | | |
| Caffeine | 112,622,075 | 235,178,775 |
| ConcurrentLinkedHashMap | 63,968,369 | 122,342,605 |
| LinkedHashMap_Lru | 8,668,785 | 12,779,625 |
| Guava (default) | 11,782,063 | 11,886,673 |
| Guava (64) | 22,782,431 | 37,332,090 |
| Ehcache2_Lru | 9,472,810 | 8,471,016 |
| Ehcache3_Lru | 10,958,697 | 17,302,523 |
| Infinispan_Old_Lru | 22,663,359 | 37,270,102 |
| Infinispan_New_Lru | 4,753,313 | 4,885,061 |

Write (100%)

| Unbounded | ops/s (8 threads) | ops/s (16 threads) |
| :-: | :-: | :-: |
| ConcurrentHashMap (v8) | 60,477,550 | 50,591,346 |
| ConcurrentHashMap (v7) | 46,204,091 | 36,659,485 |
| | | |
| Bounded | | |
| Caffeine | 55,281,751 | 47,482,019 |
| ConcurrentLinkedHashMap | 23,819,597 | 39,797,969 |
| LinkedHashMap_Lru | 10,179,891 | 10,859,549 |
| Guava (default) | 4,764,056 | 5,446,282 |
| Guava (64) | 8,128,024 | 7,483,986 |
| Ehcache2_Lru | 4,205,936 | 4,697,745 |
| Ehcache3_Lru | 10,051,020 | 13,939,317 |
| Infinispan_Old_Lru | 7,538,859 | 7,332,973 |
| Infinispan_New_Lru | 4,797,502 | 5,086,305 |

P3 package=cache type=performance

Source

ben-manes

👍3

Most helpful comment

A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).

| Benchmark | ops/s (8 threads) | ops/s (16 threads) |
| :-: | :-: | :-: |
| Read (100%) | 121,808,437 | 260,593,210 |
| Read (75%) / Write (25%) | 34,770,671 | 48,538,555 |
| Write (100%) | 5,064,388 | 6,045,980 |

We can increase the concurrency level (64) for better write throughput, but unsurprisingly this decreases the read throughput. This is because the cpu caching effects work against us. ConcurrentHashMap reads are faster with fewer segments (576M ops/s) and Guava's recencyQueue is associated to the segment instead of the processor. The queue fills up slower, making it more often contended and less often in the L1/L2 cache. We still see a substantial gain, though.

| Benchmark | ops/s (8 threads) | ops/s (16 threads) |
| :-: | :-: | :-: |
| Read (100%) | 40,935,985 | 84,109,039 |
| Read (75%) / Write (25%) | 40,854,227 | 64,462,500 |
| Write (100%) | 6,757,178 | 6,570,233 |

Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.

| Computer | sameKey ops/s | spread ops/s |
| :-: | :-: | :-: |
| ConcurrentHashMap (v8) | 30,321,032 | 64,427,143 |
| Guava (current) | 23,149,556 | 71,423,966 |
| Guava (patched) | 329,220,794 | 166,448,754 |
| Caffeine | 1,558,302,420 | 533,181,707 |

I have been advocating this improvement since we first began, originally assumed that we'd get to it in a performance iteration. After leaving I didn't have a good threaded benchmark to demonstrate the improvements, so my hand wavy assertions were ignored. There are still significant gains when moving to the Java 8 rewrite, but these changes are platform compatible and non-invasive.

I don't know why I even bother anymore...

ben-manes on 31 May 2015

👍4

All 7 comments

Is the "op" a read, update or insert? The latest stuff I did in cache2k only have a bottleneck for the insert operation, which implies structural changes and therefor needs locking. All other operations run fully concurrently. Can you maybe rerun your benchmarks with a cache2k setup I give you?

cruftex on 29 May 2015

You can send a pull request to add a new cache type to the GetPutBenchmark. I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself.

ben-manes on 29 May 2015

I looked at cache2k a while back and am not confident in it being threadsafe, so it may perform well but corrupt itself.

Yes, quite normal reaction. I regularly have these concerns, when I look at a code area that I had not touched for some months. I walk through the design and the JMM again and finally find good sleep. So far we had no corruption in production, even for releases marked as experimental.

cruftex on 30 May 2015

I know adding to an unfamiliar code base is confusing, so I made an addition that I'll send you a pull request to review. Stepping through I see where some of my initial concerns regarding races are addressed for the LruCache.

On my laptop I get the following results, which matches the Desktop-class benchmarks section.

| Benchmark | ops/s (8 threads) |
| :-: | :-: |
| Read (100%) | 8,864,382 |
| Read (75%) / Write (25%) | 8,064,139 |
| Write (100%) | 2,489,101 |

The performance matches what I'd expect from top-level lock, which appears to be how the LRU is implemented. I haven't looked at why writes are slower, since they should take the same amount of time. This may be due to synchronization or degradation in the custom hash table.

ben-manes on 31 May 2015

A quick and dirty hack to Guava resulted in up to 25x read performance gain at the default concurrency level (4).

Because we are no longer thrashing on readCounter, the compute performance increases substantially as well. Here we use 32 threads with Guava at the default concurrency level.

I don't know why I even bother anymore...

ben-manes on 31 May 2015

👍4

I'm going to try to find time to take a crack at this, though I'm not 100% confident when I'll be able to.