Elasticsearch: Disable TCP compression when shipping Lucene segments during peer recovery

Created on 19 Sep 2018  路  19Comments  路  Source: elastic/elasticsearch

Shipping lucene segments during peer recovery is configured to set the transport compress options to false:
https://github.com/elastic/elasticsearch/blob/7f473b683d4f7426d3aa13cb618b6330a09ba309/server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java#L70-L71

Unfortunately this is overridden when transport.tcp.compress is enabled, discarding the request-level option:
https://github.com/elastic/elasticsearch/blob/7f473b683d4f7426d3aa13cb618b6330a09ba309/server/src/main/java/org/elasticsearch/transport/TcpTransport.java#L873-L875

This means that when transport.tcp.compress is enabled, the (already compressed) Lucene files are unnecessarily compressed a second time, slowing peer recovery.

My preference would be that transport options that were explicitly set (withCompress(true) or withCompress(false)) take precedence over transport.tcp.compress.

Relates to https://discuss.elastic.co/t/elasticsearch-6-3-0-shard-recovery-is-slow/140940/

:DistributeNetwork :DistributeRecovery >enhancement Distributed

Most helpful comment

We discussed this during FixitFriday today. While re-compression on top of Lucene can bring further space savings (needs more benchmarks, but numbers between 10% to 50% reduction in size were mentioned), it comes at the cost of additional CPU usage, which 1) can have a negative impact on indexing / search performance, as recovery is supposed to be a background task and 2) due to the way we're sequentially sending files during recovery, can limit the throughput of single shard recoveries. While we're less enthusiastic about adding a new configuration option just for recoveries, we want to explore always disabling compression for phase 1 of recoveries, even if transport.tcp.compress is enabled. This also looks to be the thing that the code currently intends on doing, even though it's not actually working that way. The first step will therefore be to figure out whether this behavior was changed in a previous ES version.

All 19 comments

Pinging @elastic/es-core-infra

Pinging @elastic/es-distributed

+1 to make transport options have precedence over the setting.

It's not completely obvious to me that Lucene files don't need further compression: Lucene files are indeed compressed, but in a way that still provides quick (often random) access, which prevents the effectiveness of compression techniques that may be used. Re-compressiong on top of Lucene might bring further space savings. It might still be the wrong trade-off, especially on a local network, but I wanted to clarify this point.

I agree that transport options set explicitly on the request should override the transport.tcp.compress setting. Like @jpountz, I am unsure that disabling compression on file-based recovery transfers is the right trade-off. I would like to see some experiments here. So we might end up with two changes:

  • request options override transport.tcp.compress
  • stop disabling compression on recovery

cc @original-brownbear @tbrooks8

@ywelsch @jasontedor I experimented with compression performance today and there's hard data provided by @DaveCTurner provided in a linked issue (more details there) and both my experiments and the provided data suggest that compression severely limits the network throughput to around 20-30MB/s (~160-240Mbit/s) with only trivial amounts of network traffic saved (this was the result of some miscommunication, see next comment in a few min) => it seems reasonable to just fix the request options to override the transport.tcp.compress and keep disabling compression on recovery.
Also, there is the anecdotal evidence in https://discuss.elastic.co/t/elasticsearch-6-3-0-shard-recovery-is-slow/140940/14 suggesting that compression is an issue here in practice.

@ywelsch thanks for pointing out the issue with the data I worked from :)
There's definitely a traffic saving here from compression. Measured this with the example log dataset we provide in the docs and its about a 50% reduction in traffic from compression for the whole recovery process end-to-end.

This makes the situation a little closer I guess :)
Still given the data we gathered on the runtime of recoveries and the reported issues, I think overall turning off compression wins (easier on the CPU and much faster on a 1Gbit link). It seems enabling the compression really only helps in situations available bandwidth for recovery is limited.

We discussed this during FixitFriday today. While re-compression on top of Lucene can bring further space savings (needs more benchmarks, but numbers between 10% to 50% reduction in size were mentioned), it comes at the cost of additional CPU usage, which 1) can have a negative impact on indexing / search performance, as recovery is supposed to be a background task and 2) due to the way we're sequentially sending files during recovery, can limit the throughput of single shard recoveries. While we're less enthusiastic about adding a new configuration option just for recoveries, we want to explore always disabling compression for phase 1 of recoveries, even if transport.tcp.compress is enabled. This also looks to be the thing that the code currently intends on doing, even though it's not actually working that way. The first step will therefore be to figure out whether this behavior was changed in a previous ES version.

@ywelsch I looked into 2.4 and it seems the same issues this PR points out is present there as well:

The global setting overrides the per request option exactly like it still does, see https://github.com/elastic/elasticsearch/blob/2.4/core/src/main/java/org/elasticsearch/transport/netty/NettyTransport.java#L832.

So while we even did have the option (removed in https://github.com/elastic/elasticsearch/pull/15235) to set compression off for the recovery it had no effect in practice with compression enabled as far as I can tell.

=> seems like this issue isn't new at all.


That said in the real world, I'd still vote for turning off compression for shipping the segments even when compression is globally enabled:

  • Keeping the compression on here at best saves 50% traffic or speeds up recovery by 50% assuming the performance is gated by bandwidth.
  • While turning it off gives you a ~10x speedup if bandwidth is not an issue.

    • And there have been a number of complaints now about the performance of recovery with compression enabled.

=> IMO it's consistent to turn off compression for data that gets little benefit out of it in this special case.

Also:

request options override transport.tcp.compress

Shouldn't we regardless of what we decide on how we want to transfer the segment data fix this to actually have the options override the global setting (or remove the option?)?

Created a PR with the above suggestion #34959

I ran a compression test with different datasets and chunk sizes. These indices use the default codec without force merge. Below is the result.

compress nyc_taxis chunk_size=512kb from 38.7gb to 27.8gb diff 28% took 32.4m
compress nyc_taxis chunk_size=1mb from 38.7gb to 27.7gb diff 28% took 32.5m
compress nyc_taxis chunk_size=2mb from 38.7gb to 27.7gb diff 28% took 32.7m
compress nyc_taxis chunk_size=4mb from 38.7gb to 27.7gb diff 28% took 32.7m

compress pmc chunk_size=512kb from 20.2gb to 16.9gb diff 16% took 19.2m
compress pmc chunk_size=1mb from 20.2gb to 16.9gb diff 16% took 19.4m
compress pmc chunk_size=2mb from 20.2gb to 16.9gb diff 16% took 19.4m
compress pmc chunk_size=4mb from 20.2gb to 16.9gb diff 16% took 19.1m

compress geonames chunk_size=512kb from 3.1gb to 2.1gb diff 30% took 2.2m
compress geonames chunk_size=1mb from 3.1gb to 2.1gb diff 30% took 2.3m
compress geonames chunk_size=2mb from 3.1gb to 2.1gb diff 30% took 2.2m
compress geonames chunk_size=4mb from 3.1gb to 2.1gb diff 30% took 2.2m

compress osmgeopoints chunk_size=512kb from 3.3gb to 2.6gb diff 22% took 2.6m
compress osmgeopoints chunk_size=1mb from 3.3gb to 2.6gb diff 22% took 2.5m
compress osmgeopoints chunk_size=2mb from 3.3gb to 2.6gb diff 22% took 2.7m
compress osmgeopoints chunk_size=4mb from 3.3gb to 2.6gb diff 22% took 2.7m

compress logs-191998 chunk_size=512kb from 677.4mb to 461mb diff 31% took 30.9s
compress logs-191998 chunk_size=1mb from 677.4mb to 460.8mb diff 31% took 29.6s
compress logs-191998 chunk_size=2mb from 677.4mb to 460.7mb diff 31% took 29.5s
compress logs-191998 chunk_size=4mb from 677.4mb to 460.6mb diff 32% took 29.9s

compress logs-201998 chunk_size=512kb from 909.6mb to 622mb diff 31% took 41.4s
compress logs-201998 chunk_size=1mb from 909.6mb to 621.7mb diff 31% took 39.7s
compress logs-201998 chunk_size=2mb from 909.6mb to 621.5mb diff 31% took 38.4s
compress logs-201998 chunk_size=4mb from 909.6mb to 621.4mb diff 31% took 40.5s

compress logs-231998 chunk_size=512kb from 890.1mb to 609.9mb diff 31% took 39.2s
compress logs-231998 chunk_size=1mb from 890.1mb to 609.6mb diff 31% took 39s
compress logs-231998 chunk_size=2mb from 890.1mb to 609.4mb diff 31% took 39.6s
compress logs-231998 chunk_size=4mb from 890.1mb to 609.3mb diff 31% took 39.3s

compress logs-211998 chunk_size=512kb from 1.1gb to 811.2mb diff 32% took 52.8s
compress logs-211998 chunk_size=1mb from 1.1gb to 810.8mb diff 32% took 52.2s
compress logs-211998 chunk_size=2mb from 1.1gb to 810.6mb diff 32% took 50.8s
compress logs-211998 chunk_size=4mb from 1.1gb to 810.5mb diff 32% took 52.4s

compress logs-181998 chunk_size=512kb from 162.2mb to 106.3mb diff 34% took 7s
compress logs-181998 chunk_size=1mb from 162.2mb to 106.3mb diff 34% took 6.8s
compress logs-181998 chunk_size=2mb from 162.2mb to 106.3mb diff 34% took 6.7s
compress logs-181998 chunk_size=4mb from 162.2mb to 106.3mb diff 34% took 6.9s

compress logs-241998 chunk_size=512kb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=1mb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=2mb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=4mb from 9.3gb to 5.8gb diff 37% took 6.9m

compress logs-221998 chunk_size=512kb from 740.6mb to 501.4mb diff 32% took 33.5s
compress logs-221998 chunk_size=1mb from 740.6mb to 501.2mb diff 32% took 31.2s
compress logs-221998 chunk_size=2mb from 740.6mb to 501mb diff 32% took 32.2s
compress logs-221998 chunk_size=4mb from 740.6mb to 501mb diff 32% took 32.6s

The detail can be found here.

/cc @ywelsch

@dnhatn thanks for running this test. A few observations:

  • increasing chunk_size does not look to help achieve better compression.
  • There is quite the CPU overhead with compression enabled vs network bandwidth saved (which is around 30%) when sending file chunk requests. On a fast network connection, this can significantly slow down peer recoveries when having compression enabled.

Another observation is that the compression ratio varies a lot depending on the extension.

Out of curiosity, would it be possible to figure out whether to enable compression dynamically, eg. by starting by compressing half the chunks and decreasing or increasing this ratio dynamically depending on whichever performs faster?

@dnhatn What was the network bandwidth between the nodes in your benchmark? Or was it localhost?

@jasontedor It's a 12GiB connection between two GCP instances.

Have you done benchmarks on slower networks or networks with high-latency?

@jasontedor Not yet. The above result is a pure compression test.

I think an option to disable transport.compress for peer recovery ONLY would be very useful, especially now that ILM uses peer recovery to handle rollover.

Some metrics, we index about 500K docs per second 24/7 and rely on transport.compress in order to not go bankrupt on EC2 network transfer cost during online replication. This saves us nearly 2/3rd on transfer!! But during rollover, transport.compress brings 1/2 the cluster to a standstill due to DEFLATE pegging CPU at ~90%, while not reducing the overall network transfer amount by any more than Lucene compression already does.

Our use case would benefit greatly from separate options for compression during online/offline replication and recovery.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abtpst picture abtpst  路  3Comments

clintongormley picture clintongormley  路  3Comments

matthughes picture matthughes  路  3Comments

dadoonet picture dadoonet  路  3Comments

jpountz picture jpountz  路  3Comments