Presto: Workers not releasing idle S3 connections to the pool when reading Avro files

Created on 15 Oct 2020  路  16Comments  路  Source: prestosql/presto

Presto 343 - 1 coordinator + 4 workers
hive metastore+postgresql - avro files on s3
We've modified the following hive.properties

hive.s3.max-connections = 10000
hive.s3select-pushdown.max-connections = 10000
hive.s3.connect-timeout=3m
hive.s3.socket-timeout=3m

We see the connection pool progressively use all connections, never releasing them (after several days). Eventually the pool is depleted and no further work with s3 files is possible until the node service is restarted.

bug

Most helpful comment

Btw I validated that cherry-picking HIVE-22981 fixes the issue. So big thanks @rdsr for pointing towards that issue.

All 16 comments

Do you observe this with any file format other than Avro?

cc @electrum @losipiuk

We only see this with avro files. ORC files we do no t see the connection counts change at all.

Are there possible fixes in the hive and hadoop repos here on the prestosql github that we should be aware of? Currently we are using the hadoop 3.3.0 and hive 3.1.2 tarballs from apache.

EDIT: ignore this, I did not realize this was for the presto jar files.

I was able to reproduce this both in our dev AWS environment (on accident while trying to deploy a sync of our fork with 343) and locally.

For the local instance I setup an S3 bucket using MinIO and a local hive metastore instance, created two tables where the only difference was the format (ORC vs. Avro), inserted junk data into both, then hooked up jConsole to observe the value of presto.plugin.hive.s3.PrestoS3FileSystem.hive.hive.Attributes.ActiveConnections.TotalCount.

Here are the commands I used to setup the tables, insert data, and issue reads.


Steps to reproduce

CREATE SCHEMA IF NOT EXISTS hive.test WITH (location = 's3a://a-bucket/');

CREATE TABLE IF NOT EXISTS hive.test.avro (id bigint, txt varchar) WITH (format = 'AVRO');

INSERT INTO hive.test.avro VALUES (1, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (2, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (3, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (4, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (5, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (6, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (7, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (8, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (9, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (10, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (11, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (12, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (13, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (14, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (15, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (16, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (17, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (18, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (19, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (20, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (21, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (22, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (23, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (24, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (25, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (26, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (27, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (28, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (29, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (30, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (31, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (32, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (33, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (34, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (35, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (36, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (37, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (38, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (39, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (40, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (41, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (42, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (43, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (44, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (45, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (46, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (47, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (48, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (49, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (50, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (51, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (52, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (53, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (54, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (55, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (56, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (57, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (58, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.avro VALUES (59, 'asdflkjasdf;lkjasfd');

select * from hive.test.avro;
# Check ActiveConnections.TotalCount in jConsole
select * from hive.test.avro;
# Check ActiveConnections.TotalCount in jConsole and compare before


CREATE TABLE IF NOT EXISTS hive.test.orc (id bigint, txt varchar) WITH (format = 'ORC');

INSERT INTO hive.test.orc VALUES (1, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (2, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (3, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (4, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (5, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (6, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (7, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (8, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (9, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (10, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (11, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (12, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (13, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (14, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (15, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (16, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (17, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (18, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (19, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (20, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (21, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (22, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (23, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (24, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (25, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (26, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (27, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (28, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (29, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (30, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (31, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (32, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (33, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (34, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (35, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (36, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (37, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (38, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (39, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (40, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (41, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (42, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (43, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (44, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (45, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (46, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (47, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (48, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (49, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (50, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (51, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (52, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (53, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (54, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (55, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (56, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (57, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (58, 'asdflkjasdf;lkjasfd');
INSERT INTO hive.test.orc VALUES (59, 'asdflkjasdf;lkjasfd');

select * from hive.test.orc;
# Check ActiveConnections.TotalCount in jConsole
select * from hive.test.orc;
# Check ActiveConnections.TotalCount in jConsole and compare before

A few interesting observations:

  1. The leak only happens with Reads. Writes do not appear to leak connections.
  2. The specific Avro schema doesn't seem to matter, just that its Avro.
  3. Amount of data in the bucket and size of the files does not seem to make any difference.

I've tested this against several current releases. This bug appears to have been introduced in 341. All previous releases including 340 do not have this bug. This bug also continues to exist as of release 346.

I'm also including the stack trace we receive once the connection pool runs out of available connections. The specific exception is what you would expect to see after running out of connections. However, after spending some time running with a debugger attached, I believe this might also be the code path where the leaks are happening. Hopefully it can provide some clues as to the root cause.


Stack Trace

com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1201)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1147)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5054)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5000)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335)
    at io.prestosql.plugin.hive.s3.PrestoS3FileSystem.lambda$getS3ObjectMetadata$5(PrestoS3FileSystem.java:704)
    at io.prestosql.plugin.hive.util.RetryDriver.run(RetryDriver.java:130)
    at io.prestosql.plugin.hive.s3.PrestoS3FileSystem.getS3ObjectMetadata(PrestoS3FileSystem.java:701)
    at io.prestosql.plugin.hive.s3.PrestoS3FileSystem.getS3ObjectMetadata(PrestoS3FileSystem.java:685)
    at io.prestosql.plugin.hive.s3.PrestoS3FileSystem.getFileStatus(PrestoS3FileSystem.java:378)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
    at org.apache.avro.mapred.FsInput.<init>(FsInput.java:43)
    at org.apache.avro.mapred.FsInput.<init>(FsInput.java:38)
    at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:96)
    at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51)
    at io.prestosql.plugin.hive.util.HiveUtil.createRecordReader(HiveUtil.java:252)
    at io.prestosql.plugin.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:88)
    at io.prestosql.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
    at io.prestosql.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:96)
    at io.prestosql.plugin.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:87)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:206)
    at io.prestosql.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:101)
    at io.prestosql.spi.connector.ConnectorPageSourceProvider.createPageSource(ConnectorPageSourceProvider.java:68)
    at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:66)
    at io.prestosql.split.PageSourceManager.createPageSource(PageSourceManager.java:64)
    at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:298)
    at io.prestosql.operator.Driver.processInternal(Driver.java:379)
    at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
    at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
    at io.prestosql.operator.Driver.processFor(Driver.java:276)
    at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
    at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
    at io.prestosql.$gen.Presto_325_3279_g2ba46c009____20201110_231256_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
    Suppressed: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        ... 44 more
    Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:314)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
        at jdk.internal.reflect.GeneratedMethodAccessor821.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy275.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
        ... 42 more
    Suppressed: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        ... 44 more
    Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:314)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
        at jdk.internal.reflect.GeneratedMethodAccessor821.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy275.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
        ... 42 more
    Suppressed: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        ... 44 more
    Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:314)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
        at jdk.internal.reflect.GeneratedMethodAccessor821.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy275.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
        ... 42 more
    Suppressed: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        ... 44 more
    Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:314)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
        at jdk.internal.reflect.GeneratedMethodAccessor821.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy275.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
        ... 42 more
    Suppressed: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
        ... 44 more
    Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:314)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
        at jdk.internal.reflect.GeneratedMethodAccessor821.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
        at com.amazonaws.http.conn.$Proxy275.get(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
        ... 42 more
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:314)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:280)
    at jdk.internal.reflect.GeneratedMethodAccessor821.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
    at com.amazonaws.http.conn.$Proxy275.get(Unknown Source)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1323)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
    ... 42 more

The issue seems to be HIVE-22981 which is fixed in the latest Hive master. This method AvroGenericRecordReader#extractWriterTimezoneFromMetadata is calling Avro's DataFileReader but is not closing it . The said method is being called from AvroGenericRecordReader#init . See here

In 341 we bumped Hive library from 3.0.6 to 3.1.2. I expect that is the reason for regression. I will see if we can backport the HIVE-22981 fix.

@losipiuk You mean backporting HIVE-22981 to branch-3.1 in apache/hive? or within prestosql/presto-hive-apache?
(i guess we could try both approaches in parallel)

I was thinking latter. But I can send backport request to apache/hive too (idk what is the process there).

for apache/hive you create a JIRA ("backport xxxx") and then do a PR. You put the JIRA id at the start of the cmt message.

@losipiuk You mean backporting HIVE-22981 to branch-3.1 in apache/hive? or within prestosql/presto-hive-apache?
(i guess we could try both approaches in parallel)

@findepi what does fixing it 'within prestosql/presto-hive-apache' entail?

prestosql/presto-hive-apache is repackaging of hive client libraries Presto uses. While repackaging we can also do point fixes like this one here.

Btw I validated that cherry-picking HIVE-22981 fixes the issue. So big thanks @rdsr for pointing towards that issue.

@csunwold I looked here a bit more and I am a bit confused.

Without the fixed Hive client, I reproduced the problem with JMX metric going up on each query which reads avro, and saying up. But now I see I cannot reproduce the actual error with depleted http connection pool.
Even with JMX counter going up, connection pool seems fine. How did you reproduce that (more important) part of the problem.

@losipiuk I wish I had a good answer for you, but it appears you've discovered a flaw in my testing process. I am now just as confused as you.

Looking closer at what I did, I did not actually reproduce the the timeout error locally. I focused in too much on the active connections and reproduced it leaving it as an assumption that when Active Connections reached the value of hive.s3.max-connections the error would _probably_ occur. But it appears this is a not so classic case of what happens when you assume.

I focused in on the Active Connections because that was exactly the metric that correlated perfectly with our cluster timeout exceptions. See graph:

Screen Shot 2020-11-19 at 2 43 48 PM

I went back and re-ran my ad-hoc test and observed the exact same behavior as you described. While the ActiveConnections goes up steadily, at no point am I able to reproduce depleted http connection pool.

One potential difference that may be worth looking closer: My test (and I assume how you reproduced as well) was to use MinIO as a stand in for S3. Could you try using a real s3 bucket instead of MinIO?

I remain optimistic that the updated Hive client does fix the issue given the strong correlation between ActiveConnections and the pool timeout in production, as well as this anecdotal report that using the new jar resolves the issue: https://prestosql.slack.com/archives/CGB0QHWSW/p1605814574075500?thread_ts=1605807972.073400&cid=CGB0QHWSW

Was this page helpful?
0 / 5 - 0 ratings