Presto: HiveMetastore outputFormat should not be accessed from a null StorageFormat

Created on 27 Dec 2016  路  9Comments  路  Source: prestodb/presto

With this commit , starting Presto 0.152, we can't query System Connector's JDBC/HiveMetastore if Hive tables were created from StorageHandlers where INPUT/OUTPUT Formats may not exists.

Repo:

  1. Create an external table on Hive w.r.t dynamodb
show create table hivet2;

WARNING: Configured write throughput of the dynamodb table SeattleBikes is less than the cluster map capacity. ClusterMapCapacity: 40 WriteThroughput: 5
WARNING: Writes to this table might result in a write outage on the table.
OK

CREATE EXTERNAL TABLE `hivet2`(
  `col1` string COMMENT 'from deserializer', 
  `col2` string COMMENT 'from deserializer', 
  `col3` string COMMENT 'from deserializer', 
  `col4` string COMMENT 'from deserializer', 
  `col5` string COMMENT 'from deserializer', 
  `col6` string COMMENT 'from deserializer', 
  `col7` string COMMENT 'from deserializer', 
  `col8` string COMMENT 'from deserializer', 
  `col9` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.dynamodb.DynamoDBSerDe' 
STORED BY 
  'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
WITH SERDEPROPERTIES ( 
  'serialization.format'='1')
LOCATION
  'hdfs://ip-172-31-21-252.us-west-2.compute.internal:8020/user/hive/warehouse/hivet2'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'dynamodb.column.mapping'='col1:BikeId,col2:BikeModel,col3:BikeType,col4:City,col5:Condition,col6:Latitude,col7:Location,col8:Longitude,col9:Manufacturer', 
  'dynamodb.table.name'='SeattleBikes', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1482875308')
Time taken: 1.076 seconds, Fetched: 27 row(s)
  1. Open CLI and run some query to pull metastore/table details.
presto-cli --catalog hive --schema default --debug
presto:default> 
presto:default> SELECT TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, 
    DECIMAL_DIGITS, NUM_PREC_RADIX, NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB, CHAR_OCTET_LENGTH, 
    ORDINAL_POSITION, IS_NULLABLE, SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE, SOURCE_DATA_TYPE, IS_AUTOINCREMENT 
    FROM system.jdbc.columns WHERE TABLE_CAT='hive';

Query 20161227_220527_00074_urbnw, FAILED, 2 nodes
http://ip-172-31-21-252.us-west-2.compute.internal:8889/query.html?20161227_220527_00074_urbnw
Splits: 2 total, 0 done (0.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 33% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20161227_220527_00074_urbnw failed: outputFormat should not be accessed from a null StorageFormat
java.lang.IllegalStateException: outputFormat should not be accessed from a null StorageFormat
        at com.facebook.presto.hive.metastore.StorageFormat.getOutputFormat(StorageFormat.java:59)
        at com.facebook.presto.hive.HiveMetadata.extractHiveStorageFormat(HiveMetadata.java:1554)
        at com.facebook.presto.hive.HiveMetadata.getTableMetadata(HiveMetadata.java:268)
        at com.facebook.presto.hive.HiveMetadata.listTableColumns(HiveMetadata.java:370)
        at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.listTableColumns(ClassLoaderSafeConnectorMetadata.java:169)
        at com.facebook.presto.metadata.MetadataManager.listTableColumns(MetadataManager.java:420)
        at com.facebook.presto.connector.system.jdbc.ColumnJdbcTable.cursor(ColumnJdbcTable.java:125)
        at com.facebook.presto.connector.system.SystemPageSourceProvider$1.cursor(SystemPageSourceProvider.java:126)
        at com.facebook.presto.split.MappedRecordSet.cursor(MappedRecordSet.java:53)
        at com.facebook.presto.spi.RecordPageSource.<init>(RecordPageSource.java:37)
        at com.facebook.presto.connector.system.SystemPageSourceProvider.createPageSource(SystemPageSourceProvider.java:105)
        at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:48)
        at com.facebook.presto.operator.ScanFilterAndProjectOperator.createSourceIfNecessary(ScanFilterAndProjectOperator.java:292)
        at com.facebook.presto.operator.ScanFilterAndProjectOperator.isFinished(ScanFilterAndProjectOperator.java:180)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:375)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Ran on Presto 0.152.3/emr-5.2.0 (Hive metastore can be external/internal).

backport

Most helpful comment

As of 0.198, I am also experiencing this error pretty regularly against S3 backed Parquet tables.

Most tables have been created through Hive, though we also have tables that have been created through Presto. Offending tables have the following formats:

InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 

We have BI tools and data exploration tools that query system.jdbc.columns and information_schema.columns and regularly receive this error.

All 9 comments

Same issue on EMR. External hive table to S3 or DynamoDB works fine. But when trying to access the DynamoDB table, it throws the same error: Query 20170410_121048_00008_sdbf4 failed: outputFormat should not be accessed from a null StorageFormat.. However, the S3 backed table continues to work fine on Presto.

Any update on this issue? We ran into it with S3-backup tables on Hive on version 0.173.

I have same error on emr presto 0.166
presto:information_schema> SELECT table_schema, table_name, column_name FROM information_schema.columns WHERE table_schema IN ('default', 'test');

Query 20170425_112726_00155_nfp6g failed: outputFormat should not be accessed from a null StorageFormat
java.lang.IllegalStateException: outputFormat should not be accessed from a null StorageFormat
at com.facebook.presto.hive.metastore.StorageFormat.getOutputFormat(StorageFormat.java:61)
at com.facebook.presto.hive.HiveMetadata.extractHiveStorageFormat(HiveMetadata.java:1266)
at com.facebook.presto.hive.HiveMetadata.getTableMetadata(HiveMetadata.java:251)
at com.facebook.presto.hive.HiveMetadata.listTableColumns(HiveMetadata.java:329)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.listTableColumns(ClassLoaderSafeConnectorMetadata.java:166)
at com.facebook.presto.metadata.MetadataManager.listTableColumns(MetadataManager.java:397)
at com.facebook.presto.connector.informationSchema.InformationSchemaPageSourceProvider.getColumnsList(InformationSchemaPageSourceProvider.java:182)
at com.facebook.presto.connector.informationSchema.InformationSchemaPageSourceProvider.buildColumns(InformationSchemaPageSourceProvider.java:156)
at com.facebook.presto.connector.informationSchema.InformationSchemaPageSourceProvider.getInformationSchemaTable(InformationSchemaPageSourceProvider.java:135)
at com.facebook.presto.connector.informationSchema.InformationSchemaPageSourceProvider.getInternalTable(InformationSchemaPageSourceProvider.java:129)
at com.facebook.presto.connector.informationSchema.InformationSchemaPageSourceProvider.createPageSource(InformationSchemaPageSourceProvider.java:87)
at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:222)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:555)
at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:691)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

but next queries is correct work
presto> SELECT table_schema, table_name, column_name FROM information_schema.columns WHERE table_schema IN ('presto');
or
presto> SELECT table_schema, table_name, column_name FROM information_schema.columns WHERE table_schema IN ('test');

As of 0.198, I am also experiencing this error pretty regularly against S3 backed Parquet tables.

Most tables have been created through Hive, though we also have tables that have been created through Presto. Offending tables have the following formats:

InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 

We have BI tools and data exploration tools that query system.jdbc.columns and information_schema.columns and regularly receive this error.

This issue also shows up for Hive external table created on AWS EMR using AWS ElasticSearch as source and again, this format uses a storage handler as well.

Below table can be queried using hive on AWS EMR but cannot be queried using presto on AWS EMR and similarly on AWS Athena (which seems to be using presto as well)

Hive DDL :

CREATE EXTERNAL TABLE prod_elastic_search.insights (
reportinstanceid string,
ip string,
sg_event_id string,
sg_message_id string,
useragent string,
event string,
email string,
timestamp bigint,
reportsubject string,
category ARRAY,
@id string,
@timestamp timestamp)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.mapping.id' = 'sendgridevents',
'es.resource' = 'cwl-*',
'es.nodes' = '',
'es.port' = '80',
'es.nodes.wan.only' = 'true',
'es.index.auto.create' = 'true'
,'es.read.field.as.array.exclude' = 'true'
);

Error Details:

presto> select * from prod_elastic_search.insights limit 10;
Query 20180419_203320_00002_juv3q failed: outputFormat should not be accessed from a null StorageFormat

We have tables using Hive StorageHandler and living in same metastore with regular and presto-queryable tables. External tools like JetBrain's Data Grip and BI tools, which use information_schema and system.jdbc.columns for displaying tables and columns regularly fail because of the StorageHandler tables, resulting different problems in the tools.

Presto should handle this exception and ignore unqueryable tables.
Is there a roadmap/timeline for merging the fix #9837 for this issue?

here is a simple makeshift :
I found the error in source code in com.facebook.presto.hive.metastore.StorageFormat, and change the code

public String getOutputFormat()
{
if (outputFormat == null) {
return "null";
//throw new IllegalStateException("outputFormat should not be accessed from a null StorageFormat");
}
return outputFormat;
}

We should backport the fix from: https://github.com/prestosql/presto/pull/568

I've opened a PR to backport the fix here: https://github.com/prestodb/presto/pull/12973

Was this page helpful?
0 / 5 - 0 ratings