Presto: backend scala can't load dataframe while using the Presto driver _ java.sql.Exception

Created on 19 Dec 2019 · 15Comments · Source: prestosql/presto

This is where the test break

 df = spark.read.format("jdbc")
                .option("url", url_ )
                .option("dbtable", dbtable_ )
                .option("driver", "com.facebook.presto.jdbc.PrestoDriver")
                .load()

**The Error that I got:

java.sql Unsupported type Array**

Source

AssouliDFK

Most helpful comment

@AssouliDFK Could you tell me the reason why you did thumbs down...? Please correct me if my understanding is wrong.

ebyhr on 19 Dec 2019

❤2 😄2

All 15 comments

@AssouliDFK Please include the full stacktrace of an error, when running Spark with Presto JDBC 326

findepi on 19 Dec 2019

👍1

[info] java.sql.SQLException: _Unsupported type ARRAY_
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:251)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
[info] at scala.Option.getOrElse(Option.scala:121)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:315)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
[info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
[info] at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
[info] ...

AssouliDFK on 19 Dec 2019

👍1

Could you also share the DDL? As far as I tested, this issue happens if the table has ARRAY type and the exception was thrown by Spark side (I suppose this isn’t Presto bug).

https://github.com/apache/spark/blob/a834dba120e3569e44c5e4b9f8db9c6eef58161b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L205

ebyhr on 19 Dec 2019

👍1

| col_1 | string
| col_2 | double
| col_3 | double
| col_4 | double
| col_5 | double
| col_6 | string
| col_7 | string
| col_8 | string
| col_9 | string
| col_10 | double
| col_11 | string
| col_12 | string
| col_13 | double
| col_14| string
| col_15 | double

| col_16| array < string >

| col_17 | string

AssouliDFK on 19 Dec 2019

@AssouliDFK Could you tell me the reason why you did thumbs down...? Please correct me if my understanding is wrong.

ebyhr on 19 Dec 2019

❤2 😄2

@ebyhr no offense , but i dont think it's a Spark bug , because I implemented with another distrbute query engines , and that was working without any bugs , so thats why i think its not a spark bug . and I'm sorry for thumbs down O:)

AssouliDFK on 19 Dec 2019

👀1

@AssouliDFK However, JdbcUtils.scala#L205 is showing that Spark doesn't support java.sql.Types.ARRAY type. Also, the same issue with PostgreSQL x Spark was reported in https://stackoverflow.com/questions/50613977/unsupported-array-error-when-reading-jdbc-source-in-pyspark. What do you think about it?

ebyhr on 19 Dec 2019

👍1

the same call work in PG :

df = spark.read .format("jdbc") .option("url", URL_ ) .option("dbtable", TABLE_NAME) .option("driver", "org.postgresql.Driver") .load()
Thank you ! O:)

AssouliDFK on 19 Dec 2019

Please try with .option("dbtable", "pg_type").

df = spark.read
               .format("jdbc")
               .option("url", URL_ )
               .option("dbtable", "pg_type")
               .option("driver", "org.postgresql.Driver")
               .load()

scala> var df = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:15432/test?user=test&password=test").option("dbtable", "pg_type").option("driver", "org.postgresql.Driver").load()
java.sql.SQLException: Unsupported type ARRAY
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:251)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:315)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
  ... 49 elided

ebyhr on 19 Dec 2019

👍2 😕1

Yo, the bug doesn't exist on PG but when loading the dataframe from hive using presto when the bug shows .
PS: the df loading is done the same way as postgres

Oshimada on 19 Dec 2019

👍1

@ebyhr so as @Oshimada told you the problem is not while loading the dataframe using PG ( It works successffuly ) , but when trying with the same logic in Presto it failed , without loading anything .
thaaanks

AssouliDFK on 19 Dec 2019

👀1

I guess the reason is Spark has PostgresDialect and the logic isn't completely the same as Presto case.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L320-L321

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L221

PostgresDialect has special logic for ARRAY type at
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L42-L45

ebyhr on 19 Dec 2019

👍2

Thank you, what im trying to fix now it's why my df can't load data when using spark reader and prestoDriver. This is my problem and not PG problem.

AssouliDFK on 19 Dec 2019

It appears Spark doesn't support arrays generically for JDBC, but only for specific databases like PostgreSQL. This is why it isn't working for Presto.

Other than changing Spark to add support for Presto, you might be able to work around this by converting the array column to JSON text in the SQL query: json_format(cast(x as json))