Presto: backend scala can't load dataframe while using the Presto driver _ java.sql.Exception

Created on 19 Dec 2019  路  15Comments  路  Source: prestosql/presto

This is where the test break

 df = spark.read.format("jdbc")
                .option("url", url_ )
                .option("dbtable", dbtable_ )
                .option("driver", "com.facebook.presto.jdbc.PrestoDriver")
                .load()

**The Error that I got:

java.sql Unsupported type Array**

Most helpful comment

@AssouliDFK Could you tell me the reason why you did thumbs down...? Please correct me if my understanding is wrong.

All 15 comments

@AssouliDFK Please include the full stacktrace of an error, when running Spark with Presto JDBC 326

[info] java.sql.SQLException: _Unsupported type ARRAY_
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:251)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
[info] at scala.Option.getOrElse(Option.scala:121)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:315)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
[info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
[info] at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
[info] ...

Could you also share the DDL? As far as I tested, this issue happens if the table has ARRAY type and the exception was thrown by Spark side (I suppose this isn鈥檛 Presto bug).

https://github.com/apache/spark/blob/a834dba120e3569e44c5e4b9f8db9c6eef58161b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L205

| col_1 | string
| col_2 | double
| col_3 | double
| col_4 | double
| col_5 | double
| col_6 | string
| col_7 | string
| col_8 | string
| col_9 | string
| col_10 | double
| col_11 | string
| col_12 | string
| col_13 | double
| col_14| string
| col_15 | double

  • | col_16| array < string >

| col_17 | string

@AssouliDFK Could you tell me the reason why you did thumbs down...? Please correct me if my understanding is wrong.

@ebyhr no offense , but i dont think it's a Spark bug , because I implemented with another distrbute query engines , and that was working without any bugs , so thats why i think its not a spark bug . and I'm sorry for thumbs down O:)

@AssouliDFK However, JdbcUtils.scala#L205 is showing that Spark doesn't support java.sql.Types.ARRAY type. Also, the same issue with PostgreSQL x Spark was reported in https://stackoverflow.com/questions/50613977/unsupported-array-error-when-reading-jdbc-source-in-pyspark. What do you think about it?

@ebyhr look , i tried the samething with PostgreSQL , and i had no issue .
and this is how it looks when i select the DDL in PostgreQSL
col | Type
col_1 | text
col_2 | double precision
col_3 | double precision
col_4 | double precision
col_5 | double precision
col_6 | text
col_7 | text
col_8 | text
col_9 | text
col_10 | double precision
col_11 | text
col_12 | text
col_13 | double precision
col_14 | text
col_15 | double precision
col_16 | text [ ]
col_17 | text
And when try to load data on the dataframe it work without any problems .

the same call work in PG :

df = spark.read .format("jdbc") .option("url", URL_ ) .option("dbtable", TABLE_NAME) .option("driver", "org.postgresql.Driver") .load()
Thank you ! O:)

Please try with .option("dbtable", "pg_type").

df = spark.read
               .format("jdbc")
               .option("url", URL_ )
               .option("dbtable", "pg_type")
               .option("driver", "org.postgresql.Driver")
               .load()
scala> var df = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:15432/test?user=test&password=test").option("dbtable", "pg_type").option("driver", "org.postgresql.Driver").load()
java.sql.SQLException: Unsupported type ARRAY
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:251)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:316)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:315)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
  ... 49 elided

Yo, the bug doesn't exist on PG but when loading the dataframe from hive using presto when the bug shows .
PS: the df loading is done the same way as postgres

@ebyhr so as @Oshimada told you the problem is not while loading the dataframe using PG ( It works successffuly ) , but when trying with the same logic in Presto it failed , without loading anything .
thaaanks

Thank you, what im trying to fix now it's why my df can't load data when using spark reader and prestoDriver. This is my problem and not PG problem.

It appears Spark doesn't support arrays generically for JDBC, but only for specific databases like PostgreSQL. This is why it isn't working for Presto.

Other than changing Spark to add support for Presto, you might be able to work around this by converting the array column to JSON聽text in the SQL query: json_format(cast(x as json))

I'm going to close this issue because it depends on the Spark implementation. Please reopen or left a comment on Slack if you need help.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

theoretical-olive picture theoretical-olive  路  5Comments

yourmain picture yourmain  路  4Comments

dpolonsky picture dpolonsky  路  4Comments

lxynov picture lxynov  路  5Comments

lxynov picture lxynov  路  4Comments