Presto: Map PostgreSQL arrays to JSON

Created on 26 Apr 2019  Â·  12Comments  Â·  Source: prestosql/presto

PostgreSQL support for arrays (added recently in https://github.com/prestosql/presto/pull/317) suffers from conceptual problem.

Postgres type system does not distinguish between array types with different
number of dimensions. _int4, int[], int[][] and int[][][][][][] are all equivalent
to the type system.

psql (11.2 (Debian 11.2-1.pgdg90+1))

test=# create table t(a int[], b int[][], c int[][][]);
CREATE TABLE
test=# insert into t values (array[1], array[1], array[1]),
          (array[array[1]], array[array[1]], array[array[1]]);
INSERT 0 2
test=# select * from t;
   a   |   b   |   c
-------+-------+-------
 {1}   | {1}   | {1}
 {{1}} | {{1}} | {{1}}
(2 rows)

Currently we map these types as follows:

  • _int4 → not supported
  • int[] → array(integer)
  • int[][] → array(array(integer))

The execution will fail when it encounters an int[] array in int[][] column.

proposed solution

Change PostgreSQL arrays mapping -- map them to Presto's JSON instead of Presto's arrays.

When we stop inspecting pg_attribute.attndims, we automatically get _int4 covered (see https://github.com/prestosql/presto/pull/659 for discussion about _<type> types in PostgreSQL)

TODO: figure out _writing_

Most helpful comment

@dain and I discussed this and had an idea. We can have a flag that says how to treat arrays:

  • fail -- map to Presto arrays and fail the query if the dimensionality of a value doesn't match
  • null -- like fail but return null
  • json -- map arrays to the json type

Having fail be the default seems reasonable, since it seems unlikely that (1) users would use mismatched dimensionality and (2) users would use multi-dimensional arrays at all.

All 12 comments

cc @vincentpoon @guyco33

I don't like backing off a feature developed so nicely. I see lots of advantages of mapping to Presto arrays, like:

  • better Presto type system than with JSON
  • easier access (via subscripts)
  • more performant

However, i am convinced we should not enforce strict types on something that has no strict types.
This will backfire. And it's much easier to change this now than later.

I will try to rephrase the problem:

  • We thought PostgreSQL supports arrays and multi-dimensional arrays
    just like Presto so we mapped them to Presto arrays.
  • However, PostgreSQL support has one type of arrays: "arrays of some number
    dimensions" where the number of array dimension is _not_ property of the type,
    it's a property of particular _value_.

    • Within a single table column I can have values array[1,2,3], array[array[1,2,3], array[4,5,6]]

      and array[array[array[1]]].

  • In Presto, this cannot be expressed with an array type (Presto array type has well-defined number
    of dimensions). However, it can be expressed with Presto JSON type.

@guyco33, does it make more sense now?

Yes. Thanks @findepi for the clarification!
Mapping Postgres Arrays to Postgres Json makes sense.

Yes, the dimensions property of the array type in Postgres is not enforced, it's up to the user if they want to maintain consistency. However, since Presto enforces dimensions implicitly, what that means is if you write only from Presto, the current solution works fine. But we do have the potential for problematic data written through Postgres directly.

However mapping to JSON sounds quite messy based on my limited understanding of the JSON feature - though Postgres doesn't enforce dimensionality, it will still enforce the base type. So you would cast the JSON into the Postgres base type? Besides the performance cost, there's the potential for type mismatch since JSON allows mixed types but ARRAY doesn't.

Perhaps we could just limit the ARRAY functionality to single dimensions arrays (don't bother fetching dimension count), and state that as a known limitation?

Not sure how feasible this is, but perhaps another option would be to write a custom data type which does exactly what we want - i.e. a BlockBuilder that can hold either an array or a baseType value. That way we can return results of mixed dimensions, of a given baseType.

There doesn't seem to be many examples of writing a custom data type, though, so I'm not sure what limitations there are.

How do people use such arrays in PostgreSQL from SQL? It seems like this would be really hard to use.

How do people use such arrays in PostgreSQL from SQL? It seems like this would be really hard to use.

@electrum Though I've never encountered such tables with mixed dimension arrays, I guess that they are used only by code after retrieving the dimension count with array_ndims function

@dain and I discussed this and had an idea. We can have a flag that says how to treat arrays:

  • fail -- map to Presto arrays and fail the query if the dimensionality of a value doesn't match
  • null -- like fail but return null
  • json -- map arrays to the json type

Having fail be the default seems reasonable, since it seems unlikely that (1) users would use mismatched dimensionality and (2) users would use multi-dimensional arrays at all.

+1, fwiw I also have not encountered use of mixed dimension arrays.

Maybe rename null to ignore?

We can use Postgres array_to_json function to easily do the JSON mapping. It can help especially for multi-dimensional arrays tables.
For example:
with tab(array_col) as (values (array[array[1,2]]),(array[2])) select array_to_json(array_col) from tab

Was this page helpful?
0 / 5 - 0 ratings