PostgreSQL support for arrays (added recently in https://github.com/prestosql/presto/pull/317) suffers from conceptual problem.
Postgres type system does not distinguish between array types with different
number of dimensions. _int4, int[], int[][] and int[][][][][][] are all equivalent
to the type system.
psql (11.2 (Debian 11.2-1.pgdg90+1))
test=# create table t(a int[], b int[][], c int[][][]);
CREATE TABLE
test=# insert into t values (array[1], array[1], array[1]),
(array[array[1]], array[array[1]], array[array[1]]);
INSERT 0 2
test=# select * from t;
a | b | c
-------+-------+-------
{1} | {1} | {1}
{{1}} | {{1}} | {{1}}
(2 rows)
Currently we map these types as follows:
_int4 → not supportedint[] → array(integer)int[][] → array(array(integer))The execution will fail when it encounters an int[] array in int[][] column.
Change PostgreSQL arrays mapping -- map them to Presto's JSON instead of Presto's arrays.
When we stop inspecting pg_attribute.attndims, we automatically get _int4 covered (see https://github.com/prestosql/presto/pull/659 for discussion about _<type> types in PostgreSQL)
TODO: figure out _writing_
cc @vincentpoon @guyco33
I don't like backing off a feature developed so nicely. I see lots of advantages of mapping to Presto arrays, like:
However, i am convinced we should not enforce strict types on something that has no strict types.
This will backfire. And it's much easier to change this now than later.
I will try to rephrase the problem:
array[1,2,3], array[array[1,2,3], array[4,5,6]]array[array[array[1]]].@guyco33, does it make more sense now?
Yes. Thanks @findepi for the clarification!
Mapping Postgres Arrays to Postgres Json makes sense.
Yes, the dimensions property of the array type in Postgres is not enforced, it's up to the user if they want to maintain consistency. However, since Presto enforces dimensions implicitly, what that means is if you write only from Presto, the current solution works fine. But we do have the potential for problematic data written through Postgres directly.
However mapping to JSON sounds quite messy based on my limited understanding of the JSON feature - though Postgres doesn't enforce dimensionality, it will still enforce the base type. So you would cast the JSON into the Postgres base type? Besides the performance cost, there's the potential for type mismatch since JSON allows mixed types but ARRAY doesn't.
Perhaps we could just limit the ARRAY functionality to single dimensions arrays (don't bother fetching dimension count), and state that as a known limitation?
Not sure how feasible this is, but perhaps another option would be to write a custom data type which does exactly what we want - i.e. a BlockBuilder that can hold either an array or a baseType value. That way we can return results of mixed dimensions, of a given baseType.
There doesn't seem to be many examples of writing a custom data type, though, so I'm not sure what limitations there are.
How do people use such arrays in PostgreSQL from SQL? It seems like this would be really hard to use.
How do people use such arrays in PostgreSQL from SQL? It seems like this would be really hard to use.
@electrum Though I've never encountered such tables with mixed dimension arrays, I guess that they are used only by code after retrieving the dimension count with array_ndims function
@dain and I discussed this and had an idea. We can have a flag that says how to treat arrays:
fail -- map to Presto arrays and fail the query if the dimensionality of a value doesn't matchnull -- like fail but return nulljson -- map arrays to the json typeHaving fail be the default seems reasonable, since it seems unlikely that (1) users would use mismatched dimensionality and (2) users would use multi-dimensional arrays at all.
+1, fwiw I also have not encountered use of mixed dimension arrays.
Maybe rename null to ignore?
We can use Postgres array_to_json function to easily do the JSON mapping. It can help especially for multi-dimensional arrays tables.
For example:
with tab(array_col) as (values (array[array[1,2]]),(array[2]))
select array_to_json(array_col) from tab
Most helpful comment
@dain and I discussed this and had an idea. We can have a flag that says how to treat arrays:
fail-- map to Presto arrays and fail the query if the dimensionality of a value doesn't matchnull-- likefailbut return nulljson-- map arrays to thejsontypeHaving
failbe the default seems reasonable, since it seems unlikely that (1) users would use mismatched dimensionality and (2) users would use multi-dimensional arrays at all.