Presto: Union of char and varchar creates char(65536) as output type

Created on 9 Jul 2020  ·  5Comments  ·  Source: prestosql/presto

 select typeof(x) from (select char 'abc' x union all select varchar 'xyz' x);
    _col0    
-------------
 char(65536) 
 char(65536) 
(2 rows)
bug

Most helpful comment

What would be surprising is that one would not expect a varchar to implicitly coerce to another type in a way that changes the value. The expectation is that that kind of operation would fail with a “type mismatch” error

All 5 comments

This might be part of the problem:

select typeof(x) from (select varchar 'abc' x);
  _col0  
---------
 varchar 
(1 row)

It seems like the type constructor should be producing a varchar(3) since the length is known.

There's also the question of why the output type is char and not varchar, and if that is the correct behavior.

The char vs varchar coercions in Presto are currently broken. In standard SQL, literals are supposed to be of type char, not varchar, and char should implicitly coerce to varchar. I was looking at this a few weeks ago in the context of implicit coercions for the || operator. This is what I wrote as part of the analysis:

A conversion from varchar to char adds padding, so it's unsafe to
do it automatically. For example, given a table t with a column v
of type varchar(100), and function that performs the following operation:

    function foo(x::char(100)) := '<' || x || '>'

A query such as this would produce surprising results:

    SELECT foo(v) from t

[...]

Moreover, the implicit coercion between varchar and char is lossy.
The way it's currently defined, if the string is larger than the
max allowed size for the char type, it will be silently truncated.

Unfortunately, changing the literals to be of type char and flipping the order of coercions is not trivial. There are some complications around how comparisons are currently implemented, but I haven't analyzed this in depth.

A query such as this would produce surprising results:

SELECT foo(v) from t

Can you fill this example with the output and what wouldn't be surprising?

For instance, I expect <x⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵⎵>.

What would be surprising is that one would not expect a varchar to implicitly coerce to another type in a way that changes the value. The expectation is that that kind of operation would fail with a “type mismatch” error

A related issue is that we currently treat varchar(n) and varchar as the same type, but this is unsound. Unlike decimal or timestamp, where decimal(5) is syntactic sugar for decimal(5, 0), and timestamp is syntactic sugar for timestamp(3), there's no equivalent transformation for varchar. There is no value of n where varchar and varchar(n) are equivalent.

This causes a few problems. One of them, as I alluded above, is the potential lossy conversion between varchar and char(n) due to the way the coercions are currently set up:

Given that varchar and varchar(n) are currently considered the same type, and varchar(n) implicitly coerces to char(n), the following relations are true:

  • varchar(a) implicitly coerces to char(b) if a <= b
  • varchar(a) implicitly coerces to varchar
  • Therefore, varchar has to implicitly coerce to char(b). Since b is capped at 65536, any string larger than that length will be silently truncated.

I've been thinking about what it would take to break the linkage between varchar and varchar(n) (in retrospect, we should've called varchar simply string or text). That would allow us to, as a start, change the coercions to behave like this:

varchar(a) -> char(b) -> varchar

In the future, we need to keep investigating what it will take to change them to:

char(a) -> varchar(b) -> varchar

Was this page helpful?
0 / 5 - 0 ratings