Presto: Substr to a CHAR(n) column returns padded string

Created on 17 Apr 2020  Â·  4Comments  Â·  Source: prestosql/presto

Problem

When I do substr to a fixed length CHAR(n) column, it returns white-space-padded string. But I think it should return a truncated string without padding.

I've done the same thing in MySQL and it returned a truncated string which I expected for.

How to reproduce

  1. Make a CHAR type column (e.g. CHAR(8)) to a table
  2. Insert a row with text (e.g. abcd1234)
  3. Select it with substr (e.g. select substr(col1, 1, 4) from T)

Expected return

varchar(n) abcd (truncated text)

Actual return

char(n) abcd (four white spaces padded)

Presto version

331

bug syntax-needs-review

All 4 comments

I need to take another look at the SQL specification, but this seems like the correct and expected behavior.

CHAR(n) is a fixed-length type (as opposed to VARCHAR(n). The result of applying substr to a CHAR(n) is, by necessity, of the same type — the type system is not powerful enough to derive a new type based on the actual arguments to the function and infer that because you asked for a substring of length k, the result should be CHAR(k).

I think substr(char(n), ...) should return varchar(n) (instead of char(n)). Let's check the spec though.

substr doesn't actually exist in the spec. There's a SUBSTRING construct that behaves in this way:

<character substring function> ::=
  SUBSTRING <left paren> <character value expression> FROM <start position>
      [ FOR <string length> ] [ USING <char length units> ] <right paren>
If <character substring function> CSF is specified, then let DTCVE be the declared type of the 
<character value expression> immediately contained in CSF. The maximum length, character set, 
and collation of the declared type DTCSF of CSF are determined as follows:
   a) Case:
      i) If the declared type of <character value expression> is fixed-length character string or 
       variable- length character string, then DTCSF is a variable-length character string type 
       with maximum length equal to the length or maximum length of DTCVE.
      ii) Otherwise, the DTCSF is a large object character string type with maximum length equal to 
       the maximum length of DTCVE.

To paraphrase, if the argument type is CHAR(n), the result of SUBSTRING(string FROM start FOR length) is supposed to be a variable-length string type such as VARCHAR(n)

On the other hand, substr in other systems behaves like this:

  • Oracle: the return type matches the input type (https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions162.htm)
  • PostgreSQL: the return type is text, a variable length type (https://www.postgresql.org/docs/9.1/functions-string.html)
  • DB2: it's complicated. T return type depends on the input type and whether the start/length arguments are constants (https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0000854.html?pos=2)
  • SQL Server: the function is named SUBSTRING. The return type is a variable-length type (https://docs.microsoft.com/en-us/sql/t-sql/functions/substring-transact-sql?view=sql-server-ver15)

Returning varchar(n) in our implementation seems like a reasonable choice.

thanks @martint for detailed analysis!

Returning varchar(n) in our implementation seems like a reasonable choice.

Unless @martint @kokosing @electrum you have any other objections, I am gonna remove the syntax-needs-review.

Was this page helpful?
0 / 5 - 0 ratings