Tidb: Let the timestamp type column can be directly compared with the ISO 8601 format string

Created on 6 Oct 2020  路  20Comments  路  Source: pingcap/tidb

Feature Request

Is your feature request related to a problem? Please describe:
TiDB currently does not support the comparison between the timestamp type column and the ISO 8601 format string.

Example:

QUERY contains ISO 8601 formatted timestamps:
---------------
SELECT count(*) as count
FROM `table_name` `t`
WHERE
`t`.`historytime` >= '2020-04-16T08:32:22.132Z'
AND `t`.`historytime` <= '2020-05-30T01:32:32.578Z'

ERROR:
---------------
SQL Error [1105] [HY000]: strconv.ParseInt: parsing "293Z": invalid syntax
========================================================================================================================================================
featuraccepted typfeature-request

Most helpful comment

Why MySQL doesn't support ISO 8601?

I suspect that it is just lack of demand. There are a lot of missing features around timestamp/timezones that need addressing: Y2K38, TIMESTAMP WITH TIMEZONE.

Does TiDB needs to support ISO 8601? What are the benefits and drawbacks?

I think this is a safe extension to add. It is at least a standard format (ISO 8601), so the typical risk (MySQL adding it later but differently) is minimal.

All 20 comments

It seems that MySQL doesn't support ISO 8601 directly, while PostgreSQL supports. For simplificatin, let's consider the following example:

drop table if exists t;
create table t(a timestamp);
insert into t values ('2020-05-30T01:32:32.578Z');
select * from t where a = '2020-05-30T01:32:32.578Z';

In MySQL 8.0.21, the insert statement result in error:

MySQL([email protected]:test) > insert into t values ('2020-05-30T01:32:32.578Z');
ERROR 1292 (22007): Incorrect datetime value: '2020-05-30T01:32:32.578Z' for column 'a' at row 1

MySQL([email protected]:test) > select version();
+-----------+
| version() |
+-----------+
| 8.0.21    |
+-----------+
1 row in set (0.00 sec)

While in PostgreSQL 12.4, the insert and select statements are both succeed:

PG([email protected]:test) > insert into t values('2020-04-16T08:32:22.132Z');
INSERT 0 1
PG([email protected]:test) > select * from t;
            a
-------------------------
 2020-04-16 08:32:22.132
(1 row)

PG([email protected]:test) > select * from t where a = '2020-04-16T08:32:22.132Z';
            a
-------------------------
 2020-04-16 08:32:22.132
(1 row)

PG([email protected]:test) > select version();
                                                      version
-------------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.4 on x86_64-apple-darwin19.5.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
(1 row)

PG([email protected]:test) >

It's related to the compatibility of ISO 8601 date and time representation. Seems the main effect to achieve this is the type conversion between ISO 8601 strings and DateTime/Timestamp values.

But it may break the MySQL compatibility because MySQL doesn't support ISO 8601. There are several questions we need to investigate or discuss:

  • Why MySQL doesn't support ISO 8601?
  • Does TiDB needs to support ISO 8601? What are the benefits and drawbacks?
  • If TiDB decides to support ISO 8601, how to maintain the MySQL compatibility? Maybe a session variable like enable_iso_8601 may help.

Why MySQL doesn't support ISO 8601?

I suspect that it is just lack of demand. There are a lot of missing features around timestamp/timezones that need addressing: Y2K38, TIMESTAMP WITH TIMEZONE.

Does TiDB needs to support ISO 8601? What are the benefits and drawbacks?

I think this is a safe extension to add. It is at least a standard format (ISO 8601), so the typical risk (MySQL adding it later but differently) is minimal.

/assign @ichn-hu

For more information,

A summary of the international standard date and time notation

summarizes the ISO-8601 pretty neatly.

also https://pypi.org/project/iso8601/ and https://github.com/btubbs/datetime are two packages that implemented the parsing for time string with ISO-8601 format, of which the latter one is in Go with MIT license, which means we could learn from its implementation.

The internal representation of time in TiDB is in type CoreTime, which is a uint64 that stores a timestamp without time zone information, the 64 bits of this structure are all used up, which means there is no room for adding time zone information without large impact and refactoring on the current code base.

To support parsing time string literals in ISO-8601 format, which contains time zone information, a compromised implementation would yield the following behaviors

  • for datetime, time zone is parsed and ignored, an additional warning can be returned if the time zone contained in the string literal is different than the system time zone variable.
  • for timestamp, time zone is parsed and interpreted as an offset, for example, if the stored timestamp are in UTC, while the string literal contains time zone Z+08:00, there additonal 8 hours offset will be added to the timestamp the string literal specified.

for datetime, time zone is parsed and ignored, an additional warning can be returned if the time zone contained in the string literal is different than the system time zone variable.

@ichn-hu would it cause any correctness issue?

  • for datetime, time zone is parsed and ignored, an additional warning can be returned if the time zone contained in the string literal is different than the system time zone variable.

It is better to return an error if the data can't be accepted. There are a lot of clients that ignore warnings, and storing values in the incorrect timezone can cause a lot of problems.

for timestamp, time zone is parsed and interpreted as an offset, for example, if the stored timestamp are in UTC, while the string literal contains time zone Z+08:00, there additonal 8 hours offset will be added to the timestamp the string literal specified.

For timestamps it should be safe to interpret the value and convert it immediately. This is because of the following behavior:

"MySQL converts TIMESTAMP values from the current time zone to UTC for storage, and back from UTC to the current time zone for retrieval. (This does not occur for other types such as DATETIME".

Behavior in PostgreSQL

Postgres has date, time, timestamp, timestamptz (and interval) for temporal types, for time and timestamp, timezone is parsed and ignored (not even generating warnings, because timestamp is by default without timezone, and PG doesn't differentiate type by literal value, therefore the zone information is ignored see https://www.postgresql.org/docs/8.4/datatype-datetime.html), while for timestamptz, timezone is interpreted before storing the data.

postgres=# create table t (dt date, tm time, ts timestamp, tsz timestamptz);
CREATE TABLE
postgres=# insert into t (tm, ts, tsz) values ('10:10:10+1000', '2020-10-10T10:10:10+1000', '2020-10-10T10:10:10+0800');
INSERT 0 1
postgres=# select * from t;
 dt |    tm    |         ts          |          tsz
----+----------+---------------------+------------------------
    | 10:10:10 | 2020-10-10 10:10:10 | 2020-10-10 02:10:10+00
(1 row)

Also according to https://dev.mysql.com/doc/refman/8.0/en/datetime.html , MySQL 8.0's datetime & timestamp actuall all supports timezone, the only difference is that timezone for datetime is only interpreted upon writing to storage, but is not interpreted upon reading, while currently TiDB can't handle such syntax.

-- mysql:
create table t (dt datetime, ts timestamp);
insert into t values ('2020-10-10 10:10:10+08:00', '2020-10-10 10:10:10+08:00');
insert into t values ('2020-10-10 10:10:10+09:00', '2020-10-10 10:10:10+09:00');
select * from t;
+---------------------+---------------------+
| dt                  | ts                  |
+---------------------+---------------------+
| 2020-10-10 10:10:10 | 2020-10-10 10:10:10 |
| 2020-10-10 09:10:10 | 2020-10-10 09:10:10 |
+---------------------+---------------------+
set @@time_zone = '+00:00';
select * from t;
+---------------------+---------------------+
| dt                  | ts                  |
+---------------------+---------------------+
| 2020-10-10 10:10:10 | 2020-10-10 02:10:10 |
| 2020-10-10 09:10:10 | 2020-10-10 01:10:10 |
+---------------------+---------------------+
-- tidb:
create table t (dt datetime, ts timestamp);
insert into t values ('2020-10-10 10:10:10+08:00', '2020-10-10 10:10:10+08:00');
ERROR 1366 (HY000): Incorrect datetime value: '2020-10-10 10:10:10+08:00' for column 'dt' at row 1

Note that the above syntax MySQL 8.0 supports is not the same as ISO 8601, the following SQL won't work with MySQL but works with PG.

mysql> insert into t values ('2020-10-10T10:10:10Z+08:00', '2020-10-10T10:10:10Z+08:00');
ERROR 1292 (22007): Incorrect datetime value: '2020-10-10T10:10:10Z+08:00' for column 'dt' at row 1

Therefore, I propose the following strategy, we will first make the MySQL's timezone syntax work and extend it to be compatible with ISO 8601, and we will handle timezone in the way MySQL does, e.g.

  • for datetime, the timezone information in the string literal is interpreted when writing to storage with respect to the system's time_zone variable. But when reading from storage, the timezone information is ignored and we will not adapt the time with respect to the time_zone variable. This is inline with the notion that datetime is for human to read, and it is not a phyiscal time. The stored value is the interpreted time with respect to time_zone system variable.
  • for timestamp, the timezone information is interpreted upon writing with respect to the system's time_zone variable, and upon reading, we also take into consideration the system time_zone variable. The stored value is the interpreted time with respect to UTC.

@zz-jason @nullnotnil PTAL about the proposed strategy.

-- mysql:
create table t (dt datetime, ts timestamp);
insert into t values ('2020-10-10 10:10:10+08:00', '2020-10-10 10:10:10+08:00');
insert into t values ('2020-10-10 10:10:10+09:00', '2020-10-10 10:10:10+09:00');
select * from t;
+---------------------+---------------------+
| dt                  | ts                  |
+---------------------+---------------------+
| 2020-10-10 10:10:10 | 2020-10-10 10:10:10 |
| 2020-10-10 09:10:10 | 2020-10-10 09:10:10 |
+---------------------+---------------------+

According to the MySQL behavior:

  • for datetime, the timezone in the string literal is ignored
  • for timestamp, the timezone in the string literal should be used

Am I correct?

-- mysql:
create table t (dt datetime, ts timestamp);
insert into t values ('2020-10-10 10:10:10+08:00', '2020-10-10 10:10:10+08:00');
insert into t values ('2020-10-10 10:10:10+09:00', '2020-10-10 10:10:10+09:00');
select * from t;
+---------------------+---------------------+
| dt                  | ts                  |
+---------------------+---------------------+
| 2020-10-10 10:10:10 | 2020-10-10 10:10:10 |
| 2020-10-10 09:10:10 | 2020-10-10 09:10:10 |
+---------------------+---------------------+

According to the MySQL behavior:

  • for datetime, the timezone in the string literal is ignored
  • for timestamp, the timezone in the string literal should be used

Am I correct?

Nope, for datetime, the timezone in the string literal is also interpreted, otherwise the dt column in the second row should be 2020-10-10 10:10:10.

OK.

  • for datetime, the timezone information in the string literal is interpreted when writing to storage with respect to the system's time_zone variable.
  • for timestamp, the timezone information is interpreted upon writing with respect to the system's time_zone variable.

So they should both be interpreted with respect to the timezone info in the string literal?

@zz-jason @nullnotnil PTAL about the proposed strategy.

LGTM. It it an aside, but I prefer the semantics of timestamp to datetime, but timestamp is also problematic because it suffers Y2K38. I will create a separate FR issue.

So they should both be interpreted with respect to the timezone info in the string literal?

Correct, the only difference appears in reading these values. The datetime is stored as a literal value, and will be retrieved as is, therefore you get what you have stored. While timestamp is stored as a physical timestamp, it will be adapted to system timezone upon reading.

So they should both be interpreted with respect to the timezone info in the string literal?

Correct, the only difference appears in reading these values. The datetime is stored as a literal value, and will be retrieved as is, therefore you get what you have stored. While timestamp is stored as a physical timestamp, it will be adapted to system timezone upon reading.

got, LGTM

Test with TiKV

Currently the parsing logic is only implemented on TiDB's side not on TiKV, and given string-to-datetime cast function pushdown has already being turned on, the user will face the following problem, see the following logs.

mysql> drop table if exists t;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> create table t (str varchar(30));
Query OK, 0 rows affected (0.07 sec)

mysql> insert into t values ('2020-10-10T10:10:10Z'), ('2020-10-10T10:10:10+10:00');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> explain select * from t where cast(str as datetime) < '2020-10-10T10:12:10Z';
+-------------------------+----------+-----------+---------------+--------------------------------------------------+
| id                      | estRows  | task      | access object | operator info                                    |
+-------------------------+----------+-----------+---------------+--------------------------------------------------+
| TableReader_7           | 8000.00  | root      |               | data:Selection_6                                 |
| 鈹斺攢Selection_6           | 8000.00  | cop[tikv] |               | lt(cast(test.t.str), 2020-10-10 18:12:10.000000) |
|   鈹斺攢TableFullScan_5     | 10000.00 | cop[tikv] | table:t       | keep order:false, stats:pseudo                   |
+-------------------------+----------+-----------+---------------+--------------------------------------------------+
3 rows in set (0.00 sec)

mysql> select * from t where cast(str as datetime) < '2020-10-10T10:12:10Z';
Empty set, 2 warnings (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------------------------------------------+
| Level   | Code | Message                                                                  |
+---------+------+--------------------------------------------------------------------------+
| Warning | 1292 | evaluation failed: Incorrect datetime value: '2020-10-10T10:10:10Z'      |
| Warning | 1292 | evaluation failed: Incorrect datetime value: '2020-10-10T10:10:10+10:00' |
+---------+------+--------------------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> insert into mysql.expr_pushdown_blacklist values ('cast', 'tikv', 'test');
Query OK, 1 row affected (0.01 sec)

mysql> admin reload expr_pushdown_blacklist;
Query OK, 0 rows affected (0.00 sec)

mysql> explain select * from t where cast(str as datetime) < '2020-10-10T10:12:10Z';
+-------------------------+---------+-----------+---------------+--------------------------------------------------+
| id                      | estRows | task      | access object | operator info                                    |
+-------------------------+---------+-----------+---------------+--------------------------------------------------+
| Selection_7             | 2.00    | root      |               | lt(cast(test.t.str), 2020-10-10 18:12:10.000000) |
| 鈹斺攢TableReader_6         | 2.00    | root      |               | data:TableFullScan_5                             |
|   鈹斺攢TableFullScan_5     | 2.00    | cop[tikv] | table:t       | keep order:false, stats:pseudo                   |
+-------------------------+---------+-----------+---------------+--------------------------------------------------+
3 rows in set (0.00 sec)

mysql> select * from t where cast(str as datetime) < '2020-10-10T10:12:10Z';
+---------------------------+
| str                       |
+---------------------------+
| 2020-10-10T10:10:10Z      |
| 2020-10-10T10:10:10+10:00 |
+---------------------------+
2 rows in set (0.00 sec)

By default, the cast function is pushed down to TiKV, and because TiKV can't parse string in ISO8601 format, it will generate warnings and return wrong results.

Note if we block the push down of cast function to TiKV, making the cast happening on TiDB, it will work correctly.

I am afraid that it is risky in time to also make TiKV work before the deadline of release 4.0.8, I am wondering if it is possible to explicitly tell the user about this behavior that for now ISO8601 is not supported on TiKV and still ship this feature partially on 4.0.8.

Note that this feature is not a breaking change--what worked before will continue to work, it is that you have to pay more attention if you want to use it.

No matter what, I'll make the feature work on TiKV as well soon. The only question is that if this is definitely wanted before the deadline of 4.0.8.

@iamxy would you please evaluate about this and let me know if it is acceptable?

LGTM. As long as the user doesn鈥檛 use cast(str as datetime), it will work well.

Was this page helpful?
0 / 5 - 0 ratings