Incubator-superset: Error with postgres columns of type "TIMESTAMP WITH TIME ZONE"

Created on 9 Jan 2017 · 16Comments · Source: apache/incubator-superset

[x] I have checked the superset logs for python stacktraces and included it here as text if any
[X] I have reproduced the issue with at least the latest released version of superset
[X] I have checked the issue tracker for the same issue and I haven't found one similar

Hey! I'm using SQL Lab with a postgres backend. I can't select from some tables, specifically when the column has a type of TIMESTAMP WITH TIME ZONE. An example of the contents of such a cell is 2014-11-24 11:30:02.140381+00

I have other tables with the column of type TIMESTAMP WITHOUT TIME ZONE, and they work fine. Ex: 2010-01-28T16:59:52.900963

Superset version

0.15.1

Expected results

Query should run and return results.

Actual results

I get flashed a 1 second error message saying "Datatype not understood" which then disappears, and then get shown the loading-spinner.

Stack trace:

2017-01-09 17:09:52,437:INFO:root:Running query:
SELECT product_id,
       base_product_id,
       updated
FROM public.product_id_mapping LIMIT 100
2017-01-09 17:09:52,513:ERROR:root:data type not understood
Traceback (most recent call last):
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/views.py", line 2500, in sql_json
    data = sql_lab.get_sql_results(query_id, return_results=True)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/sql_lab.py", line 171, in get_sql_results
    payload['columns'] = cdf.columns_dict if cdf else []
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/dataframe.py", line 54, in columns_dict
    agg = agg_func(self.__df.dtypes[col], col)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/dataframe.py", line 110, in agg_func
    if np.issubdtype(dtype, np.number):
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/numpy/core/numerictypes.py", line 755, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood

Steps to reproduce

Run a SELECT query against a postgresql db table that has a column of type TIMEZONE WITH TIME STAMP

cant-reproduce help-wanted

Source

ama-ableton

👍1

Most helpful comment

After poking at this a bit more, it appears the issue is not entirely resolved by updating Pandas. The behavior I've observed is as follows:

Prior to updating, an exception would be thrown on any selection from a table containing a timestamp with time zone column, regardless of whether or not that column was selected. After updating, it's possible to select from such a table, as long as the results themselves do not contain any timestamp with time zone type values. It does appear to be possible to select a timestamp with time zone column as long as it's you cast to a timestamp without time zone. For example...

SELECT (timevalue_column_name at time zone 'UTC')::timestamp without time zone AS "Timestamp"
FROM table_name;

Applying some function that returns a different type also seems to work fine - converting to date for example:

SELECT date(timevalue_column_name at time zone 'UTC') AS "Date"
FROM table_name;

tfellison on 22 Feb 2017

👍5

All 16 comments

This looks like a dup of #1900

xrmx on 9 Jan 2017

Good call, sorry xrmx!

ama-ableton on 9 Jan 2017

@ama-ableton np, could you please add some debug as requested in https://github.com/airbnb/superset/issues/1900#issuecomment-270713457 ?

xrmx on 9 Jan 2017

Sure, does that help?

ama-ableton on 9 Jan 2017

I have the same issue for a PostgreSQL db on superset==0.15.4 .

srugano on 29 Jan 2017

I also have this issue.
Fetched last master but the same.
Thanks

btzsoft on 3 Feb 2017

From ama-ableton:

Here's the arg1 and arg2 passed into the issubdtype function mentioned in the stack trace:

arg1: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]
arg2: <type 'numpy.number'>

xrmx on 10 Feb 2017

We don't use Postgres here at Airbnb and we would love to have a contributor that irons out all of those Postgres specific bugs!

There's a module called db_engine_specs.py that is used to define all of the engine specific logic. We could have a new method in there alter_dataframe that would apply some casting on Pandas dataframes after receiving the data. The base class would do nothing, and the Postgres one would somehow change things like psycopg2.tz.FixedOffsetTimezone to a numpy compatible date.

mistercrunch on 10 Feb 2017

Updating Pandas (from 0.18.1 to 0.19.2 in my case) resolved this issue. I would recommend giving that a shot.

My understanding is that Pandas itself is supposed to store results using a time zone unaware type (by converting to UTC). Perhaps a defect in this behavior was introduced and subsequently resolved?

tfellison on 22 Feb 2017

@tfellison thanks for the update. Can anyone else please confirm it?

xrmx on 22 Feb 2017

After poking at this a bit more, it appears the issue is not entirely resolved by updating Pandas. The behavior I've observed is as follows:

SELECT (timevalue_column_name at time zone 'UTC')::timestamp without time zone AS "Timestamp"
FROM table_name;

Applying some function that returns a different type also seems to work fine - converting to date for example:

SELECT date(timevalue_column_name at time zone 'UTC') AS "Date"
FROM table_name;

tfellison on 22 Feb 2017

👍5

Short update on superset 0.20.5:

In SQLLab the issue ist still present:

[2017-11-12 21:25:03,039: ERROR/ForkPoolWorker-18] zip argument #2 must support iteration
Traceback (most recent call last):
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/superset/sql_lab.py", line 94, in get_sql_results
    ctask, query_id, return_results, store_results, user_name)
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/superset/sql_lab.py", line 241, in execute_sql
    'data': cdf.data if cdf.data else [],
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/superset/dataframe.py", line 52, in data
    return self.__df.to_dict(orient='records')
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/pandas/core/frame.py", line 898, in to_dict
    for row in self.values]
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/pandas/core/frame.py", line 898, in <listcomp>
    for row in self.values]
TypeError: zip argument #2 must support iteration

How to reproduce

select current_timestamp

or equivalently

select now()

It is definitely caused by the combination of date and timezone, since the following selects work without any problems:

Time with timezone:

select current_time

Timestamp without timezone:

select localtimestamp

select now()::timestamp

Impact

In our configuration (Postgres, Celery backend for SQLLab, Redis Cache) this bug manages to paralyze SQLLab for quite some time, hindering subsequent queries to succeed.

rumbin on 12 Nov 2017

This likely is a bug in Pandas. When using a Postgres field with time zone a DateTimeIndex is created with “freq=None” which is not iterable. If using naive date times this is not present.

^^^ this is not entirely correct

bolkedebruin on 18 Nov 2017

There's probably a way to convert the column in the dataframe on our side to work around the bug. I spent 10-15 minutes trying a few days ago and gave up.

mistercrunch on 19 Nov 2017

This is the reported issue: https://github.com/pandas-dev/pandas/issues/18372

This is the, confirmed, fix:

import numpy as np

from pandas.core.common import _maybe_box_datetimelike

    @property
    def data(self):
        # workaround pandas issue
        # see: https://github.com/pandas-dev/pandas/issues/18372
        return [dict((k, _maybe_box_datetimelike(v))
                     for k, v in zip(self.__df.columns, np.atleast_1d(row)))
                for row in self.__df.values]]

Let me see if I can create a patch for this tomorrow.

bolkedebruin on 19 Nov 2017

aaaa eeeee lysiy muzhik v nakolkah