Incubator-superset: Error with postgres columns of type "TIMESTAMP WITH TIME ZONE"

Created on 9 Jan 2017  ·  16Comments  ·  Source: apache/incubator-superset

  • [x] I have checked the superset logs for python stacktraces and included it here as text if any
  • [X] I have reproduced the issue with at least the latest released version of superset
  • [X] I have checked the issue tracker for the same issue and I haven't found one similar

Hey! I'm using SQL Lab with a postgres backend. I can't select from some tables, specifically when the column has a type of TIMESTAMP WITH TIME ZONE. An example of the contents of such a cell is 2014-11-24 11:30:02.140381+00

I have other tables with the column of type TIMESTAMP WITHOUT TIME ZONE, and they work fine. Ex: 2010-01-28T16:59:52.900963

Superset version

0.15.1

Expected results

Query should run and return results.

Actual results

I get flashed a 1 second error message saying "Datatype not understood" which then disappears, and then get shown the loading-spinner.

Stack trace:

2017-01-09 17:09:52,437:INFO:root:Running query:
SELECT product_id,
       base_product_id,
       updated
FROM public.product_id_mapping LIMIT 100
2017-01-09 17:09:52,513:ERROR:root:data type not understood
Traceback (most recent call last):
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/views.py", line 2500, in sql_json
    data = sql_lab.get_sql_results(query_id, return_results=True)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/sql_lab.py", line 171, in get_sql_results
    payload['columns'] = cdf.columns_dict if cdf else []
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/dataframe.py", line 54, in columns_dict
    agg = agg_func(self.__df.dtypes[col], col)
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/superset/dataframe.py", line 110, in agg_func
    if np.issubdtype(dtype, np.number):
  File "/Users/user/src/superset/venv/lib/python2.7/site-packages/numpy/core/numerictypes.py", line 755, in issubdtype
    return issubclass(dtype(arg1).type, arg2)
TypeError: data type not understood

Steps to reproduce

Run a SELECT query against a postgresql db table that has a column of type TIMEZONE WITH TIME STAMP

cant-reproduce help-wanted

Most helpful comment

After poking at this a bit more, it appears the issue is not entirely resolved by updating Pandas. The behavior I've observed is as follows:

Prior to updating, an exception would be thrown on any selection from a table containing a timestamp with time zone column, regardless of whether or not that column was selected. After updating, it's possible to select from such a table, as long as the results themselves do not contain any timestamp with time zone type values. It does appear to be possible to select a timestamp with time zone column as long as it's you cast to a timestamp without time zone. For example...

SELECT (timevalue_column_name at time zone 'UTC')::timestamp without time zone AS "Timestamp"
FROM table_name;

Applying some function that returns a different type also seems to work fine - converting to date for example:

SELECT date(timevalue_column_name at time zone 'UTC') AS "Date"
FROM table_name;

All 16 comments

This looks like a dup of #1900

Good call, sorry xrmx!

@ama-ableton np, could you please add some debug as requested in https://github.com/airbnb/superset/issues/1900#issuecomment-270713457 ?

Sure, does that help?

I have the same issue for a PostgreSQL db on superset==0.15.4 .

I also have this issue.
Fetched last master but the same.
Thanks

From ama-ableton:

Here's the arg1 and arg2 passed into the issubdtype function mentioned in the stack trace:

arg1: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]
arg2: <type 'numpy.number'>

We don't use Postgres here at Airbnb and we would love to have a contributor that irons out all of those Postgres specific bugs!

There's a module called db_engine_specs.py that is used to define all of the engine specific logic. We could have a new method in there alter_dataframe that would apply some casting on Pandas dataframes after receiving the data. The base class would do nothing, and the Postgres one would somehow change things like psycopg2.tz.FixedOffsetTimezone to a numpy compatible date.

Updating Pandas (from 0.18.1 to 0.19.2 in my case) resolved this issue. I would recommend giving that a shot.

My understanding is that Pandas itself is supposed to store results using a time zone unaware type (by converting to UTC). Perhaps a defect in this behavior was introduced and subsequently resolved?

@tfellison thanks for the update. Can anyone else please confirm it?

After poking at this a bit more, it appears the issue is not entirely resolved by updating Pandas. The behavior I've observed is as follows:

Prior to updating, an exception would be thrown on any selection from a table containing a timestamp with time zone column, regardless of whether or not that column was selected. After updating, it's possible to select from such a table, as long as the results themselves do not contain any timestamp with time zone type values. It does appear to be possible to select a timestamp with time zone column as long as it's you cast to a timestamp without time zone. For example...

SELECT (timevalue_column_name at time zone 'UTC')::timestamp without time zone AS "Timestamp"
FROM table_name;

Applying some function that returns a different type also seems to work fine - converting to date for example:

SELECT date(timevalue_column_name at time zone 'UTC') AS "Date"
FROM table_name;

Short update on superset 0.20.5:

In SQLLab the issue ist still present:

[2017-11-12 21:25:03,039: ERROR/ForkPoolWorker-18] zip argument #2 must support iteration
Traceback (most recent call last):
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/superset/sql_lab.py", line 94, in get_sql_results
    ctask, query_id, return_results, store_results, user_name)
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/superset/sql_lab.py", line 241, in execute_sql
    'data': cdf.data if cdf.data else [],
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/superset/dataframe.py", line 52, in data
    return self.__df.to_dict(orient='records')
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/pandas/core/frame.py", line 898, in to_dict
    for row in self.values]
  File "/opt/apps/superset/venv_superset_0.20.5/lib64/python3.4/site-packages/pandas/core/frame.py", line 898, in <listcomp>
    for row in self.values]
TypeError: zip argument #2 must support iteration

How to reproduce

select current_timestamp

or equivalently

select now()

It is definitely caused by the combination of date and timezone, since the following selects work without any problems:

Time with timezone:

select current_time

Timestamp without timezone:

select localtimestamp

or

select now()::timestamp

Impact

In our configuration (Postgres, Celery backend for SQLLab, Redis Cache) this bug manages to paralyze SQLLab for quite some time, hindering subsequent queries to succeed.

This likely is a bug in Pandas. When using a Postgres field with time zone a DateTimeIndex is created with “freq=None” which is not iterable. If using naive date times this is not present.

^^^ this is not entirely correct

There's probably a way to convert the column in the dataframe on our side to work around the bug. I spent 10-15 minutes trying a few days ago and gave up.

This is the reported issue: https://github.com/pandas-dev/pandas/issues/18372

This is the, confirmed, fix:

import numpy as np

from pandas.core.common import _maybe_box_datetimelike

    @property
    def data(self):
        # workaround pandas issue
        # see: https://github.com/pandas-dev/pandas/issues/18372
        return [dict((k, _maybe_box_datetimelike(v))
                     for k, v in zip(self.__df.columns, np.atleast_1d(row)))
                for row in self.__df.values]]

Let me see if I can create a patch for this tomorrow.

aaaa eeeee lysiy muzhik v nakolkah

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gbrian picture gbrian  ·  3Comments

eliab picture eliab  ·  3Comments

kalimuthu123 picture kalimuthu123  ·  3Comments

fly-high-bj picture fly-high-bj  ·  3Comments

XiaodiKong picture XiaodiKong  ·  3Comments