Cudf: [BUG] cudf.DataFrame.merge should support implicit type conversions on join columns

Created on 12 Jul 2019  路  15Comments  路  Source: rapidsai/cudf

When merging on columns with different types in Pandas, it implicitly converts to matching types and runs the merge:

>>> df = pd.DataFrame()
>>> df['id'] = [0, 1, 2]
>>> df['val'] = [9, 9, 9]

>>> df_2 = pd.DataFrame()
>>> df_2['id'] = [0, 1, 2]

>>> df.dtypes
id     int64
val    int64
dtype: object

>>> df_2.dtypes
id    int64
dtype: object
>>> df_2['id'] = df_2['id'].astype('float64')

>>> df.merge(df_2, on=['id'])
   id  val
0   0    9
1   1    9
2   2    9

cudf doesn't, and fails with type mismatch:

>>> import cudf
>>> df = cudf.from_pandas(df)
>>> df_2 = cudf.from_pandas(df_2)
>>> df.dtypes
id     int64
val    int64
dtype: object
>>> df_2.dtypes
id    float64
dtype: object
>>> df.merge(df_2, on=['id']).to_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.9.0a0+1094.gdcdf3596a.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/dataframe.py", line 1934, in merge
    lhs._cols, rhs._cols, left_on, right_on, how, method
  File "cudf/bindings/join.pyx", line 26, in cudf.bindings.join.join
  File "cudf/bindings/join.pyx", line 124, in cudf.bindings.join.join
  File "cudf/bindings/cudf_cpp.pyx", line 487, in cudf.bindings.cudf_cpp.check_gdf_error
cudf.bindings.GDFError.GDFError: b'GDF_DTYPE_MISMATCH'
bug cuDF (Python)

All 15 comments

@harrism is this something we can handle down in libcudf or would you want us to typecast in Python before passing to libcudf?

@jrhemstad are you more familiar with join logic in order to answer?

One thing we would need from you @kkraus14 is the expected casting rules, so we can discuss differences from expected C++ casting and make sure we do the right thing...

@harrism is this something we can handle down in libcudf or would you want us to typecast in Python before passing to libcudf?

I think it is the caller's responsibility to make the types the same.

One thing we would need from you @kkraus14 is the expected casting rules, so we can discuss differences from expected C++ casting and make sure we do the right thing...

This is a chief reason why it should be the caller's responsibility.

@jrhemstad If they have to do this then they always have to run a kernel and allocate more memory in order to cast before the join. But for compatible types, it seems like the cast could be fused into the join kernels if it were supported in libcudf.

But for compatible types, it seems like the cast could be fused into the join kernels if it were supported in libcudf.

In theory, yes, but now instead of doing a single type dispatch per column, you'd have to do _double_ type dispatch per column:

  • Dispatch the lhs type
  • Dispatch the rhs type
  • Compare the two elements

And this would be for _every_ element comparison.

Why for every element comparison? The types don't vary within the columns, so you should be able to dispatch on the column types once per pair of columns, right?

Why for every element comparison? The types don't vary within the columns, so you should be able to dispatch on the _column_ types once per pair of columns, right?

So to be clear, you're asking if for n columns we can do only n type dispatches? If so, that's not possible without some kind of runtime compilation.

Instead, for n columns and m rows, we have to do m*n dispatches because there's no way to just figure out the type of each column once. That would require doing something like invoking a variadic template, which isn't currently possible because we'd have to explicitly instantiate an infinite number of template instantiations (for all possible n and for all combinations of types of columns [0,n))

This is how we currently check if two rows are equal between two tables.

https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/table/device_table_row_operators.cuh#L75

working on this 馃憤

So I went ahead and wrote some tests for this based on the condition that we want to match the equivalent result from pandas when the join keys on the left and right correspond to various datatypes. In doing so I discovered some asymmetries in the way pandas handles this which I think warrant some discussion, as well as a decision on to what extent we actually want to follow the pandas behavior here.

Numerical Issues:

In general for numeric datatypes (as well as some others), the datatype of the key column in the resulting merged table is different depending on what table is left and what table is right:

>>> import pandas as pd
>>> pd.__version__
'0.24.2'
>>> df1 = pd.DataFrame.from_dict({'A':[1,2,3]}, dtype='int32') 
>>> df2 = pd.DataFrame.from_dict({'A':[1,2,3]}, 
dtype='float64')
>>> df1.merge(df2, on='A')['A'].dtype
dtype('int32') 
>>> df2.merge(df1, on='A')['A'].dtype
dtype('float64')

For a left join one might argue that the left table has some 'priority', but one could equally argue that inner joins are by definition symmetric and we might consider promoting to a common dtype that can hold all the data from both. Moreover, pandas won't cast for like-kinded datatypes, and instead will just attempt to perform the join, dropping the 'mismatching' entries. In the example below pandas drops some entries because float32(x) != float64(x).

>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict({'A':[1,2,3.1,4.9]}, 
dtype='float32')
>>> df2 = pd.DataFrame.from_dict({'A':[1,2,3.1,4.9]}, 
dtype='float64')
>>> df1.merge(df2, on='A')
     A
0  1.0
1  2.0

This causes an issue for us as we'll obtain DTYPE mismatch trying to pass this case to libcudf. If we cast to the same datatype, all the values will merge and we won't get the same result as pandas. If we don't cast, we'll error. We'd have to take the key column in both tables and filter by where df[key_col] == df[key_col].astype(to_dtype) up front and then perform the merge with the filtered result.

Categorical Issues:

When the key column in the left table is of categorical type, and the key column on the right is not, the join column in the resulting table will be cast to the underlying datatype of the categories themselves:

>>> df1 = pd.DataFrame.from_dict({'A':[1,2,3]}, 
dtype='category')
>>> df2 = pd.DataFrame.from_dict({'A':[1,2,3]}, 
dtype='float32')
>>> df1['A'].cat.categories.dtype
dtype('int64')
>>> df1.merge(df2, on='A')['A'].dtype
dtype('int64')

But if you do the merge the other way, you'll actually get an object datatype due to the specific way the loop in pandas works that coerces the merge keys:

>>> df2.merge(df1, on='A')['A'].dtype
dtype('O')

This again creates a situation where we need to write some kind of asymmetric code to match.

Date issues:

Pandas only actually supports nanoseconds while we can handle a broader range of resolutions - this gives us the freedom really to decide to do whatever we want, noting that libcudf is written to refuse joins on columns where the resolutions are different:

>>> s1 =  Series(['2011-01-01', '2011-01-02', '2011-01-03'], 
dtype='datetime64[ns]')
>>> df1 = DataFrame({'A':s1})
>>> df2 = df1.copy()
>>> df2['A'] = df2['A'].astype('datetime64[ns]')
>>> df1['A'].dtype
dtype('<M8[ms]')
>>> df2['A'].dtype
dtype('<M8[ns]')
>>> df1.merge(df2, on='A')
*snip*
RuntimeError: cuDF failure at: 
[...]/cudf/cpp/src/join/legacy/joining.cu:210: Timestamp 
resolution mismatch

These issues combined make it a bit tricky to write a general function that matches Pandas without basically manually coding up all these cases individually and in some cases casting multiple times / performing extra work.

In this case I would not be opposed to just writing our own more symmetric casting rules and adhering to that instead of pandas but I would like to hear what others think as well.

@beckernick @kkraus14 @randerzander

I agree we should just define a sane set of rules as opposed to following Pandas semantics here.

From my perspective, for left/right joins we should typecast key columns to the type in the left/right table. For inner and outer joins we should find a common dtype and cast both columns to that common dtype. This should handle all numerical types as well as timestamps, but how do we handle avoided issues from overflows? I.E. my left table's key is int8 and my right table's key is int64, does typecasting error or silently overflow?

For category dtype, we should keep the output as a category dtype, but generate a new synchronized dictionary for use by the output. I believe that's already what we're doing.

Thanks @kkraus14 this is what I was leaning towards too.

Is this worth filing on the pandas board?

Thanks @kkraus14 this is what I was leaning towards too.

Is this worth filing on the pandas board?

Yes it would be good to at least kick off a discussion surrounding the behavior.

I think the aspect that still needs discussion is How do we handle overflows in typecasting?

@kkraus14 I think it would be fair to error there as this seems like something one is more likely to run into unintentionally.

That said, (I think) if the left table only has the range of int8 then anything on the right hand side that doesn't fall within the int8 range can't possibly match during the merge anyways, so we could also filter by the min/max and then perform the join and still get the same result without ever overflowing.

That said, (I think) if the left table only has the range of int8 then anything on the right hand side that doesn't fall within the int8 range can't possibly match during the merge anyways, so we could also filter by the min/max and then perform the join and still get the same result without ever overflowing.

Yes that sounds like an option if downcasting but sounds potentially expensive. It may honestly be cheaper to upcast the smaller type in this situation, perform the join, and then perform the downcast on the result column. May be worth trying both and checking performance / profiles?

@kkraus14 I will try it out.

Was this page helpful?
0 / 5 - 0 ratings