Cudf: [QST] regarding pandas version dependency for cuDF 0.9.0

Created on 27 Sep 2019 · 15Comments · Source: rapidsai/cudf

What is your question?
I have observed there is pandas>=0.24.2,<0.25 version dependency has been added for cuDF 0.9.0 .

https://github.com/rapidsai/cudf/blob/a0a3a359cf87db855e81bdb3ae20d80a238d1607/conda/recipes/cudf/meta.yaml#L30

is it mandatory requirement for cuDF to have dependency on pandas < 0.25 ? This might lead to downgrade default pandas version available in conda environment. Please suggest ! Thank you !

cuDF (Python) cuIO libcudf question

Source

pradghos

Most helpful comment

@pradghos There's an incompatibility with Pandas 0.25+ somewhere in the Python codebase that we just haven't had time to triage and resolve. We'll likely tackle in 0.11, but would welcome a contribution that resolves the problem 😄

kkraus14 on 1 Oct 2019

👍2

All 15 comments

@kkraus14 : Any pointer on the above would really help. Thank you !

pradghos on 27 Sep 2019

kkraus14 on 1 Oct 2019

👍2

@pradghos There's an incompatibility with Pandas 0.25+ somewhere in the Python codebase that we just haven't had time to triage and resolve. We'll likely tackle in 0.11, but would welcome a contribution that resolves the problem 😄

Thanks for the response !

pradghos on 1 Oct 2019

https://pandas.pydata.org/pandas-docs/version/0.25/whatsnew/v0.25.0.html#column-order-is-preserved-when-passing-a-list-of-dicts-to-dataframe

libcudf.avro.read_avro does not preserve the order of dictionaries. Seems reasonably solvable, but I'm going to save this for last because I'm not familiar with avro.

groupby/repr have what appears to be the same problem with Dataframe.from_pandas() and column order. This seems more fixable with dirty python hacks, but my heart tells me that's the wrong way to go.

As far as our test suite reports, this is the extent of the pandas 0.25 issues. That pandas is now to be in compliance with python 3.7 guarantees the column order of Dataframe. Seems like something that can be fixed in one place and probably is better solved in libcudf so I'm going in that direction.

@kkraus14 Do you have any feelings about this? Any feedback is welcome, but I think I've got a way through the woods if you don't have an opinion.

millerhooks on 13 Nov 2019

libcudf.avro.read_avro does not preserve the order of dictionaries. Seems reasonably solvable, but I'm going to save this for last because I'm not familiar with avro.

What dictionaries are you referring to? Is this a test issue or an issue in the actual read_avro code?

groupby/repr have what appears to be the same problem with Dataframe.from_pandas() and column order. This seems more fixable with dirty python hacks, but my heart tells me that's the wrong way to go.

In general column order is all dirty Python hacks in rearranging the C++ returned ordering to match Pandas expected behavior. As long as we're not having to copy actual data and we're just reordering the references it should still be cheap and fine.

As far as our test suite reports, this is the extent of the pandas 0.25 issues. That pandas is now to be in compliance with python 3.7 guarantees the column order of Dataframe. Seems like something that can be fixed in one place and probably is better solved in libcudf so I'm going in that direction.

libcudf shouldn't match Pandas behavior here as Python / cuDF aren't the only consumers of it. libcudf should strive to have a C++ style API and sane behaviors and making them "Pandas-like" should be done in the Python / Cython.

kkraus14 on 15 Nov 2019

It's an issue in the libcudf avro reader (not fastavro) returning an unordered map of the dictionary index. This problem is mirrored but more complex in the return of groupby from libcudf. Because of this tests fail for avro, groupby, and repr. The failure for the repr test is also related to the same groupby call.

To the best of my knowledge, none of the other backwards incompatible api changes in pandas>=0.25 cause any issues or are incompatible with anything else in cudf. While this can be "fixed" by changing the tests, it doesn't really address the underlying issue and no amount of python reordering hacks would work because the functions that would need to do column reorder to match the pandas output would have to have knowledge of what it's being compared to.

I have located where these issues sit in libcudf but I'm not sure that making those changes are in the scope of this ticket and for sure I'd need some expert consult before making changes as it's a pretty deep cut. I talked to @trxcllnt about it and walked him through what I was seeing and he seemed to agree and said it might be worth it to loop in someone from the io part of the team.

millerhooks on 15 Nov 2019

@mjsamoht @OlivierNV to comment on Avro.

harrism on 11 Dec 2019

@jrhemstad @devavret FYI regarding groupby. @millerhooks perhaps you could file specific issues for the libcudf feature incompatibilities?

harrism on 11 Dec 2019

I'm not sure I fully understand the question about preserving the order of dictionaries. I'm assuming that this refers to just "preserving the dictionaries": AVRO includes an "enum" type where the schema may include a list of named symbols, which is effectively a column-level dictionary.
For such column (enums), the cudf avro reader will either create an 'int' column if no symbols are present (raw enum values), or create a string column if the enum symbols are named.
Once libcudf++ supports string dictionaries, it would be relatively easy to preserve the dictionary encoding, though I'm not sure what the 'unordered map of the dictionary index' is referring to.

OlivierNV on 11 Dec 2019

@OlivierNV I think the issue is the C++ avro reader is returning the schema fields in a different order than the fastavro library. @millerhooks is talking about seeing the fields in the Python schema dictionary (i.e. OrderedDict) in an unexpected order, not anything to do with dictionary-typed columns. Schema field order should be deterministic, right? When I looked into it I think I saw libcudf using unordered_map; could that be the issue?

trxcllnt on 11 Dec 2019

Oh, the order of the columns themselves ? I remember there was something very weird about fastavro: it returns the columns in the reverse order in which they were present in the JSON schema, so the cudf reader was modified to explicitly do that same reversal so that it would match fastavro on the pandas side, see reader_impl.cu at line 123

Afaik the order should definitely be deterministic and the reverse order of the column names in the avro schema. Since this weirdness was specially made to match the fastavro results (python tests check that the order is the same as fastavro afaik), it would be good to know what's the difference here.

As @kkraus14 said in a previous post, that reversal should be moved to the python side rather than within the libcudf reader.

OlivierNV on 11 Dec 2019

I don't quite understand the issue regarding groupby. On libcudf side, the order of columns returned is the same as the order of input values.

devavret on 11 Dec 2019

I think the avro issue is really a side effect of fastavro returning columns in an order that depends on pandas/python version, as described in this fastavro issue
I'd like to remove the reordering from the libcudf reader, and modify the avro test so that it's independent of the order of columns returned by fastavro, though I'm no python expert, so if anyone has suggestions on how to compare two dataframes whose columns may be in different order, I'd love to hear it. I'm thinking of indexing one of the dataframe using the name of the column of the other dataframe, but apparently it's not trivial to get a column's name given a column. If I can't figure it out, I might just reverse the order of the dataframe returned by fastavro using df.iloc[:, ::-1] and let python folks remove that reversal from the test when upgrading to a newer version of python.

OlivierNV on 11 Dec 2019

👍1

Really hope this gets priority, and good to see it's being worked on :+1: :smile:
Having just one version of pandas to choose from makes it a bit frustrating to use in our model-build images

NegatioN on 9 Jan 2020

Hi @NegatioN our plan right now is to enable 0.25+ only, and then when Pandas 1.0 releases in the near future we will move to supporting 1.0+ only, but do our best to support multiple minor versions of 1.x.

kkraus14 on 9 Jan 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings