Cudf: [BUG] Non-determinism in `cudf.MultiIndex.from_product`

Created on 18 Mar 2020 · 15Comments · Source: rapidsai/cudf

Describe the bug

We are seeing some test failures in test_multiindex_from_product in PR ( https://github.com/rapidsai/cudf/pull/4567 ), which appear unrelated to that change. Looks like some non-determinism in the cudf.MultiIndex.from_product constructor. Here's an example failure that demonstrates this:

=================================== FAILURES ===================================
_________________________ test_multiindex_from_product _________________________

    def test_multiindex_from_product():
        arrays = [["a", "a", "b", "b"], ["house", "store", "house", "store"]]
        pmi = pd.MultiIndex.from_product(arrays, names=["alpha", "location"])
        gmi = cudf.MultiIndex.from_product(arrays, names=["alpha", "location"])
>       assert_eq(pmi, gmi)
E       AssertionError: MultiIndex level [0] are different
E       
E       MultiIndex level [0] values are different (50.0 %)
E       [left]:  Index(['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b',
E              'b', 'b'],
E             dtype='object', name='alpha')
E       [right]: Index(['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b',
E              'a', 'b'],
E             dtype='object', name='alpha')

cudf/tests/test_multiindex.py:384: AssertionError

https://gpuci.gpuopenanalytics.com/blue/rest/organizations/jenkins/pipelines/rapidsai/pipelines/gpuci/pipelines/cudf/pipelines/prb/pipelines/cudf-gpu-build/runs/19222/log/?start=0

Steps/Code to reproduce bug

Appears running test_multiindex_from_product a few times is sufficient.

Expected behavior

That cudf.MultiIndex.from_product behaves deterministically.

Environment overview (please complete the following information)

Environment location: gpuCI

Environment details

Other details should be in the log. If not, maybe we can ask OPS for more details.

Additional context

bug cuDF (Python)

Source

jakirkham

All 15 comments

FWIW I tried running this locally, but was not able to reproduce it. Though I may not have the most recent cuDF version.

In [1]: import pandas as pd

In [2]: import cudf

In [3]: import rmm

In [4]: from cudf.tests.utils import assert_eq

In [5]: rmm.reinitialize(pool_allocator=True,
   ...:                  initial_pool_size=int(1 * 2**20))
Out[5]: 0

In [6]: for i in range(100):
   ...:     arrays = [
   ...:         ["a", "a", "b", "b"],
   ...:         ["house", "store", "house", "store"]
   ...:     ]
   ...:     pmi = pd.MultiIndex.from_product(arrays,
   ...:                                      names=["alpha", "location"])
   ...:     gmi = cudf.MultiIndex.from_product(arrays,
   ...:                                        names=["alpha", "location"])
   ...:     assert_eq(pmi, gmi)

In [7]: cudf.__version__
Out[7]: '0.13.0a+4269.g11903d2'

jakirkham on 18 Mar 2020

One source of nondeterminism might be the call to join in the MultiIndex constructor:

            level = DataFrame(index=codes).join(level)

Although this could totally be a red herring.

shwina on 18 Mar 2020

Also something that might be relevant is the get_level_values method that was added somewhat recently:

https://github.com/rapidsai/cudf/blob/branch-0.13/python/cudf/cudf/core/multiindex.py#L540

Although it looks like that isn't used outside of groupby.

shwina on 18 Mar 2020

The mentioned test seem to pass in my local with latest changes + #4567 :

(cudf) root@dt07:/cudf# pytest python/cudf/cudf/tests/test_multiindex.py::test_multiindex_from_product
============================== test session starts ===============================
platform linux -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.0
rootdir: /cudf/python/cudf
plugins: hypothesis-5.5.4
collected 1 item                                                                 

python/cudf/cudf/tests/test_multiindex.py .                                [100%]

=============================== 1 passed in 1.40s ================================

Was it failing in a specific OS?

galipremsagar on 18 Mar 2020

All details should be in the linked log in the OP. From what I can see, it says ubuntu16.04.

jakirkham on 18 Mar 2020

👍1

Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.

jrhemstad on 18 Mar 2020

I don't know. Maybe someone else can comment 🙂

In either the case, there was a test for it and it was broken. If we don't want the determinism, maybe we can relax the test. If we do want the determinism, then there is a bug fix needed.

It's worth noting that if we do relax the determinism our results will differ from Pandas. Not sure how important that is to us, but figured it should be highlighted.

jakirkham on 18 Mar 2020

It's worth noting that if we do relax the determinism our results will differ from Pandas. Not sure how important that is to us, but figured it should be highlighted.

Yeah, this has come up before.

https://github.com/rapidsai/cudf/issues/1781

jrhemstad on 18 Mar 2020

Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.

Yes, this is basically asking for a Table construction from a set of dictionary encoded values and then us returning that Table in a completely different order than asked for.

kkraus14 on 18 Mar 2020

@kkraus14 I don't understand your explanation. Why is the ordering of the table crucial?

Also, is determinism necessary in #4570 as well?

harrism on 18 Mar 2020

Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.

Yes, this is basically asking for a Table construction from a set of dictionary encoded values and then us returning that Table in a completely different order than asked for.

Okay, that seems like a case to expect deterministic ordering.

How is this being implemented that it _isn't_ deterministic?

jrhemstad on 18 Mar 2020

How is this being implemented that it isn't deterministic?

We're essentially passed two Python lists of potentially long arrays:

Codes: A Python list of codes arrays
Levels: A Python list of dictionary arrays

We iterate over each list, generate an additional array of 0 --> N where N is the length of the codes, run a join which outputs a decoded column and a column of the corresponding generated value, then sort by the generated value column. In practice it could look something like:

Inputs

| levels |
| - |
| 'a' |
| 'b' |
| 'c' |
| 'd' |

| codes |
| - |
| 2 |
| 0 |
| 2 |
| 1 |
| 0 |

Intermediate Tables

lhs

| codes | order |
| - | - |
| 2 | 0 |
| 0 | 1 |
| 2 | 2 |
| 1 | 3 |
| 0 | 4 |

rhs

| levels | enumerated_codes |
| - | - |
| 'a' | 0 |
| 'b' | 1 |
| 'c' | 2 |
| 'd' | 3 |

Joined on `codes` and `enumerated_codes`

| levels | order |
| - | - |
| 'b' | 3 |
| 'a' | 1 |
| 'c' | 0 |
| 'c' | 2 |
| 'a' | 4 |

Sorted by order and then drop it to get final output

| levels |
| - |
| 'c' |
| 'a' |
| 'c' |
| 'b' |
| 'a' |

kkraus14 on 18 Mar 2020

@kkraus14 I assume that a real use case is more complex than your example since your example is just a simple gather. Is there a more complex example that can't just be done with a gather?

jrhemstad on 19 Mar 2020

@kkraus14 I assume that a real use case is more complex than your example since your example is just a simple gather. Is there a more complex example that can't just be done with a gather?

I believe so but I may be mistaken. Digging to figure out.

kkraus14 on 19 Mar 2020

Totally ended up being a gather where we totally overengineered / overthought the problem. Thanks @jrhemstad and @harrism!

kkraus14 on 19 Mar 2020

🚀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

automatically infer dtypes when using read_csv

ericmjl · 3Comments

[FEA] Remove `experimental` namespace from libcudf

jrhemstad · 3Comments

[QST] Multiple GPU memories with Dask-cuDF

jmkim · 3Comments

[BUG] cudf installation via pip segfaults

henningpeters · 3Comments

[BUG] cudf.read_csv: KeyError: 8

randerzander · 3Comments

Cudf: [BUG] Non-determinism in `cudf.MultiIndex.from_product`

All 15 comments

Inputs

Intermediate Tables

lhs

rhs

Joined on codes and enumerated_codes

Sorted by order and then drop it to get final output

Related issues

Joined on `codes` and `enumerated_codes`