Cudf: [BUG] Non-determinism in `cudf.MultiIndex.from_product`

Created on 18 Mar 2020  路  15Comments  路  Source: rapidsai/cudf

Describe the bug

We are seeing some test failures in test_multiindex_from_product in PR ( https://github.com/rapidsai/cudf/pull/4567 ), which appear unrelated to that change. Looks like some non-determinism in the cudf.MultiIndex.from_product constructor. Here's an example failure that demonstrates this:

=================================== FAILURES ===================================
_________________________ test_multiindex_from_product _________________________

    def test_multiindex_from_product():
        arrays = [["a", "a", "b", "b"], ["house", "store", "house", "store"]]
        pmi = pd.MultiIndex.from_product(arrays, names=["alpha", "location"])
        gmi = cudf.MultiIndex.from_product(arrays, names=["alpha", "location"])
>       assert_eq(pmi, gmi)
E       AssertionError: MultiIndex level [0] are different
E       
E       MultiIndex level [0] values are different (50.0 %)
E       [left]:  Index(['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b',
E              'b', 'b'],
E             dtype='object', name='alpha')
E       [right]: Index(['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b',
E              'a', 'b'],
E             dtype='object', name='alpha')

cudf/tests/test_multiindex.py:384: AssertionError

https://gpuci.gpuopenanalytics.com/blue/rest/organizations/jenkins/pipelines/rapidsai/pipelines/gpuci/pipelines/cudf/pipelines/prb/pipelines/cudf-gpu-build/runs/19222/log/?start=0

Steps/Code to reproduce bug

Appears running test_multiindex_from_product a few times is sufficient.

Expected behavior

That cudf.MultiIndex.from_product behaves deterministically.

Environment overview (please complete the following information)

  • Environment location: gpuCI

Environment details

Other details should be in the log. If not, maybe we can ask OPS for more details.

Additional context

NA

bug cuDF (Python)

All 15 comments

FWIW I tried running this locally, but was not able to reproduce it. Though I may not have the most recent cuDF version.

In [1]: import pandas as pd

In [2]: import cudf

In [3]: import rmm

In [4]: from cudf.tests.utils import assert_eq

In [5]: rmm.reinitialize(pool_allocator=True,
   ...:                  initial_pool_size=int(1 * 2**20))
Out[5]: 0

In [6]: for i in range(100):
   ...:     arrays = [
   ...:         ["a", "a", "b", "b"],
   ...:         ["house", "store", "house", "store"]
   ...:     ]
   ...:     pmi = pd.MultiIndex.from_product(arrays,
   ...:                                      names=["alpha", "location"])
   ...:     gmi = cudf.MultiIndex.from_product(arrays,
   ...:                                        names=["alpha", "location"])
   ...:     assert_eq(pmi, gmi)

In [7]: cudf.__version__
Out[7]: '0.13.0a+4269.g11903d2'

One source of nondeterminism might be the call to join in the MultiIndex constructor:

            level = DataFrame(index=codes).join(level)

Although this could totally be a red herring.

Also something that might be relevant is the get_level_values method that was added somewhat recently:

https://github.com/rapidsai/cudf/blob/branch-0.13/python/cudf/cudf/core/multiindex.py#L540

Although it looks like that isn't used outside of groupby.

The mentioned test seem to pass in my local with latest changes + #4567 :

(cudf) root@dt07:/cudf# pytest python/cudf/cudf/tests/test_multiindex.py::test_multiindex_from_product
============================== test session starts ===============================
platform linux -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.0
rootdir: /cudf/python/cudf
plugins: hypothesis-5.5.4
collected 1 item                                                                 

python/cudf/cudf/tests/test_multiindex.py .                                [100%]

=============================== 1 passed in 1.40s ================================

Was it failing in a specific OS?

All details should be in the linked log in the OP. From what I can see, it says ubuntu16.04.

Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.

I don't know. Maybe someone else can comment 馃檪

In either the case, there was a test for it and it was broken. If we don't want the determinism, maybe we can relax the test. If we do want the determinism, then there is a bug fix needed.

It's worth noting that if we do relax the determinism our results will differ from Pandas. Not sure how important that is to us, but figured it should be highlighted.

It's worth noting that if we do relax the determinism our results will differ from Pandas. Not sure how important that is to us, but figured it should be highlighted.

Yeah, this has come up before.

https://github.com/rapidsai/cudf/issues/1781

Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.

Yes, this is basically asking for a Table construction from a set of dictionary encoded values and then us returning that Table in a completely different order than asked for.

@kkraus14 I don't understand your explanation. Why is the ordering of the table crucial?

Also, is determinism necessary in #4570 as well?

Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.

Yes, this is basically asking for a Table construction from a set of dictionary encoded values and then us returning that Table in a completely different order than asked for.

Okay, that seems like a case to expect deterministic ordering.

How is this being implemented that it _isn't_ deterministic?

How is this being implemented that it isn't deterministic?

We're essentially passed two Python lists of potentially long arrays:

  1. Codes: A Python list of codes arrays
  2. Levels: A Python list of dictionary arrays

We iterate over each list, generate an additional array of 0 --> N where N is the length of the codes, run a join which outputs a decoded column and a column of the corresponding generated value, then sort by the generated value column. In practice it could look something like:

Inputs

| levels |
| - |
| 'a' |
| 'b' |
| 'c' |
| 'd' |

| codes |
| - |
| 2 |
| 0 |
| 2 |
| 1 |
| 0 |

Intermediate Tables

lhs

| codes | order |
| - | - |
| 2 | 0 |
| 0 | 1 |
| 2 | 2 |
| 1 | 3 |
| 0 | 4 |

rhs

| levels | enumerated_codes |
| - | - |
| 'a' | 0 |
| 'b' | 1 |
| 'c' | 2 |
| 'd' | 3 |

Joined on codes and enumerated_codes

| levels | order |
| - | - |
| 'b' | 3 |
| 'a' | 1 |
| 'c' | 0 |
| 'c' | 2 |
| 'a' | 4 |

Sorted by order and then drop it to get final output

| levels |
| - |
| 'c' |
| 'a' |
| 'c' |
| 'b' |
| 'a' |

@kkraus14 I assume that a real use case is more complex than your example since your example is just a simple gather. Is there a more complex example that can't just be done with a gather?

@kkraus14 I assume that a real use case is more complex than your example since your example is just a simple gather. Is there a more complex example that can't just be done with a gather?

I believe so but I may be mistaken. Digging to figure out.

Totally ended up being a gather where we totally overengineered / overthought the problem. Thanks @jrhemstad and @harrism!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ericmjl picture ericmjl  路  3Comments

jrhemstad picture jrhemstad  路  3Comments

jmkim picture jmkim  路  3Comments

henningpeters picture henningpeters  路  3Comments

randerzander picture randerzander  路  3Comments