Describe the bug
We are seeing some test failures in test_multiindex_from_product in PR ( https://github.com/rapidsai/cudf/pull/4567 ), which appear unrelated to that change. Looks like some non-determinism in the cudf.MultiIndex.from_product constructor. Here's an example failure that demonstrates this:
=================================== FAILURES ===================================
_________________________ test_multiindex_from_product _________________________
def test_multiindex_from_product():
arrays = [["a", "a", "b", "b"], ["house", "store", "house", "store"]]
pmi = pd.MultiIndex.from_product(arrays, names=["alpha", "location"])
gmi = cudf.MultiIndex.from_product(arrays, names=["alpha", "location"])
> assert_eq(pmi, gmi)
E AssertionError: MultiIndex level [0] are different
E
E MultiIndex level [0] values are different (50.0 %)
E [left]: Index(['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b',
E 'b', 'b'],
E dtype='object', name='alpha')
E [right]: Index(['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b',
E 'a', 'b'],
E dtype='object', name='alpha')
cudf/tests/test_multiindex.py:384: AssertionError
Steps/Code to reproduce bug
Appears running test_multiindex_from_product a few times is sufficient.
Expected behavior
That cudf.MultiIndex.from_product behaves deterministically.
Environment overview (please complete the following information)
Environment details
Other details should be in the log. If not, maybe we can ask OPS for more details.
Additional context
NA
FWIW I tried running this locally, but was not able to reproduce it. Though I may not have the most recent cuDF version.
In [1]: import pandas as pd
In [2]: import cudf
In [3]: import rmm
In [4]: from cudf.tests.utils import assert_eq
In [5]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(1 * 2**20))
Out[5]: 0
In [6]: for i in range(100):
...: arrays = [
...: ["a", "a", "b", "b"],
...: ["house", "store", "house", "store"]
...: ]
...: pmi = pd.MultiIndex.from_product(arrays,
...: names=["alpha", "location"])
...: gmi = cudf.MultiIndex.from_product(arrays,
...: names=["alpha", "location"])
...: assert_eq(pmi, gmi)
In [7]: cudf.__version__
Out[7]: '0.13.0a+4269.g11903d2'
One source of nondeterminism might be the call to join in the MultiIndex constructor:
level = DataFrame(index=codes).join(level)
Although this could totally be a red herring.
Also something that might be relevant is the get_level_values method that was added somewhat recently:
https://github.com/rapidsai/cudf/blob/branch-0.13/python/cudf/cudf/core/multiindex.py#L540
Although it looks like that isn't used outside of groupby.
The mentioned test seem to pass in my local with latest changes + #4567 :
(cudf) root@dt07:/cudf# pytest python/cudf/cudf/tests/test_multiindex.py::test_multiindex_from_product
============================== test session starts ===============================
platform linux -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.0
rootdir: /cudf/python/cudf
plugins: hypothesis-5.5.4
collected 1 item
python/cudf/cudf/tests/test_multiindex.py . [100%]
=============================== 1 passed in 1.40s ================================
Was it failing in a specific OS?
All details should be in the linked log in the OP. From what I can see, it says ubuntu16.04.
Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.
I don't know. Maybe someone else can comment 馃檪
In either the case, there was a test for it and it was broken. If we don't want the determinism, maybe we can relax the test. If we do want the determinism, then there is a bug fix needed.
It's worth noting that if we do relax the determinism our results will differ from Pandas. Not sure how important that is to us, but figured it should be highlighted.
It's worth noting that if we do relax the determinism our results will differ from Pandas. Not sure how important that is to us, but figured it should be highlighted.
Yeah, this has come up before.
Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.
Yes, this is basically asking for a Table construction from a set of dictionary encoded values and then us returning that Table in a completely different order than asked for.
@kkraus14 I don't understand your explanation. Why is the ordering of the table crucial?
Also, is determinism necessary in #4570 as well?
Is determinism truly necessary? Note that you're paying a heavy cost in having to sort to get deterministic ordering.
Yes, this is basically asking for a Table construction from a set of dictionary encoded values and then us returning that Table in a completely different order than asked for.
Okay, that seems like a case to expect deterministic ordering.
How is this being implemented that it _isn't_ deterministic?
How is this being implemented that it isn't deterministic?
We're essentially passed two Python lists of potentially long arrays:
We iterate over each list, generate an additional array of 0 --> N where N is the length of the codes, run a join which outputs a decoded column and a column of the corresponding generated value, then sort by the generated value column. In practice it could look something like:
| levels |
| - |
| 'a' |
| 'b' |
| 'c' |
| 'd' |
| codes |
| - |
| 2 |
| 0 |
| 2 |
| 1 |
| 0 |
| codes | order |
| - | - |
| 2 | 0 |
| 0 | 1 |
| 2 | 2 |
| 1 | 3 |
| 0 | 4 |
| levels | enumerated_codes |
| - | - |
| 'a' | 0 |
| 'b' | 1 |
| 'c' | 2 |
| 'd' | 3 |
codes and enumerated_codes| levels | order |
| - | - |
| 'b' | 3 |
| 'a' | 1 |
| 'c' | 0 |
| 'c' | 2 |
| 'a' | 4 |
| levels |
| - |
| 'c' |
| 'a' |
| 'c' |
| 'b' |
| 'a' |
@kkraus14 I assume that a real use case is more complex than your example since your example is just a simple gather. Is there a more complex example that can't just be done with a gather?
@kkraus14 I assume that a real use case is more complex than your example since your example is just a simple
gather. Is there a more complex example that can't just be done with agather?
I believe so but I may be mistaken. Digging to figure out.
Totally ended up being a gather where we totally overengineered / overthought the problem. Thanks @jrhemstad and @harrism!