Cudf: [BUG] GROUPBY.as_df() not giving correct results

Created on 12 Mar 2019  路  11Comments  路  Source: rapidsai/cudf

Bug Description
The behavior for cudf GROUPBY.as_df() does not appear to be giving correct beginning offsets for each group.

Code to reproduce bug


import cudf
import pandas as pd
import numpy as np

def sample_gdf(n_rows =1_000,n_keys_id_1 =1000,n_keys_id_2 =100,n_keys_id_3 =10):

    '''
        Returns  sampled dataframe
    '''

    col_1 = np.random.randint(0, n_keys_id_1, size=n_rows)
    col_2 = np.random.randint(0, n_keys_id_2, size=n_rows)
    col_3 = np.random.randint(0, n_keys_id_3, size=n_rows)
    df = pd.DataFrame({'id_1':col_1,'id_2':col_2,'id_3':col_3})
    return cudf.from_pandas(df)

gdf = sample_gdf(1000_000)
pd_df = gdf.to_pandas()

grouped_df, sr_segs = gdf.groupby(by=['id_1', 'id_2', 'id_3'], method='cudf',as_index=False).as_df()

#CUDF number of Groups
print("Cudf number of groups {}".format(len(sr_segs)))

#Grouping Using Pandas
panda_groups = pd_df.groupby(by=['id_1', 'id_2', 'id_3'])
print("Pandas number of groups {}".format(len(panda_groups)))

#Checking all distinct values using pandas
print("Number of distict values in df {}".format(len(pd_df.drop_duplicates())))

Current Output

Cudf number of groups 449035
Pandas number of groups 632450
Number of distict values in df 632450

Expected behavior

The correct number of groups is the one given by pandas

Additional context
This was working in earlier versions. I was using the group by to get distinct values from a cudf frame but the group by some how started giving incorrect results.

Please let me know if there is an alternative way to get distinct values

Environment details (please complete the following information):

  • Environment location: Docke
  • Method of cuDF install: Docker

    • docker pull: docker pull rapidsai/rapidsai-nightly:latest



      • docker run: docker run --runtime=nvidia --rm -it -p 9888:9888 -p 9787:9787 -p 9786:9786


        rapidsai/rapidsai-nightly:latest



  • Please run and attach the output of the cudf/print_env.sh script to gather relevant environment details
**git***
Not inside a git repository

***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux 47da0a2e98d7 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
Tue Mar 12 07:16:35 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |    698MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    34W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   30C    P0    32W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   31C    P0    31W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   32C    P0    33W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   30C    P0    34W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   31C    P0    35W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    32W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

***CPU***
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2113.460
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4391.80
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d

***CMake***
/conda/envs/rapids/bin/cmake
cmake version 3.12.4

CMake suite maintained and supported by Kitware (kitware.com/cmake).

***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

***Python***
/conda/envs/rapids/bin/python
Python 3.6.6

***Environment Variables***
PATH                            : /conda/envs/rapids/bin:/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/conda/bin
LD_LIBRARY_PATH                 : 
NUMBAPRO_NVVM                   : /usr/local/cuda/nvvm/lib64/libnvvm.so
NUMBAPRO_LIBDEVICE              : /usr/local/cuda/nvvm/libdevice
CONDA_PREFIX                    : /conda/envs/rapids
PYTHON_PATH                     : 

***conda packages***
/conda/condabin/conda
# packages in environment at /conda/envs/rapids:
#
# Name                    Version                   Build  Channel
arrow-cpp                 0.12.1           py36h0e61e49_0    conda-forge
atomicwrites              1.3.0                      py_0    conda-forge
attrs                     19.1.0                     py_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
blas                      1.0                         mkl  
bleach                    3.1.0                      py_0    conda-forge
bokeh                     1.0.4                 py36_1000    conda-forge
boost-cpp                 1.68.0            h11c811c_1000    conda-forge
bzip2                     1.0.6             h14c3975_1002    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
certifi                   2018.11.29            py36_1000    conda-forge
cffi                      1.11.5          py36h9745a5d_1001    conda-forge
click                     7.0                      pypi_0    pypi
cloudpickle               0.8.0                      py_0    conda-forge
cmake                     3.12.4            h8d4ced6_1000    conda-forge
cuda92                    1.0                           0    pytorch
cudf                      0+unknown                pypi_0    pypi
cuml                      0+unknown                pypi_0    pypi
curl                      7.63.0            h646f8bb_1000    conda-forge
cython                    0.29.6           py36hf484d3e_0    conda-forge
cytoolz                   0.9.0.1         py36h14c3975_1001    conda-forge
dask                      1.1.1                      py_0    conda-forge
dask-core                 1.1.1                      py_0    conda-forge
dask-cudf                 0+untagged.1.ge3d3350          pypi_0    pypi
dask-xgboost              0.1.5                    pypi_0    pypi
decorator                 4.3.2                      py_0    conda-forge
defusedxml                0.5.0                      py_1    conda-forge
distributed               1.25.3                   py36_0    conda-forge
entrypoints               0.3                   py36_1000    conda-forge
expat                     2.2.5             hf484d3e_1002    conda-forge
faiss-gpu                 1.5.0            py36_cuda9.2_1  [cuda92]  pytorch
freetype                  2.9.1             h94bbf69_1005    conda-forge
heapdict                  1.0.0                 py36_1000    conda-forge
icu                       58.2              hf484d3e_1000    conda-forge
intel-openmp              2019.1                      144  
ipykernel                 5.1.0           py36h24bf2e0_1002    conda-forge
ipython                   7.3.0            py36h24bf2e0_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.13.3                   py36_0    conda-forge
jinja2                    2.10                       py_1    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
jsonschema                3.0.1                    py36_0    conda-forge
jupyter_client            5.2.4                      py_3    conda-forge
jupyter_core              4.4.0                      py_0    conda-forge
jupyterlab                0.35.4                   py36_0    conda-forge
jupyterlab_server         0.2.0                      py_0    conda-forge
krb5                      1.16.2            hc83ff2d_1000    conda-forge
libcurl                   7.63.0            h01ee5af_1000    conda-forge
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 8.2.0                hdf63c60_1  
libgdf-cffi               0.6.0                    pypi_0    pypi
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
libpng                    1.6.36            h84994c4_1000    conda-forge
libprotobuf               3.6.1             hdbcaa40_1001    conda-forge
librmm-cffi               0.5.0                    pypi_0    pypi
libsodium                 1.0.16            h14c3975_1001    conda-forge
libssh2                   1.8.0             h1ad7b7a_1003    conda-forge
libstdcxx-ng              8.2.0                hdf63c60_1  
libtiff                   4.0.10            h648cc4a_1001    conda-forge
libuv                     1.26.0               h14c3975_0    conda-forge
llvmlite                  0.27.0           py36hf484d3e_0    numba
locket                    0.2.0                      py_2    conda-forge
markupsafe                1.1.1            py36h14c3975_0    conda-forge
mistune                   0.8.4           py36h14c3975_1000    conda-forge
mkl                       2019.1                      144  
mkl_fft                   1.0.10           py36h14c3975_1    conda-forge
mkl_random                1.0.2            py36h637b7d7_2    conda-forge
more-itertools            4.3.0                 py36_1000    conda-forge
msgpack-python            0.6.1            py36h6bb024c_0    conda-forge
nbconvert                 5.4.1                      py_2    conda-forge
nbformat                  4.4.0                      py_1    conda-forge
ncurses                   6.1                  he6710b0_1  
notebook                  5.7.5                    py36_0    conda-forge
numba                     0.42.0          np115py36hf484d3e_0    numba
numpy                     1.15.4           py36h7e9f1db_0  
numpy-base                1.15.4           py36hde5b4d6_0  
nvstrings                 0.2.0            cuda9.2_py36_0    nvidia/label/cuda9.2
olefile                   0.46                       py_0    conda-forge
openssl                   1.0.2r               h14c3975_0    conda-forge
packaging                 19.0                       py_0    conda-forge
pandas                    0.23.4          py36h637b7d7_1000    conda-forge
pandoc                    2.6                           1    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.5.1                         4    conda-forge
parso                     0.3.4                      py_0    conda-forge
partd                     0.3.9                      py_0    conda-forge
pexpect                   4.6.0                 py36_1000    conda-forge
pickleshare               0.7.5                 py36_1000    conda-forge
pillow                    5.3.0           py36h00a061d_1000    conda-forge
pip                       19.0.3                   py36_0  
pluggy                    0.9.0                      py_0    conda-forge
prometheus_client         0.6.0                      py_0    conda-forge
prompt_toolkit            2.0.9                      py_0    conda-forge
psutil                    5.5.1            py36h14c3975_0    conda-forge
ptyprocess                0.6.0                 py36_1000    conda-forge
py                        1.8.0                      py_0    conda-forge
pyarrow                   0.12.1           py36hbbcf98d_0    conda-forge
pycparser                 2.19                       py_0    conda-forge
pygments                  2.3.1                      py_0    conda-forge
pyparsing                 2.3.1                      py_0    conda-forge
pyrsistent                0.14.11          py36h14c3975_0    conda-forge
pytest                    4.3.0                    py36_0    conda-forge
python                    3.6.6             hd21baee_1003    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge
pytz                      2018.9                     py_0    conda-forge
pyyaml                    3.13            py36h14c3975_1001    conda-forge
pyzmq                     18.0.1           py36h0e1adb2_0    conda-forge
readline                  7.0                  h7b6447c_5  
rhash                     1.3.6             h14c3975_1001    conda-forge
scikit-learn              0.20.2           py36hd81dba3_0  
scipy                     1.2.1            py36h7c811a0_0  
send2trash                1.5.0                      py_0    conda-forge
setuptools                40.8.0                   py36_0  
six                       1.12.0                py36_1000    conda-forge
sortedcontainers          2.1.0                      py_0    conda-forge
sqlite                    3.26.0               h7b6447c_0  
tblib                     1.3.2                    pypi_0    pypi
terminado                 0.8.1                 py36_1001    conda-forge
testpath                  0.4.2                 py36_1000    conda-forge
thrift-cpp                0.12.0            h23e226f_1001    conda-forge
tk                        8.6.8                hbc83047_0  
toolz                     0.9.0                    pypi_0    pypi
tornado                   6.0.1            py36h14c3975_0    conda-forge
traitlets                 4.3.2                 py36_1000    conda-forge
wcwidth                   0.1.7                      py_1    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.33.1                   py36_0  
xgboost                   0.80                     pypi_0    pypi
xz                        5.2.4                h14c3975_4  
yaml                      0.1.7             h14c3975_1001    conda-forge
zeromq                    4.2.5             hf484d3e_1006    conda-forge
zict                      0.1.3                      py_0    conda-forge
zlib                      1.2.11               h7b6447c_3  


bug cuDF (Python) libcudf

All 11 comments

@VibhuJawa can you create a simpler repro? Is it cudf's native groupby that is giving incorrect results, or dask-cudf groupbys?

If dask-cudf, I suggest moving this issue over to the dask-cudf repo.

Cleaned up the code to make it simpler repro without any dask parts.

The cudf method will be going away in the near future to fix this and streamline groupbys in general.

Dependent on #982

What is the libcudf need from this issue?

@harrism this was before the gdf_groupby_without_aggregation function was implemented and that was the need.

@shwina @devavret is this an issue that is easy to fix with the new groupby?

as_df() doesn't seem to be a Pandas groupby function. We currently support iterating over the groups in the same way as Pandas, which is what it looks like as_df() was used for.

@VibhuJawa could you confirm?

as_df() doesn't seem to be a Pandas groupby function. We currently support iterating over the groups in the same way as Pandas, which is what it looks like as_df() was used for.

@VibhuJawa could you confirm?

Yup, as_df (was our legacy api) which was used for getting all the groups and their boundaries IIRC.

I think .groups (api) might be a better equivalent to the older as_df we used to support which is still a pending feature . (According to the tracker here: https://github.com/rapidsai/cudf/wiki/cuDF-groupby )

That said, i think we can close this issue as as_df no longer exists.

We don't have a .groups, and likely won't be able to support it in the short term. But we do have iteration over the grouped dataframe via __iter__, i.e., something like:

for name, group in df.groupby('a'):
    print(name)
    print(group)

I think this is closer to what as_df() was doing earlier.

Thanks for the example here 馃憤

Was this page helpful?
0 / 5 - 0 ratings