Bug Description
The behavior for cudf GROUPBY.as_df() does not appear to be giving correct beginning offsets for each group.
Code to reproduce bug
import cudf
import pandas as pd
import numpy as np
def sample_gdf(n_rows =1_000,n_keys_id_1 =1000,n_keys_id_2 =100,n_keys_id_3 =10):
'''
Returns sampled dataframe
'''
col_1 = np.random.randint(0, n_keys_id_1, size=n_rows)
col_2 = np.random.randint(0, n_keys_id_2, size=n_rows)
col_3 = np.random.randint(0, n_keys_id_3, size=n_rows)
df = pd.DataFrame({'id_1':col_1,'id_2':col_2,'id_3':col_3})
return cudf.from_pandas(df)
gdf = sample_gdf(1000_000)
pd_df = gdf.to_pandas()
grouped_df, sr_segs = gdf.groupby(by=['id_1', 'id_2', 'id_3'], method='cudf',as_index=False).as_df()
#CUDF number of Groups
print("Cudf number of groups {}".format(len(sr_segs)))
#Grouping Using Pandas
panda_groups = pd_df.groupby(by=['id_1', 'id_2', 'id_3'])
print("Pandas number of groups {}".format(len(panda_groups)))
#Checking all distinct values using pandas
print("Number of distict values in df {}".format(len(pd_df.drop_duplicates())))
Current Output
Cudf number of groups 449035
Pandas number of groups 632450
Number of distict values in df 632450
Expected behavior
The correct number of groups is the one given by pandas
Additional context
This was working in earlier versions. I was using the group by to get distinct values from a cudf frame but the group by some how started giving incorrect results.
Please let me know if there is an alternative way to get distinct values
Environment details (please complete the following information):
cudf/print_env.sh script to gather relevant environment details**git***
Not inside a git repository
***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux 47da0a2e98d7 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Tue Mar 12 07:16:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 698MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 33C P0 34W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 30C P0 32W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 31C P0 31W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P100-SXM2... On | 00000000:85:00.0 Off | 0 |
| N/A 32C P0 33W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 30C P0 34W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 31C P0 35W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 31C P0 32W / 300W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
***CPU***
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2113.460
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4391.80
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
***CMake***
/conda/envs/rapids/bin/cmake
cmake version 3.12.4
CMake suite maintained and supported by Kitware (kitware.com/cmake).
***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148
***Python***
/conda/envs/rapids/bin/python
Python 3.6.6
***Environment Variables***
PATH : /conda/envs/rapids/bin:/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/conda/bin
LD_LIBRARY_PATH :
NUMBAPRO_NVVM : /usr/local/cuda/nvvm/lib64/libnvvm.so
NUMBAPRO_LIBDEVICE : /usr/local/cuda/nvvm/libdevice
CONDA_PREFIX : /conda/envs/rapids
PYTHON_PATH :
***conda packages***
/conda/condabin/conda
# packages in environment at /conda/envs/rapids:
#
# Name Version Build Channel
arrow-cpp 0.12.1 py36h0e61e49_0 conda-forge
atomicwrites 1.3.0 py_0 conda-forge
attrs 19.1.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
blas 1.0 mkl
bleach 3.1.0 py_0 conda-forge
bokeh 1.0.4 py36_1000 conda-forge
boost-cpp 1.68.0 h11c811c_1000 conda-forge
bzip2 1.0.6 h14c3975_1002 conda-forge
ca-certificates 2018.11.29 ha4d7672_0 conda-forge
certifi 2018.11.29 py36_1000 conda-forge
cffi 1.11.5 py36h9745a5d_1001 conda-forge
click 7.0 pypi_0 pypi
cloudpickle 0.8.0 py_0 conda-forge
cmake 3.12.4 h8d4ced6_1000 conda-forge
cuda92 1.0 0 pytorch
cudf 0+unknown pypi_0 pypi
cuml 0+unknown pypi_0 pypi
curl 7.63.0 h646f8bb_1000 conda-forge
cython 0.29.6 py36hf484d3e_0 conda-forge
cytoolz 0.9.0.1 py36h14c3975_1001 conda-forge
dask 1.1.1 py_0 conda-forge
dask-core 1.1.1 py_0 conda-forge
dask-cudf 0+untagged.1.ge3d3350 pypi_0 pypi
dask-xgboost 0.1.5 pypi_0 pypi
decorator 4.3.2 py_0 conda-forge
defusedxml 0.5.0 py_1 conda-forge
distributed 1.25.3 py36_0 conda-forge
entrypoints 0.3 py36_1000 conda-forge
expat 2.2.5 hf484d3e_1002 conda-forge
faiss-gpu 1.5.0 py36_cuda9.2_1 [cuda92] pytorch
freetype 2.9.1 h94bbf69_1005 conda-forge
heapdict 1.0.0 py36_1000 conda-forge
icu 58.2 hf484d3e_1000 conda-forge
intel-openmp 2019.1 144
ipykernel 5.1.0 py36h24bf2e0_1002 conda-forge
ipython 7.3.0 py36h24bf2e0_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.13.3 py36_0 conda-forge
jinja2 2.10 py_1 conda-forge
jpeg 9c h14c3975_1001 conda-forge
jsonschema 3.0.1 py36_0 conda-forge
jupyter_client 5.2.4 py_3 conda-forge
jupyter_core 4.4.0 py_0 conda-forge
jupyterlab 0.35.4 py36_0 conda-forge
jupyterlab_server 0.2.0 py_0 conda-forge
krb5 1.16.2 hc83ff2d_1000 conda-forge
libcurl 7.63.0 h01ee5af_1000 conda-forge
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgdf-cffi 0.6.0 pypi_0 pypi
libgfortran-ng 7.2.0 hdf63c60_3 conda-forge
libpng 1.6.36 h84994c4_1000 conda-forge
libprotobuf 3.6.1 hdbcaa40_1001 conda-forge
librmm-cffi 0.5.0 pypi_0 pypi
libsodium 1.0.16 h14c3975_1001 conda-forge
libssh2 1.8.0 h1ad7b7a_1003 conda-forge
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h648cc4a_1001 conda-forge
libuv 1.26.0 h14c3975_0 conda-forge
llvmlite 0.27.0 py36hf484d3e_0 numba
locket 0.2.0 py_2 conda-forge
markupsafe 1.1.1 py36h14c3975_0 conda-forge
mistune 0.8.4 py36h14c3975_1000 conda-forge
mkl 2019.1 144
mkl_fft 1.0.10 py36h14c3975_1 conda-forge
mkl_random 1.0.2 py36h637b7d7_2 conda-forge
more-itertools 4.3.0 py36_1000 conda-forge
msgpack-python 0.6.1 py36h6bb024c_0 conda-forge
nbconvert 5.4.1 py_2 conda-forge
nbformat 4.4.0 py_1 conda-forge
ncurses 6.1 he6710b0_1
notebook 5.7.5 py36_0 conda-forge
numba 0.42.0 np115py36hf484d3e_0 numba
numpy 1.15.4 py36h7e9f1db_0
numpy-base 1.15.4 py36hde5b4d6_0
nvstrings 0.2.0 cuda9.2_py36_0 nvidia/label/cuda9.2
olefile 0.46 py_0 conda-forge
openssl 1.0.2r h14c3975_0 conda-forge
packaging 19.0 py_0 conda-forge
pandas 0.23.4 py36h637b7d7_1000 conda-forge
pandoc 2.6 1 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.1 4 conda-forge
parso 0.3.4 py_0 conda-forge
partd 0.3.9 py_0 conda-forge
pexpect 4.6.0 py36_1000 conda-forge
pickleshare 0.7.5 py36_1000 conda-forge
pillow 5.3.0 py36h00a061d_1000 conda-forge
pip 19.0.3 py36_0
pluggy 0.9.0 py_0 conda-forge
prometheus_client 0.6.0 py_0 conda-forge
prompt_toolkit 2.0.9 py_0 conda-forge
psutil 5.5.1 py36h14c3975_0 conda-forge
ptyprocess 0.6.0 py36_1000 conda-forge
py 1.8.0 py_0 conda-forge
pyarrow 0.12.1 py36hbbcf98d_0 conda-forge
pycparser 2.19 py_0 conda-forge
pygments 2.3.1 py_0 conda-forge
pyparsing 2.3.1 py_0 conda-forge
pyrsistent 0.14.11 py36h14c3975_0 conda-forge
pytest 4.3.0 py36_0 conda-forge
python 3.6.6 hd21baee_1003 conda-forge
python-dateutil 2.8.0 py_0 conda-forge
pytz 2018.9 py_0 conda-forge
pyyaml 3.13 py36h14c3975_1001 conda-forge
pyzmq 18.0.1 py36h0e1adb2_0 conda-forge
readline 7.0 h7b6447c_5
rhash 1.3.6 h14c3975_1001 conda-forge
scikit-learn 0.20.2 py36hd81dba3_0
scipy 1.2.1 py36h7c811a0_0
send2trash 1.5.0 py_0 conda-forge
setuptools 40.8.0 py36_0
six 1.12.0 py36_1000 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.26.0 h7b6447c_0
tblib 1.3.2 pypi_0 pypi
terminado 0.8.1 py36_1001 conda-forge
testpath 0.4.2 py36_1000 conda-forge
thrift-cpp 0.12.0 h23e226f_1001 conda-forge
tk 8.6.8 hbc83047_0
toolz 0.9.0 pypi_0 pypi
tornado 6.0.1 py36h14c3975_0 conda-forge
traitlets 4.3.2 py36_1000 conda-forge
wcwidth 0.1.7 py_1 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.33.1 py36_0
xgboost 0.80 pypi_0 pypi
xz 5.2.4 h14c3975_4
yaml 0.1.7 h14c3975_1001 conda-forge
zeromq 4.2.5 hf484d3e_1006 conda-forge
zict 0.1.3 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
@VibhuJawa can you create a simpler repro? Is it cudf's native groupby that is giving incorrect results, or dask-cudf groupbys?
If dask-cudf, I suggest moving this issue over to the dask-cudf repo.
Cleaned up the code to make it simpler repro without any dask parts.
The cudf method will be going away in the near future to fix this and streamline groupbys in general.
Dependent on #982
What is the libcudf need from this issue?
@harrism this was before the gdf_groupby_without_aggregation function was implemented and that was the need.
@shwina @devavret is this an issue that is easy to fix with the new groupby?
as_df() doesn't seem to be a Pandas groupby function. We currently support iterating over the groups in the same way as Pandas, which is what it looks like as_df() was used for.
@VibhuJawa could you confirm?
as_df()doesn't seem to be a Pandas groupby function. We currently support iterating over the groups in the same way as Pandas, which is what it looks likeas_df()was used for.@VibhuJawa could you confirm?
Yup, as_df (was our legacy api) which was used for getting all the groups and their boundaries IIRC.
I think .groups (api) might be a better equivalent to the older as_df we used to support which is still a pending feature . (According to the tracker here: https://github.com/rapidsai/cudf/wiki/cuDF-groupby )
That said, i think we can close this issue as as_df no longer exists.
We don't have a .groups, and likely won't be able to support it in the short term. But we do have iteration over the grouped dataframe via __iter__, i.e., something like:
for name, group in df.groupby('a'):
print(name)
print(group)
I think this is closer to what as_df() was doing earlier.
Thanks for the example here 馃憤