Cudf: [BUG] Slow performance with `groupby.count()`

Created on 22 Jun 2020  Â·  14Comments  Â·  Source: rapidsai/cudf

I am getting 10times worse performance on cudf comparing to pandas for the following code segment. What can be the possible issue?

code:

import numpy as np
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import time

pdf = pd.read_csv('HEPARTWO10k.csv')
pdf.apply(lambda x: pd.factorize(x)[0])

cdf = cd.from_pandas(pdf)
res = cdf.groupby(['alcoholism']).count()
res1 = pdf.groupby(['alcoholism']).count()

start_time = time.time()
res = cdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))

start_time = time.time()
res1 = pdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))

**Output:

--- 0.3670494556427002 seconds ---
--- 0.0398714542388916 seconds ---**

As it can be seen that Pandas shows almost 10 times better performance of groupby. What can go wrong? what am I missing?

Output of nvidia-smi:
$ nvidia-smi
Mon Jun 22 09:38:06 2020

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   28C    P0    52W / 300W |  18775MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   28C    P0    38W / 300W |     10MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   26C    P0    38W / 300W |     10MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   30C    P0    51W / 300W |  31289MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
bug cuDF (Python)

Most helpful comment

Hi @arghyakusumdas6163,

There are currently conda packages available for cudf 0.13 for Power on the Early Access channel for Watson Machine Learning Community Edition. The link to the conda channel is here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/.

All 14 comments

@arghyakusumdas6163 how many rows are in your pdf / cdf dataframe? What are the dtypes of the columns? What version of cudf are you using?

Thanks for your prompt response:
The csv dataset can be downloaded from here (all of them are categorical value): http://www.ccd.pitt.edu/wiki/images/HEPARTWO10k.csv

I converted it to numerical using Pandas factorize method and doing the groupby on that. Following are the details of that converted table:

>>> pdf.shape
(10000, 70)
>>> pdf.dtypes
alcoholism             object
vh_amn                 object
hepatotoxic            object
THepatitis             object
hospital               object
surgery                object
gallstones             object
choledocholithotomy    object
injections             object
transfusion            object
ChHepatitis            object
sex                    object
age                    object
PBC                    object
fibrosis               object
diabetes               object
obesity                object
Steatosis              object
Cirrhosis              object
Hyperbilirubinemia     object
triglycerides          object
RHepatitis             object
fatigue                object
bilirubin              object
itching                object
upper_pain             object
fat                    object
pain_ruq               object
pressure_ruq           object
phosphatase            object
                        ...  
flatulence             object
alcohol                object
encephalopathy         object
urea                   object
ascites                object
hepatomegaly           object
hepatalgia             object
density                object
ESR                    object
alt                    object
ast                    object
amylase                object
ggtp                   object
cholesterol            object
hbsag                  object
hbsag_anti             object
anorexia               object
nausea                 object
spleen                 object
consciousness          object
spiders                object
jaundice               object
albumin                object
edge                   object
irregular_liver        object
hbc_anti               object
hcv_anti               object
palms                  object
hbeag                  object
carcinoma              object
Length: 70, dtype: object

Do I need to convert everything to int/float? Is that the problem?

I changed the data frame to int64. Pandas performance improved by another magnitude now. But cudf still shows a similar time:

New Code (added to_numeric):

import numpy as np
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import time

pdf = pd.read_csv('HEPARTWO10k.csv')
pdf = pdf.apply(lambda x: pd.factorize(x)[0])
pdf = pdf.apply(pd.to_numeric)

cdf = cd.from_pandas(pdf)

res = cdf.groupby(['alcoholism']).count()
res1 = pdf.groupby(['alcoholism']).count()

start_time = time.time()
res = cdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))

And here is the new dtypes:

>>> cdf.dtypes
alcoholism             int64
vh_amn                 int64
hepatotoxic            int64
THepatitis             int64
hospital               int64
surgery                int64
gallstones             int64
choledocholithotomy    int64
injections             int64
transfusion            int64
ChHepatitis            int64
sex                    int64
age                    int64
PBC                    int64
fibrosis               int64
diabetes               int64
obesity                int64
Steatosis              int64
Cirrhosis              int64
Hyperbilirubinemia     int64
triglycerides          int64
RHepatitis             int64
fatigue                int64
bilirubin              int64
itching                int64
upper_pain             int64
fat                    int64
pain_ruq               int64
pressure_ruq           int64
phosphatase            int64
                       ...  
flatulence             int64
alcohol                int64
encephalopathy         int64
urea                   int64
ascites                int64
hepatomegaly           int64
hepatalgia             int64
density                int64
ESR                    int64
alt                    int64
ast                    int64
amylase                int64
ggtp                   int64
cholesterol            int64
hbsag                  int64
hbsag_anti             int64
anorexia               int64
nausea                 int64
spleen                 int64
consciousness          int64
spiders                int64
jaundice               int64
albumin                int64
edge                   int64
irregular_liver        int64
hbc_anti               int64
hcv_anti               int64
palms                  int64
hbeag                  int64
carcinoma              int64
Length: 70, dtype: object

And here is the time:
--- 0.34274744987487793 seconds ---
--- 0.008851051330566406 seconds ---

I believe the problem is likely due to a combination of having a small number of rows (10,000 rows) and being a bit wide (70 columns). In general, if your computation is only taking ~9ms in Pandas it's likely not a good candidate for GPU acceleration as the small amount of overhead for GPUs is harder to amortize.

Could you try turning on the pool allocator to see if that would help? This allocates a block of GPU memory up front to avoid overheads from having to call into the CUDA driver to allocate memory. Try this code:

import numpy as np
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import time

cudf.set_allocator(pool=True, initial_pool_size=int(30e9))

pdf = pd.read_csv('HEPARTWO10k.csv')
pdf = pdf.apply(lambda x: pd.factorize(x)[0])
pdf = pdf.apply(pd.to_numeric)

cdf = cd.from_pandas(pdf)

res = cdf.groupby(['alcoholism']).count()
res1 = pdf.groupby(['alcoholism']).count()

start_time = time.time()
res = cdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))

It is throwing me the following error. Is it a version problem

AttributeError: module 'cudf' has no attribute 'set_allocator'
I think, the system I am using has an older version of cudf:
conda list revealed the version:
cudf 0.9.0 cuda10.1_py36_626.gddcad2d https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
what is the latest version so that I can use set_allocator

Are you on a POWER machine or x86? The latest release is 0.14 and 0.13 in particular had major performance improvements for groupbys.

I am on POWER9

We unfortunately do not publish POWER conda packages ourselves. cc @pradghos @kriskend from IBM who maybe can answer questions about planned releases for PowerAI conda packages?

Thanks for the prompt responses and including the responsible persons in the thread.

I quick question: what type of shape do you think is ideal for cudf? Since this is a benchmarking test, I am interested to know if there is any estimate.

For example, I just extracted only 5 columns from the original dataset. So now the shape is:

>>> pdf1.shape
(10000, 5)

Now the time of cudf improved almost 10 times whereas Pandas remains almost same. Following are the time:
--- 0.032217979431152344 seconds for cudf---
--- 0.008929252624511719 seconds for pandas---

cuDF is definitely optimized for tall and skinny as opposed to short and wide. Each column is a separate memory allocation following the Apache Arrow memory layout (https://arrow.apache.org/docs/format/Columnar.html#format-columnar), where if your dataframe goes wide you get bottlenecked in memory allocations as opposed to compute. We have a memory pool that is opt-in to help alleviate this which can make the world of difference in a lot of situations. Also due to the nature of GPUs being throughput optimized as opposed to latency optimized, you need enough work to saturate the GPU and amortize the costs of kernel launches. We typically see this ~1MM rows or so (problem dependent).

Pandas is slightly different under the hood since it has the BlockManager where it doesn't deal with as many memory allocations (at the cost of having to do block consolidations later). This is a great blog explaining how Pandas handles memory: https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.

If you're doing benchmarking I would ask that you reserve judgement / publishing results until you can use the latest 0.14 release of cuDF as there was essentially a full rewrite of the library across 0.12 and 0.13.

Thanks. Sure I will wait till .14
However, I just did a quick test with simulated data with [10000000 rows x 70 columns] and I got 10 times improvement on GPU. Following is the result:
--- 0.5639636516571045 seconds on cudf---
--- 5.326399564743042 seconds on pandas---

@arghyakusumdas6163 I'm going to close this as your initial question has been answered. Feel free to open a new issue regarding getting updated conda packages for POWER9.

Sure. Thanks

Get Outlook for iOShttps://aka.ms/o0ukef


From: Keith Kraus notifications@github.com
Sent: Tuesday, June 23, 2020 5:03:56 PM
To: rapidsai/cudf cudf@noreply.github.com
Cc: Arghya K Das dasa@uwplatt.edu; Mention mention@noreply.github.com
Subject: Re: [rapidsai/cudf] [BUG] Slow performance with groupby.count() (#5533)

@arghyakusumdas6163https://github.com/arghyakusumdas6163 I'm going to close this as your initial question has been answered. Feel free to open a new issue regarding getting updated conda packages for POWER9.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/rapidsai/cudf/issues/5533#issuecomment-648451781, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTKOU5IYHMCPIAIHB4YNALRYEREZANCNFSM4OE4AXSA.

Hi @arghyakusumdas6163,

There are currently conda packages available for cudf 0.13 for Power on the Early Access channel for Watson Machine Learning Community Edition. The link to the conda channel is here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/.

Was this page helpful?
0 / 5 - 0 ratings