I am getting 10times worse performance on cudf comparing to pandas for the following code segment. What can be the possible issue?
import numpy as np
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import time
pdf = pd.read_csv('HEPARTWO10k.csv')
pdf.apply(lambda x: pd.factorize(x)[0])
cdf = cd.from_pandas(pdf)
res = cdf.groupby(['alcoholism']).count()
res1 = pdf.groupby(['alcoholism']).count()
start_time = time.time()
res = cdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
res1 = pdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.3670494556427002 seconds ---
--- 0.0398714542388916 seconds ---**
As it can be seen that Pandas shows almost 10 times better performance of groupby. What can go wrong? what am I missing?
Output of nvidia-smi:
$ nvidia-smi
Mon Jun 22 09:38:06 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 28C P0 52W / 300W | 18775MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 28C P0 38W / 300W | 10MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 26C P0 38W / 300W | 10MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 30C P0 51W / 300W | 31289MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
@arghyakusumdas6163 how many rows are in your pdf / cdf dataframe? What are the dtypes of the columns? What version of cudf are you using?
Thanks for your prompt response:
The csv dataset can be downloaded from here (all of them are categorical value): http://www.ccd.pitt.edu/wiki/images/HEPARTWO10k.csv
I converted it to numerical using Pandas factorize method and doing the groupby on that. Following are the details of that converted table:
>>> pdf.shape
(10000, 70)
>>> pdf.dtypes
alcoholism object
vh_amn object
hepatotoxic object
THepatitis object
hospital object
surgery object
gallstones object
choledocholithotomy object
injections object
transfusion object
ChHepatitis object
sex object
age object
PBC object
fibrosis object
diabetes object
obesity object
Steatosis object
Cirrhosis object
Hyperbilirubinemia object
triglycerides object
RHepatitis object
fatigue object
bilirubin object
itching object
upper_pain object
fat object
pain_ruq object
pressure_ruq object
phosphatase object
...
flatulence object
alcohol object
encephalopathy object
urea object
ascites object
hepatomegaly object
hepatalgia object
density object
ESR object
alt object
ast object
amylase object
ggtp object
cholesterol object
hbsag object
hbsag_anti object
anorexia object
nausea object
spleen object
consciousness object
spiders object
jaundice object
albumin object
edge object
irregular_liver object
hbc_anti object
hcv_anti object
palms object
hbeag object
carcinoma object
Length: 70, dtype: object
Do I need to convert everything to int/float? Is that the problem?
I changed the data frame to int64. Pandas performance improved by another magnitude now. But cudf still shows a similar time:
New Code (added to_numeric):
import numpy as np
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import time
pdf = pd.read_csv('HEPARTWO10k.csv')
pdf = pdf.apply(lambda x: pd.factorize(x)[0])
pdf = pdf.apply(pd.to_numeric)
cdf = cd.from_pandas(pdf)
res = cdf.groupby(['alcoholism']).count()
res1 = pdf.groupby(['alcoholism']).count()
start_time = time.time()
res = cdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))
And here is the new dtypes:
>>> cdf.dtypes
alcoholism int64
vh_amn int64
hepatotoxic int64
THepatitis int64
hospital int64
surgery int64
gallstones int64
choledocholithotomy int64
injections int64
transfusion int64
ChHepatitis int64
sex int64
age int64
PBC int64
fibrosis int64
diabetes int64
obesity int64
Steatosis int64
Cirrhosis int64
Hyperbilirubinemia int64
triglycerides int64
RHepatitis int64
fatigue int64
bilirubin int64
itching int64
upper_pain int64
fat int64
pain_ruq int64
pressure_ruq int64
phosphatase int64
...
flatulence int64
alcohol int64
encephalopathy int64
urea int64
ascites int64
hepatomegaly int64
hepatalgia int64
density int64
ESR int64
alt int64
ast int64
amylase int64
ggtp int64
cholesterol int64
hbsag int64
hbsag_anti int64
anorexia int64
nausea int64
spleen int64
consciousness int64
spiders int64
jaundice int64
albumin int64
edge int64
irregular_liver int64
hbc_anti int64
hcv_anti int64
palms int64
hbeag int64
carcinoma int64
Length: 70, dtype: object
And here is the time:
--- 0.34274744987487793 seconds ---
--- 0.008851051330566406 seconds ---
I believe the problem is likely due to a combination of having a small number of rows (10,000 rows) and being a bit wide (70 columns). In general, if your computation is only taking ~9ms in Pandas it's likely not a good candidate for GPU acceleration as the small amount of overhead for GPUs is harder to amortize.
Could you try turning on the pool allocator to see if that would help? This allocates a block of GPU memory up front to avoid overheads from having to call into the CUDA driver to allocate memory. Try this code:
import numpy as np
import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import time
cudf.set_allocator(pool=True, initial_pool_size=int(30e9))
pdf = pd.read_csv('HEPARTWO10k.csv')
pdf = pdf.apply(lambda x: pd.factorize(x)[0])
pdf = pdf.apply(pd.to_numeric)
cdf = cd.from_pandas(pdf)
res = cdf.groupby(['alcoholism']).count()
res1 = pdf.groupby(['alcoholism']).count()
start_time = time.time()
res = cdf.groupby(['alcoholism','vh_amn', 'hepatotoxic', 'THepatitis', 'ChHepatitis']).count()
print("--- %s seconds ---" % (time.time() - start_time))
It is throwing me the following error. Is it a version problem
AttributeError: module 'cudf' has no attribute 'set_allocator'
I think, the system I am using has an older version of cudf:
conda list revealed the version:
cudf 0.9.0 cuda10.1_py36_626.gddcad2d https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
what is the latest version so that I can use set_allocator
Are you on a POWER machine or x86? The latest release is 0.14 and 0.13 in particular had major performance improvements for groupbys.
I am on POWER9
We unfortunately do not publish POWER conda packages ourselves. cc @pradghos @kriskend from IBM who maybe can answer questions about planned releases for PowerAI conda packages?
Thanks for the prompt responses and including the responsible persons in the thread.
I quick question: what type of shape do you think is ideal for cudf? Since this is a benchmarking test, I am interested to know if there is any estimate.
For example, I just extracted only 5 columns from the original dataset. So now the shape is:
>>> pdf1.shape
(10000, 5)
Now the time of cudf improved almost 10 times whereas Pandas remains almost same. Following are the time:
--- 0.032217979431152344 seconds for cudf---
--- 0.008929252624511719 seconds for pandas---
cuDF is definitely optimized for tall and skinny as opposed to short and wide. Each column is a separate memory allocation following the Apache Arrow memory layout (https://arrow.apache.org/docs/format/Columnar.html#format-columnar), where if your dataframe goes wide you get bottlenecked in memory allocations as opposed to compute. We have a memory pool that is opt-in to help alleviate this which can make the world of difference in a lot of situations. Also due to the nature of GPUs being throughput optimized as opposed to latency optimized, you need enough work to saturate the GPU and amortize the costs of kernel launches. We typically see this ~1MM rows or so (problem dependent).
Pandas is slightly different under the hood since it has the BlockManager where it doesn't deal with as many memory allocations (at the cost of having to do block consolidations later). This is a great blog explaining how Pandas handles memory: https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.
If you're doing benchmarking I would ask that you reserve judgement / publishing results until you can use the latest 0.14 release of cuDF as there was essentially a full rewrite of the library across 0.12 and 0.13.
Thanks. Sure I will wait till .14
However, I just did a quick test with simulated data with [10000000 rows x 70 columns] and I got 10 times improvement on GPU. Following is the result:
--- 0.5639636516571045 seconds on cudf---
--- 5.326399564743042 seconds on pandas---
@arghyakusumdas6163 I'm going to close this as your initial question has been answered. Feel free to open a new issue regarding getting updated conda packages for POWER9.
Sure. Thanks
Get Outlook for iOShttps://aka.ms/o0ukef
From: Keith Kraus notifications@github.com
Sent: Tuesday, June 23, 2020 5:03:56 PM
To: rapidsai/cudf cudf@noreply.github.com
Cc: Arghya K Das dasa@uwplatt.edu; Mention mention@noreply.github.com
Subject: Re: [rapidsai/cudf] [BUG] Slow performance with groupby.count() (#5533)
@arghyakusumdas6163https://github.com/arghyakusumdas6163 I'm going to close this as your initial question has been answered. Feel free to open a new issue regarding getting updated conda packages for POWER9.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/rapidsai/cudf/issues/5533#issuecomment-648451781, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTKOU5IYHMCPIAIHB4YNALRYEREZANCNFSM4OE4AXSA.
Hi @arghyakusumdas6163,
There are currently conda packages available for cudf 0.13 for Power on the Early Access channel for Watson Machine Learning Community Edition. The link to the conda channel is here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/.
Most helpful comment
Hi @arghyakusumdas6163,
There are currently conda packages available for cudf 0.13 for Power on the Early Access channel for Watson Machine Learning Community Edition. The link to the conda channel is here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/.