Describe the bug
A clear and concise description of what the bug is.
cuDF Series str.split performance appears to be orders of magnitude slower than Pandas.
Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
# Construct DataFrame illustration
import cudf
import pandas as pd
import numpy as np
data = {
"col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)]
}
df = pd.DataFrame(data=data)
cdf = cudf.from_pandas(df)
print("Pandas:", flush=True)
%time df['col_0'].str.split(",", expand = True).reset_index(drop=True)
Pandas:
CPU times: user 402 ms, sys: 35 ms, total: 437 ms
Wall time: 26.8 ms
print("cuDF:", flush=True)
%time cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)
cuDF:
CPU times: user 42.2 s, sys: 16.7 s, total: 58.9 s
Wall time: 58.8 s
Expected behavior
A clear and concise description of what you expected to happen.
Performance should be more closely aligned with Pandas.
Environment overview (please complete the following information)
docker pull & docker run commands usedBare-metal with Conda environment + Conda based cuDF installation
Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details
cudatoolkit 10.1.243 h6bb024c_0 nvidia
cudf 0.13.0 py37_0 rapidsai
cudnn 7.6.0 cuda10.1_0 nvidia
cuml 0.13.0 cuda10.1_py37_0 rapidsai
cupy 7.3.0 py37h0632833_0 conda-forge
dask-cudf 0.13.0 py37_0 rapidsai
libcudf 0.13.0 cuda10.1_0 rapidsai
libcuml 0.13.0 cuda10.1_0 rapidsai
libcumlprims 0.13.0 cuda10.1_0 nvidia
libnvstrings 0.13.0 cuda10.1_0 rapidsai
librmm 0.13.0 cuda10.1_0 rapidsai
ucx 1.7.0+g9d06c3a cuda10.1_0 rapidsai
Additional context
Add any other context about the problem here.
It seems libcudf is taking most of the time
ncalls tottime percall cumtime percall filename:lineno(function)
1 47.683 47.683 47.714 47.714 {cudf._lib.strings.split.split.split}
400 0.167 0.000 0.221 0.001 {cudf._lib.copying.copy_column}
1 0.018 0.018 47.732 47.732 string.py:804(split)
5604 0.015 0.000 0.038 0.000 dtypes.py:79(is_categorical_dtype)
73666 0.015 0.000 0.015 0.000 {built-in method builtins.hasattr}
32026 0.013 0.000 0.027 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
2400 0.008 0.000 0.046 0.000 column.py:1039(build_column)
72948 0.007 0.000 0.007 0.000 {built-in method builtins.isinstance}
1 0.007 0.007 47.965 47.965 <string>:1(<module>)
It seems libcudf is taking most of the time
ncalls tottime percall cumtime percall filename:lineno(function) 1 47.683 47.683 47.714 47.714 {cudf._lib.strings.split.split.split} 400 0.167 0.000 0.221 0.001 {cudf._lib.copying.copy_column} 1 0.018 0.018 47.732 47.732 string.py:804(split) 5604 0.015 0.000 0.038 0.000 dtypes.py:79(is_categorical_dtype) 73666 0.015 0.000 0.015 0.000 {built-in method builtins.hasattr} 32026 0.013 0.000 0.027 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist) 2400 0.008 0.000 0.046 0.000 column.py:1039(build_column) 72948 0.007 0.000 0.007 0.000 {built-in method builtins.isinstance} 1 0.007 0.007 47.965 47.965 <string>:1(<module>)
Can you try with the pool allocator? This looks like a situation where allocating 400 columns is likely the bottleneck, but I could be mistaken.
cc @davidwendt
Tried with pool allocator, and problem still persists
In [1]: import cudf
...:
...: # Use the pool allocator with a 2GiB initial pool size
...: cudf.set_allocator(pool=True, initial_pool_size=2<<30)
In [3]: import pandas as pd
...: import numpy as np
In [4]: data = {
...: "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)]
...: }
In [5]:
In [5]: df = pd.DataFrame(data=data)
...: cdf = cudf.from_pandas(df)
In [6]: print("cuDF:", flush=True)
...: %prun cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)
ncalls tottime percall cumtime percall filename:lineno(function)
1 47.646 47.646 47.676 47.676 {cudf._lib.strings.split.split.split}
400 0.043 0.000 0.094 0.000 {cudf._lib.copying.copy_column}
5604 0.015 0.000 0.037 0.000 dtypes.py:79(is_categorical_dtype)
73666 0.015 0.000 0.015 0.000 {built-in method builtins.hasattr}
32026 0.013 0.000 0.026 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
2400 0.008 0.000 0.044 0.000 column.py:1039(build_column)
72948 0.007 0.000 0.007 0.000 {built-in method builtins.isinstance}
1 0.007 0.007 47.782 47.782 <string>:1(<module>)
2400 0.005 0.000 0.008 0.000 column.py:41(__init__)
8006 0.003 0.000 0.005 0.000 common.py:2020(pandas_dtype)
2400 0.003 0.000 0.004 0.000 buffer.py:135(_buffer_data_from_array_interface)
2400 0.002 0.000 0.007 0.000 buffer.py:87(_init_from_array_like)
Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.
Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.
Yea I understand that, but ~47s seems crazy excessive, no?
Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.
My observations are that this is not the case. You can check by simply setting the range argument from 1 to to however many N strings you want.
For example, with N=1000, and reducing the number of linear steps to 200 (with 400 it runs much longer):
Pandas:
CPU times: user 48.1 ms, sys: 4 ms, total: 52.1 ms
Wall time: 51.8 ms
cuDF:
CPU times: user 34.9 s, sys: 12.1 s, total: 47.1 s
Wall time: 47.1 s
N=10,000, steps=200
Pandas:
CPU times: user 824 ms, sys: 74.2 ms, total: 898 ms
Wall time: 897 ms
cuDF:
CPU times: user 57 s, sys: 19.5 s, total: 1min 16s
Wall time: 1min 16s
I see that! Wow. Thanks for the code snippet here Devin. This will be very helpful. I should be able to profile this to see what is going on.
Most helpful comment
I see that! Wow. Thanks for the code snippet here Devin. This will be very helpful. I should be able to profile this to see what is going on.