Cudf: [BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas.

Created on 13 Apr 2020  路  8Comments  路  Source: rapidsai/cudf

Describe the bug
A clear and concise description of what the bug is.

cuDF Series str.split performance appears to be orders of magnitude slower than Pandas.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

# Construct DataFrame illustration
import cudf
import pandas as pd
import numpy as np

data = {
    "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)]
}

df = pd.DataFrame(data=data)
cdf = cudf.from_pandas(df)

print("Pandas:", flush=True)
%time df['col_0'].str.split(",", expand = True).reset_index(drop=True)
Pandas:
CPU times: user 402 ms, sys: 35 ms, total: 437 ms
Wall time: 26.8 ms

print("cuDF:", flush=True)
%time cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)
cuDF:
CPU times: user 42.2 s, sys: 16.7 s, total: 58.9 s
Wall time: 58.8 s

Expected behavior
A clear and concise description of what you expected to happen.

Performance should be more closely aligned with Pandas.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]

    • If method of install is [Docker], provide docker pull & docker run commands used

Bare-metal with Conda environment + Conda based cuDF installation

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

cudatoolkit               10.1.243             h6bb024c_0    nvidia
cudf                      0.13.0                   py37_0    rapidsai
cudnn                     7.6.0                cuda10.1_0    nvidia
cuml                      0.13.0          cuda10.1_py37_0    rapidsai
cupy                      7.3.0            py37h0632833_0    conda-forge
dask-cudf                 0.13.0                   py37_0    rapidsai
libcudf                   0.13.0               cuda10.1_0    rapidsai
libcuml                   0.13.0               cuda10.1_0    rapidsai
libcumlprims              0.13.0               cuda10.1_0    nvidia
libnvstrings              0.13.0               cuda10.1_0    rapidsai
librmm                    0.13.0               cuda10.1_0    rapidsai
ucx                       1.7.0+g9d06c3a       cuda10.1_0    rapidsai

Additional context
Add any other context about the problem here.

Performance bug cuDF (Python) libcudf strings

Most helpful comment

I see that! Wow. Thanks for the code snippet here Devin. This will be very helpful. I should be able to profile this to see what is going on.

All 8 comments

It seems libcudf is taking most of the time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.683   47.683   47.714   47.714 {cudf._lib.strings.split.split.split}
      400    0.167    0.000    0.221    0.001 {cudf._lib.copying.copy_column}
        1    0.018    0.018   47.732   47.732 string.py:804(split)
     5604    0.015    0.000    0.038    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.027    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.046    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.965   47.965 <string>:1(<module>)

It seems libcudf is taking most of the time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.683   47.683   47.714   47.714 {cudf._lib.strings.split.split.split}
      400    0.167    0.000    0.221    0.001 {cudf._lib.copying.copy_column}
        1    0.018    0.018   47.732   47.732 string.py:804(split)
     5604    0.015    0.000    0.038    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.027    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.046    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.965   47.965 <string>:1(<module>)

Can you try with the pool allocator? This looks like a situation where allocating 400 columns is likely the bottleneck, but I could be mistaken.

cc @davidwendt

Tried with pool allocator, and problem still persists

In [1]: import cudf 
   ...:  
   ...: # Use the pool allocator with a 2GiB initial pool size 
   ...: cudf.set_allocator(pool=True, initial_pool_size=2<<30) 
In [3]: import pandas as pd 
   ...: import numpy as np                                                                                                                                                                                                       

In [4]: data = { 
   ...:     "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)] 
   ...: }                                                                                                                                                                                                                        

In [5]:                                                                                                                                                                                                                          

In [5]: df = pd.DataFrame(data=data) 
   ...: cdf = cudf.from_pandas(df)                                                                                                                                                                                               

In [6]: print("cuDF:", flush=True) 
   ...: %prun cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)                       
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.646   47.646   47.676   47.676 {cudf._lib.strings.split.split.split}
      400    0.043    0.000    0.094    0.000 {cudf._lib.copying.copy_column}
     5604    0.015    0.000    0.037    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.026    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.044    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.782   47.782 <string>:1(<module>)
     2400    0.005    0.000    0.008    0.000 column.py:41(__init__)
     8006    0.003    0.000    0.005    0.000 common.py:2020(pandas_dtype)
     2400    0.003    0.000    0.004    0.000 buffer.py:135(_buffer_data_from_array_interface)
     2400    0.002    0.000    0.007    0.000 buffer.py:87(_init_from_array_like)

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

Yea I understand that, but ~47s seems crazy excessive, no?

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

My observations are that this is not the case. You can check by simply setting the range argument from 1 to to however many N strings you want.

For example, with N=1000, and reducing the number of linear steps to 200 (with 400 it runs much longer):
Pandas:
CPU times: user 48.1 ms, sys: 4 ms, total: 52.1 ms
Wall time: 51.8 ms
cuDF:
CPU times: user 34.9 s, sys: 12.1 s, total: 47.1 s
Wall time: 47.1 s

N=10,000, steps=200
Pandas:
CPU times: user 824 ms, sys: 74.2 ms, total: 898 ms
Wall time: 897 ms
cuDF:
CPU times: user 57 s, sys: 19.5 s, total: 1min 16s
Wall time: 1min 16s

I see that! Wow. Thanks for the code snippet here Devin. This will be very helpful. I should be able to profile this to see what is going on.

Was this page helpful?
0 / 5 - 0 ratings