Cudf: [BUG] Serializing StringColumn's slows when offset encountered

Created on 22 Apr 2020  路  7Comments  路  Source: rapidsai/cudf

Describe the bug

When serializing a StringColumn with an offsets, am noticing a fair bit of time is spent adjusting for the offset. In particular this line shows up during profiling. Though the lines after are likely also expensive.

https://github.com/rapidsai/cudf/blob/65bbee7920182f709a269b8f0aba4b2434f4beb3/python/cudf/cudf/core/column/string.py#L1906

Digging a bit deeper it appears a fair bit of time is spent here where a device-to-host transfer occurs. Though likely a host-to-device transfer is needed for later operations. So avoid these transfers would like help, but there may be other ways to pursue this.

https://github.com/rapidsai/cudf/blob/65bbee7920182f709a269b8f0aba4b2434f4beb3/python/cudf/cudf/core/column/column.py#L417

Steps/Code to reproduce bug

import cudf

s = cudf.Series(["abc", "def", None, "ghi"])
s = s[1:]

s.serialize()

Expected behavior
A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: DGX-1 or DGX-2
  • Method of cuDF install: Conda

Environment details
Standard cuDF install with Conda.

Additional context
NA

cc @VibhuJawa @rgsl888prabhu @kkraus14

bug cuDF (Python) libcudf

All 7 comments

FYI just to give some context here, this is only expensive for sending sliced columns where we need to grab the scalar value of the offset based on the slice, and run a binaryop against a device scalar constructed from that scalar.

I believe there's ongoing work to allow building a device scalar without having to do a D --> H copy or synchronizing the stream here: #4900

Awesome! Thanks for sharing Keith 馃槃

@kkraus14 but you would need to do D->H at the end while creating the char column , to get size of the char column and to offset the start of char column

I believe there's ongoing work to allow building a device scalar without having to do a D --> H copy or synchronizing the stream here: #4900

D --> H copy happens for string_view and dictionary columns in get_element #4900
For fixed_width scalars, there is no D --> H copy.
In all cases, there is a kernel launch (to copy both data and validity).

These are always integer columns, so that kernel launch should be a lot cheaper than a D --> H --> D that we currently have 馃槃.

Getting the size of the char column is still going to require a D --> H though.

Tested locally on an internal benchmark with latest cudf build as of now and along with PR #4900, the time consumption reduces from 22.42 s to 11.03 seconds.

Closing this as we have merged #5072

Was this page helpful?
0 / 5 - 0 ratings