Cudf: [BUG] Extra memory footprint by nvstrings.url_decode causes dask to OOM

Created on 1 Aug 2019 · 2Comments · Source: rapidsai/cudf

Describe the bug
The following is a minimal reproducible sample of a subset of the actual dataset(private) hence not posting the dataset and instead posting a near replica of the data.

Steps/Code to reproduce bug

The GPU memory usage for a string replace of a column with 10038326 records is at 3429MiB, while if we use url_decode I find that there is extra memory usage(3733MiB) which is potentially causing dask to OOM with very large datasets(1075366479 records). I have verified the removing the url_decode call from the workflow is avoiding the OOM.

url_decode

import cudf
df = cudf.DataFrame({'path':['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'] * 10038326})
y = df.path.str.url_decode()

GPU usage at this point:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    346429      C   /opt/conda/envs/rapids/bin/python            3733MiB|
+-----------------------------------------------------------------------------+

After restarting kernel

String replace:

import cudf
df = cudf.DataFrame({'path':['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'] * 10038326})
y = df.path.str.replace('%2F','/').str.replace('%27',"\'").str.replace('%3D','=').str.replace('%20',' ').str.replace('%3A',':').str.replace('%2B','+').str.replace('%24','$').str.replace('%3B',';').str.replace('%23','#').str.replace('%2C',',').str.replace('%2A','*').str.replace('%40','@').str.replace('%28','(').str.replace('%29',')').str.replace('%5E','^').str.replace('%5C','\\').str.replace('%22','"').str.replace('%25','%').str.replace('%21','!').str.replace('%5B','[').str.replace('%5D',']').str.replace('%3F','?').str.replace('%3E','>').str.replace('%7C','|')

GPU usage at this point:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    347047      C   /opt/conda/envs/rapids/bin/python            3429MiB|
+-----------------------------------------------------------------------------+

Verified the parity between two outputs as well:

import cudf
df = cudf.DataFrame({'path':['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'] * 10038326})
y1 = df.path.str.replace('%2F','/').str.replace('%27',"\'").str.replace('%3D','=').str.replace('%20',' ').str.replace('%3A',':').str.replace('%2B','+').str.replace('%24','$').str.replace('%3B',';').str.replace('%23','#').str.replace('%2C',',').str.replace('%2A','*').str.replace('%40','@').str.replace('%28','(').str.replace('%29',')').str.replace('%5E','^').str.replace('%5C','\\').str.replace('%22','"').str.replace('%25','%').str.replace('%21','!').str.replace('%5B','[').str.replace('%5D',']').str.replace('%3F','?').str.replace('%3E','>').str.replace('%7C','|')
y2 = df.path.str.url_decode()
print((y1 == y2).value_counts())
True    10038326
dtype: int64

Expected behavior
Same memory foot-print to avoid OOM when performing a scaled run with dask.

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: from source

Environment details
cudf/print_env.sh: env_aug1.txt

Additional context
Provided in the above along with steps.

Logging this issue in cudf as custrings is about to be merged into cudf. Hence logged this issue here.

cc: @davidwendt @beckernick @kkraus14

bug libcudf strings

Source

galipremsagar

Most helpful comment

This is actually a good issue to showcase the new nvstrings device_memory() method added by Vibhu right before the merge blackout. Here is the example above with just one string:

import nvstrings
s = nvstrings.to_device(['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'])
ds = s.url_decode()
rs = s.replace('%2F','/').replace('%27',"\'").replace('%3D','=').replace('%20',' ').replace('%3A',':').replace('%2B','+').replace('%24','$').replace('%3B',';').replace('%23','#').replace('%2C',',').replace('%2A','*').replace('%40','@').replace('%28','(').replace('%29',')').replace('%5E','^').replace('%5C','\\').replace('%22','"').replace('%25','%').replace('%21','!').replace('%5B','[').replace('%5D',']').replace('%3F','?').replace('%3E','>').replace('%7C','|')

print('decode size:', ds.device_memory())
print('replace size:', rs.device_memory())

decode size: 184
replace size: 152

I found the logic error in url_decode that over-calculated the memory size for this and have it fixed locally on my machine.

davidwendt on 2 Aug 2019

❤1 🎉1

All 2 comments

This is actually a good issue to showcase the new nvstrings device_memory() method added by Vibhu right before the merge blackout. Here is the example above with just one string:

import nvstrings
s = nvstrings.to_device(['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'])
ds = s.url_decode()
rs = s.replace('%2F','/').replace('%27',"\'").replace('%3D','=').replace('%20',' ').replace('%3A',':').replace('%2B','+').replace('%24','$').replace('%3B',';').replace('%23','#').replace('%2C',',').replace('%2A','*').replace('%40','@').replace('%28','(').replace('%29',')').replace('%5E','^').replace('%5C','\\').replace('%22','"').replace('%25','%').replace('%21','!').replace('%5B','[').replace('%5D',']').replace('%3F','?').replace('%3E','>').replace('%7C','|')

print('decode size:', ds.device_memory())
print('replace size:', rs.device_memory())