Describe the bug
The following is a minimal reproducible sample of a subset of the actual dataset(private) hence not posting the dataset and instead posting a near replica of the data.
Steps/Code to reproduce bug
The GPU memory usage for a string replace of a column with 10038326 records is at 3429MiB, while if we use url_decode I find that there is extra memory usage(3733MiB) which is potentially causing dask to OOM with very large datasets(1075366479 records). I have verified the removing the url_decode call from the workflow is avoiding the OOM.
url_decode
import cudf
df = cudf.DataFrame({'path':['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'] * 10038326})
y = df.path.str.url_decode()
GPU usage at this point:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 346429 C /opt/conda/envs/rapids/bin/python 3733MiB|
+-----------------------------------------------------------------------------+
After restarting kernel
String replace:
import cudf
df = cudf.DataFrame({'path':['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'] * 10038326})
y = df.path.str.replace('%2F','/').str.replace('%27',"\'").str.replace('%3D','=').str.replace('%20',' ').str.replace('%3A',':').str.replace('%2B','+').str.replace('%24','$').str.replace('%3B',';').str.replace('%23','#').str.replace('%2C',',').str.replace('%2A','*').str.replace('%40','@').str.replace('%28','(').str.replace('%29',')').str.replace('%5E','^').str.replace('%5C','\\').str.replace('%22','"').str.replace('%25','%').str.replace('%21','!').str.replace('%5B','[').str.replace('%5D',']').str.replace('%3F','?').str.replace('%3E','>').str.replace('%7C','|')
GPU usage at this point:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 347047 C /opt/conda/envs/rapids/bin/python 3429MiB|
+-----------------------------------------------------------------------------+
Verified the parity between two outputs as well:
import cudf
df = cudf.DataFrame({'path':['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'] * 10038326})
y1 = df.path.str.replace('%2F','/').str.replace('%27',"\'").str.replace('%3D','=').str.replace('%20',' ').str.replace('%3A',':').str.replace('%2B','+').str.replace('%24','$').str.replace('%3B',';').str.replace('%23','#').str.replace('%2C',',').str.replace('%2A','*').str.replace('%40','@').str.replace('%28','(').str.replace('%29',')').str.replace('%5E','^').str.replace('%5C','\\').str.replace('%22','"').str.replace('%25','%').str.replace('%21','!').str.replace('%5B','[').str.replace('%5D',']').str.replace('%3F','?').str.replace('%3E','>').str.replace('%7C','|')
y2 = df.path.str.url_decode()
print((y1 == y2).value_counts())
True 10038326
dtype: int64
Expected behavior
Same memory foot-print to avoid OOM when performing a scaled run with dask.
Environment overview (please complete the following information)
Environment details
cudf/print_env.sh: env_aug1.txt
Additional context
Provided in the above along with steps.
Logging this issue in cudf as custrings is about to be merged into cudf. Hence logged this issue here.
cc: @davidwendt @beckernick @kkraus14
This is actually a good issue to showcase the new nvstrings device_memory() method added by Vibhu right before the merge blackout. Here is the example above with just one string:
import nvstrings
s = nvstrings.to_device(['esjfdlk jslkdfj l ldsjk %2Fslk sdkshk ajkhdk s hdksah aksjhdkjs dkjkhjhsad djflkdjlk7%3D%20%3A%2B%24dgff djlfk %3B%23%2C ksdjhfkjdshfkjdshf22%25%21%5B%5D%3F%3E%7C'])
ds = s.url_decode()
rs = s.replace('%2F','/').replace('%27',"\'").replace('%3D','=').replace('%20',' ').replace('%3A',':').replace('%2B','+').replace('%24','$').replace('%3B',';').replace('%23','#').replace('%2C',',').replace('%2A','*').replace('%40','@').replace('%28','(').replace('%29',')').replace('%5E','^').replace('%5C','\\').replace('%22','"').replace('%25','%').replace('%21','!').replace('%5B','[').replace('%5D',']').replace('%3F','?').replace('%3E','>').replace('%7C','|')
print('decode size:', ds.device_memory())
print('replace size:', rs.device_memory())
decode size: 184
replace size: 152
I found the logic error in url_decode that over-calculated the memory size for this and have it fixed locally on my machine.
@davidwendt @harrism Marking this as libcudf because it's custrings and we haven't finished the repo migration quite yet.
Most helpful comment
This is actually a good issue to showcase the new nvstrings
device_memory()method added by Vibhu right before the merge blackout. Here is the example above with just one string:I found the logic error in
url_decodethat over-calculated the memory size for this and have it fixed locally on my machine.