This is my use-case:
I'm filtering out records from a dask dataframe in which one of the string column is compared against a regular expression. My code in dask using map_partitions is:
def custom_regex_filter(df):
df['m'] = df['text'].data.match(regex_string)
df = df.query('m==True')
df = df.drop('m',axis=1)
return df
gpu_present_df = master_df.map_partitions(custom_regex_filter)
Currently if I try the below code I get an exception saying Series doesn't have .data attribute. Instead of the above workaround it would be nice to have the .data attribute for string columns in dask dataframe to be exposed to be able to write something like:
master_df['m'] = master_df.text.data.match(regex_string)
master_df = master_df.query("m==True")
This is my use-case:
I'm filtering out records from a dask dataframe in which one of the string column is compared against a regular expression. My code in dask using map_partitions is:def custom_regex_filter(df): df['m'] = df['text'].data.match(regex_string) df = df.query('m==True') df = df.drop('m',axis=1) return df gpu_present_df = master_df.map_partitions(custom_regex_filter)Currently if I try the below code I get an exception saying Series doesn't have .data attribute. Instead of the above workaround it would be nice to have the
.dataattribute for string columns in dask dataframe to be exposed to be able to write something like:master_df['m'] = master_df.text.data.match(regex_string) master_df = master_df.query("m==True")
The .data attribute is custom to cuDF and doesn't exist in the Pandas API. In general, the user shouldn't have to access the underlying column object in order to accomplish their task. What would the equivalent API be in using Pandas / dask.dataframe for doing a string regex match?
This is my use-case:
I'm filtering out records from a dask dataframe in which one of the string column is compared against a regular expression. My code in dask using map_partitions is:def custom_regex_filter(df): df['m'] = df['text'].data.match(regex_string) df = df.query('m==True') df = df.drop('m',axis=1) return df gpu_present_df = master_df.map_partitions(custom_regex_filter)Currently if I try the below code I get an exception saying Series doesn't have .data attribute. Instead of the above workaround it would be nice to have the
.dataattribute for string columns in dask dataframe to be exposed to be able to write something like:master_df['m'] = master_df.text.data.match(regex_string) master_df = master_df.query("m==True")The
.dataattribute is custom to cuDF and doesn't exist in the Pandas API. In general, the user shouldn't have to access the underlying column object in order to accomplish their task. What would the equivalent API be in using Pandas / dask.dataframe for doing a string regex match?
The equivalent in Pandas is: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
The above is a sample code, but my actual requirement is that dask cudf allows string concatenation of two string columns like dask does:
dask:
import pandas as pd
x = pd.DataFrame({"b":["prefix"],"c":["suffix"]})
ddf = dask.dataframe.from_pandas(x, npartitions=20)
ddf['d'] = ddf.b + ddf.c
print(ddf.compute())
output:
b c d
0 prefix suffix prefixsuffix
But In dask-cudf the following is currently not possible:
x = cudf.DataFrame({"b":["prefix"],"c":["suffix"]})
ddf = dask_cudf.from_cudf(x, npartitions=20)
ddf['d'] = ddf.b + ddf.c
print(ddf.compute())
So I was looking at to utilize cat function of nvstrings like we currently do incase of cudf to achieve the above.
We could have more bin ops support in cudf for StringColumns and/or add functionality around str.contains but I agree with @kkraus14 that we don't want the user to have to access the .data attribute to achieve a particular task
Actually, I just looked and not only does cudf.col.str.contain exist but so does cat:
In [88]: import cudf
In [89]: cdf = cudf.DataFrame({'a': [1.2, 3.4, 5], 'b': range(3), 'c': ['a', 'b', 'c']})
In [90]: cdf.c.str.cat(cdf.c).to_pandas()
Out[90]:
0 aa
1 bb
2 cc
dtype: object
Still, I could see the binop of cdf.c + cdf.c being a nice thing @kkraus14 thoughts ?
@quasiben I wonder if implementing the concatenation binary op for strings may resolve the issue with using str.cat in dask-cudf (just noticed this).
dask-cudf is able to successfully handle StringColumn functionality that isn't explicitly implemented but is accessed via getattr in cudf grabbing the method of the underlying nvstrings object (data). str.slice is an example of this:
x = pd.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
gx = cudf.from_pandas(x)
gxddf = dask_cudf.from_cudf(gx, 2)
print(gxddf['b'].str.slice(3, 5).compute())
0 fi
1 fi
2 fi
3 fi
4 fi
dtype: object
What we can't do is concatenate with str.cat, regardless of whether it is explicitly implemented or not.
import pandas as pd
import dask
โ
x = pd.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
ddf = dask.dataframe.from_pandas(x, npartitions=20)
ddf['d'] = ddf.b + ddf.c
print(ddf.compute())
print(ddf['b'].str.cat(ddf['c']).compute()) # works in dask
โ
print()
โ
gx = cudf.from_pandas(x)
print(gx.b.str.cat(gx.c)) # works in cudf. could just set this as a new column in gx
โ
โ
gxddf = dask_cudf.from_cudf(gx, 2)
print(gxddf['b'].str.cat(gxddf['c']).compute()) # not in dask-cudf. fails silently and returns an empty series
โ
b c d
0 prefix suffix prefixsuffix
1 prefix suffix prefixsuffix
2 prefix suffix prefixsuffix
3 prefix suffix prefixsuffix
4 prefix suffix prefixsuffix
0 prefixsuffix
1 prefixsuffix
2 prefixsuffix
3 prefixsuffix
4 prefixsuffix
Name: b, dtype: object
0 prefixsuffix
1 prefixsuffix
2 prefixsuffix
3 prefixsuffix
4 prefixsuffix
dtype: object
<empty Series of dtype=object>
@beckernick yeah, that would probably help! though, i want to look at what is happening with str.cat in dask-cudf
๐
Also @galipremsagar , in general you should be able to access any method of the nvstrings data by doing df.text.str.match(regex_string) instead of df.text.data.match(regex_string).
When they are explicitly implemented, you'll see them as a named method. When they are accessed by getting the attribute from the underlying object, you would see the signature as:
gx.b.str.match
<function cudf.dataframe.string.StringMethods.__getattr__.<locals>.wrapper(*args, **kwargs)>
I believe this should resolve the specific issue of needing the .data accessor in dask, though not your actual problem of needing string concatenation unfortunately.
@beckernick hmm, something odd is happening. when dask-cudf hands str.cat a column, that column is a wrapped in a tuple:
> /home/nfs/bzaitlen/TESTS/str-cat.py(16)<module>()
15 breakpoint()
---> 16 foo = gxddf['b'].str.cat(gxddf['c']).compute()
17 print(foo)
ipdb> c
> /home/nfs/bzaitlen/GitRepos/cudf/python/cudf/dataframe/string.py(112)cat()
111 breakpoint()
--> 112 if isinstance(others, (Series, Index)):
113 assert others.dtype == np.dtype('object')
ipdb> others
(<cudf.Series nrows=3 >,)
ipdb> self._parent.data.cat(others=others, sep=sep, na_rep=na_rep)
ipdb> self._parent.data.cat(others=others[0].data, sep=sep, na_rep=na_rep)
<nvstrings count=3>
others should be a cudf.Series
Hmm, interesting. Pandas can handle the tuple of objects, so that actually may make sense. From the docstring: "others : Series, Index, DataFrame, np.ndarrary or list-like".
x = pd.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
tup = (x.c, x.c)
x.b.str.cat(tup)
0 prefixsuffixsuffix
1 prefixsuffixsuffix
2 prefixsuffixsuffix
3 prefixsuffixsuffix
4 prefixsuffixsuffix
Name: b, dtype: object
If others can be list-like, the question may be whether or they should be passed all at once within an list-like structure (tuple in this case) or one at a time. Given that pandas allows for something list-like, we should probably allow for that as well. I think that would solve this issue. What do you think @quasiben ?
EDIT: Edited for clarity. Keep adding the wrong snippet by accident.
Currently, the cat method of nvstrings cannot handle multiple nvstrings objects as inputs, so some potential resolutions could be something like one of the following:
str.cat method;others if its list-like. This would require more kernel calls and overhead (though we still would only need to construct the series once, so the overhead probably isn't that large)Any preferences @kkraus14 , @quasiben @davidwendt ?
EDIT: Can confirm that the second approach would resolve the str.cat issue in dask-cudf
import cudf
import dask_cudf
โ
# altered string.py in cudf as suggested above
โ
x = cudf.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
tup = (x.c, x.b)
print(x.b.str.cat(tup))
โ
gxddf = dask_cudf.from_cudf(x, 2)
tup = (gxddf['c'], gxddf['b'])
print(gxddf['b'].str.cat(tup).compute())
0 prefixsuffixprefix
1 prefixsuffixprefix
2 prefixsuffixprefix
3 prefixsuffixprefix
4 prefixsuffixprefix
dtype: object
0 prefixsuffixprefix
1 prefixsuffixprefix
2 prefixsuffixprefix
3 prefixsuffixprefix
4 prefixsuffixprefix
Name: b, dtype: object
Regardless of approach, cuDF str.cat should accept list-like inputs because pandas does. Will create an issue.
@beckernick that's rad you already solved the issue!
I would assume cuDF as the place to resolve -- but I know very little about nvstrings. If the overhead is not significant than cuDF seems reasonable
@galipremsagar do you want to execute the approach discussed above for rapidsai/cudf#1939?
๐
Also @galipremsagar , in general you should be able to access any method of the nvstrings
databy doingdf.text.str.match(regex_string)instead ofdf.text.data.match(regex_string).When they are explicitly implemented, you'll see them as a named method. When they are accessed by getting the attribute from the underlying object, you would see the signature as:
gx.b.str.match <function cudf.dataframe.string.StringMethods.__getattr__.<locals>.wrapper(*args, **kwargs)>I believe this should resolve the specific issue of needing the
.dataaccessor in dask, though not your actual problem of needing string concatenation unfortunately.
Thanks @beckernick This works for me.
@galipremsagar do you want to execute the approach discussed above for rapidsai/cudf#1939?
Sure, will work on it
Closing this as it is resolved in #1939