Cudf: [Feature request] Dask should provide ability to access .data to access underlying nvstrings functionalities

Created on 6 Jun 2019  ยท  14Comments  ยท  Source: rapidsai/cudf

This is my use-case:
I'm filtering out records from a dask dataframe in which one of the string column is compared against a regular expression. My code in dask using map_partitions is:

def custom_regex_filter(df):
    df['m'] = df['text'].data.match(regex_string)
    df = df.query('m==True')
    df = df.drop('m',axis=1)
    return df
gpu_present_df = master_df.map_partitions(custom_regex_filter)

Currently if I try the below code I get an exception saying Series doesn't have .data attribute. Instead of the above workaround it would be nice to have the .data attribute for string columns in dask dataframe to be exposed to be able to write something like:

master_df['m'] = master_df.text.data.match(regex_string)
master_df = master_df.query("m==True")

dask-cudf

All 14 comments

This is my use-case:
I'm filtering out records from a dask dataframe in which one of the string column is compared against a regular expression. My code in dask using map_partitions is:

def custom_regex_filter(df):
    df['m'] = df['text'].data.match(regex_string)
    df = df.query('m==True')
    df = df.drop('m',axis=1)
    return df
gpu_present_df = master_df.map_partitions(custom_regex_filter)

Currently if I try the below code I get an exception saying Series doesn't have .data attribute. Instead of the above workaround it would be nice to have the .data attribute for string columns in dask dataframe to be exposed to be able to write something like:

master_df['m'] = master_df.text.data.match(regex_string)
master_df = master_df.query("m==True")

The .data attribute is custom to cuDF and doesn't exist in the Pandas API. In general, the user shouldn't have to access the underlying column object in order to accomplish their task. What would the equivalent API be in using Pandas / dask.dataframe for doing a string regex match?

This is my use-case:
I'm filtering out records from a dask dataframe in which one of the string column is compared against a regular expression. My code in dask using map_partitions is:

def custom_regex_filter(df):
    df['m'] = df['text'].data.match(regex_string)
    df = df.query('m==True')
    df = df.drop('m',axis=1)
    return df
gpu_present_df = master_df.map_partitions(custom_regex_filter)

Currently if I try the below code I get an exception saying Series doesn't have .data attribute. Instead of the above workaround it would be nice to have the .data attribute for string columns in dask dataframe to be exposed to be able to write something like:

master_df['m'] = master_df.text.data.match(regex_string)
master_df = master_df.query("m==True")

The .data attribute is custom to cuDF and doesn't exist in the Pandas API. In general, the user shouldn't have to access the underlying column object in order to accomplish their task. What would the equivalent API be in using Pandas / dask.dataframe for doing a string regex match?

The equivalent in Pandas is: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

The above is a sample code, but my actual requirement is that dask cudf allows string concatenation of two string columns like dask does:

dask:

import pandas as pd
x = pd.DataFrame({"b":["prefix"],"c":["suffix"]})
ddf = dask.dataframe.from_pandas(x, npartitions=20)
ddf['d'] = ddf.b + ddf.c
print(ddf.compute())

output:

        b       c             d
0  prefix  suffix  prefixsuffix

But In dask-cudf the following is currently not possible:

x = cudf.DataFrame({"b":["prefix"],"c":["suffix"]})
ddf = dask_cudf.from_cudf(x, npartitions=20)
ddf['d'] = ddf.b + ddf.c
print(ddf.compute())

So I was looking at to utilize cat function of nvstrings like we currently do incase of cudf to achieve the above.

We could have more bin ops support in cudf for StringColumns and/or add functionality around str.contains but I agree with @kkraus14 that we don't want the user to have to access the .data attribute to achieve a particular task

Actually, I just looked and not only does cudf.col.str.contain exist but so does cat:

In [88]: import cudf

In [89]: cdf = cudf.DataFrame({'a': [1.2, 3.4, 5], 'b': range(3), 'c': ['a', 'b', 'c']})

In [90]: cdf.c.str.cat(cdf.c).to_pandas()
Out[90]:
0    aa
1    bb
2    cc
dtype: object

Still, I could see the binop of cdf.c + cdf.c being a nice thing @kkraus14 thoughts ?

@quasiben I wonder if implementing the concatenation binary op for strings may resolve the issue with using str.cat in dask-cudf (just noticed this).

dask-cudf is able to successfully handle StringColumn functionality that isn't explicitly implemented but is accessed via getattr in cudf grabbing the method of the underlying nvstrings object (data). str.slice is an example of this:

x = pd.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
gx = cudf.from_pandas(x)
gxddf = dask_cudf.from_cudf(gx, 2)
print(gxddf['b'].str.slice(3, 5).compute())
0    fi
1    fi
2    fi
3    fi
4    fi
dtype: object

What we can't do is concatenate with str.cat, regardless of whether it is explicitly implemented or not.

import pandas as pd
import dask
โ€‹
x = pd.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
ddf = dask.dataframe.from_pandas(x, npartitions=20)
ddf['d'] = ddf.b + ddf.c
print(ddf.compute())
print(ddf['b'].str.cat(ddf['c']).compute()) # works in dask
โ€‹
print()
โ€‹
gx = cudf.from_pandas(x)
print(gx.b.str.cat(gx.c)) # works in cudf. could just set this as a new column in gx
โ€‹
โ€‹
gxddf = dask_cudf.from_cudf(gx, 2)
print(gxddf['b'].str.cat(gxddf['c']).compute()) # not in dask-cudf. fails silently and returns an empty series
โ€‹
        b       c             d
0  prefix  suffix  prefixsuffix
1  prefix  suffix  prefixsuffix
2  prefix  suffix  prefixsuffix
3  prefix  suffix  prefixsuffix
4  prefix  suffix  prefixsuffix
0    prefixsuffix
1    prefixsuffix
2    prefixsuffix
3    prefixsuffix
4    prefixsuffix
Name: b, dtype: object

0    prefixsuffix
1    prefixsuffix
2    prefixsuffix
3    prefixsuffix
4    prefixsuffix
dtype: object
<empty Series of dtype=object>

@beckernick yeah, that would probably help! though, i want to look at what is happening with str.cat in dask-cudf

๐Ÿ‘

Also @galipremsagar , in general you should be able to access any method of the nvstrings data by doing df.text.str.match(regex_string) instead of df.text.data.match(regex_string).

When they are explicitly implemented, you'll see them as a named method. When they are accessed by getting the attribute from the underlying object, you would see the signature as:

gx.b.str.match
<function cudf.dataframe.string.StringMethods.__getattr__.<locals>.wrapper(*args, **kwargs)>

I believe this should resolve the specific issue of needing the .data accessor in dask, though not your actual problem of needing string concatenation unfortunately.

@beckernick hmm, something odd is happening. when dask-cudf hands str.cat a column, that column is a wrapped in a tuple:

> /home/nfs/bzaitlen/TESTS/str-cat.py(16)<module>()
     15 breakpoint()
---> 16 foo = gxddf['b'].str.cat(gxddf['c']).compute()
     17 print(foo)

ipdb> c
> /home/nfs/bzaitlen/GitRepos/cudf/python/cudf/dataframe/string.py(112)cat()
    111         breakpoint()
--> 112         if isinstance(others, (Series, Index)):
    113             assert others.dtype == np.dtype('object')

ipdb> others
(<cudf.Series nrows=3 >,)
ipdb> self._parent.data.cat(others=others, sep=sep, na_rep=na_rep)
ipdb> self._parent.data.cat(others=others[0].data, sep=sep, na_rep=na_rep)
<nvstrings count=3>

others should be a cudf.Series

Hmm, interesting. Pandas can handle the tuple of objects, so that actually may make sense. From the docstring: "others : Series, Index, DataFrame, np.ndarrary or list-like".

x = pd.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
tup = (x.c, x.c)
x.b.str.cat(tup)
0    prefixsuffixsuffix
1    prefixsuffixsuffix
2    prefixsuffixsuffix
3    prefixsuffixsuffix
4    prefixsuffixsuffix
Name: b, dtype: object

If others can be list-like, the question may be whether or they should be passed all at once within an list-like structure (tuple in this case) or one at a time. Given that pandas allows for something list-like, we should probably allow for that as well. I think that would solve this issue. What do you think @quasiben ?

EDIT: Edited for clarity. Keep adding the wrong snippet by accident.

Currently, the cat method of nvstrings cannot handle multiple nvstrings objects as inputs, so some potential resolutions could be something like one of the following:

  • Be changed upstream in nvstrings and then subsequently allow for list-like inputs in the cuDF str.cat method;
  • Kept as is in nvstrings and add branching logic to iterate through others if its list-like. This would require more kernel calls and overhead (though we still would only need to construct the series once, so the overhead probably isn't that large)

Any preferences @kkraus14 , @quasiben @davidwendt ?

EDIT: Can confirm that the second approach would resolve the str.cat issue in dask-cudf

import cudf
import dask_cudf
โ€‹
# altered string.py in cudf as suggested above
โ€‹
x = cudf.DataFrame({"b":["prefix"]*5,"c":["suffix"]*5})
tup = (x.c, x.b)
print(x.b.str.cat(tup))
โ€‹
gxddf = dask_cudf.from_cudf(x, 2)
tup = (gxddf['c'], gxddf['b'])
print(gxddf['b'].str.cat(tup).compute())
0    prefixsuffixprefix
1    prefixsuffixprefix
2    prefixsuffixprefix
3    prefixsuffixprefix
4    prefixsuffixprefix
dtype: object
0    prefixsuffixprefix
1    prefixsuffixprefix
2    prefixsuffixprefix
3    prefixsuffixprefix
4    prefixsuffixprefix
Name: b, dtype: object

Regardless of approach, cuDF str.cat should accept list-like inputs because pandas does. Will create an issue.

@beckernick that's rad you already solved the issue!

I would assume cuDF as the place to resolve -- but I know very little about nvstrings. If the overhead is not significant than cuDF seems reasonable

@galipremsagar do you want to execute the approach discussed above for rapidsai/cudf#1939?

๐Ÿ‘

Also @galipremsagar , in general you should be able to access any method of the nvstrings data by doing df.text.str.match(regex_string) instead of df.text.data.match(regex_string).

When they are explicitly implemented, you'll see them as a named method. When they are accessed by getting the attribute from the underlying object, you would see the signature as:

gx.b.str.match
<function cudf.dataframe.string.StringMethods.__getattr__.<locals>.wrapper(*args, **kwargs)>

I believe this should resolve the specific issue of needing the .data accessor in dask, though not your actual problem of needing string concatenation unfortunately.

Thanks @beckernick This works for me.

@galipremsagar do you want to execute the approach discussed above for rapidsai/cudf#1939?

Sure, will work on it

Closing this as it is resolved in #1939

Was this page helpful?
0 / 5 - 0 ratings