Cudf: [FEA] multi range-set in libcudf

Created on 12 Aug 2019  路  9Comments  路  Source: rapidsai/cudf

As string support has matured, we've seen more text processing and streaming event-log processing workflows. For both, we need to assign different "labels" to different ranges of a DataFrame.

For example, a DataFrame df has author and line. We want to split each line into tokens, but preserve the author for each word.

Input:

'john', 'this is a sentence'
'randy', 'another sentence'

Desired output:

john, this
john, is
john, a
john, sentence
randy, another
randy, sentence

We can do this today inefficiently with groupbys and concats:

words_df = cudf.DataFrame()
words_df['author'] = []
words_df['word'] = []

for author in df['author'].unique():
  words = nvtext.tokenize(df[df['author'] == author]['line'].data)
  words = cudf.Series(words)

  author_df = cudf.DataFrame()
  author_df['word'] = words
  author_df['author'] = author
  words_df = cudf.concat([words_df, author_df])

The initial implementation of assignment by index & ranges is almost done. To avoid the overhead of iterating groupby results + concat above, We should support a "multi set-range" function, where the user can supply a list or Series of values and the ranges which should be filled with each value.

cuDF (Python) feature request libcudf

Most helpful comment

I think the current blocker is getting @randerzander and others from the Python cuDF team (@kkraus14 @shwina ?) to tell us what kind of Python API is expected for this.

I did some searching and discovered numpy.repeat which looks like what we want to emulate (like @davidwendt 's suggestion).

>>> authors = ["john", "randy"]
>>> sentences = ['this is a sentence', 'another sentence']
>>> numpy.repeat(authors, [len(s.split(' ')) for s in sentences])
array(['john', 'john', 'john', 'john', 'randy', 'randy'], dtype='<U5')

(This is not Pandas or cuDF, but you get the idea).

So the C++ API would be something like:

gdf_column repeat(const gdf_column& in, const gdf_column& count); 

(It would enforce that count has integral type)

This is much simpler to use than a multi-range set -- you don't have to compute the ranges.

@randerzander is this what you really want?

Parallel Implementation (pseudocode, since it assumes columns are interchangeable with device_vector, and it ignores nulls):

gdf_column repeat(const gdf_column &in, const gdf_column& count) {
  thrust::exclusive_scan(count.begin(), count.end(), offset.begin());
  output_size = count[count.size()-1] + offset[count.size()-1];
  // Allocate `indices` with output_size elements
  thrust::lower_bound(counting_iterator(0), counting_iterator(output_size),
                      offset.begin(), offset.end(), indices.begin(), thrust::less<T>());
  // Allocate `output` with output_size elements
  thrust::transform(indices.begin(), indices.end(), output.begin(), [](auto i) { return in[i]; }
  return gdf_column{output};
}

Complexity is O(output.size * log(count.size))

Edit: I think you could actually replace the lower_bound by calling non-vectorized thrust::lower_bound() inside the transform lambda. Same overall complexity, but less bandwidth and eliminates a kernel launch.

All 9 comments

There's a couple of questions I need to ask:

  1. Do you want only multi-fill or multi-set-range? The difference would be
# multi-range-set
words_df['word'][a:b, x:y] = [words1, words2] # words1, words2 are Series
# multi-fill
words_df['author'][a:b, x:y] = [author1, author2] # author1, author2 are Scalars
  1. What would be the pandas equivalent for this? What would the bindings look like? multi slice [a:b, x:y] is not directly supported in pandas

@harrism , This would require us to revisit how to represent string in gdf_scalar.

Options:

  1. _Use a pointer to a c string._
    The whole point of making the data member of gdf_scalar a union instead of a void pointer is so that we don't have to cudaMemcpy to pass it to a kernel or read from host.
  2. _Add member gdf_dtype_extra_info._
    This is consistent with how we deal with strings in columns. The data union can be integral and always 0. The attached nvcategory would have a singular value.

Isn't there another option? words_df['word'][a:b, x:y] = authors where authors is a series with two elements (equal to the number of ranges). The range words_df['word'][a:b] gets authors[0] and words_df['word'][x:y] gets authors[1].

This would not require solving the gdf_scalar problem since it would just be a gdf_column or cudf::column.

The C++ API could just take a target column and three other columns: a column of range starts, a column of range ends, and a column of values to fill in each range.

where authors is a series with two elements (equal to the number of ranges).

That's the one I'm currently implementing. BTW are we ok with range starts and range ends being stl vectors?

I would assume they might be computed in cuDF so might need to be columns. @randerzander should comment.

First, just want to be clear on what is happening here. The original df looks like the following:

>>> print(df)
  author                line
0   john  this is a sentence
1  randy    another sentence

And we want the resulting DataFrame to look like:

>>> print(words_df)
  author      word
0   john      this
1   john        is
2   john         a
3   john  sentence
0  randy   another
1  randy  sentence

If that is correct, it seems creating the individual ranges could be complicated and slow. You'd need to know how many tokens/words per author and then sequentially set start/end ranges for each one. So you would still need to loop through the number of authors. For example, john has 4 words and therefore the randy range of length 2 starts on 4 an ends on 5. If you have the word counts for each string you could do a scan but the output would be offsets and not ranges; (maybe multiple scans?)

Seems it would be easier to have a fill/repeat method just accept length values corresponding to each column value instead of ranges. For this example, you can get the number of tokens per string using nvtext.token_count.

First, we can tokenize the entire line column in one call as follows:

>>> print(nvtext.tokenize(df['line'].data))
['this', 'is', 'a', 'sentence', 'another', 'sentence']

We could also get the number of tokens/words per string with nvtext.token_count:

>>> tcs = cudf.Series(np.arange(len(df),dtype=np.int32))
>>> nvtext.token_count(df['line'].data, devptr=tcs.data.mem.device_ctypes_pointer.value)
>>> print( tcs )
0    4
1    2
dtype: int32

Then I propose creating a new repeat_count method using the df['author'] column and the array result from token_count as follows:

>>> print( cudf.repeat_count(df['author'], tcs ) )
0    john
1    john
2    john
3    john
4    randy
5    randy

Perhaps a range fill method is still needed for other use cases but I think a simple repeat-count would be more useful here. No need to loop over the number of authors.

This could be coded for fixed-width types using something like the following.
(ignoring syntax, nulls, and errors):

    size_type total_count = reduce( lengths, lengths+size );
    exclusive_scan( lengths, lengths+size, offsets );
    for_each_n(0, size,
        [column_data, lengths, offsets, results] (size_type idx) {
            auto range = results + offsets[idx];
            for( size_type ridx=0; ridx < lengths[idx]; ++ridx )
                range[ridx] = column_data[idx];
        });

We could create a repeat_count on nvstrings to do the same thing for strings columns.

One other optimization is to have the nvtext.tokenize method return the number of tokens array along with the nvstrings instance. This would save calling nvtext.token_count which essentially has to perform the tokenization operation again.

I think the current blocker is getting @randerzander and others from the Python cuDF team (@kkraus14 @shwina ?) to tell us what kind of Python API is expected for this.

I did some searching and discovered numpy.repeat which looks like what we want to emulate (like @davidwendt 's suggestion).

>>> authors = ["john", "randy"]
>>> sentences = ['this is a sentence', 'another sentence']
>>> numpy.repeat(authors, [len(s.split(' ')) for s in sentences])
array(['john', 'john', 'john', 'john', 'randy', 'randy'], dtype='<U5')

(This is not Pandas or cuDF, but you get the idea).

So the C++ API would be something like:

gdf_column repeat(const gdf_column& in, const gdf_column& count); 

(It would enforce that count has integral type)

This is much simpler to use than a multi-range set -- you don't have to compute the ranges.

@randerzander is this what you really want?

Parallel Implementation (pseudocode, since it assumes columns are interchangeable with device_vector, and it ignores nulls):

gdf_column repeat(const gdf_column &in, const gdf_column& count) {
  thrust::exclusive_scan(count.begin(), count.end(), offset.begin());
  output_size = count[count.size()-1] + offset[count.size()-1];
  // Allocate `indices` with output_size elements
  thrust::lower_bound(counting_iterator(0), counting_iterator(output_size),
                      offset.begin(), offset.end(), indices.begin(), thrust::less<T>());
  // Allocate `output` with output_size elements
  thrust::transform(indices.begin(), indices.end(), output.begin(), [](auto i) { return in[i]; }
  return gdf_column{output};
}

Complexity is O(output.size * log(count.size))

Edit: I think you could actually replace the lower_bound by calling non-vectorized thrust::lower_bound() inside the transform lambda. Same overall complexity, but less bandwidth and eliminates a kernel launch.

I did some searching and discovered numpy.repeat which looks like what we want to emulate (like @davidwendt 's suggestion).

The python api I would like is exactly similar to numpy.repeat .

I would like it to be on below lines :

cudf.series.repeat(input_ser= input_series, repeat_ser = repeat_series)

FWIW, in my current work-arounds for strings cases, I actually use numpy.repeat for string ids and token counts followed by a join to work around it for string columns.

OK, so the original request is not what is actually wanted. What is wanted is numpy.repeat. @devavret sorry for the thrash. I think an implementation like I suggested could work well.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AjayThorve picture AjayThorve  路  3Comments

henningpeters picture henningpeters  路  3Comments

saifrahmed picture saifrahmed  路  3Comments

galipremsagar picture galipremsagar  路  3Comments

randerzander picture randerzander  路  3Comments