As string support has matured, we've seen more text processing and streaming event-log processing workflows. For both, we need to assign different "labels" to different ranges of a DataFrame.
For example, a DataFrame df has author and line. We want to split each line into tokens, but preserve the author for each word.
Input:
'john', 'this is a sentence'
'randy', 'another sentence'
Desired output:
john, this
john, is
john, a
john, sentence
randy, another
randy, sentence
We can do this today inefficiently with groupbys and concats:
words_df = cudf.DataFrame()
words_df['author'] = []
words_df['word'] = []
for author in df['author'].unique():
words = nvtext.tokenize(df[df['author'] == author]['line'].data)
words = cudf.Series(words)
author_df = cudf.DataFrame()
author_df['word'] = words
author_df['author'] = author
words_df = cudf.concat([words_df, author_df])
The initial implementation of assignment by index & ranges is almost done. To avoid the overhead of iterating groupby results + concat above, We should support a "multi set-range" function, where the user can supply a list or Series of values and the ranges which should be filled with each value.
There's a couple of questions I need to ask:
# multi-range-set
words_df['word'][a:b, x:y] = [words1, words2] # words1, words2 are Series
# multi-fill
words_df['author'][a:b, x:y] = [author1, author2] # author1, author2 are Scalars
[a:b, x:y] is not directly supported in pandas@harrism , This would require us to revisit how to represent string in gdf_scalar.
Options:
cudaMemcpy to pass it to a kernel or read from host.gdf_dtype_extra_info._Isn't there another option? words_df['word'][a:b, x:y] = authors where authors is a series with two elements (equal to the number of ranges). The range words_df['word'][a:b] gets authors[0] and words_df['word'][x:y] gets authors[1].
This would not require solving the gdf_scalar problem since it would just be a gdf_column or cudf::column.
The C++ API could just take a target column and three other columns: a column of range starts, a column of range ends, and a column of values to fill in each range.
where
authorsis a series with two elements (equal to the number of ranges).
That's the one I'm currently implementing. BTW are we ok with range starts and range ends being stl vectors?
I would assume they might be computed in cuDF so might need to be columns. @randerzander should comment.
First, just want to be clear on what is happening here. The original df looks like the following:
>>> print(df)
author line
0 john this is a sentence
1 randy another sentence
And we want the resulting DataFrame to look like:
>>> print(words_df)
author word
0 john this
1 john is
2 john a
3 john sentence
0 randy another
1 randy sentence
If that is correct, it seems creating the individual ranges could be complicated and slow. You'd need to know how many tokens/words per author and then sequentially set start/end ranges for each one. So you would still need to loop through the number of authors. For example, john has 4 words and therefore the randy range of length 2 starts on 4 an ends on 5. If you have the word counts for each string you could do a scan but the output would be offsets and not ranges; (maybe multiple scans?)
Seems it would be easier to have a fill/repeat method just accept length values corresponding to each column value instead of ranges. For this example, you can get the number of tokens per string using nvtext.token_count.
First, we can tokenize the entire line column in one call as follows:
>>> print(nvtext.tokenize(df['line'].data))
['this', 'is', 'a', 'sentence', 'another', 'sentence']
We could also get the number of tokens/words per string with nvtext.token_count:
>>> tcs = cudf.Series(np.arange(len(df),dtype=np.int32))
>>> nvtext.token_count(df['line'].data, devptr=tcs.data.mem.device_ctypes_pointer.value)
>>> print( tcs )
0 4
1 2
dtype: int32
Then I propose creating a new repeat_count method using the df['author'] column and the array result from token_count as follows:
>>> print( cudf.repeat_count(df['author'], tcs ) )
0 john
1 john
2 john
3 john
4 randy
5 randy
Perhaps a range fill method is still needed for other use cases but I think a simple repeat-count would be more useful here. No need to loop over the number of authors.
This could be coded for fixed-width types using something like the following.
(ignoring syntax, nulls, and errors):
size_type total_count = reduce( lengths, lengths+size );
exclusive_scan( lengths, lengths+size, offsets );
for_each_n(0, size,
[column_data, lengths, offsets, results] (size_type idx) {
auto range = results + offsets[idx];
for( size_type ridx=0; ridx < lengths[idx]; ++ridx )
range[ridx] = column_data[idx];
});
We could create a repeat_count on nvstrings to do the same thing for strings columns.
One other optimization is to have the nvtext.tokenize method return the number of tokens array along with the nvstrings instance. This would save calling nvtext.token_count which essentially has to perform the tokenization operation again.
I think the current blocker is getting @randerzander and others from the Python cuDF team (@kkraus14 @shwina ?) to tell us what kind of Python API is expected for this.
I did some searching and discovered numpy.repeat which looks like what we want to emulate (like @davidwendt 's suggestion).
>>> authors = ["john", "randy"]
>>> sentences = ['this is a sentence', 'another sentence']
>>> numpy.repeat(authors, [len(s.split(' ')) for s in sentences])
array(['john', 'john', 'john', 'john', 'randy', 'randy'], dtype='<U5')
(This is not Pandas or cuDF, but you get the idea).
So the C++ API would be something like:
gdf_column repeat(const gdf_column& in, const gdf_column& count);
(It would enforce that count has integral type)
This is much simpler to use than a multi-range set -- you don't have to compute the ranges.
@randerzander is this what you really want?
Parallel Implementation (pseudocode, since it assumes columns are interchangeable with device_vector, and it ignores nulls):
gdf_column repeat(const gdf_column &in, const gdf_column& count) {
thrust::exclusive_scan(count.begin(), count.end(), offset.begin());
output_size = count[count.size()-1] + offset[count.size()-1];
// Allocate `indices` with output_size elements
thrust::lower_bound(counting_iterator(0), counting_iterator(output_size),
offset.begin(), offset.end(), indices.begin(), thrust::less<T>());
// Allocate `output` with output_size elements
thrust::transform(indices.begin(), indices.end(), output.begin(), [](auto i) { return in[i]; }
return gdf_column{output};
}
Complexity is O(output.size * log(count.size))
Edit: I think you could actually replace the lower_bound by calling non-vectorized thrust::lower_bound() inside the transform lambda. Same overall complexity, but less bandwidth and eliminates a kernel launch.
I did some searching and discovered
numpy.repeatwhich looks like what we want to emulate (like @davidwendt 's suggestion).
The python api I would like is exactly similar to numpy.repeat .
I would like it to be on below lines :
cudf.series.repeat(input_ser= input_series, repeat_ser = repeat_series)
FWIW, in my current work-arounds for strings cases, I actually use numpy.repeat for string ids and token counts followed by a join to work around it for string columns.
OK, so the original request is not what is actually wanted. What is wanted is numpy.repeat. @devavret sorry for the thrash. I think an implementation like I suggested could work well.
Most helpful comment
I think the current blocker is getting @randerzander and others from the Python cuDF team (@kkraus14 @shwina ?) to tell us what kind of Python API is expected for this.
I did some searching and discovered
numpy.repeatwhich looks like what we want to emulate (like @davidwendt 's suggestion).(This is not Pandas or cuDF, but you get the idea).
So the C++ API would be something like:
(It would enforce that count has integral type)
This is much simpler to use than a multi-range set -- you don't have to compute the ranges.
@randerzander is this what you really want?
Parallel Implementation (pseudocode, since it assumes columns are interchangeable with
device_vector, and it ignores nulls):Complexity is O(output.size * log(count.size))
Edit: I think you could actually replace the lower_bound by calling non-vectorized
thrust::lower_bound()inside thetransformlambda. Same overall complexity, but less bandwidth and eliminates a kernel launch.