Pandas: Suggestion: method to slice strings using index columns (start and end) in dataframe

Created on 6 Nov 2014  路  12Comments  路  Source: pandas-dev/pandas

What about implementing the following slice function? The df.start and df.end columns contain the start and end index required to slice the df.string.

df.sliced = df.string[df.start:df.end]

Currently we can slice columns with a fixed start and end index

df.sliced = df.string.str.slice(1, -1)

However it would be great if we could do this using variable start and stop indices from the dataframe itself, without the need for lambda functions.

Possible complications:
I can imagine that this would be complicated by the presence of NaN values in the column.

You could either force users to clean up their data first, so they can only apply the function if the column dtypes of the start and stop are integers (basically: take your dirty boots off before stepping into the house!).

Or you could be nice, and apply the slice function to anything in the target column that looks like a string, using anything in the start and end columns that looks like an integer. (not that I would have a clue how to do that!) Using this strategy, return NaN only when invalid strings, NaN or floats, or index-out-of-range are encountered?

This problem was raised along with #8747 in a StackOverflow question. Some code and examples are given.
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram

edit: here is some example code, including a current workaround. Sorry, I'll make sure the code is included immediately next time.

import pandas as pd
#some example strings for slicing, start indices, stop indices
h1,h2,h3 = 'MPTMGFWVYITVE','MP-NSSLVYIGLE','MPLETQDALYVAL' 
s1, s2, s3 = 1,1,2
e1, e2, e3 = 7,7,8
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'hit':[h1,h2,h3],'start': [s1, s2, s3],'end': [e1, e2, e3]})
#reorder columns
df = df[['hit', 'start', 'end']]

#HERE is where the new syntax would be useful to slice the strings, example
#df.sliced = df.hit.str[df.start:df.end] 
#or the equivalent df['sliced'] = df['hit'].str[df['start']:df['end']]

#Current workaround using a lambda function that specifies the start & end columns.
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
#apply the slice function to the dataframe
df['sliced'] = df.apply(fn_slice_hit, axis = 1)

In [2]: df
Out[2]: 
             hit  start  end  sliced
0  MPTMGFWVYITVE      1    7  PTMGFW
1  MP-NSSLVYIGLE      1    7  P-NSSL
2  MPLETQDALYVAL      2    8  LETQDA

[3 rows x 4 columns]
API Design Indexing Strings

Most helpful comment

There is a really fast way to do this.

df['new_column'] = [A[B:C] for A, B, C in zip(df.A, df.start_index df.end_index)]

This solution is the same as Psidom's solution in https://stackoverflow.com/a/45523050/6004997 however I added ability to use start and end index.

All 12 comments

Not sure I like it, but we could _consider_ df.string.str[df.start:df.end].

df.string[df.start:df.end] it out because it conflicts with standard series indexing.

@MarkInLabcoat can you give a code-example here, e.g. copy/pastable (and indicate where the syntax is wanted). Mainly need to boil down the example so its very clear what is needed/wanted.

There are really two enhancements here:

  1. A new notation for string slicing df.string.str[1:-1] instead of df.string.str.slice(1, -1)
  2. Allowing the slice parameters start and stop to be Series objects rather than ints

@jim22k the direct slicing on str already works!

In [1]: s = pd.Series(['abcde', 'fghij'])

In [2]: s
Out[2]:
0    abcde
1    fghij
dtype: object

In [3]: s.str[1:-1]
Out[3]:
0    bcd
1    ghi
dtype: object

But allowing it to be list-likes is indeed an enhancement request.

@jreback
Thanks for the tip. I added some example code.
@jorisvandenbossche
As you say, the enhancement request involves accepting a list-like (list or series) rather than an integer. Perhaps also a list of tuples containing the start and stop indices?
@jim22k
If the notation for slicing was changed to df.string.str[1:-1], this would be have an advantage in that it is compatible with stride df.string.str[1:-1: 2], using a similar syntax as a for single string.

As I said before, the df.string.str[1:-1] notation already works, nothing to change there!
Although slicing with a stride does not work there, but that seems more like a bug.
UPDATE: opened an issue for that: #8754

@jorisvandenbossche
sorry, you're right, df.string.str[1:-1] already works.
However:
stride: df.string.str[0:5:2] gives the same output as df.string.str[0:5]
reverse: df.string.str[::-1] simply gives the original series as output. Another bug?
Are there other string functions that we haven't tested yet?

I came across this issue when I was searching for solution of a similar problem. Just wondering if there are any updates on @MarkInLabcoat's enhancements request No.2?
2. Allowing the slice parameters start and stop to be Series objects rather than ints

no but pull requests are welcome

Closing this for now. PRs welcome

Here's an idea I came up with. Split the variable-offset slicing task into several slicing subtasks each with its own fixed offset (first creating value counts of the various offsets encountered in the data). For each of these fixed offsets taken in turn we can handle all rows in a single call to the current limited str.slice() method.

Here is a sample code snippet which takes around 1 sec to process a 1 million row Series of strings (with 8 different offset values):

# get the array of indexes of the searched substring
searched_substr_ind = cur_raw_data.str.find(searched_substr) 

# prepare the list of variable offsets
substr_offsets = pd.Series(searched_substr_ind).value_counts().index.astype("int64")
substr_offsets = substr_offsets[substr_offsets > 0]

# by default insert raw missings code:
cur_cleaned_data = pd.Series(np.repeat(raw_missings_code, len(cur_raw_data)))

# loop over all substring offsets 
# (caution: can get slow if having too many)
for cur_offset in substr_offsets:

    # identify rows with the current offset
    cur_offset_rows = np.where(searched_substr_ind == cur_offset)

    # for all the identified rows extract the slice using current offset
    cur_cleaned_data.iloc[cur_offset_rows] = \
        cur_raw_data.iloc[cur_offset_rows].str.slice(cur_offset+searched_substr_len, 
                                                     cur_offset+searched_substr_len+ret_substr_len)

print(pd.Series(cur_cleaned_data).value_counts())

There is a really fast way to do this.

df['new_column'] = [A[B:C] for A, B, C in zip(df.A, df.start_index df.end_index)]

This solution is the same as Psidom's solution in https://stackoverflow.com/a/45523050/6004997 however I added ability to use start and end index.

Was this page helpful?
0 / 5 - 0 ratings