Pandas: Feature request: Series.flatmap, DataFrame.flatmap

Created on 9 Oct 2014  路  21Comments  路  Source: pandas-dev/pandas

I'm working on some language analysis and using pandas to munge the data and grab some descriptive stats. This is just an illustrative example, I'm doing all kinds of slighty different things.

Suppose I have a series containing chunks of text, and I want to turn the line into multiple lines, preserving the index values. Here are the naive results:

In [53]: s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'],index=['Alice','Bob','Alice'])
    ...: s
Out[53]: 
Alice    This is text No 1.
Bob       and here is no. 2
Alice                 and 3
dtype: object

In [54]: s.map(lambda x: x.split(' '))
Out[54]: 
Alice    [This, is, text, No, 1.]
Bob       [and, here, is, no., 2]
Alice                    [and, 3]
dtype: object

In [55]: s.apply(lambda x: pd.Series(x.split(' ')))
Out[55]: 
          0     1     2    3    4
Alice  This    is  text   No   1.
Bob     and  here    is  no.    2
Alice   and     3   NaN  NaN  NaN

What I'd like is to be able to do is (Made up example):

In [67]: s.flatmap(lambda x: x.split(' '))
Out[67]: 
Alice    This
Alice    is
Alice    text
Alice    No
Alice    1.
Bob     and
Bob    here
Bob    is
Bob    no.
Bob    2
Alice   and
Alice  3
dtype: object

In general, I'd like to be able to explode a single row in a dataframe into multiple rows, by transforming one column value into multiple values, each becoming a new row with the value of other columns prserved, for example:

In [69]: df=pd.DataFrame([['2014-01-01','Alice',"A B"],['2014-01-02','Bob','C D']],columns=['dt','name','text'])
    ...: df
Out[69]: 
           dt   name text
0  2014-01-01  Alice  A B
1  2014-01-02    Bob  C D

In [70]: df.flatmap(lambda x: x.split(),on='text')
           dt   name text
0  2014-01-01  Alice  A
1  2014-01-01  Alice  B
2  2014-01-01    Bob  C
3  2014-01-01    Bob  D

Perhaps there's another way to do this, but that's how my natural instict suggests this should be done, flatmap is a fairly universal concept.
Groupby already does similar things based on return type, It doesn't have to be limited to groupby though.

Reshaping Usage Question

Most helpful comment

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

All 21 comments

This is pretty straightforward (and pretty fast), to simply do this
https://github.com/pydata/pandas/issues/8428 will make the first part slightly easier (the .split())
as it will return a frame directly

In [8]: s.str.split('\s+').apply(Series).unstack().dropna()
Out[8]: 
0  Alice    This
   Bob       and
   Alice     and
1  Alice      is
   Bob      here
   Alice       3
2  Alice    text
   Bob        is
3  Alice      No
   Bob       no.
4  Alice      1.
   Bob         2
dtype: object
s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'])
s.str.split('\s+').apply(Series).unstack().T.dropna().reset_index(1,drop=True)
Out[36]: 
0    This
0     and
0     and
1      is
1    here
1       3
2    text
2      is
3      No
3     no.
4      1.
4       2
dtype: object

That does work with slight tweaking, thanks. However I have about 15 MB of text chunks and it
doesn't seem very fast compared with spliting() a text string of that size, it's orders slower.
Actually, it exhausts my box's memory when I try to run it for all of them, consuming several GB.
If one string has an unusually large word count, this generates tons of useless NA's for every shorter string. If it can't handle 15MB of data... Is there a less hacky way?

This seems like a natural operation to have supported. in terms of map-reduce pandas is great at processing rows but converting one row into multiple rows (the map stage) just isn't there as a stock operation.

not really sure what you are doing. If you don't need the indices then just iterate, split, and concatenate.

As much as I like flatMap interface, I'm not sure how would you apply it to a multi-column DataFrame, except for probably replicating whole rows (is that operation useful?).

For Series of strings, there's nothing to vectorize and unless you (ab-)use character-wise representation of strings (which you can't do with pandas) there's not much you can do to beat the performance of something along the lines of

In [38]: s = pd.Series(['This is text No 1.', 'and here is no. 2','and 3']); s
Out[38]: 
0    This is text No 1.
1     and here is no. 2
2                 and 3
dtype: object

In [39]: mapper = lambda x: pd.lib.list_to_object_array(x.split())            

In [40]: vals = [(k, mapper(v)) for k, v in s.iteritems()]                    

In [41]: pd.Series(                                                           
    np.concatenate([v[1] for v in vals]),
    np.concatenate([np.repeat(v[0], len(v[1])) for v in vals]),
    name=s.name)
Out[41]: 
0    This
0      is
0    text
0      No
0      1.
1     and
1    here
1      is
1     no.
1       2
2     and
2       3
dtype: object

"is that operation useful?" it is for what I'm trying to do, I'd argure for text data it would be very useful. Perhaps with the index gaining another level to numerate the generated sub-results within the original index label.

Thanks for the example, lots of footwork, but it doesn't throw away memory like jreback's hack. You're both basically saying do it somewhere else then throw it in a DataFrame. Not awesome , but of course always an option.

groupby.apply() converts returned series into multiple rows doesn't it? why can't pandas have similar functionality for an DataFrame apply-style operation?

@kay1793 why don't you show a copy/pastable example end-to-end of what you are actually trying to do. Are you trying to say get work counts? or a dummies matrix? (that's what my soln does, but if you have a large work freq it WILL blow up memory, but that assumes you want to track the index). apparently you don't.

@immerrr soln is just fine for what you are showing. But then you are not harnessing the power of pandas at all. Its just list manipulation.

an example is worth 1000's of words.

Fair enough.

How do I get from this:

In [8]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night. Alice and bob were at their computer"],
   ...:                  ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall. Then the mirror broke. He cut his left foot on the shards"]],columns=['ts','name','text'])
   ...: df

to this:

In [10]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night"],
    ...:                  ['2014-01-01T00:01:00','Alice',"Alice and bob were at their computer"],
    ...:                  ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall"],
    ...:                  ['2014-01-02T00:02:23','Bob',"Then the mirror broke"],
    ...:                  ['2014-01-02T00:02:23','Bob',"He cut his left foot on the shards"]],
    ...:                  columns=['ts','name','text'],index=pd.MultiIndex.from_arrays([[0,0,1,1,1],[0,1,0,1,2]]))
    ...: df
    ...: 
Out[10]: 
                      ts   name                                  text
0 0  2014-01-01T00:01:00  Alice        It was a dark and stormy night
  1  2014-01-01T00:01:00  Alice  Alice and bob were at their computer
1 0  2014-01-02T00:02:23    Bob     Detective Morgan was 9 feet tall 
  1  2014-01-02T00:02:23    Bob                 Then the mirror broke
  2  2014-01-02T00:02:23    Bob       He cut his left foot the shards

The index is unique, nothing else is in general. One chunk may have 50 (or 1000) paragraphs, others just one. Once I get it in this form, I can easily do summary stats and selection queries, looking for patterns. Basically, I'm using a pandas dataframe as my corpus.

But then you are not harnessing the power of pandas at all. Its just list manipulation.

Which was exactly my point: if we're doing native object manipulation, there's not much pandas can help with performance-wise.

As for the example, it's helpful indeed. So the desired output is equivalent to taking out that one column, flatmap it as a series adding an integer index level and join it back with the rest of the dataframe on the original index column.

Which was exactly my point: if we're doing native object manipulation, there's not 
much pandas can help with performance-wise.

I think the performance of Sol #1 was due to bad complexity otherwise non-vectorized is good enough for my data sizes.

the desired output is equivalent to <...>

Yes, but with 2 sugars please.

I've come to think of pandas as THE way to work with data and now that I'm working with a corpus It lacks some operations that appear also generally useful and common. The groupby point I made earlier seems solid to me, I hope you'll agree.

closing, but pls reopen if the issue persists.

We would take a cookbook of the recipes (the https://stackoverflow.com/questions/31080258/pysparks-flatmap-in-pandas) is most idiomatic.

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

i'm supportive of adding a flatmap function to pandas. Even more so I would like to have a first class list/array type.

Is this being considered? I think having a flatMap or bind operation would be very interesting and useful (besides being theoretically satisfying for people with a functional programming background, but this isn't really an argument).

I had a use-case today. I have a dataframe of events, with start time and end time:

| event | start_time | end_time |
| --- | --- | --- |
| A | 2014-01-01 12:00:00 | 2014-01-01 13:45:00 |
| B | 2014-01-01 12:45:00 | 2014-01-01 12:55:00 |
| ... | ... | ... |

And we'd like to transform this to a dataframe that has a row for each minute where an event where actually happening:

| event | time |
| --- | --- |
| A | 2014-01-01 12:00:00 |
| A | 2014-01-01 12:01:00 |
| A | 2014-01-01 12:02:00 |
| A | 2014-01-01 12:03:00 |
| ... | ... |
| A | 2014-01-01 13:44:00 |
| A | 2014-01-01 13:45:00 |
| B | ... |

There are lots of ways to do this, but if we had a flatMap method this would be almost trivial.

def make_time_table(row):
    return DataFrame(
        {'event': row['event'], 'time': time}
        for time in pandas.date_range(...)
    )

df.flatMap(make_time_table)

Or something like it.

I've been using @timotta 's function above which has been surprisingly fast

The topic I reference in the above link (https://stackoverflow.com/questions/12680754/split-pandas-dataframe-string-entry-to-separate-rows) also leads to https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows.

I have a long text in one of the dataframe columns I need to be able to split the text 2 or more columns depending on the length of that cell

Was this closed because of opposition of storing lists in dataframes/series?

I don't think that flatmap requires this, and the lack of flatmap is a fairly big problem with non-trivial and non-performant solutions. (+scala, spark, many other functional system have it as a first class operation).

see https://github.com/pandas-dev/pandas/issues/16538 which has a PR
associated with it

This how I'm doing for now. It's not the more performatic way, but works:

def flatmap(df, func):
    rows = []
    cols = df.columns
    for r in df.values:
        index = 0
        row = {}
        for col in cols:
            row[col] = r[index]
            index += 1
        multrows = func(row)
        for rr in multrows:
            rows.append(rr)
    return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)

Works great for me. Thanks @timotta!

Slightly modified to simplify and be more similar to the original poster's wishing syntax:

def flatmap(self, func):
    rows = []
    for idx, row in self.iterrows():
        multrows = func(row)
        rows.extend(multrows)
    return pd.DataFrame.from_records(rows)

pd.DataFrame.flatmap = flatmap

df.flatmap(lambda x: ....)
Was this page helpful?
0 / 5 - 0 ratings