I'm working on some language analysis and using pandas to munge the data and grab some descriptive stats. This is just an illustrative example, I'm doing all kinds of slighty different things.
Suppose I have a series containing chunks of text, and I want to turn the line into multiple lines, preserving the index values. Here are the naive results:
In [53]: s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'],index=['Alice','Bob','Alice'])
...: s
Out[53]:
Alice This is text No 1.
Bob and here is no. 2
Alice and 3
dtype: object
In [54]: s.map(lambda x: x.split(' '))
Out[54]:
Alice [This, is, text, No, 1.]
Bob [and, here, is, no., 2]
Alice [and, 3]
dtype: object
In [55]: s.apply(lambda x: pd.Series(x.split(' ')))
Out[55]:
0 1 2 3 4
Alice This is text No 1.
Bob and here is no. 2
Alice and 3 NaN NaN NaN
What I'd like is to be able to do is (Made up example):
In [67]: s.flatmap(lambda x: x.split(' '))
Out[67]:
Alice This
Alice is
Alice text
Alice No
Alice 1.
Bob and
Bob here
Bob is
Bob no.
Bob 2
Alice and
Alice 3
dtype: object
In general, I'd like to be able to explode a single row in a dataframe into multiple rows, by transforming one column value into multiple values, each becoming a new row with the value of other columns prserved, for example:
In [69]: df=pd.DataFrame([['2014-01-01','Alice',"A B"],['2014-01-02','Bob','C D']],columns=['dt','name','text'])
...: df
Out[69]:
dt name text
0 2014-01-01 Alice A B
1 2014-01-02 Bob C D
In [70]: df.flatmap(lambda x: x.split(),on='text')
dt name text
0 2014-01-01 Alice A
1 2014-01-01 Alice B
2 2014-01-01 Bob C
3 2014-01-01 Bob D
Perhaps there's another way to do this, but that's how my natural instict suggests this should be done, flatmap is a fairly universal concept.
Groupby already does similar things based on return type, It doesn't have to be limited to groupby though.
This is pretty straightforward (and pretty fast), to simply do this
https://github.com/pydata/pandas/issues/8428 will make the first part slightly easier (the .split()
)
as it will return a frame directly
In [8]: s.str.split('\s+').apply(Series).unstack().dropna()
Out[8]:
0 Alice This
Bob and
Alice and
1 Alice is
Bob here
Alice 3
2 Alice text
Bob is
3 Alice No
Bob no.
4 Alice 1.
Bob 2
dtype: object
s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'])
s.str.split('\s+').apply(Series).unstack().T.dropna().reset_index(1,drop=True)
Out[36]:
0 This
0 and
0 and
1 is
1 here
1 3
2 text
2 is
3 No
3 no.
4 1.
4 2
dtype: object
That does work with slight tweaking, thanks. However I have about 15 MB of text chunks and it
doesn't seem very fast compared with spliting() a text string of that size, it's orders slower.
Actually, it exhausts my box's memory when I try to run it for all of them, consuming several GB.
If one string has an unusually large word count, this generates tons of useless NA's for every shorter string. If it can't handle 15MB of data... Is there a less hacky way?
This seems like a natural operation to have supported. in terms of map-reduce pandas is great at processing rows but converting one row into multiple rows (the map stage) just isn't there as a stock operation.
not really sure what you are doing. If you don't need the indices then just iterate, split, and concatenate.
As much as I like flatMap interface, I'm not sure how would you apply it to a multi-column DataFrame, except for probably replicating whole rows (is that operation useful?).
For Series of strings, there's nothing to vectorize and unless you (ab-)use character-wise representation of strings (which you can't do with pandas) there's not much you can do to beat the performance of something along the lines of
In [38]: s = pd.Series(['This is text No 1.', 'and here is no. 2','and 3']); s
Out[38]:
0 This is text No 1.
1 and here is no. 2
2 and 3
dtype: object
In [39]: mapper = lambda x: pd.lib.list_to_object_array(x.split())
In [40]: vals = [(k, mapper(v)) for k, v in s.iteritems()]
In [41]: pd.Series(
np.concatenate([v[1] for v in vals]),
np.concatenate([np.repeat(v[0], len(v[1])) for v in vals]),
name=s.name)
Out[41]:
0 This
0 is
0 text
0 No
0 1.
1 and
1 here
1 is
1 no.
1 2
2 and
2 3
dtype: object
"is that operation useful?" it is for what I'm trying to do, I'd argure for text data it would be very useful. Perhaps with the index gaining another level to numerate the generated sub-results within the original index label.
Thanks for the example, lots of footwork, but it doesn't throw away memory like jreback's hack. You're both basically saying do it somewhere else then throw it in a DataFrame. Not awesome , but of course always an option.
groupby.apply() converts returned series into multiple rows doesn't it? why can't pandas have similar functionality for an DataFrame apply-style operation?
@kay1793 why don't you show a copy/pastable example end-to-end of what you are actually trying to do. Are you trying to say get work counts? or a dummies matrix? (that's what my soln does, but if you have a large work freq it WILL blow up memory, but that assumes you want to track the index). apparently you don't.
@immerrr soln is just fine for what you are showing. But then you are not harnessing the power of pandas at all. Its just list manipulation.
an example is worth 1000's of words.
Fair enough.
How do I get from this:
In [8]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night. Alice and bob were at their computer"],
...: ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall. Then the mirror broke. He cut his left foot on the shards"]],columns=['ts','name','text'])
...: df
to this:
In [10]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night"],
...: ['2014-01-01T00:01:00','Alice',"Alice and bob were at their computer"],
...: ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall"],
...: ['2014-01-02T00:02:23','Bob',"Then the mirror broke"],
...: ['2014-01-02T00:02:23','Bob',"He cut his left foot on the shards"]],
...: columns=['ts','name','text'],index=pd.MultiIndex.from_arrays([[0,0,1,1,1],[0,1,0,1,2]]))
...: df
...:
Out[10]:
ts name text
0 0 2014-01-01T00:01:00 Alice It was a dark and stormy night
1 2014-01-01T00:01:00 Alice Alice and bob were at their computer
1 0 2014-01-02T00:02:23 Bob Detective Morgan was 9 feet tall
1 2014-01-02T00:02:23 Bob Then the mirror broke
2 2014-01-02T00:02:23 Bob He cut his left foot the shards
The index is unique, nothing else is in general. One chunk may have 50 (or 1000) paragraphs, others just one. Once I get it in this form, I can easily do summary stats and selection queries, looking for patterns. Basically, I'm using a pandas dataframe as my corpus.
But then you are not harnessing the power of pandas at all. Its just list manipulation.
Which was exactly my point: if we're doing native object manipulation, there's not much pandas can help with performance-wise.
As for the example, it's helpful indeed. So the desired output is equivalent to taking out that one column, flatmap it as a series adding an integer index level and join it back with the rest of the dataframe on the original index column.
Which was exactly my point: if we're doing native object manipulation, there's not
much pandas can help with performance-wise.
I think the performance of Sol #1 was due to bad complexity otherwise non-vectorized is good enough for my data sizes.
the desired output is equivalent to <...>
Yes, but with 2 sugars please.
I've come to think of pandas as THE way to work with data and now that I'm working with a corpus It lacks some operations that appear also generally useful and common. The groupby point I made earlier seems solid to me, I hope you'll agree.
closing, but pls reopen if the issue persists.
Just came across this and I think it's worth reopening.
Some sugar for flatmap
/explode
would be nice. There are several workarounds circulating and none of them are quite the same, eg.
We would take a cookbook of the recipes (the https://stackoverflow.com/questions/31080258/pysparks-flatmap-in-pandas) is most idiomatic.
This how I'm doing for now. It's not the more performatic way, but works:
def flatmap(df, func):
rows = []
cols = df.columns
for r in df.values:
index = 0
row = {}
for col in cols:
row[col] = r[index]
index += 1
multrows = func(row)
for rr in multrows:
rows.append(rr)
return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]]
flattened = flatmap(df, flatfunc)
i'm supportive of adding a flatmap
function to pandas. Even more so I would like to have a first class list/array type.
Is this being considered? I think having a flatMap or bind operation would be very interesting and useful (besides being theoretically satisfying for people with a functional programming background, but this isn't really an argument).
I had a use-case today. I have a dataframe of events, with start time and end time:
| event | start_time | end_time |
| --- | --- | --- |
| A | 2014-01-01 12:00:00 | 2014-01-01 13:45:00 |
| B | 2014-01-01 12:45:00 | 2014-01-01 12:55:00 |
| ... | ... | ... |
And we'd like to transform this to a dataframe that has a row for each minute where an event where actually happening:
| event | time |
| --- | --- |
| A | 2014-01-01 12:00:00 |
| A | 2014-01-01 12:01:00 |
| A | 2014-01-01 12:02:00 |
| A | 2014-01-01 12:03:00 |
| ... | ... |
| A | 2014-01-01 13:44:00 |
| A | 2014-01-01 13:45:00 |
| B | ... |
There are lots of ways to do this, but if we had a flatMap method this would be almost trivial.
def make_time_table(row):
return DataFrame(
{'event': row['event'], 'time': time}
for time in pandas.date_range(...)
)
df.flatMap(make_time_table)
Or something like it.
I've been using @timotta 's function above which has been surprisingly fast
The topic I reference in the above link (https://stackoverflow.com/questions/12680754/split-pandas-dataframe-string-entry-to-separate-rows) also leads to https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows.
I have a long text in one of the dataframe columns I need to be able to split the text 2 or more columns depending on the length of that cell
Was this closed because of opposition of storing lists in dataframes/series?
I don't think that flatmap requires this, and the lack of flatmap is a fairly big problem with non-trivial and non-performant solutions. (+scala, spark, many other functional system have it as a first class operation).
see https://github.com/pandas-dev/pandas/issues/16538 which has a PR
associated with it
This how I'm doing for now. It's not the more performatic way, but works:
def flatmap(df, func): rows = [] cols = df.columns for r in df.values: index = 0 row = {} for col in cols: row[col] = r[index] index += 1 multrows = func(row) for rr in multrows: rows.append(rr) return pd.DataFrame.from_records(rows)
flatfunc = lambda r: [{"userId": r["userId"], "rating": r["rating"], "genre": genre} for genre in r["genreList"]] flattened = flatmap(df, flatfunc)
Works great for me. Thanks @timotta!
Slightly modified to simplify and be more similar to the original poster's wishing syntax:
def flatmap(self, func):
rows = []
for idx, row in self.iterrows():
multrows = func(row)
rows.extend(multrows)
return pd.DataFrame.from_records(rows)
pd.DataFrame.flatmap = flatmap
df.flatmap(lambda x: ....)
Most helpful comment
This how I'm doing for now. It's not the more performatic way, but works: