When we use DataFrame.explode right now, it will repeat the index for each element in the iterator. To keep it consistent with the methods like DataFrame.sort_values, DataFrame.append and pd.concat, we can add an argument ignore_index, which will reset the index.
df = pd.DataFrame({'id':range(0,30,10),
'values':[list('abc'), list('def'), list('ghi')]})
print(df)
id values
0 0 [a, b, c]
1 10 [d, e, f]
2 20 [g, h, i]
print(df.explode('values'))
id values
0 0 a
0 0 b
0 0 c
1 10 d
1 10 e
1 10 f
2 20 g
2 20 h
2 20 i
Expected behaviour with addition of the argument:
df.explode('values', ignore_index=True)
id values
0 0 a
1 0 b
2 0 c
3 10 d
4 10 e
5 10 f
6 20 g
7 20 h
8 20 i
If this change looks oké by one of the devs, I can submit a PR for this.
take
I think something like this was discussed when explode was originally implemented. @erfannariman can you go through the original pull request implementing explode and summarize the discussion on this point?
I looked at the following discussions and couldn't find anything about resetting the index:
Not sure if I missed anything. @TomAugspurger
Thanks for checking.
On Mon, Jun 22, 2020 at 7:03 AM Erfan Nariman notifications@github.com
wrote:
I looked at the following discussions and couldn't find anything about
resetting the index:
- #16538 https://github.com/pandas-dev/pandas/issues/16538
- #10511 https://github.com/pandas-dev/pandas/issues/10511
- #27267 https://github.com/pandas-dev/pandas/pull/27267
Not sure if I missed anything. @TomAugspurger
https://github.com/TomAugspurger—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/34932#issuecomment-647473754,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKAOIULGOMPJDVR5X7QKE3RX5CCHANCNFSM4OEOKHFA
.
it’s ok to add this argument (was added elsewhere after explode existed)
What's the upside of adding this as an argument instead of just calling reset_index?
Not sure if im in the position to comment on your question, but in terms of API design, isn't that in the line of other methods like DataFrame.append, DataFrame.sort_values, pd.concat? Or do you mean internally wise? @WillAyd
Ah OK makes sense since we do elsewhere
For the ignore_index, another example where this was added recently is drop_duplicates (https://github.com/pandas-dev/pandas/pull/30405) and sort_values (https://github.com/pandas-dev/pandas/pull/30402).
And another reason to add it is that it can be a bit more performant (avoid an additional copy as you would have with reset_index(drop=True)).
One aspect related to the index of the result that was briefly discussed in the original PR (https://github.com/pandas-dev/pandas/pull/27267#pullrequestreview-259233262) is whether to add a level to the index with a "count", thus resulting in a MultiIndex (which could eg be useful if you want to do an unstack in a next step).
I personally think that could still be useful, and we could potentially think about combining that in a single keyword. However, since ignore_index is already used in other places, probably better to consider this separately.