Currently to append to a DataFrame, the following is the approach:
df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
df = df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')))
append is a DataFrame or Series method, and as such should be able to modify the DataFrame or Series in place. If in place modification is not required, one may use concat or set inplace kwag to False. It will avoid an explicit assignment operation which is quite slow in Python, as we all know. Further, it will make the expected behavior similar to Python lists, and avoid questions such as these: 1, 2...
Additionally at present, append is full subset of concat, and as such it need not exist at all. Given the vast number of functions to append a DataFrame or Series to another in Pandas, it makes sense that each has it's merits and demerits. Gaining an inplace kwag will clearly distinguish append from concat, and simplify code.
I understand that this issue was raised in #2801 a long time ago. However, the conversation in that deviated from the simplification offered by the inplace kwag to performance enhancement. I (and many like me) are looking for ease of use, and not so much at performance. Also, we expect the data to fit in memory (which is a limitation even with current version of append).
df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')), inplace=True)
I am opposed to this for the exact reasons discussed in #2801: it would mislead users who might expect a performance benefit.
Virtually all of pandas methods return a new object, the exception being the indexing operations. Using inplace is not idiomatic, quite unreadable and not (more) performant at all.
Closing, though if someone thinks that we should add a signature like
(...., inplace=False), and then raise a TypeError if inplace=True to give a nice error message, then we can reopen for that purpose.
In [2]: df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
...: df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')), inplace=True)
TypeError: append() got an unexpected keyword argument 'inplace'
In the case of a namedtuple which contains a Series object, the inplace approach would be nice to have as a feature.
This would not be related in any way to the performance but would be a way to expose data to users.
Indeed, the nametuple objects are by design providing a way for writing a library and exposing it to a user allowing them to only modify it inplace.
Trying to overwrite an attribute of a namedtuple is intentionally raising AttributeError: can't set attribute so that the user does not try to affect your library. But mutable attributes are allowed.
Consider the following dummy code:
from collections import namedtuple
from pandas import Series
# ----- Library part ------
sample_schema = {
"name": str,
"some_info": str,
"content": Series
}
my_data_type = namedtuple("MyDataType", sample_schema.keys())
exposed_data = my_data_type(
name="Library data",
some_info="Modify the content as you want",
content=Series({"a": 0})
)
# ----- User code part ------
series_to_be_appended = Series({"b": 0})
# This is forbidden
exposed_data.content = exposed_data.content.append(series_to_be_appended)
# This would be allowed but is not implemented in Series
exposed_data.content.append(series_to_be_appended, inplace=True)
The name and some_info attributes are string and therefore immutable. A user would not (easily) be able to affect them. But here the content can be modified as long as it is not set to a new object altogether.
I would think inplace methods are nice to have on any mutable object in general.
So the consensus among the maintainers is that it would be too confusing to have an append() method which actually appends?
I'd suggest removing the method from DataFrame entirely, or potentially renaming it. Someone familiar with pandas might find it confusing, but the opposite is currently true for those of us without your level of experience.
Agreeing here.
Never got why Pandas affords an API having its own logic rather than sharing the one of Python itself. One can get used to the fact that most pandas methods return objects rather than modifying their objects, although its counter-intuitive. (Pandas standard behavior is imho counter-intuitive for all persons that use more Python than Pandas, which should be most of the user-base). And one can get used to the fact that most Pandas methods behave as a user would expect it when passing inplace=True as argument.
Can live still with that. But not adding the possibility to specify inplace for append() and defaulting just it to False, which effectively keeps the method for all who want it but greatly helps those who need it, is something I cannot follow. Sorry.
Adding a usecase:
csv files, with few entries in each, many of which have additional columns.pandas.DataFrame.append() docs)Columns in other that are not in the caller are added as new columns.
combined_dataframe = pd.DataFrame()
for dataframe in list_of_dataframes_read_from_csvs:
combined_dataframe.append(dataframe, inplace=True)
inplace for append(), led me to this issue.
Most helpful comment
In the case of a namedtuple which contains a Series object, the inplace approach would be nice to have as a feature.
This would not be related in any way to the performance but would be a way to expose data to users.
Indeed, the nametuple objects are by design providing a way for writing a library and exposing it to a user allowing them to only modify it inplace.
Trying to overwrite an attribute of a namedtuple is intentionally raising
AttributeError: can't set attributeso that the user does not try to affect your library. But mutable attributes are allowed.Consider the following dummy code:
The
nameandsome_infoattributes are string and therefore immutable. A user would not (easily) be able to affect them. But here thecontentcan be modified as long as it is not set to a new object altogether.I would think inplace methods are nice to have on any mutable object in general.