Pandas: DataFrame.copy(deep=True) is not a deep copy of the index

Created on 23 Feb 2018 · 14Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])
df1.index.name = "foo"
print(df1)

# create deep copy of df1 and change a value in the index
df2 = df1.copy(deep=True)
df2.index.name = "bar"
df2.index.values[0] = 'c'  # changes both df1 and df2

print(df1)
print(df2)

Problem description

DataFrame.copy(deep=True) is not a deep copy of the index.

https://github.com/pandas-dev/pandas/blob/a00154dcfe5057cb3fd86653172e74b6893e337d/pandas/core/indexes/base.py#L787

maybe deep should be set to True?

Expected Output

     foo  muu
foo          
a    NaN  NaN
b    NaN  NaN
     foo  muu
foo          
c    NaN  NaN
b    NaN  NaN
     foo  muu
bar          
c    NaN  NaN
b    NaN  NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.8.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0

Bug

Source

skaae

👍3

Most helpful comment

IMO, copy(deep=True) should completely sever all connections between the original and the copied object - compare the official python docs (https://docs.python.org/3/library/copy.html):

A _deep copy_ constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

So, IMO, deep=True should come to mean what deep='all' does currently (and the latter can then be removed).

Re:

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

This is not a valid argument IMO - it's up to me as a user (consenting adults and all...) what I do with my objects, including the indexes, and if I make a deep copy, it's a justified expectation (I would even argue: a built-in expectation of the word "deep") that this will not mess with the original.

Plus, if I'm already deep-copying the much larger values of a DF, not copying the index only saves a comparatively irrelevant amount of memory.

h-vetinari on 13 Aug 2018

👍3

All 14 comments

df2.index.values[0] = 'c' # changes both df1 and df2

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

TomAugspurger on 23 Feb 2018

ok. I think the documentation of copy is unclear then: Make a deep copy, including a copy of the data and the indices.

skaae on 23 Feb 2018

hhhmmm I would expect that a copy of a dataframe to be truly deep when deep=True

bordingj on 23 Feb 2018

Which bit is unclear? The indices are copied, they are different objects:

In [3]: df1.index is df2.index
Out[3]: False

But the underlying data are shared between indexes since they're immutable.

I am noticing that http://pandas-docs.github.io/pandas-docs-travis/dsintro.html doesn't have a section for Index. It'd be good to add a short one stating that

They're containers for labels, used in indexing and alignment
They're immutable
There are many specializations of the Index for various dtypes.

TomAugspurger on 23 Feb 2018

kind of the same issue as: https://github.com/pandas-dev/pandas/issues/19505

meaning docs need a bit more

jreback on 23 Feb 2018

But the underlying data are shared between indexes since they're immutable.

Is there any reason the underlying data in the index is not copied? It seems that the df.values is actually copied just the indices are not?

skaae on 23 Feb 2018

Is there any reason the underlying data in the index is not copied?

Performance. Since indices are immutable, the underlying data can safely be shared. There's no reason to copy it. DataFrames / series are mutable, so the data need to be copied.

It seems that the df.values is actually copied just the indices are not?

And just to be clear, the index is a copy, since they are different objects. Its the underlying values (which users should not be mutating) that are not copied.

TomAugspurger on 23 Feb 2018

ok. I'll close this.

skaae on 28 Feb 2018

Apparently there is a deep='all', that exactly deals with this (also copying underlying index data or not). To illustrate with the original example:

In [21]: df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])

In [22]: df2 = df1.copy(deep=True)

In [23]: df2.index.values[0] = 'c'

In [24]: df1
Out[24]: 
     foo  muu
c    NaN  NaN    <--- updated
b    NaN  NaN

In [25]: df3 = df1.copy(deep='all')

In [26]: df3.index.values[1] = 'd'

In [27]: df1
Out[27]: 
     foo  muu
c    NaN  NaN
b    NaN  NaN    <--- not updated

But, deep='all' is completely undocumented, and as far as I can find from a quick search, also only used once in our own code base (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/reduction.pyx#L537)

I am not sure we actually want to document this?
But then we should maybe just remove that ability?

jorisvandenbossche on 23 Mar 2018

yes, this was really only and never implemented (or meant to be), should be removed.

jreback on 23 Mar 2018

IMO, copy(deep=True) should completely sever all connections between the original and the copied object - compare the official python docs (https://docs.python.org/3/library/copy.html):

A _deep copy_ constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

So, IMO, deep=True should come to mean what deep='all' does currently (and the latter can then be removed).

Re:

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

Plus, if I'm already deep-copying the much larger values of a DF, not copying the index only saves a comparatively irrelevant amount of memory.

h-vetinari on 13 Aug 2018

👍3

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

This is not a valid argument IMO - it's up to me as a user (consenting adults and all...) what I do with my objects, including the indexes

There are other problems as well that are not related to copying that makes directly changing underlying values a bad idea. For example, the internal hashtable that is used for indexing will be no longer valid if you change the underlying values of an index (so indexing will give wrong results).

not copying the index only saves a comparatively irrelevant amount of memory.

For DataFrame that might be true (depending on its size), but not for Series.

To be clear, I am personally not necessarily against changing this (IMO this would make the behaviour more straightforward, at cost of some performance. So a trade-off, of which I am not fully sure on which side I am), only answering some of your arguments.

One additional thing. You mention the comparison to the stdlib deep copy behaviour, but note that even the deep='all' is not comparable to that (it does copy the index, but it still does not copy python objects inside the values recursively).

jorisvandenbossche on 13 Aug 2018

One additional thing. You mention the comparison to the stdlib deep copy behaviour, but note that even the deep='all' is not comparable to that (it does copy the index, but it still does not copy python objects inside the values recursively).

Isn't that moving the goal posts? It _is_ within the power of pandas to influence how its own indexes are handled, whereas arbitrary python objects can obviously be quite complicated.

But even then, the meaning of deep in vanilla python follows the "complete separation" interpretation:

from copy import deepcopy
x = [0, 1]
x.append(x)
x
# [0, 1, [...]]
y = deepcopy(x)
y[2][0] = 10  # same for arbitrarily many times "[2]"
y
# [10, 1, [...]]
x
# [0, 1, [...]]

h-vetinari on 13 Aug 2018

The example looks to work on master. Could use a test

In [38]: df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])
    ...: df1.index.name = "foo"
    ...: print(df1)
    ...:
    ...: # create deep copy of df1 and change a value in the index
    ...: df2 = df1.copy(deep=True)
    ...: df2.index.name = "bar"
    ...: df2.index.values[0] = 'c'  # changes both df1 and df2
    ...:
    ...: print(df1)
    ...: print(df2)
     foo  muu
foo
a    NaN  NaN
b    NaN  NaN
     foo  muu
foo
c    NaN  NaN
b    NaN  NaN
     foo  muu
bar
c    NaN  NaN
b    NaN  NaN

In [39]: pd.__version__
Out[39]: '1.1.0.dev0+1216.gd4d58f960'