np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
The hashing suggests that df.values != arr.
Further investigation shows that indeed the types are different.
Moreover, each evaluation of this code yields a new hash for the data frame.
It is expected that pd.DataFrame(np.arr).values == arr.
pd.show_versions()commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.4
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
.values constructs a consolidated (single) dtyped np.array. since you have object dtypes (strings), this is object. It is newly constructed each time. So hashing doesn't work
In [3]: id(df.values)
Out[3]: 4545762848
In [4]: id(df.values)
Out[4]: 4546427680
However, you can as of 0.20.1, use the included hashing functions which are public (though minimal documentation, except for doc-string), to efficiently hash data. These are a pure data hash and are based on siphashing with a common scheme.
In [5]: from pandas.util import hash_pandas_object
In [6]: hash_pandas_object(df)
Out[6]:
0 9162640643739096251
1 10885429402166970872
2 13102355359759172147
dtype: uint64
In [7]: hash_pandas_object(df)
Out[7]:
0 9162640643739096251
1 10885429402166970872
2 13102355359759172147
dtype: uint64
Most helpful comment
.valuesconstructs a consolidated (single) dtyped np.array. since you have object dtypes (strings), this isobject. It is newly constructed each time. So hashing doesn't workHowever, you can as of 0.20.1, use the included hashing functions which are public (though minimal documentation, except for doc-string), to efficiently hash data. These are a pure data hash and are based on siphashing with a common scheme.