# data.csv contains 10000*10 records which is depicted as follows:
# 1,1,1,1,1,1,1,1,1,1
# 2,2,2,2,2,2,2,2,2,2
# ...
# 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000
import pandas as pd
import numpy as np
import time
data = pd.read_csv('data.csv')
data_matrix = data.as_matrix()
s = time.time()
for _ in range(10000):
v = data.iloc[_]
e = time.time()
print('Performance with DataFrame:'+str(e-s))
s = time.time()
for _ in range(10000):
v = data_matrix[_]
e = time.time()
print('Performance with Array: '+str(e-s))
#result
#performance with DataFrame:3.964857816696167
#performance with array: 0.015623092651367188
As is shown in the coding above,locating an element while given the index takes much much much longer in the DataFrame than in a raw array.
In a common sense,it hardly takes any time to locate an element in an raw array or list as the index of that element is given and there is no reason to use a sequential search.
However,comparing to the array,the performance is quite lower while retrieving a specific row in a DataFrame.The row-index is already given,the DataFrame should be able to locate the row directly,could it be said that the DataFrame actually start a sequential search?
Any alternative methods that works like locating an element in an array?
pd.show_versions()INSTALLED VERSIONS
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 37 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.0.0
Cython: None
numpy: 1.10.2
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
#
once you account for the construction of a Series for each row, and a small amount of overhead for additional checking that .iloc does they are about the same
In [12]: %timeit Series(df.values[10])
10000 loops, best of 3: 62.8 碌s per loop
In [13]: %timeit df.iloc[10]
10000 loops, best of 3: 74.4 碌s per loop
In general iterative looping is not recommended (in numpy or pandas) unless absolutely necessary.
Most helpful comment
once you account for the construction of a Series for each row, and a small amount of overhead for additional checking that
.ilocdoes they are about the sameIn general iterative looping is not recommended (in numpy or pandas) unless absolutely necessary.