Pandas: Why does it take so much time to retrieve one row?

Created on 13 Jan 2017 · 2Comments · Source: pandas-dev/pandas

# data.csv contains 10000*10  records which is depicted as follows:
# 1,1,1,1,1,1,1,1,1,1
# 2,2,2,2,2,2,2,2,2,2
# ...
# 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000
import pandas as pd
import numpy as np
import time

data = pd.read_csv('data.csv')
data_matrix = data.as_matrix()
s = time.time()
for _ in range(10000):
    v = data.iloc[_]
e = time.time()
print('Performance with DataFrame:'+str(e-s))

s = time.time()
for _ in range(10000):
    v = data_matrix[_]
e = time.time()
print('Performance with Array:         '+str(e-s))


#result
#performance with DataFrame:3.964857816696167
#performance with array:    0.015623092651367188

Problem description

As is shown in the coding above,locating an element while given the index takes much much much longer in the DataFrame than in a raw array.

In a common sense,it hardly takes any time to locate an element in an raw array or list as the index of that element is given and there is no reason to use a sequential search.

However,comparing to the array,the performance is quite lower while retrieving a specific row in a DataFrame.The row-index is already given,the DataFrame should be able to locate the row directly,could it be said that the DataFrame actually start a sequential search?

Any alternative methods that works like locating an element in an array?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 37 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.0.0
Cython: None
numpy: 1.10.2
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
#

Indexing Performance Usage Question

Source

nicoJiang

Most helpful comment

once you account for the construction of a Series for each row, and a small amount of overhead for additional checking that .iloc does they are about the same

In [12]: %timeit Series(df.values[10])
10000 loops, best of 3: 62.8 µs per loop

In [13]: %timeit df.iloc[10]
10000 loops, best of 3: 74.4 µs per loop

In general iterative looping is not recommended (in numpy or pandas) unless absolutely necessary.

jreback on 13 Jan 2017

❤1 👍1

All 2 comments

http://stackoverflow.com/a/16476974/3960038

discort on 13 Jan 2017

👍1

once you account for the construction of a Series for each row, and a small amount of overhead for additional checking that .iloc does they are about the same

In [12]: %timeit Series(df.values[10])
10000 loops, best of 3: 62.8 µs per loop

In [13]: %timeit df.iloc[10]
10000 loops, best of 3: 74.4 µs per loop

In general iterative looping is not recommended (in numpy or pandas) unless absolutely necessary.

jreback on 13 Jan 2017

❤1 👍1

Was this page helpful?

0 / 5 - 0 ratings