Pandas: PERF: pandas import is too slow

Created on 30 May 2014 · 48Comments · Source: pandas-dev/pandas

The following test demonstrates the problem... the contents of testme.py is literally import pandas; however, it takes almost 6 seconds to import pandas on my Lenovo T60.

[mpenning@Mudslide panex]$ time python testme.py 

real    0m5.759s
user    0m5.612s
sys 0m0.120s
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ uname -a
Linux Mudslide 3.2.0-4-686-pae #1 SMP Debian 3.2.57-3+deb7u1 i686 GNU/Linux
[mpenning@Mudslide panex]$ python -V
Python 2.7.3
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ pip freeze
Babel==1.3
Cython==0.20.1
Flask==0.10.1
Flask-Babel==0.8
Flask-Login==0.2.7
Flask-Mail==0.7.6
Flask-OpenID==1.1.1
Flask-SQLAlchemy==0.16
Flask-WTF==0.8.4
Flask-WhooshAlchemy==0.54a
Jinja2==2.7.1
MarkupSafe==0.18
Pygments==1.6
SQLAlchemy==0.7.9
Sphinx==1.2.2
Tempita==0.5.1
WTForms==1.0.5
Werkzeug==0.9.4
Whoosh==2.5.4
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
blinker==1.3
ciscoconfparse==1.1.1
decorator==3.4.0
docutils==0.11
dulwich==0.9.6
## FIXME: could not find svn URL in dependency_links for this package:
flup==1.0.3.dev-20110405
hg-git==0.5.0
ipaddr==2.1.11
itsdangerous==0.23
matplotlib==1.3.1
mercurial==3.0
mock==1.0.1
nose==1.3.3
numexpr==2.4
numpy==1.8.1
numpydoc==0.4
pandas==0.13.1
pyparsing==2.0.2
python-dateutil==2.2
python-openid==2.2.5
pytz==2013b
six==1.6.1
speaklater==1.3
sqlalchemy-migrate==0.7.2
tables==3.1.1
tornado==3.2.1
wsgiref==0.1.2

Performance

Source

mpenning

👍3

Most helpful comment

Please, do not ignore this issue. It's closed, but I also found problems with a long import duration. Maybe it should be picked up. Create awareness about this issue and higher the prio? Otherwise it is not good for the popularity of pandas.

danger89 on 15 Feb 2017

👍5

All 48 comments

sounds a bit odd, you might have a path issue. do you have multiple pythons/environments installed? does importing numpy take the same amount of time?

import pandas
pandas.show_versions()

time python testme.py
0.252u 0.076s 0:00.33 96.9%     0+0k 0+8io 1pf+0w

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0rc1-43-g0dec048
nose: 1.3.0
Cython: 0.20
numpy: 1.8.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.1.3
patsy: 0.1.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: 0.6.0
tables: 3.0.0
numexpr: 2.4
matplotlib: None
openpyxl: 1.5.7
xlrd: 0.9.0
xlwt: 0.7.4
xlsxwriter: None
lxml: 2.3.4
bs4: 4.1.3
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.7.7
pymysql: None
psycopg2: 2.4.5 (dt dec pq3 ext)

jreback on 30 May 2014

numpy doesn't seem to have this issue...

[mpenning@Mudslide pymtr]$ time python -c 'import numpy'

real    0m0.184s
user    0m0.136s
sys 0m0.048s
[mpenning@Mudslide pymtr]$ time python -c 'import pandas'

real    0m5.724s
user    0m5.516s
sys 0m0.188s
[mpenning@Mudslide pymtr]$

mpenning on 30 May 2014

no idea; whey don't you try in a virtualenv with only pandas deps installed

jreback on 30 May 2014

are you loading this over a network? try to install locally, print out pd.__file__ to be sure

jreback on 30 May 2014

closing as not a bug.

jreback on 7 Jul 2014

I have the same problem. Was this closed because you found a solution? I'd be grateful if you could share it. Thanks.

steve3141 on 14 Jul 2014

@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?

jtratner on 14 Jul 2014

Afraid I can't; work, lockdown, etc. So I realize this is very likely not the fault of pandas, except insofar as "import pandas" executes an enormous number -- over 500 by my count -- of secondary import statements. Filesystem overhead.

Thanks,
Steve

From: Jeff Tratner [email protected]

To: pydata/pandas [email protected]
Cc: steve3141 [email protected]
Sent: Monday, July 14, 2014 5:00 PM
Subject: Re: [pandas] PERF: pandas import is too slow (#7282)

@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?
—
Reply to this email directly or view it on GitHub.

steve3141 on 15 Jul 2014

I know this has been closed for awhile but I'm seeing the same thing and it is not pandas specific. We have our pandas environment in a virtualenv on a drive on a server. That drive is then mounted by each client. This allows us to maintain a sane package environment among all users. However, this is clearly sacrificing startup time to an unreasonable extent. The import times in seconds are as follows:

| Package | Server | Client |
| --- | --- | --- |
| pandas | 1.22 | 6.23 |
| numpy | .2 | 1.2 |

So clearly this is a setup issue, but how do other companies deal with this problem? I find it hard to believe that packages are installed locally on every user's box and if that isn't the case, that they experience these long startup times.

The network itself is working fine...transfer speeds are ~120MB/s.

rockg on 29 Sep 2014

👍2

@rockg - dunno about every corporation, but certainly all of the installations I've worked with have had everything locally. Conda and tox can make it much easier to have local installs.

jtratner on 30 Sep 2014

I have the same problem -> 6s import time, local install (anaconda, pandas '0.14.1). This is impossibly slow, especially trying to import on multiple processes.

ndc33 on 12 Feb 2015

Same problem, (pandas 0.18) although mine is not as awful: 400ms just to import pandas on a local SSD. I can't imagine how bad this would be for someone using say a networked filesystem.

Rufflewind on 25 Apr 2016

+1. I see anywhere between 400 - 700ms.

szs8 on 24 May 2016

👍2

try removing the mpl font caches. Or, if you are in such o locked down enviroment that you can not write the caches, this might be mpl searching your system for fonts everytime it is imported.

tacaswell on 25 May 2016

(in python3/ pandas 1.6.2 via anaconda)
In ipython clearing matplotlib cache:

import shutil; import matplotlib
shutil.rmtree(matplotlib.get_cachedir())

---- restart ipython ----

%timeit -n1 -r1 import pandas

381 ms on linux
748 ms on windows
(it didn't do anything)

importing pandas from ipython(300ms) is faster than running it from python(500ms)

Importing some sub-dependencies speeds up importing pandas

%timeit -n1 -r1 import pandas
375ms

--- restart ipython -----

In [1]: %timeit -n1 -r1 import numpy
1 loops, best of 1: 87.8 ms per loop

In [2]: %timeit -n1 -r1 import pytz
1 loops, best of 1: 157 ms per loop

In [3]: %timeit -n1 -r1 import dateutil
1 loops, best of 1: 1.51 ms per loop

In [4]: %timeit -n1 -r1 import matplotlib
1 loops, best of 1: 54 ms per loop

In [5]: %timeit -n1 -r1 import xlsxwriter
1 loops, best of 1: 47.8 ms per loop

In [6]: %timeit -n1 -r1 import pandas
1 loops, best of 1: 177 ms per loop

It looks like pytz is particularly slow

Getting all the modules from pandas

I uninstalled matplotlib, xlsxwriter, and cython and imported pandas' sub imports before pandas(as seen via sys.modules.keys()). The import time of pandas(running this scripts via interpreter) was 100ms after all the dependent imports instead of 500ms:

import __future__
import __main__
import _ast
import _bisect
import _bootlocale
import _bz2
import _codecs
import _collections
import _collections_abc
import _compat_pickle
import _csv
import _ctypes
import _datetime
import _decimal
import _frozen_importlib
import _functools
import _hashlib
import _heapq
import _imp
import _io
import _json
import _locale
import _lzma
import _opcode
import _operator
import _pickle
import _posixsubprocess
import _random
import _sitebuiltins
import _socket
import _sre
import _ssl
import _stat
import _string
import _struct
import _sysconfigdata
import _thread
import _warnings
import _weakref
import _weakrefset
import abc
import argparse
import ast
import atexit
import base64
import binascii
import bisect
import builtins
import bz2
import calendar
import codecs
import collections
import contextlib
import copy
import copyreg
import csv
import ctypes
import datetime
import dateutil
import decimal
import difflib
import dis
import distutils
import email
import encodings
import enum
import errno
import fnmatch
import functools
import gc
import genericpath
import gettext
import grp
import hashlib
import heapq
import http
import importlib
import inspect
import io
import itertools
import json
import keyword
import linecache
import locale
import logging
import lzma
import marshal
import math
import numbers
import numexpr
import numpy
import opcode
import operator
import os
import parser
import pickle
import pkg_resources
import pkgutil
import platform
import plistlib
import posix
import posixpath
import pprint
import pwd
import pyexpat
import pytz
import quopri
import random
import re
import reprlib
import select
import selectors
import shutil
import signal
import site
import six
import socket
import sre_compile
import sre_constants
import sre_parse
import ssl
import stat
import string
import struct
import subprocess
import symbol
import sys
import sysconfig
import tarfile
import tempfile
import textwrap
import threading
import time
import timeit
import token
import tokenize
import traceback
import types
import unittest
import urllib
import uu
import uuid
import warnings
import weakref
import xml
import zipfile
import zipimport
import zlib

print(timeit.timeit('import pandas', number=1))

A workaround may be to stratify these imports before you need pandas

I'm getting similar results with no anaconda / python2 / pandas 1.8

brycepg on 7 Jun 2016

Similar issue for me. It makes development in Flask unbearable since it is 10s after every file change to reload. I debugged it an an import time of 3-10 seconds of pandas is the main culprit (2015 MBA) running anaconda on 3.5

There is some caching happening, but not sure what...

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 3.71 sec per loop
(abg) jacob@Jacobs-Air:~/stuff/abg% python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 652 msec per loop

jacobSingh on 6 Jan 2017

One workaround is to isolate all the code that interacts with pandas and lazily import that code only when you need it so that the wait period is only during program execution. (that's what I do)

brycepg on 7 Jan 2017

I don't think that will help - then I'll have the delay every reload
(since all my code works with pandas).

I'd done this in a terminal window:

'''
while true; do date && python -m timeit -n1 -r1 "import pandas"; sleep 2;
done;
'''

Doing this keeps pandas in the OS cache. Stupid hack, but keeps loading
down to 300-500ms.

-J

On Sat, Jan 7, 2017 at 7:27 AM, Bryce Guinta notifications@github.com
wrote:

One workaround is to isolate all the code that interacts with pandas and
lazily import that code only when you need it so that the wait period is
only during program execution. (that's what I do)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/7282#issuecomment-271054552,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHchj5AFRzCAFrswjpTz1B7e0XM-mo1ks5rPvEDgaJpZM4B_lQ5
.

--
+919971876580
twitter: @JacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]

jacobSingh on 8 Jan 2017

I'm having a similar issue. Running on OSX and does the same in the virtualenv and out of it. Tried reinstalling everything and that didn't help. Doesn't seem to be matplotlib as that is relatively fast on its own. Very tricky to troubleshoot this- doesn't seem to show anything in the logs.

grantstephens on 10 Jan 2017

Can somebody please profile a simple "import pandas" and we can see if the problem is easily identified?

rockg on 10 Jan 2017

👍2

So I did a quick profile and found the following:

         93778 function calls (91484 primitive calls) in 4.278 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        6    0.426    0.071    0.905    0.151 api.py:3(<module>)
        1    0.306    0.306    4.276    4.276 __init__.py:5(<module>)
        1    0.189    0.189    0.211    0.211 base.py:1(<module>)
        1    0.163    0.163    0.426    0.426 api.py:1(<module>)
        1    0.129    0.129    1.170    1.170 format.py:5(<module>)
        2    0.121    0.061    0.197    0.099 base.py:3(<module>)
       20    0.120    0.006    0.390    0.019 __init__.py:1(<module>)
        3    0.119    0.040    0.178    0.059 common.py:1(<module>)
        1    0.115    0.115    0.572    0.572 __init__.py:26(<module>)
        1    0.112    0.112    0.569    0.569 frame.py:10(<module>)
        1    0.111    0.111    0.214    0.214 httplib.py:67(<module>)
        2    0.103    0.051    0.630    0.315 index.py:2(<module>)
        1    0.089    0.089    0.144    0.144 parser.py:29(<module>)
        1    0.078    0.078    0.084    0.084 excel.py:3(<module>)
        1    0.074    0.074    0.840    0.840 api.py:5(<module>)
        1    0.072    0.072    0.091    0.091 sparse.py:4(<module>)
        1    0.070    0.070    0.149    0.149 gbq.py:1(<module>)
        1    0.070    0.070    0.650    0.650 groupby.py:1(<module>)
        1    0.068    0.068    0.138    0.138 generic.py:2(<module>)
        1    0.063    0.063    1.265    1.265 config_init.py:11(<module>)
        1    0.060    0.060    0.060    0.060 socket.py:45(<module>)
        1    0.055    0.055    0.145    0.145 eval.py:4(<module>)
        1    0.054    0.054    0.075    0.075 expr.py:2(<module>)
        2    0.052    0.026    0.069    0.035 __init__.py:9(<module>)
        1    0.052    0.052    0.054    0.054 pytables.py:4(<module>)
        1    0.052    0.052    0.165    0.165 series.py:3(<module>)

Seems like the init at line 5 is taking most of the time- is this the main init of pandas?

grantstephens on 10 Jan 2017

just for comparison on osx.

# 2.7
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop

# 3.5
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop

jreback on 10 Jan 2017

Probably cached?

On Tue, Jan 10, 2017 at 9:56 PM, Jeff Reback notifications@github.com
wrote:

just for comparison on osx.

2.7

bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop

3.5

bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/7282#issuecomment-271622897,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHchrtO89XXFLDcIdOvZzNQFEzSgmPrks5rQ7FOgaJpZM4B_lQ5
.

--
+919971876580
twitter: @JacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]

jacobSingh on 10 Jan 2017

not sure what you think is cached

jreback on 10 Jan 2017

@RexFuzzle I'm surprised you don't have any long file names. Did you strip the directories? You should be seeing something like the below. That will make it easier to see what is taking the majority of time. I think it comes down to pandas importing a lot of dependencies each which have their own hit.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.296    0.296    4.990    4.990 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/__init__.py:3(<module>)
        1    0.198    0.198    0.331    0.331 /mnt/environment/software/python/lib/python2.7/site-packages/numpy/core/__init__.py:1(<module>)
        1    0.165    0.165    0.248    0.248 /mnt/environment/software/python/lib/python2.7/site-packages/bottleneck/__init__.py:3(<module>)
        1    0.154    0.154    0.164    0.164 /mnt/environment/software/python/lib/python2.7/site-packages/bs4/dammit.py:8(<module>)
        1    0.134    0.134    0.164    0.164 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/core/common.py:3(<module>)

rockg on 10 Jan 2017

Hmmm, that is strange- I didn't strip anything- was using cProfile- don't know if that could have caused it. Will investigate it a bit further tomorrow. From my results though it certainly seems like it is just the one init that is taking all the time- will try to get mine in the same format as yours and then we can compare- see if it is the same init file and line number.

grantstephens on 10 Jan 2017

Save out the cprofile to a file and then load with pstats and print. If it is a specific module, run the line profiler to see if it anything specific or just a lot of small things.

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()

rockg on 10 Jan 2017

For me, the first load is 4s. The the OS caches the library in memory, so
it's around 300-500ms. Wait a little while, and try again.

Best,
Jacob
On Tue, Jan 10, 2017 at 10:42 PM, Jeff Reback notifications@github.com
wrote:

not sure what you think is cached

--
+919971876580
twitter: @JacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]

jacobSingh on 10 Jan 2017

Ok, so running as @rockg suggested:

Wed Jan 11 08:56:08 2017    pandasImport

         103330 function calls (100844 primitive calls) in 14.431 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.701    0.701   14.432   14.432 /usr/local/lib/python2.7/site-packages/pandas/__init__.py:5(<module>)
        1    0.611    0.611    2.023    2.023 /usr/local/lib/python2.7/site-packages/numpy/core/__init__.py:1(<module>)
        1    0.575    0.575    1.420    1.420 /usr/local/lib/python2.7/site-packages/pandas/io/api.py:3(<module>)
        1    0.565    0.565    0.621    0.621 /usr/local/lib/python2.7/site-packages/pandas/indexes/base.py:1(<module>)
        1    0.563    0.563    3.323    3.323 /usr/local/lib/python2.7/site-packages/numpy/lib/__init__.py:1(<module>)
        1    0.393    0.393    0.394    0.394 /usr/local/lib/python2.7/site-packages/pandas/computation/engines.py:2(<module>)
        1    0.378    0.378    1.080    1.080 /usr/local/lib/python2.7/site-packages/pandas/indexes/api.py:1(<module>)
        1    0.313    0.313    1.991    1.991 /usr/local/lib/python2.7/site-packages/pandas/core/groupby.py:1(<module>)
        1    0.313    0.313    0.401    0.401 /usr/local/lib/python2.7/site-packages/numpy/polynomial/__init__.py:15(<module>)
        1    0.271    0.271    1.338    1.338 /usr/local/lib/python2.7/site-packages/pandas/compat/__init__.py:26(<module>)
        1    0.262    0.262    0.311    0.311 /usr/local/lib/python2.7/site-packages/pandas/core/sparse.py:4(<module>)
        1    0.246    0.246    0.246    0.246 /usr/local/lib/python2.7/site-packages/numpy/lib/npyio.py:1(<module>)
        1    0.240    0.240    0.493    0.493 /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py:67(<module>)
        1    0.238    0.238    0.447    0.447 /usr/local/lib/python2.7/site-packages/numpy/testing/utils.py:4(<module>)
        1    0.238    0.238    2.684    2.684 /usr/local/lib/python2.7/site-packages/pandas/formats/format.py:5(<module>)
        1    0.221    0.221    0.271    0.271 /usr/local/lib/python2.7/site-packages/numpy/ma/__init__.py:41(<module>)
        1    0.220    0.220    2.976    2.976 /usr/local/lib/python2.7/site-packages/pandas/core/config_init.py:11(<module>)
        1    0.217    0.217    0.365    0.365 /usr/local/lib/python2.7/site-packages/pandas/core/base.py:3(<module>)
        1    0.215    0.215    4.526    4.526 /usr/local/lib/python2.7/site-packages/numpy/__init__.py:106(<module>)
        1    0.208    0.208    0.425    0.425 /usr/local/lib/python2.7/site-packages/pandas/core/generic.py:2(<module>)
        1    0.207    0.207    1.667    1.667 /usr/local/lib/python2.7/site-packages/pandas/core/frame.py:10(<module>)
        1    0.194    0.194    2.565    2.565 /usr/local/lib/python2.7/site-packages/pandas/core/api.py:5(<module>)
        1    0.192    0.192    0.194    0.194 /usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:4(<module>)
        1    0.182    0.182    0.307    0.307 /usr/local/lib/python2.7/site-packages/pytz/__init__.py:9(<module>)
        1    0.173    0.173    0.374    0.374 /usr/local/lib/python2.7/site-packages/pandas/io/common.py:1(<module>)
        1    0.167    0.167    0.337    0.337 /usr/local/lib/python2.7/site-packages/pandas/stats/api.py:3(<module>)
        1    0.161    0.161    0.167    0.167 /usr/local/lib/python2.7/site-packages/pandas/io/excel.py:3(<module>)
        1    0.160    0.160    0.330    0.330 /usr/local/lib/python2.7/site-packages/numpy/core/numeric.py:1(<module>)
        1    0.159    0.159    0.160    0.160 /usr/local/lib/python2.7/site-packages/numpy/random/__init__.py:88(<module>)
        1    0.158    0.158    0.204    0.204 /usr/local/lib/python2.7/site-packages/pandas/computation/expr.py:2(<module>)
        1    0.150    0.150    0.232    0.232 /usr/local/lib/python2.7/site-packages/pandas/tseries/frequencies.py:1(<module>)

grantstephens on 11 Jan 2017

All right, let's go one step further and do a line profile of pandas.__init__. You can do this by using the line_profiler.

rockg on 11 Jan 2017

Maybe you could also give https://github.com/cournape/import-profiler a try

But looking at the above values, although the import time is much larger, also numpy takes much longer. The ratio of numpy import to full pandas import seems rather the same of for the much smaller numbers @jreback posted, or I also see). So if numpy is already taking more than 4 seconds to import, we of course are not going to get pandas import time below that.

jorisvandenbossche on 11 Jan 2017

Thanks for all the input. I ran a dtruss in the mean time and found that nothing happens for a few seconds before anything shows up there and so I'm thinking that there is a lag on disk reads instead of it being a python problem, this, to me, is re-enforced by the fact that the time seems to be grouped with the first line of the init file (artifact from cProfile?). Will do a bit more digging. Also agree that it seems to be more a numpy problem and will have a look through their issues and see if anybody else has something similar.
Thanks again for the input.

grantstephens on 11 Jan 2017

Also agree that it seems to be more a numpy problem

Sorry, that is not what I wanted to say. I just meant that both numpy and pandas seem to take longer (compared to my laptop, both x10 to x15 times longer), so that is not necessarily to pinpoint to a certain import that is the culprit. It just seems generally slower. Which does not mean of course that we might do some more lazy imports in pandas to improve things, if there are bottlenecks.

jorisvandenbossche on 11 Jan 2017

danger89 on 15 Feb 2017

👍5

I'm willing and able to do any more testing, but I don't know of any other profile type tests that I can run that can try to find the source, so I am open to suggestions.

grantstephens on 15 Feb 2017

Greetings,

When using pandas with not so big datasets it would take at least 5 to 10 seconds to parse through all the data and plot, which is quite a long time.
So, the steps that led me to the slow execution of pandas in pycharm were:

Anaconda installation for all users
Python 3.6.1 installation

So, since, it was an abnormal amount of time for little code execution i decided to uninstall both Anaconda and Python 3.6.1 and take a extra steps:

Install visualcppbuildtoolsfull (which can be found here: http://landinghub.visualstudio.com/visual-cpp-build-tools)
Python 3.6.1 installation
Anaconda installation for all users
Pycharm Default Settings > Project Interpreter > Select correct one (generic python 3.6.1 or Anaconda) in order to query through all the packages.
(Optional) I suggest doing step 4 for every path that pycharm detects.

Now code execution is faster (much faster then before).
I hope it helps someone.

dafer660 on 15 Jun 2017

I just ran the same as rockg suggested but sorted by cumtime, not tottime, which immediately points out that the pytz module takes half of the total import time (on my PC). Is there any way to make this optional or lazy? I rarely use datetimes, and when I do, they are almost always UTC, so I have very little interest in timezones.

Same with the pandas.plotting module -- I have an application which doesn't do any plotting, so it stinks that it adds a significant time to my importing with no benefit. It seems like it would make sense to make this lazy, since matplotlib takes a long time anyway and 0.15s extra isn't noticeable.

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("cumtime").print_stats()

which prints (stuff below 0.1 second elided)

Mon Oct 23 14:01:19 2017    pandasImport

         204659 function calls (202288 primitive calls) in 1.875 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.042    0.042    1.876    1.876 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\__init__.py:5(<module>)
   321/44    0.041    0.000    1.156    0.026 {__import__}
        1    0.008    0.008    0.925    0.925 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
        1    0.002    0.002    0.914    0.914 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:14(<module>)
        1    0.000    0.000    0.651    0.651 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:704(subscribe)
      217    0.000    0.000    0.650    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2870(<lambda>)
      217    0.001    0.000    0.650    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2299(activate)
      427    0.002    0.000    0.602    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1845(_handle_ns)
      217    0.001    0.000    0.586    0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1898(fixup_namespace_packages)
      411    0.003    0.000    0.581    0.001 c:\app\python\anaconda\1.6.0\lib\pkgutil.py:176(find_module)
      411    0.571    0.001    0.571    0.001 {imp.find_module}
        1    0.011    0.011    0.423    0.423 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\api.py:5(<module>)
        1    0.007    0.007    0.352    0.352 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\groupby.py:1(<module>)
       40    0.001    0.000    0.248    0.006 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:444(add_entry)
      472    0.005    0.000    0.236    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1779(find_on_path)
        1    0.005    0.005    0.231    0.231 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\frame.py:10(<module>)
      472    0.188    0.000    0.188    0.000 {nt._isdir}
      476    0.002    0.000    0.187    0.000 {map}
        1    0.023    0.023    0.173    0.173 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\__init__.py:106(<module>)
        1    0.003    0.003    0.157    0.157 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\series.py:3(<module>)
        1    0.005    0.005    0.142    0.142 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\__init__.py:3(<module>)
        1    0.008    0.008    0.132    0.132 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\_converter.py:1(<module>)
        1    0.000    0.000    0.127    0.127 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:430(__init__)
        1    0.003    0.003    0.119    0.119 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\add_newdocs.py:10(<module>)
        1    0.019    0.019    0.115    0.115 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\lib\__init__.py:1(<module>)
        1    0.002    0.002    0.109    0.109 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\util\_tester.py:3(<module>)
        1    0.015    0.015    0.107    0.107 c:\app\python\anaconda\1.6.0\lib\site-packages\pytest.py:4(<module>)
        1    0.005    0.005    0.102    0.102 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\index.py:2(<module>)

jason-s on 23 Oct 2017

FYI -- I have an SSD on my PC so if there is a disk seek issue that some people have, I don't see it. numpy 1.12 takes 0.17 seconds to import.

jason-s on 23 Oct 2017

@jason-s pytz imports in <5 microseconds on my machine, so something is strange there.

FYI #17710 did some work on this, so things should be quicker in the upcoming release (nothing touching pytz though).

TomAugspurger on 23 Oct 2017

I'm using pandas 0.20.2 with pytz 2016.4 on a Windows 7 machine running Anaconda Python 2.7

jason-s on 23 Oct 2017

I just ran conda uninstall pytz and reinstalled it, it now takes 0.01 second with pytz-2017.2

Reinstalled pytz 2016.4 (conda install pytz=2016.4) and it slowed back down to 0.92 seconds again

Installed pytz 2016.7 -- it also is very fast (13 milliseconds to import). There is an item in the profile data called "lazy.py" which sounds like they converted to a "lazy" loading in 2016.7.

import cProfile
import pstats
cProfile.run("import pytz", "profiling_data")
p = pstats.Stats("profiling_data")
p.sort_stats("cumtime").print_stats()

which prints this for pytz 2016.7:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.018    0.018 <string>:1(<module>)
        1    0.008    0.008    0.013    0.013 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
        2    0.002    0.001    0.002    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:135(__new__)
        1    0.002    0.002    0.002    0.002 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzinfo.py:1(<module>)
        1    0.000    0.000    0.001    0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:1(<module>)
        1    0.000    0.000    0.000    0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzfile.py:4(<module>)
        2    0.000    0.000    0.000    0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:80(__new__)

jason-s on 23 Oct 2017

Hmm. unfortunately switching to pytz 2017.2 (or 2016.7) doesn't seem to speed up the pandas import; looks like either there are a lot of shared dependencies between the two, or during pandas __init__ process it uses pytz and negates the speed advantage that pytz import provides by lazy initialization.

Oh, here we go, both are using pkg_resources.py, which takes about 0.9s on my PC to execute whatever it is doing, whether it's from pytz or pandas.

I had setuptools 27.2 (which includes pkg_resources); this seems to be related to this issue https://github.com/pypa/setuptools/issues/926

jason-s on 23 Oct 2017

OK, I used ripgrep in my site-packages to look for pkg_resources, and the culprits are pytz (which now uses it lazily) and numexpr.

I filed an issue with numexpr.

Is numexpr imported lazily in pandas in the upcoming release? That's another area where a feature I never use (at least, I think I never use it) slows down the pandas import significantly.

edit: never mind, you already know about this:

https://github.com/pandas-dev/pandas/pull/17710#issuecomment-332952362

jason-s on 24 Oct 2017

For reference, here's an import profile using Python 3.7's importtime and tuna:

python3.7 -X importtime -c "import pandas" 2> pandas.log
tuna pandas.log

pandas

nschloe on 1 Jul 2018

👍1

Our solution is to set up a web server, and using post request to the algorithm part, and the time for import "pandas" package could be reduced.

miazoin on 16 Oct 2018

😄1

having the same issue here

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop

hosamn on 28 Nov 2018

Feel free to make a PR if you identify easy fixes.

On Wed, Nov 28, 2018 at 4:41 PM hosamn notifications@github.com wrote:

having the same issue here

python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/7282#issuecomment-442633080,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIix6pTkeFPbLXh1GHnWwM8L_y4ceks5uzxD9gaJpZM4B_lQ5
.

TomAugspurger on 28 Nov 2018

So I think I may have found this issue. Over 50% of my time is one a single function call: mkl._py_mkl_service.get_version

pandasImport

     187472 function calls (181157 primitive calls) in 4.406 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
1 2.295 2.295 2.295 2.295 {built-in method mkl._py_mkl_service.get_version}

Code

import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()

pandas.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.7.4.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : 0.29.21
pytest           : 6.0.2
hypothesis       : 5.35.3
sphinx           : 2.2.0
blosc            : None
feather          : None
xlsxwriter       : 1.3.3
lxml.etree       : 4.5.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fsspec           : 0.8.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : 1.3.19
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2