The following test demonstrates the problem... the contents of testme.py
is literally import pandas
; however, it takes almost 6 seconds to import pandas on my Lenovo T60.
[mpenning@Mudslide panex]$ time python testme.py
real 0m5.759s
user 0m5.612s
sys 0m0.120s
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ uname -a
Linux Mudslide 3.2.0-4-686-pae #1 SMP Debian 3.2.57-3+deb7u1 i686 GNU/Linux
[mpenning@Mudslide panex]$ python -V
Python 2.7.3
[mpenning@Mudslide panex]$
[mpenning@Mudslide panex]$ pip freeze
Babel==1.3
Cython==0.20.1
Flask==0.10.1
Flask-Babel==0.8
Flask-Login==0.2.7
Flask-Mail==0.7.6
Flask-OpenID==1.1.1
Flask-SQLAlchemy==0.16
Flask-WTF==0.8.4
Flask-WhooshAlchemy==0.54a
Jinja2==2.7.1
MarkupSafe==0.18
Pygments==1.6
SQLAlchemy==0.7.9
Sphinx==1.2.2
Tempita==0.5.1
WTForms==1.0.5
Werkzeug==0.9.4
Whoosh==2.5.4
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
blinker==1.3
ciscoconfparse==1.1.1
decorator==3.4.0
docutils==0.11
dulwich==0.9.6
## FIXME: could not find svn URL in dependency_links for this package:
flup==1.0.3.dev-20110405
hg-git==0.5.0
ipaddr==2.1.11
itsdangerous==0.23
matplotlib==1.3.1
mercurial==3.0
mock==1.0.1
nose==1.3.3
numexpr==2.4
numpy==1.8.1
numpydoc==0.4
pandas==0.13.1
pyparsing==2.0.2
python-dateutil==2.2
python-openid==2.2.5
pytz==2013b
six==1.6.1
speaklater==1.3
sqlalchemy-migrate==0.7.2
tables==3.1.1
tornado==3.2.1
wsgiref==0.1.2
sounds a bit odd, you might have a path issue. do you have multiple pythons/environments installed? does importing numpy take the same amount of time?
import pandas
pandas.show_versions()
time python testme.py
0.252u 0.076s 0:00.33 96.9% 0+0k 0+8io 1pf+0w
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.0rc1-43-g0dec048
nose: 1.3.0
Cython: 0.20
numpy: 1.8.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.1.3
patsy: 0.1.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: 0.6.0
tables: 3.0.0
numexpr: 2.4
matplotlib: None
openpyxl: 1.5.7
xlrd: 0.9.0
xlwt: 0.7.4
xlsxwriter: None
lxml: 2.3.4
bs4: 4.1.3
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.7.7
pymysql: None
psycopg2: 2.4.5 (dt dec pq3 ext)
numpy doesn't seem to have this issue...
[mpenning@Mudslide pymtr]$ time python -c 'import numpy'
real 0m0.184s
user 0m0.136s
sys 0m0.048s
[mpenning@Mudslide pymtr]$ time python -c 'import pandas'
real 0m5.724s
user 0m5.516s
sys 0m0.188s
[mpenning@Mudslide pymtr]$
no idea; whey don't you try in a virtualenv with only pandas deps installed
are you loading this over a network? try to install locally, print out pd.__file__
to be sure
closing as not a bug.
I have the same problem. Was this closed because you found a solution? I'd be grateful if you could share it. Thanks.
@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?
Afraid I can't; work, lockdown, etc. So I realize this is very likely not the fault of pandas, except insofar as "import pandas" executes an enormous number -- over 500 by my count -- of secondary import statements. Filesystem overhead.
Thanks,
Steve
Â
From: Jeff Tratner [email protected]
To: pydata/pandas [email protected]
Cc: steve3141 [email protected]
Sent: Monday, July 14, 2014 5:00 PM
Subject: Re: [pandas] PERF: pandas import is too slow (#7282)@steve3141 - have you tried creating a pristine virtualenv and seeing if that helps?
—
Reply to this email directly or view it on GitHub.
I know this has been closed for awhile but I'm seeing the same thing and it is not pandas specific. We have our pandas environment in a virtualenv on a drive on a server. That drive is then mounted by each client. This allows us to maintain a sane package environment among all users. However, this is clearly sacrificing startup time to an unreasonable extent. The import times in seconds are as follows:
| Package | Server | Client |
| --- | --- | --- |
| pandas | 1.22 | 6.23 |
| numpy | .2 | 1.2 |
So clearly this is a setup issue, but how do other companies deal with this problem? I find it hard to believe that packages are installed locally on every user's box and if that isn't the case, that they experience these long startup times.
The network itself is working fine...transfer speeds are ~120MB/s.
@rockg - dunno about every corporation, but certainly all of the installations I've worked with have had everything locally. Conda and tox can make it much easier to have local installs.
I have the same problem -> 6s import time, local install (anaconda, pandas '0.14.1). This is impossibly slow, especially trying to import on multiple processes.
Same problem, (pandas 0.18) although mine is not as awful: 400ms just to import pandas
on a local SSD. I can't imagine how bad this would be for someone using say a networked filesystem.
+1. I see anywhere between 400 - 700ms.
try removing the mpl font caches. Or, if you are in such o locked down enviroment that you can not write the caches, this might be mpl searching your system for fonts everytime it is imported.
(in python3/ pandas 1.6.2 via anaconda)
In ipython clearing matplotlib cache:
import shutil; import matplotlib
shutil.rmtree(matplotlib.get_cachedir())
---- restart ipython ----
%timeit -n1 -r1 import pandas
381 ms on linux
748 ms on windows
(it didn't do anything)
importing pandas from ipython(300ms) is faster than running it from python(500ms)
Importing some sub-dependencies speeds up importing pandas
%timeit -n1 -r1 import pandas
375ms
--- restart ipython -----
In [1]: %timeit -n1 -r1 import numpy
1 loops, best of 1: 87.8 ms per loop
In [2]: %timeit -n1 -r1 import pytz
1 loops, best of 1: 157 ms per loop
In [3]: %timeit -n1 -r1 import dateutil
1 loops, best of 1: 1.51 ms per loop
In [4]: %timeit -n1 -r1 import matplotlib
1 loops, best of 1: 54 ms per loop
In [5]: %timeit -n1 -r1 import xlsxwriter
1 loops, best of 1: 47.8 ms per loop
In [6]: %timeit -n1 -r1 import pandas
1 loops, best of 1: 177 ms per loop
It looks like pytz is particularly slow
Getting all the modules from pandas
I uninstalled matplotlib, xlsxwriter, and cython and imported pandas' sub imports before pandas(as seen via sys.modules.keys()
). The import time of pandas(running this scripts via interpreter) was 100ms after all the dependent imports instead of 500ms:
import __future__
import __main__
import _ast
import _bisect
import _bootlocale
import _bz2
import _codecs
import _collections
import _collections_abc
import _compat_pickle
import _csv
import _ctypes
import _datetime
import _decimal
import _frozen_importlib
import _functools
import _hashlib
import _heapq
import _imp
import _io
import _json
import _locale
import _lzma
import _opcode
import _operator
import _pickle
import _posixsubprocess
import _random
import _sitebuiltins
import _socket
import _sre
import _ssl
import _stat
import _string
import _struct
import _sysconfigdata
import _thread
import _warnings
import _weakref
import _weakrefset
import abc
import argparse
import ast
import atexit
import base64
import binascii
import bisect
import builtins
import bz2
import calendar
import codecs
import collections
import contextlib
import copy
import copyreg
import csv
import ctypes
import datetime
import dateutil
import decimal
import difflib
import dis
import distutils
import email
import encodings
import enum
import errno
import fnmatch
import functools
import gc
import genericpath
import gettext
import grp
import hashlib
import heapq
import http
import importlib
import inspect
import io
import itertools
import json
import keyword
import linecache
import locale
import logging
import lzma
import marshal
import math
import numbers
import numexpr
import numpy
import opcode
import operator
import os
import parser
import pickle
import pkg_resources
import pkgutil
import platform
import plistlib
import posix
import posixpath
import pprint
import pwd
import pyexpat
import pytz
import quopri
import random
import re
import reprlib
import select
import selectors
import shutil
import signal
import site
import six
import socket
import sre_compile
import sre_constants
import sre_parse
import ssl
import stat
import string
import struct
import subprocess
import symbol
import sys
import sysconfig
import tarfile
import tempfile
import textwrap
import threading
import time
import timeit
import token
import tokenize
import traceback
import types
import unittest
import urllib
import uu
import uuid
import warnings
import weakref
import xml
import zipfile
import zipimport
import zlib
print(timeit.timeit('import pandas', number=1))
A workaround may be to stratify these imports before you need pandas
I'm getting similar results with no anaconda / python2 / pandas 1.8
Similar issue for me. It makes development in Flask unbearable since it is 10s after every file change to reload. I debugged it an an import time of 3-10 seconds of pandas is the main culprit (2015 MBA) running anaconda on 3.5
There is some caching happening, but not sure what...
python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 3.71 sec per loop
(abg) jacob@Jacobs-Air:~/stuff/abg% python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 652 msec per loop
One workaround is to isolate all the code that interacts with pandas and lazily import that code only when you need it so that the wait period is only during program execution. (that's what I do)
I don't think that will help - then I'll have the delay every reload
(since all my code works with pandas).
I'd done this in a terminal window:
'''
while true; do date && python -m timeit -n1 -r1 "import pandas"; sleep 2;
done;
'''
Doing this keeps pandas in the OS cache. Stupid hack, but keeps loading
down to 300-500ms.
-J
On Sat, Jan 7, 2017 at 7:27 AM, Bryce Guinta notifications@github.com
wrote:
One workaround is to isolate all the code that interacts with pandas and
lazily import that code only when you need it so that the wait period is
only during program execution. (that's what I do)—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/7282#issuecomment-271054552,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHchj5AFRzCAFrswjpTz1B7e0XM-mo1ks5rPvEDgaJpZM4B_lQ5
.
--
+919971876580
twitter: @JacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]
I'm having a similar issue. Running on OSX and does the same in the virtualenv and out of it. Tried reinstalling everything and that didn't help. Doesn't seem to be matplotlib as that is relatively fast on its own. Very tricky to troubleshoot this- doesn't seem to show anything in the logs.
Can somebody please profile a simple "import pandas" and we can see if the problem is easily identified?
So I did a quick profile and found the following:
93778 function calls (91484 primitive calls) in 4.278 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6 0.426 0.071 0.905 0.151 api.py:3(<module>)
1 0.306 0.306 4.276 4.276 __init__.py:5(<module>)
1 0.189 0.189 0.211 0.211 base.py:1(<module>)
1 0.163 0.163 0.426 0.426 api.py:1(<module>)
1 0.129 0.129 1.170 1.170 format.py:5(<module>)
2 0.121 0.061 0.197 0.099 base.py:3(<module>)
20 0.120 0.006 0.390 0.019 __init__.py:1(<module>)
3 0.119 0.040 0.178 0.059 common.py:1(<module>)
1 0.115 0.115 0.572 0.572 __init__.py:26(<module>)
1 0.112 0.112 0.569 0.569 frame.py:10(<module>)
1 0.111 0.111 0.214 0.214 httplib.py:67(<module>)
2 0.103 0.051 0.630 0.315 index.py:2(<module>)
1 0.089 0.089 0.144 0.144 parser.py:29(<module>)
1 0.078 0.078 0.084 0.084 excel.py:3(<module>)
1 0.074 0.074 0.840 0.840 api.py:5(<module>)
1 0.072 0.072 0.091 0.091 sparse.py:4(<module>)
1 0.070 0.070 0.149 0.149 gbq.py:1(<module>)
1 0.070 0.070 0.650 0.650 groupby.py:1(<module>)
1 0.068 0.068 0.138 0.138 generic.py:2(<module>)
1 0.063 0.063 1.265 1.265 config_init.py:11(<module>)
1 0.060 0.060 0.060 0.060 socket.py:45(<module>)
1 0.055 0.055 0.145 0.145 eval.py:4(<module>)
1 0.054 0.054 0.075 0.075 expr.py:2(<module>)
2 0.052 0.026 0.069 0.035 __init__.py:9(<module>)
1 0.052 0.052 0.054 0.054 pytables.py:4(<module>)
1 0.052 0.052 0.165 0.165 series.py:3(<module>)
Seems like the init at line 5 is taking most of the time- is this the main init of pandas?
just for comparison on osx.
# 2.7
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop
# 3.5
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop
Probably cached?
On Tue, Jan 10, 2017 at 9:56 PM, Jeff Reback notifications@github.com
wrote:
just for comparison on osx.
2.7
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 287 msec per loop
bash-3.2$ ~/miniconda3/envs/py2.7/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 671 msec per loop3.5
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import numpy"
1 loops, best of 1: 168 msec per loop
bash-3.2$ ~/miniconda3/envs/pandas/bin/python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 494 msec per loop—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/7282#issuecomment-271622897,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHchrtO89XXFLDcIdOvZzNQFEzSgmPrks5rQ7FOgaJpZM4B_lQ5
.
--
+919971876580
twitter: @JacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]
not sure what you think is cached
@RexFuzzle I'm surprised you don't have any long file names. Did you strip the directories? You should be seeing something like the below. That will make it easier to see what is taking the majority of time. I think it comes down to pandas importing a lot of dependencies each which have their own hit.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.296 0.296 4.990 4.990 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/__init__.py:3(<module>)
1 0.198 0.198 0.331 0.331 /mnt/environment/software/python/lib/python2.7/site-packages/numpy/core/__init__.py:1(<module>)
1 0.165 0.165 0.248 0.248 /mnt/environment/software/python/lib/python2.7/site-packages/bottleneck/__init__.py:3(<module>)
1 0.154 0.154 0.164 0.164 /mnt/environment/software/python/lib/python2.7/site-packages/bs4/dammit.py:8(<module>)
1 0.134 0.134 0.164 0.164 /mnt/environment/software/python/lib/python2.7/site-packages/pandas/core/common.py:3(<module>)
Hmmm, that is strange- I didn't strip anything- was using cProfile- don't know if that could have caused it. Will investigate it a bit further tomorrow. From my results though it certainly seems like it is just the one init that is taking all the time- will try to get mine in the same format as yours and then we can compare- see if it is the same init file and line number.
Save out the cprofile to a file and then load with pstats and print. If it is a specific module, run the line profiler to see if it anything specific or just a lot of small things.
import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()
For me, the first load is 4s. The the OS caches the library in memory, so
it's around 300-500ms. Wait a little while, and try again.
Best,
Jacob
On Tue, Jan 10, 2017 at 10:42 PM, Jeff Reback notifications@github.com
wrote:
not sure what you think is cached
--
+919971876580
twitter: @JacobSingh ( http://twitter.com/#!/JacobSingh )
web: http://www.jacobsingh.name
Skype: pajamadesign
gTalk: [email protected]
Ok, so running as @rockg suggested:
Wed Jan 11 08:56:08 2017 pandasImport
103330 function calls (100844 primitive calls) in 14.431 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.701 0.701 14.432 14.432 /usr/local/lib/python2.7/site-packages/pandas/__init__.py:5(<module>)
1 0.611 0.611 2.023 2.023 /usr/local/lib/python2.7/site-packages/numpy/core/__init__.py:1(<module>)
1 0.575 0.575 1.420 1.420 /usr/local/lib/python2.7/site-packages/pandas/io/api.py:3(<module>)
1 0.565 0.565 0.621 0.621 /usr/local/lib/python2.7/site-packages/pandas/indexes/base.py:1(<module>)
1 0.563 0.563 3.323 3.323 /usr/local/lib/python2.7/site-packages/numpy/lib/__init__.py:1(<module>)
1 0.393 0.393 0.394 0.394 /usr/local/lib/python2.7/site-packages/pandas/computation/engines.py:2(<module>)
1 0.378 0.378 1.080 1.080 /usr/local/lib/python2.7/site-packages/pandas/indexes/api.py:1(<module>)
1 0.313 0.313 1.991 1.991 /usr/local/lib/python2.7/site-packages/pandas/core/groupby.py:1(<module>)
1 0.313 0.313 0.401 0.401 /usr/local/lib/python2.7/site-packages/numpy/polynomial/__init__.py:15(<module>)
1 0.271 0.271 1.338 1.338 /usr/local/lib/python2.7/site-packages/pandas/compat/__init__.py:26(<module>)
1 0.262 0.262 0.311 0.311 /usr/local/lib/python2.7/site-packages/pandas/core/sparse.py:4(<module>)
1 0.246 0.246 0.246 0.246 /usr/local/lib/python2.7/site-packages/numpy/lib/npyio.py:1(<module>)
1 0.240 0.240 0.493 0.493 /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py:67(<module>)
1 0.238 0.238 0.447 0.447 /usr/local/lib/python2.7/site-packages/numpy/testing/utils.py:4(<module>)
1 0.238 0.238 2.684 2.684 /usr/local/lib/python2.7/site-packages/pandas/formats/format.py:5(<module>)
1 0.221 0.221 0.271 0.271 /usr/local/lib/python2.7/site-packages/numpy/ma/__init__.py:41(<module>)
1 0.220 0.220 2.976 2.976 /usr/local/lib/python2.7/site-packages/pandas/core/config_init.py:11(<module>)
1 0.217 0.217 0.365 0.365 /usr/local/lib/python2.7/site-packages/pandas/core/base.py:3(<module>)
1 0.215 0.215 4.526 4.526 /usr/local/lib/python2.7/site-packages/numpy/__init__.py:106(<module>)
1 0.208 0.208 0.425 0.425 /usr/local/lib/python2.7/site-packages/pandas/core/generic.py:2(<module>)
1 0.207 0.207 1.667 1.667 /usr/local/lib/python2.7/site-packages/pandas/core/frame.py:10(<module>)
1 0.194 0.194 2.565 2.565 /usr/local/lib/python2.7/site-packages/pandas/core/api.py:5(<module>)
1 0.192 0.192 0.194 0.194 /usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:4(<module>)
1 0.182 0.182 0.307 0.307 /usr/local/lib/python2.7/site-packages/pytz/__init__.py:9(<module>)
1 0.173 0.173 0.374 0.374 /usr/local/lib/python2.7/site-packages/pandas/io/common.py:1(<module>)
1 0.167 0.167 0.337 0.337 /usr/local/lib/python2.7/site-packages/pandas/stats/api.py:3(<module>)
1 0.161 0.161 0.167 0.167 /usr/local/lib/python2.7/site-packages/pandas/io/excel.py:3(<module>)
1 0.160 0.160 0.330 0.330 /usr/local/lib/python2.7/site-packages/numpy/core/numeric.py:1(<module>)
1 0.159 0.159 0.160 0.160 /usr/local/lib/python2.7/site-packages/numpy/random/__init__.py:88(<module>)
1 0.158 0.158 0.204 0.204 /usr/local/lib/python2.7/site-packages/pandas/computation/expr.py:2(<module>)
1 0.150 0.150 0.232 0.232 /usr/local/lib/python2.7/site-packages/pandas/tseries/frequencies.py:1(<module>)
All right, let's go one step further and do a line profile of pandas.__init__
. You can do this by using the line_profiler.
Maybe you could also give https://github.com/cournape/import-profiler a try
But looking at the above values, although the import time is much larger, also numpy takes much longer. The ratio of numpy import to full pandas import seems rather the same of for the much smaller numbers @jreback posted, or I also see). So if numpy is already taking more than 4 seconds to import, we of course are not going to get pandas import time below that.
Thanks for all the input. I ran a dtruss in the mean time and found that nothing happens for a few seconds before anything shows up there and so I'm thinking that there is a lag on disk reads instead of it being a python problem, this, to me, is re-enforced by the fact that the time seems to be grouped with the first line of the init file (artifact from cProfile?). Will do a bit more digging. Also agree that it seems to be more a numpy problem and will have a look through their issues and see if anybody else has something similar.
Thanks again for the input.
Also agree that it seems to be more a numpy problem
Sorry, that is not what I wanted to say. I just meant that both numpy and pandas seem to take longer (compared to my laptop, both x10 to x15 times longer), so that is not necessarily to pinpoint to a certain import that is the culprit. It just seems generally slower. Which does not mean of course that we might do some more lazy imports in pandas to improve things, if there are bottlenecks.
Please, do not ignore this issue. It's closed, but I also found problems with a long import duration. Maybe it should be picked up. Create awareness about this issue and higher the prio? Otherwise it is not good for the popularity of pandas.
I'm willing and able to do any more testing, but I don't know of any other profile type tests that I can run that can try to find the source, so I am open to suggestions.
Greetings,
When using pandas with not so big datasets it would take at least 5 to 10 seconds to parse through all the data and plot, which is quite a long time.
So, the steps that led me to the slow execution of pandas in pycharm were:
So, since, it was an abnormal amount of time for little code execution i decided to uninstall both Anaconda and Python 3.6.1 and take a extra steps:
Now code execution is faster (much faster then before).
I hope it helps someone.
I just ran the same as rockg suggested but sorted by cumtime
, not tottime
, which immediately points out that the pytz
module takes half of the total import time (on my PC). Is there any way to make this optional or lazy? I rarely use datetimes, and when I do, they are almost always UTC, so I have very little interest in timezones.
Same with the pandas.plotting module -- I have an application which doesn't do any plotting, so it stinks that it adds a significant time to my importing with no benefit. It seems like it would make sense to make this lazy, since matplotlib takes a long time anyway and 0.15s extra isn't noticeable.
import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("cumtime").print_stats()
which prints (stuff below 0.1 second elided)
Mon Oct 23 14:01:19 2017 pandasImport
204659 function calls (202288 primitive calls) in 1.875 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.042 0.042 1.876 1.876 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\__init__.py:5(<module>)
321/44 0.041 0.000 1.156 0.026 {__import__}
1 0.008 0.008 0.925 0.925 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
1 0.002 0.002 0.914 0.914 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:14(<module>)
1 0.000 0.000 0.651 0.651 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:704(subscribe)
217 0.000 0.000 0.650 0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2870(<lambda>)
217 0.001 0.000 0.650 0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:2299(activate)
427 0.002 0.000 0.602 0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1845(_handle_ns)
217 0.001 0.000 0.586 0.003 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1898(fixup_namespace_packages)
411 0.003 0.000 0.581 0.001 c:\app\python\anaconda\1.6.0\lib\pkgutil.py:176(find_module)
411 0.571 0.001 0.571 0.001 {imp.find_module}
1 0.011 0.011 0.423 0.423 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\api.py:5(<module>)
1 0.007 0.007 0.352 0.352 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\groupby.py:1(<module>)
40 0.001 0.000 0.248 0.006 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:444(add_entry)
472 0.005 0.000 0.236 0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:1779(find_on_path)
1 0.005 0.005 0.231 0.231 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\frame.py:10(<module>)
472 0.188 0.000 0.188 0.000 {nt._isdir}
476 0.002 0.000 0.187 0.000 {map}
1 0.023 0.023 0.173 0.173 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\__init__.py:106(<module>)
1 0.003 0.003 0.157 0.157 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\series.py:3(<module>)
1 0.005 0.005 0.142 0.142 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\__init__.py:3(<module>)
1 0.008 0.008 0.132 0.132 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\plotting\_converter.py:1(<module>)
1 0.000 0.000 0.127 0.127 c:\app\python\anaconda\1.6.0\lib\site-packages\pkg_resources.py:430(__init__)
1 0.003 0.003 0.119 0.119 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\add_newdocs.py:10(<module>)
1 0.019 0.019 0.115 0.115 c:\app\python\anaconda\1.6.0\lib\site-packages\numpy\lib\__init__.py:1(<module>)
1 0.002 0.002 0.109 0.109 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\util\_tester.py:3(<module>)
1 0.015 0.015 0.107 0.107 c:\app\python\anaconda\1.6.0\lib\site-packages\pytest.py:4(<module>)
1 0.005 0.005 0.102 0.102 c:\app\python\anaconda\1.6.0\lib\site-packages\pandas\core\index.py:2(<module>)
FYI -- I have an SSD on my PC so if there is a disk seek issue that some people have, I don't see it. numpy 1.12 takes 0.17 seconds to import.
@jason-s pytz imports in <5 microseconds on my machine, so something is strange there.
FYI #17710 did some work on this, so things should be quicker in the upcoming release (nothing touching pytz though).
I'm using pandas 0.20.2 with pytz 2016.4 on a Windows 7 machine running Anaconda Python 2.7
I just ran conda uninstall pytz and reinstalled it, it now takes 0.01 second with pytz-2017.2
Reinstalled pytz 2016.4 (conda install pytz=2016.4) and it slowed back down to 0.92 seconds again
Installed pytz 2016.7 -- it also is very fast (13 milliseconds to import). There is an item in the profile data called "lazy.py" which sounds like they converted to a "lazy" loading in 2016.7.
import cProfile
import pstats
cProfile.run("import pytz", "profiling_data")
p = pstats.Stats("profiling_data")
p.sort_stats("cumtime").print_stats()
which prints this for pytz 2016.7:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.005 0.005 0.018 0.018 <string>:1(<module>)
1 0.008 0.008 0.013 0.013 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\__init__.py:9(<module>)
2 0.002 0.001 0.002 0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:135(__new__)
1 0.002 0.002 0.002 0.002 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzinfo.py:1(<module>)
1 0.000 0.000 0.001 0.001 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:1(<module>)
1 0.000 0.000 0.000 0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\tzfile.py:4(<module>)
2 0.000 0.000 0.000 0.000 c:\app\python\anaconda\1.6.0\lib\site-packages\pytz\lazy.py:80(__new__)
Hmm. unfortunately switching to pytz 2017.2 (or 2016.7) doesn't seem to speed up the pandas import; looks like either there are a lot of shared dependencies between the two, or during pandas __init__
process it uses pytz and negates the speed advantage that pytz import provides by lazy initialization.
Oh, here we go, both are using pkg_resources.py, which takes about 0.9s on my PC to execute whatever it is doing, whether it's from pytz or pandas.
I had setuptools 27.2 (which includes pkg_resources); this seems to be related to this issue https://github.com/pypa/setuptools/issues/926
OK, I used ripgrep in my site-packages to look for pkg_resources, and the culprits are pytz (which now uses it lazily) and numexpr.
I filed an issue with numexpr.
Is numexpr imported lazily in pandas in the upcoming release? That's another area where a feature I never use (at least, I think I never use it) slows down the pandas import significantly.
edit: never mind, you already know about this:
https://github.com/pandas-dev/pandas/pull/17710#issuecomment-332952362
For reference, here's an import profile using Python 3.7's importtime
and tuna:
python3.7 -X importtime -c "import pandas" 2> pandas.log
tuna pandas.log
Our solution is to set up a web server, and using post request to the algorithm part, and the time for import "pandas" package could be reduced.
having the same issue here
python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop
Feel free to make a PR if you identify easy fixes.
On Wed, Nov 28, 2018 at 4:41 PM hosamn notifications@github.com wrote:
having the same issue here
python -m timeit -n1 -r1 "import pandas"
1 loops, best of 1: 8.56 sec per loop—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/7282#issuecomment-442633080,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIix6pTkeFPbLXh1GHnWwM8L_y4ceks5uzxD9gaJpZM4B_lQ5
.
So I think I may have found this issue. Over 50% of my time is one a single function call: mkl._py_mkl_service.get_version
pandasImport
187472 function calls (181157 primitive calls) in 4.406 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.295 2.295 2.295 2.295 {built-in method mkl._py_mkl_service.get_version}
Code
import cProfile
import pstats
cProfile.run("import pandas", "pandasImport")
p = pstats.Stats("pandasImport")
p.sort_stats("tottime").print_stats()
pandas.show_versions()
INSTALLED VERSIONS
------------------
commit : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : 6.0.2
hypothesis : 5.35.3
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.3.3
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.8.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2
Most helpful comment
Please, do not ignore this issue. It's closed, but I also found problems with a long import duration. Maybe it should be picked up. Create awareness about this issue and higher the prio? Otherwise it is not good for the popularity of pandas.