Describe the bug
It takes a few seconds to import cudf today
Steps/Code to reproduce bug
In [1]: import numpy, pandas
In [2]: %time import cudf
CPU times: user 432 ms, sys: 132 ms, total: 564 ms
Wall time: 2.44 s
**git***
commit ab3f45857f641548f6d64d977908075d63c193bf (HEAD -> repr-html, mrocklin/repr-html)
Author: Matthew Rocklin <[email protected]>
Date: Thu Jan 3 17:35:10 2019 -0800
Test repr for both large and small dataframes
***OS Information***
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU8310024
DGX_OTA_VERSION="3.1.7"
DGX_OTA_DATE="Thu Sep 27 20:07:53 PDT 2018"
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux dgx16 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Thu Jan 3 18:42:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 42C P0 59W / 300W | 1085MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 43C P0 46W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 42C P0 46W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 41C P0 47W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |
| N/A 42C P0 46W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 42C P0 44W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 44C P0 45W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 41C P0 44W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 64394 C ...klin/miniconda/envs/cudf_dev/bin/python 644MiB |
| 0 65545 C ...klin/miniconda/envs/cudf_dev/bin/python 430MiB |
+-----------------------------------------------------------------------------+
***CPU***
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2030.789
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4391.76
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
***CMake***
/home/nfs/mrocklin/miniconda/envs/cudf_dev/bin/cmake
cmake version 3.13.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).
***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
***Python***
/home/nfs/mrocklin/miniconda/envs/cudf_dev/bin/python
Python 3.5.5
***Environment Variables***
PATH : /home/nfs/mrocklin/miniconda/envs/cudf_dev/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/bin:/home/nfs/mrocklin/.local/bin:/home/nfs/mrocklin/miniconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
LD_LIBRARY_PATH :
NUMBAPRO_NVVM :
NUMBAPRO_LIBDEVICE :
CONDA_PREFIX : /home/nfs/mrocklin/miniconda/envs/cudf_dev
PYTHON_PATH :
***conda packages***
/home/nfs/mrocklin/miniconda/bin/conda
# packages in environment at /home/nfs/mrocklin/miniconda/envs/cudf_dev:
#
# Name Version Build Channel
alabaster 0.7.12 py_0 conda-forge
arrow-cpp 0.10.0 py35h70250a7_0 conda-forge
asn1crypto 0.24.0 py35_3 conda-forge
atomicwrites 1.2.1 py_0 conda-forge
attrs 18.2.0 py_0 conda-forge
babel 2.6.0 py_1 conda-forge
backcall 0.1.0 py_0 conda-forge
blas 1.0 mkl
bleach 3.0.2 py_1 conda-forge
bokeh 0.13.0 py35_0
boost 1.67.0 py35h3e44d54_0 conda-forge
boost-cpp 1.67.0 h3a22d5f_0 conda-forge
bzip2 1.0.6 h470a237_2 conda-forge
ca-certificates 2018.03.07 0
certifi 2018.8.24 py35_1
cffi 1.11.5 py35h5e8e0c9_1 conda-forge
chardet 3.0.4 py35_3 conda-forge
click 7.0 py_0 conda-forge
cloudpickle 0.6.1 py_0 conda-forge
cmake 3.13.2 h011004d_0 conda-forge
commonmark 0.5.4 py_2 conda-forge
cryptography 2.3.1 py35hdffb7b8_0 conda-forge
cryptography-vectors 2.3.1 py35_0 conda-forge
cudatoolkit 9.2 0
cudf 0.4.0+385.g81a187d <pip>
curl 7.63.0 h74213dd_0 conda-forge
cython 0.28.5 py35hfc679d8_0 conda-forge
cytoolz 0.9.0.1 py35h470a237_0 conda-forge
dask 1.0.0+51.g2ca205b <pip>
dask-core 1.0.0 py_0 conda-forge
dask-cuda 0.0.0 <pip>
dask-cudf 0.0.1+222.g10a9f90.dirty <pip>
decorator 4.3.0 py_0 conda-forge
distributed 1.23.2 py35_1 conda-forge
distributed 1.25.1+10.ga0d0ed2 <pip>
docutils 0.14 py35_1 conda-forge
entrypoints 0.2.3 py35_2 conda-forge
expat 2.2.5 hfc679d8_2 conda-forge
future 0.16.0 py35_2 conda-forge
gmp 6.1.2 hfc679d8_0 conda-forge
heapdict 1.0.0 py35_1000 conda-forge
icu 58.2 hfc679d8_0 conda-forge
idna 2.7 py35_2 conda-forge
imagesize 1.1.0 py_0 conda-forge
intel-openmp 2019.1 144
ipykernel 5.1.0 pyh24bf2e0_0 conda-forge
ipython 7.0.1 py35h24bf2e0_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.12.1 py35_0 conda-forge
jinja2 2.10 py_1 conda-forge
jsonschema 2.6.0 py35_2 conda-forge
jupyter_client 5.2.4 py_0 conda-forge
jupyter_core 4.4.0 py_0 conda-forge
krb5 1.16.2 hbb41f41_0 conda-forge
libcurl 7.63.0 hbdb9355_0 conda-forge
libedit 3.1.20170329 haf1bffa_1 conda-forge
libffi 3.2.1 hfc679d8_5 conda-forge
libgcc-ng 8.2.0 hdf63c60_1
libgdf-cffi 0.4.0 <pip>
libgfortran-ng 7.2.0 hdf63c60_3 conda-forge
librmm-cffi 0.4.0 <pip>
libsodium 1.0.16 h470a237_1 conda-forge
libssh2 1.8.0 h5b517e9_3 conda-forge
libstdcxx-ng 8.2.0 hdf63c60_1
libuv 1.24.1 h470a237_0 conda-forge
llvmlite 0.27.0 py35hf484d3e_0 numba
Markdown 2.6.11 <pip>
markupsafe 1.0 py35h470a237_1 conda-forge
mistune 0.8.3 py35h470a237_2 conda-forge
mkl 2018.0.3 1
mkl_fft 1.0.9 py35_0 conda-forge
mkl_random 1.0.1 py35_0 conda-forge
more-itertools 4.3.0 py35_0 conda-forge
msgpack-python 0.5.6 py35h2d50403_3 conda-forge
nbconvert 5.3.1 py_1 conda-forge
nbformat 4.4.0 py_1 conda-forge
ncurses 6.1 hfc679d8_2 conda-forge
notebook 5.7.0 py35_0 conda-forge
numba 0.42.0 np115py35hf484d3e_0 numba
numpy 1.15.2 py35h1d66e8a_0
numpy-base 1.15.2 py35h81de0dd_0
numpydoc 0.8.0 py_1 conda-forge
nvstrings 0.1.0 cuda9.2_py35_0 nvidia
openssl 1.0.2p h14c3975_0
packaging 18.0 py_0 conda-forge
pandas 0.20.3 py35_1 conda-forge
pandoc 2.5 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.0.pre h83d4a3d_0 conda-forge
parso 0.3.1 py_0 conda-forge
pathlib2 2.3.2 py35_0 conda-forge
pexpect 4.6.0 py35_0 conda-forge
pickleshare 0.7.5 py35_0 conda-forge
pip 18.0 py35_1001 conda-forge
pluggy 0.8.0 py_0 conda-forge
prometheus_client 0.5.0 py_0 conda-forge
prompt_toolkit 2.0.7 py_0 conda-forge
psutil 5.4.7 py35h470a237_1 conda-forge
ptyprocess 0.6.0 py35_1000 conda-forge
py 1.7.0 py_0 conda-forge
pyarrow 0.10.0 py35hfc679d8_0 conda-forge
pycparser 2.19 py_0 conda-forge
pygments 2.3.1 py_0 conda-forge
pyopenssl 18.0.0 py35_0 conda-forge
pyparsing 2.3.0 py_0 conda-forge
pysocks 1.6.8 py35_2 conda-forge
pytest 4.0.2 <pip>
pytest 3.8.1 py35_0
python 3.5.5 h5001a0f_2 conda-forge
python-dateutil 2.7.5 py_0 conda-forge
pytz 2018.7 py_0 conda-forge
pyyaml 3.13 py35h470a237_1 conda-forge
pyzmq 17.1.2 py35hae99301_0 conda-forge
readline 7.0 haf1bffa_1 conda-forge
recommonmark 0.4.0 py_2 conda-forge
requests 2.19.1 py35_1 conda-forge
rhash 1.3.6 h470a237_1 conda-forge
send2trash 1.5.0 py_0 conda-forge
setuptools 40.4.3 py35_0 conda-forge
simplegeneric 0.8.1 py_1 conda-forge
six 1.11.0 py35_1 conda-forge
snowballstemmer 1.2.1 py_1 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sphinx 1.8.1 py35_0 conda-forge
sphinx-markdown-tables 0.0.9 <pip>
sphinx_rtd_theme 0.4.2 py_0 conda-forge
sphinxcontrib-websupport 1.1.0 py_1 conda-forge
sqlite 3.26.0 hb1c47c0_0 conda-forge
tblib 1.3.2 py_1 conda-forge
terminado 0.8.1 py35_1 conda-forge
testpath 0.3.1 py35_1 conda-forge
tk 8.6.9 ha92aebf_0 conda-forge
toolz 0.9.0 py_1 conda-forge
tornado 5.1.1 py35h470a237_0 conda-forge
traitlets 4.3.2 py35_0 conda-forge
urllib3 1.23 py35_1 conda-forge
wcwidth 0.1.7 py_1 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.32.0 py35_1000 conda-forge
xz 5.2.4 h470a237_1 conda-forge
yaml 0.1.7 h470a237_1 conda-forge
zeromq 4.2.5 hfc679d8_6 conda-forge
zict 0.1.3 py_0 conda-forge
zlib 1.2.11 h470a237_3 conda-forge
Increasing
In [1]: %time import cudf
CPU times: user 556 ms, sys: 204 ms, total: 760 ms
Wall time: 3.39 s
Looking at the profile it looks like we're doing a lot of odd things at import time
367424 function calls (356126 primitive calls) in 3.046 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
572 0.610 0.001 0.621 0.001 <frozen importlib._bootstrap_external>:914(get_data)
3250 0.605 0.000 0.605 0.000 {built-in method posix.stat}
3 0.573 0.191 0.573 0.191 {method 'read' of '_io.BufferedReader' objects}
599 0.296 0.000 0.296 0.000 {built-in method io.open}
597 0.256 0.000 0.256 0.000 {method 'close' of '_io.BufferedReader' objects}
74/73 0.079 0.001 0.082 0.001 {built-in method _imp.create_dynamic}
572 0.076 0.000 0.076 0.000 {built-in method marshal.loads}
1 0.040 0.040 0.041 0.041 nvvm.py:106(__new__)
82 0.039 0.000 0.039 0.000 {built-in method posix.listdir}
2000/1992 0.028 0.000 0.075 0.000 {built-in method builtins.__build_class__}
547 0.019 0.000 0.019 0.000 ffi.py:106(__call__)
2 0.017 0.008 0.017 0.008 {method 'dlopen' of 'CompiledFFI' objects}
74/59 0.015 0.000 0.901 0.015 {built-in method _imp.exec_dynamic}
921 0.015 0.000 0.015 0.000 {method 'sub' of 're.Pattern' objects}
613 0.013 0.000 0.013 0.000 {method 'findall' of 're.Pattern' objects}
572 0.011 0.000 0.011 0.000 {method 'read' of '_io.FileIO' objects}
1093 0.010 0.000 0.400 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
7 0.009 0.001 0.009 0.001 {built-in method posix.read}
4 0.009 0.002 0.009 0.002 {built-in method _posixsubprocess.fork_exec}
572 0.009 0.000 0.732 0.001 <frozen importlib._bootstrap_external>:793(get_code)
1 0.008 0.008 0.008 0.008 {method 'dot' of 'numpy.ndarray' objects}
211 0.008 0.000 0.008 0.000 sre_compile.py:276(_optimize_charset)
820 0.007 0.000 0.421 0.001 <frozen importlib._bootstrap>:882(_find_spec)
592 0.006 0.000 0.586 0.001 __init__.py:78(open_resource)
4 0.006 0.001 0.006 0.001 {built-in method _ctypes.dlopen}
8994 0.006 0.000 0.006 0.000 {built-in method builtins.hasattr}
Profiled this a bit today and on my end it seems like the issue is inside cudf._cuda.gpu.getDeviceCount().
Profiled this a bit today and on my end it seems like the issue is inside
cudf._cuda.gpu.getDeviceCount().
I doubt that problem is with that function. My guess is that's just the first function that invokes a CUDA runtime API which initializes the context and other first time setup stuff.
Profiled this a bit today and on my end it seems like the issue is inside
cudf._cuda.gpu.getDeviceCount().
Could you dump the profile here by any chance?
Here's the top piece of the profile.
785450 function calls (763821 primitive calls) in 5.156 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 3.047 3.047 3.047 3.047 {cudf._cuda.gpu.getDeviceCount}
120 0.358 0.003 0.358 0.003 {method 'read' of '_io.BufferedReader' objects}
945 0.293 0.000 0.307 0.000 <frozen importlib._bootstrap_external>:914(get_data)
5895 0.280 0.000 0.280 0.000 {built-in method posix.stat}
204/202 0.143 0.001 0.146 0.001 {built-in method _imp.create_dynamic}
945 0.109 0.000 0.109 0.000 {built-in method marshal.loads}
506 0.089 0.000 0.089 0.000 {built-in method posix.listdir}
204/130 0.045 0.000 0.161 0.001 {built-in method _imp.exec_dynamic}
2851/2818 0.038 0.000 0.133 0.000 {built-in method builtins.__build_class__}
4 0.025 0.006 0.025 0.006 {built-in method _ctypes.dlopen}
518 0.024 0.000 0.024 0.000 {built-in method builtins.compile}
1826 0.016 0.000 0.312 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
561 0.015 0.000 0.015 0.000 ffi.py:112(__call__)
132 0.015 0.000 0.015 0.000 {built-in method io.open}
1145 0.014 0.000 0.014 0.000 {method 'sub' of 're.Pattern' objects}
945 0.014 0.000 0.014 0.000 {method 'read' of '_io.FileIO' objects}
8 0.013 0.002 0.013 0.002 {built-in method posix.read}
805/193 0.013 0.000 0.035 0.000 sre_parse.py:469(_parse)
5 0.012 0.002 0.012 0.002 {built-in method _posixsubprocess.fork_exec}
945 0.011 0.000 0.460 0.000 <frozen importlib._bootstrap_external>:793(get_code)
614 0.010 0.000 0.012 0.000 sre_compile.py:276(_optimize_charset)
501 0.010 0.000 0.010 0.000 {method 'findall' of 're.Pattern' objects}
1215 0.009 0.000 0.333 0.000 <frozen importlib._bootstrap>:882(_find_spec)
16074/16069 0.008 0.000 0.008 0.000 {built-in method builtins.hasattr}
10147 0.008 0.000 0.020 0.000 <frozen importlib._bootstrap_external>:56(_path_join)
18447 0.008 0.000 0.009 0.000 {built-in method builtins.getattr}
1343/1 0.007 0.000 5.158 5.158 {built-in method builtins.exec}
1578 0.007 0.000 0.021 0.000 version.py:198(__init__)
114 0.007 0.000 0.021 0.000 __init__.py:1617(_get)
3039 0.007 0.000 0.010 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
5 0.007 0.001 0.007 0.001 {built-in method posix.waitpid}
3 0.007 0.002 0.008 0.003 six.py:1(<module>)
51125 0.007 0.000 0.008 0.000 {built-in method builtins.isinstance}
1890 0.007 0.000 0.015 0.000 <frozen importlib._bootstrap_external>:271(cache_from_source)
15643/15615 0.007 0.000 0.011 0.000 abstract.py:121(__hash__)
10147 0.007 0.000 0.009 0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
2850 0.006 0.000 0.006 0.000 {built-in method __new__ of type object at 0x55ed34c9f240}
349 0.006 0.000 0.007 0.000 templates.py:614(make_overload_template)
1239/1 0.006 0.000 5.158 5.158 <frozen importlib._bootstrap>:978(_find_and_load)
3039 0.006 0.000 0.006 0.000 <frozen importlib._bootstrap>:78(acquire)
1173/1 0.006 0.000 5.157 5.157 <frozen importlib._bootstrap>:663(_load_unlocked)
18919 0.006 0.000 0.006 0.000 sre_parse.py:233(__next)
1161 0.005 0.000 0.026 0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
1337/180 0.005 0.000 0.018 0.000 sre_compile.py:71(_compile)
1239/1 0.005 0.000 5.158 5.158 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
14699/14694 0.005 0.000 0.006 0.000 {method 'join' of 'str' objects}
40 0.005 0.000 0.023 0.001 castgraph.py:94(propagate)
3039 0.005 0.000 0.005 0.000 <frozen importlib._bootstrap>:103(release)
1149 0.005 0.000 0.006 0.000 <frozen importlib._bootstrap_external>:574(spec_from_file_location)
6132/741 0.004 0.000 1.570 0.002 <frozen importlib._bootstrap>:1009(_handle_fromlist)
31763 0.004 0.000 0.004 0.000 {method 'startswith' of 'str' objects}
10345 0.004 0.000 0.004 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
1208 0.004 0.000 0.317 0.000 <frozen importlib._bootstrap_external>:1240(_get_spec)
16689 0.004 0.000 0.009 0.000 sre_parse.py:254(get)
42 0.004 0.000 0.008 0.000 enum.py:135(__new__)
77 0.004 0.000 0.009 0.000 __init__.py:316(namedtuple)
790 0.003 0.000 0.004 0.000 version.py:343(_cmpkey)
1607 0.003 0.000 0.003 0.000 {method 'search' of 're.Pattern' objects}
945 0.003 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:438(_classify_pyc)
10024 0.003 0.000 0.003 0.000 {method 'rpartition' of 'str' objects}
945/1 0.003 0.000 5.157 5.157 <frozen importlib._bootstrap_external>:722(exec_module)
945 0.003 0.000 0.115 0.000 <frozen importlib._bootstrap_external>:523(_compile_bytecode)
2487 0.003 0.000 0.227 0.000 <frozen importlib._bootstrap_external>:84(_path_is_mode_type)
13657 0.003 0.000 0.006 0.000 {method 'get' of 'dict' objects}
23704/22175 0.003 0.000 0.003 0.000 {built-in method builtins.len}
2986/2855 0.003 0.000 0.004 0.000 {method 'format' of 'str' objects}
680 0.003 0.000 0.150 0.000 __init__.py:2126(distributions_from_metadata)
22762 0.003 0.000 0.003 0.000 {method 'rstrip' of 'str' objects}
9152 0.003 0.000 0.004 0.000 sre_parse.py:164(__getitem__)
5258 0.003 0.000 0.250 0.000 <frozen importlib._bootstrap_external>:74(_path_stat)
689 0.003 0.000 0.003 0.000 functions.py:87(__init__)
1149 0.003 0.000 0.010 0.000 <frozen importlib._bootstrap_external>:1351(_get_spec)
1161 0.003 0.000 0.005 0.000 <frozen importlib._bootstrap>:318(__exit__)
29445 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
17714 0.003 0.000 0.003 0.000 abstract.py:99(key)
1161/1158 0.003 0.000 0.177 0.000 <frozen importlib._bootstrap>:576(module_from_spec)
346 0.003 0.000 0.063 0.000 __init__.py:2585(from_location)
1568/435 0.003 0.000 0.003 0.000 sre_parse.py:174(getwidth)
501 0.003 0.000 0.029 0.000 textwrap.py:414(dedent)
1154 0.003 0.000 0.069 0.000 re.py:271(_compile)
2 0.002 0.001 0.002 0.001 {method 'poll' of 'select.poll' objects}
3626 0.002 0.000 0.004 0.000 typing.py:704(__setattr__)
776 0.002 0.000 0.004 0.000 functools.py:37(update_wrapper)
7138 0.002 0.000 0.003 0.000 {built-in method builtins.setattr}
788 0.002 0.000 0.006 0.000 version.py:131(_legacy_cmpkey)
1800 0.002 0.000 0.010 0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
131 0.002 0.000 0.003 0.000 templates.py:816(make_overload_attribute_template)
923 0.002 0.000 0.003 0.000 posixpath.py:75(join)
18763/18019 0.002 0.000 0.002 0.000 {built-in method builtins.hash}
1149 0.002 0.000 0.012 0.000 <frozen importlib._bootstrap_external>:369(_get_cached)
150 0.002 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:1190(_path_hooks)
945 0.002 0.000 0.002 0.000 {built-in method _imp._fix_co_filename}
3092 0.002 0.000 0.003 0.000 {built-in method builtins.any}
2519 0.002 0.000 0.003 0.000 <frozen importlib._bootstrap>:416(parent)
4454 0.002 0.000 0.002 0.000 {method 'match' of 're.Pattern' objects}
1890 0.002 0.000 0.003 0.000 <frozen importlib._bootstrap_external>:62(_path_split)
2032 0.002 0.000 0.007 0.000 castgraph.py:41(insert)
580/180 0.002 0.000 0.036 0.000 sre_parse.py:411(_parse_sub)
481 0.002 0.000 0.007 0.000 typing.py:603(__init__)
3434 0.002 0.000 0.003 0.000 version.py:114(_parse_version_parts)
8269 0.002 0.000 0.002 0.000 {method 'group' of 're.Match' objects}
2835 0.002 0.000 0.003 0.000 <frozen importlib._bootstrap_external>:51(_r_long)
2094 0.002 0.000 0.014 0.000 <frozen importlib._bootstrap>:403(cached)
454 0.002 0.000 0.019 0.000 __init__.py:1327(safe_version)
1226 0.002 0.000 0.002 0.000 <frozen importlib._bootstrap>:176(cb)
2306/2299 0.002 0.000 0.002 0.000 abstract.py:124(__eq__)
3816 0.002 0.000 0.002 0.000 <frozen importlib._bootstrap>:859(__exit__)
960 0.001 0.000 0.004 0.000 typing.py:113(_type_check)
612 0.001 0.000 0.002 0.000 _inspect.py:67(getargs)
4435 0.001 0.000 0.001 0.000 {method 'endswith' of 'str' objects}
2215 0.001 0.000 0.227 0.000 <frozen importlib._bootstrap_external>:93(_path_isfile)
706/50 0.001 0.000 1.575 0.031 {built-in method builtins.__import__}
1226 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap>:58(__init__)
680/342 0.001 0.000 0.002 0.000 {built-in method _abc._abc_subclasscheck}
1025 0.001 0.000 0.001 0.000 {method 'split' of 're.Pattern' objects}
75 0.001 0.000 1.800 0.024 __init__.py:1(<module>)
1239 0.001 0.000 0.012 0.000 <frozen importlib._bootstrap>:147(__enter__)
152/132 0.001 0.000 0.023 0.000 {built-in method builtins.sorted}
1410/1302 0.001 0.000 0.018 0.000 typing.py:248(inner)
4300 0.001 0.000 0.002 0.000 version.py:65(_compare)
818/815 0.001 0.000 0.009 0.000 abstract.py:63(__call__)
945 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:471(_validate_timestamp_pyc)
6786 0.001 0.000 0.001 0.000 {built-in method posix.fspath}
5184 0.001 0.000 0.001 0.000 {built-in method builtins.min}
8223 0.001 0.000 0.001 0.000 {built-in method _imp.acquire_lock}
1910 0.001 0.000 0.007 0.000 <frozen importlib._bootstrap_external>:1203(_path_importer_cache)
1388 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:369(__init__)
346 0.001 0.000 0.003 0.000 __init__.py:686(add)
279 0.001 0.000 0.001 0.000 {built-in method _abc._abc_init}
1208 0.001 0.000 0.318 0.000 <frozen importlib._bootstrap_external>:1272(find_spec)
950 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:35(_new_module)
3118 0.001 0.000 0.001 0.000 version.py:207(<genexpr>)
454 0.001 0.000 0.003 0.000 version.py:236(__str__)
886 0.001 0.000 0.002 0.000 __init__.py:2644(key)
2921 0.001 0.000 0.001 0.000 {built-in method from_bytes}
286 0.001 0.000 0.005 0.000 overrides.py:72(verify_matching_signatures)
1768/1 0.001 0.000 5.156 5.156 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
8223 0.001 0.000 0.001 0.000 {built-in method _imp.release_lock}
818 0.001 0.000 0.004 0.000 abstract.py:51(_intern)
945 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:401(_check_name_wrapper)
590 0.001 0.000 0.041 0.000 __init__.py:2772(_get_metadata)
496 0.001 0.000 0.002 0.000 posixpath.py:154(dirname)
349 0.001 0.000 0.002 0.000 extending.py:57(overload)
584 0.001 0.000 0.002 0.000 enum.py:70(__setitem__)
344/238 0.001 0.000 0.009 0.000 typing.py:340(__getitem__)
3626 0.001 0.000 0.002 0.000 typing.py:590(_is_dunder)
66 0.001 0.000 0.001 0.000 templates.py:667(make_intrinsic_template)
3816 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:855(__enter__)
349 0.001 0.000 0.012 0.000 extending.py:111(decorate)
1124 0.001 0.000 0.015 0.000 version.py:24(parse)
62 0.001 0.000 0.001 0.000 {built-in method cupy.core._kernel.create_ufunc}
360 0.001 0.000 0.180 0.000 __init__.py:2039(find_on_path)
1708 0.001 0.000 0.002 0.000 __init__.py:2071(dist_factory)
180 0.001 0.000 0.064 0.000 sre_compile.py:759(compile)
945 0.001 0.000 0.011 0.000 <frozen importlib._bootstrap_external>:951(path_stats)
1533/1102 0.001 0.000 0.001 0.000 typing.py:664(__hash__)
6095 0.001 0.000 0.001 0.000 {built-in method _thread.get_ident}
1161 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:311(__enter__)
471 0.001 0.000 0.002 0.000 copy.py:268(_reconstruct)
2268 0.001 0.000 0.001 0.000 {method 'split' of 'str' objects}
150 0.001 0.000 0.025 0.000 <frozen importlib._bootstrap_external>:1404(_fill_cache)
8 0.001 0.000 0.001 0.000 {method 'read' of '_io.TextIOWrapper' objects}
2032 0.001 0.000 0.003 0.000 castgraph.py:50(get)
686 0.001 0.000 0.001 0.000 genericpath.py:121(_splitext)
2044 0.001 0.000 0.002 0.000 castgraph.py:67(__getitem__)
4297 0.001 0.000 0.001 0.000 sre_parse.py:249(match)
1239 0.001 0.000 0.004 0.000 <frozen importlib._bootstrap>:151(__exit__)
472 0.001 0.000 0.003 0.000 copy.py:66(copy)
316 0.001 0.000 0.026 0.000 overrides.py:166(decorator)
180 0.001 0.000 0.006 0.000 sre_compile.py:536(_compile_info)
1766 0.001 0.000 0.001 0.000 {method 'update' of 'dict' objects}
2247 0.001 0.000 0.001 0.000 {method 'rfind' of 'str' objects}
3058 0.001 0.000 0.002 0.000 {method 'add' of 'set' objects}
1208 0.001 0.000 0.001 0.000 {built-in method _imp.is_frozen}
807 0.001 0.000 0.002 0.000 ffi.py:56(__getattr__)
2229 0.001 0.000 0.003 0.000 abstract.py:127(__ne__)
355 0.001 0.000 0.001 0.000 ntpath.py:122(splitdrive)
686 0.001 0.000 0.002 0.000 posixpath.py:121(splitext)
213 0.001 0.000 0.001 0.000 {method 'splitlines' of 'str' objects}
335 0.001 0.000 0.001 0.000 pyparsing.py:1144(__init__)
4642 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
4 0.001 0.000 0.001 0.000 {method 'readlines' of '_io._IOBase' objects}
1239 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:143(__init__)
630 0.001 0.000 0.001 0.000 enum.py:376(__setattr__)
1080 0.001 0.000 0.001 0.000 {method 'extend' of 'list' objects}
355 0.001 0.000 0.003 0.000 __init__.py:1494(_validate_resource_path)
150 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1319(__init__)
614 0.001 0.000 0.001 0.000 sre_compile.py:249(_compile_charset)
2172 0.001 0.000 0.001 0.000 posixpath.py:41(_get_sep)
353 0.001 0.000 0.001 0.000 posixpath.py:144(basename)
945 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:884(__init__)
2594 0.001 0.000 0.001 0.000 sre_parse.py:172(append)
616 0.001 0.000 0.003 0.000 _inspect.py:98(getargspec)
Was this on a DGX-1 as previously reported?
The first CUDA call will initialise the context, as Jake pointed out. The time will be higher on multiGPU machines. There is also a cost that is proportional to the sysmem size, I believe.
Loading all the libcudf device code (currently near 300MB) to each GPU is also a substantial startup cost. Eliminating legacy should help this. Also we could either move to device-side dispatch where it makes sense (binops, reductions), and/or do more JIT.
@harrism running import cudf doesn't actually initialize a CUDA context or copy the device code to device memory. It does initialize the driver in validating the GPU architecture / driver version / toolkit version and it looks like that's taking ~3s per above. There's an additional ~2s just in Python imports but I'm not seeing anything stick out in the profile above that's immediately actionable.
What does getDeviceCount call?
What does
getDeviceCountcall?
cudaGetDeviceCount: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18808e54893cfcaafefeab31a73cc55f
This apparently doesn't actually create a context though.
This was on a DGX-1. I have noticed that subsequent imports don't take nearly as long. Excluding the three seconds up front to call that cuda function, my profile looks fairly similar to the original one in this thread. So I'm actually not sure if 3 seconds spent inside getDeviceCount is what the issue was originally raised about, it might have been the rest of the time spent modulo the cuda work.
This apparently doesn't actually create a context though.
Where do the docs say that?
This apparently doesn't actually create a context though.
Where do the docs say that?
They don't, an internal slack thread @jrhemstad had with CUDA developers confirmed it though. I can try to dig up a link if you're interested.
OK, then next question: can we find out if all of the 3s in the profile for that call are in the CUDA runtime/driver? (i.e. is there any Numba overhead?)
OK, then next question: can we find out if all of the 3s in the profile for that call are in the CUDA runtime/driver? (i.e. is there any Numba overhead?)
These aren't called via Numba, we wrote our own Cython bindings here where there should be basically zero overhead.
Sorry, misunderstood. Well, I'm guessing the runtime is doing some sort of initialization at that call...
Sorry, misunderstood. Well, I'm guessing the runtime is doing some sort of initialization at that call...
From my understanding it does the driver API initialization that happens implicitly when a context is created. Basically a cuInit(0) call. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html
Can you try compiling this? Compile this code:
#include <stdio.h>
#include <chrono>
int main()
{
int numgpus=0;
printf("starting\n");
for (int i = 0; i < 2; i++) {
auto start = std::chrono::high_resolution_clock::now();
cudaGetDeviceCount(&numgpus);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
printf("iter: %d: GPUs: %d, elapsed time: %lf\n", i, numgpus, elapsed_seconds.count());
}
return 0;
}
Using nvcc test.cu
When I run on a DGX I get the following. Second result is limiting visible devices to a single GPU (it's faster).
(base) mharris@dgx02:~$ ./a.out
starting
iter: 0: GPUs: 8, elapsed time: 0.394263
iter: 1: GPUs: 8, elapsed time: 0.000000
(base) mharris@dgx02:~$ CUDA_VISIBLE_DEVICES=3 ./a.out
starting
iter: 0: GPUs: 1, elapsed time: 0.079943
iter: 1: GPUs: 1, elapsed time: 0.000000
So with 8 GPUs it takes .39s -- not 3s, but still significant. And the second time it takes less than a microsecond. It must be doing something. I think that something is initializing the driver context. And if libcudf is linked into the module, then its 300MB fatbin is probably being loaded.
You can try on your DGX to see if its different.
And if libcudf is linked into the module, then its 300MB fatbin is probably being loaded.
libcudf is absolutely linked into the module.
There were some untyped variables in those Cython bindings. Doubt that matters much, but it may affect this a little bit and is easy to fix. Sent PR ( https://github.com/rapidsai/cudf/pull/4925 ) to type them.
Reworked the original proposal as there were some device functions that needed some changes to handle their errors respectively. Otherwise largely the same idea. Submitted as PR ( https://github.com/rapidsai/cudf/pull/4943 ).
Retested this and it appears to be ~2s now. Longer or shorter depending on the run. Occasionally see an outlier.
Retested this and it appears to be ~2s now. Longer or shorter depending on the run. Occasionally see an outlier.
What kind of machine did you test this on? I suspect this scales somewhat linearly with the number of visible GPUs due to the driver initialization.
DGX-1. Were the previous measurements all on DGX-1s or did some of them use other machines?
FWIW the import time appears to be the same if I restrict to a single device. Tested the following on a DGX-1.
All GPUs:
$ ipython
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %time import cudf
CPU times: user 4.28 s, sys: 4.02 s, total: 8.29 s
Wall time: 2.12 s
In [2]: cudf._cuda.gpu.getDeviceCount()
Out[2]: 8
One GPU:
$ CUDA_VISIBLE_DEVICES=0 ipython
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %time import cudf
CPU times: user 4.28 s, sys: 3.57 s, total: 7.85 s
Wall time: 2.14 s
In [2]: cudf._cuda.gpu.getDeviceCount()
Out[2]: 1
Again there is some variability above and below 2s. However this is representative of what I'm seeing.
WIth 0.14 and legacy removed, I thought this might have improved... I tried on my linux desktop with a single V100 GPU and importing cudf took 776 ms. But I also tried running in a container on a DGX-1 with only one of the V100s visible and it took 15.5s! Could being in a container make it slow?
WIth 0.14 and legacy removed, I thought this might have improved... I tried on my linux desktop with a single V100 GPU and importing cudf took 776 ms. But I also tried running in a container on a DGX-1 with only one of the V100s visible and it took 15.5s! Could being in a container make it slow?
That may be that the DGX didn't have persistence mode enabled and it took a long time for the driver to initialize because the GPUs had to "wake up".
Persistence mode is on.
More data points:
%time from numba import cuda: 1.04s%time import rmm: 1.29sI tried importing cudf on the same DGX outside of my rapids-compose container, but presumably my installation isn't quite right because I'm building from source, and it couldn't find a numba submodule. HOWEVER, before it hit that error, I counted for about 12 seconds. That numba loading was inside RMM, which is imported at the beginning of cuDF's __init__.py.
Not sure if that tells us anything.
Most helpful comment
I doubt that problem is with that function. My guess is that's just the first function that invokes a CUDA runtime API which initializes the context and other first time setup stuff.