Describe the bug
This issue is meant to superceed dask#5322 as the official forum for discussing problems with cupy + cudf + dask compatability.
The problem is best illustrated by this "hang-reproducer" gist (mostly copied below to make the discussion in this issue self contained). In many cases, when using the standard multi-threaded scheduler in dask, the execution of the task graph will "hang" when a cupy function is used (even if the "use" is trivial).
Steps/Code to reproduce bug
The following code snippet can be used to reproduce the "hang":
import dask
from dask.threaded import get
from dask.base import tokenize
from toolz import merge
import cudf
import cupy
import numpy as np
ddf = dask.datasets.timeseries(seed = 42)
gddf = ddf.map_partitions(cudf.from_pandas)
def _percentiles_summary(df):
x = cupy.array([]) # <-- Problematic Line
vals_and_weights = (np.array([1004.0, 1004.0]), np.array([50.0, 50.0]))
return vals_and_weights
def _partition_quantiles(df, npartitions):
def _dtype_info(df):
return df.dtype, None
def _combine(sequence_of_data):
return sequence_of_data
token = tokenize(df)
df_keys = df.__dask_keys__()
name0 = "re-quantiles-0-" + token
dtype_dsk = {(name0, 0): (_dtype_info, df_keys[0])}
name1 = "re-quantiles-1-" + token
val_dsk = {
(name1, i): (_percentiles_summary, key)
for i, key in enumerate(df_keys)
}
val_dsk["combine"] = (
_combine, [(name1, i) for i, key in enumerate(df_keys)]
)
return merge(df.dask, dtype_dsk, val_dsk)
dsk = _partition_quantiles(gddf["id"], gddf.npartitions)
%timeit get(dsk, "combine")
Note that the problem is caused by x = cupy.array([]) (there is no problem after removing this line). Also, the code will sometimes run once without any problems, but the hang will always occur when the graph is executed within a loop (hence the use of %timeit).
Expected behavior
In the snippet above, the behavior should be the same with and without the x = cupy.array([]) line.
Environment overview (please complete the following information)
Environment details
Click here to see environment details
**git***
commit 2d6e14d8b00362093bddf815d26662c5a05f8500 (HEAD -> scatter-api, origin/scatter-api)
Merge: 3a50322 65268e7
Author: Richard (Rick) Zamora <[email protected]>
Date: Tue Sep 17 12:39:29 2019 -0500
Merge branch 'branch-0.10' into scatter-api
**git submodules***
b165e1fb11eeea64ccf95053e40f2424312599cc thirdparty/cub (v1.7.1)
63f644be44201467e3938d59ed9d89cc8725c35d thirdparty/jitify (remotes/origin/feature/api_v2)
***OS Information***
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU822000C
DGX_OTA_VERSION="3.1.7"
DGX_OTA_DATE="Mon Jul 2 18:36:07 PDT 2018"
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux dgx15 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Tue Sep 17 11:09:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 36C P0 56W / 300W | 629MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 32C P0 43W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |
| N/A 36C P0 57W / 300W | 517MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 35C P0 45W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 32C P0 42W / 300W | 11MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28951 C ...iniconda3/envs/cudf_bugfixes/bin/python 618MiB |
| 4 66094 C ...iniconda3/envs/cudf_bugfixes/bin/python 506MiB |
+-----------------------------------------------------------------------------+
***CPU***
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 3480.640
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4392.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
***CMake***
/home/nfs/rzamora/miniconda3/envs/cudf_bugfixes/bin/cmake
cmake version 3.15.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).
***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
***Python***
/home/nfs/rzamora/miniconda3/envs/cudf_bugfixes/bin/python
Python 3.7.3
***Environment Variables***
PATH : /home/nfs/rzamora/.vscode-server-insiders/bin/a39d2de39dd5038a1a696800ac9af6dc32a31eab/bin:/home/nfs/rzamora/bin:/home/nfs/rzamora/.local/bin:/home/nfs/rzamora/miniconda3/envs/cudf_bugfixes/bin:/home/nfs/rzamora/miniconda3/condabin:/home/nfs/rzamora/.vscode-server-insiders/bin/a39d2de39dd5038a1a696800ac9af6dc32a31eab/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
LD_LIBRARY_PATH :
NUMBAPRO_NVVM : /usr/local/cuda-9.2/nvvm/lib64/libnvvm.so
NUMBAPRO_LIBDEVICE : /usr/local/cuda-9.2/nvvm/libdevice
CONDA_PREFIX : /home/nfs/rzamora/miniconda3/envs/cudf_bugfixes
PYTHON_PATH :
***conda packages***
/home/nfs/rzamora/miniconda3/condabin/conda
# packages in environment at /home/nfs/rzamora/miniconda3/envs/cudf_bugfixes:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
aiohttp 3.6.0 pypi_0 pypi
alabaster 0.7.12 py_0 conda-forge
appdirs 1.4.3 py_1 conda-forge
arrow-cpp 0.14.1 py37h6b969ab_1 conda-forge
asn1crypto 0.24.0 py37_1003 conda-forge
aspy.yaml 1.3.0 py_0 conda-forge
async-timeout 3.0.1 pypi_0 pypi
atomicwrites 1.3.0 py_0 conda-forge
attrs 19.1.0 py_0 conda-forge
aws-sam-translator 1.14.0 py37_0 conda-forge
aws-xray-sdk 0.95 py_0 conda-forge
babel 2.7.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
backports 1.0 py_2 conda-forge
backports.tempfile 1.0 py_0 conda-forge
backports.weakref 1.0.post1 py37_1000 conda-forge
black 19.3b0 py_0
bleach 3.1.0 py_0 conda-forge
bokeh 1.3.4 py37_0 conda-forge
boost-cpp 1.70.0 h8e57a91_2 conda-forge
boto 2.49.0 py_0 conda-forge
boto3 1.9.220 py_0 conda-forge
botocore 1.12.220 py_0 conda-forge
brotli 1.0.7 he1b5a44_1000 conda-forge
bzip2 1.0.8 h516909a_0 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2019.5.15 1
cached-property 1.5.1 py_0 conda-forge
certifi 2019.6.16 py37_1
cffi 1.12.3 py37h8022711_0 conda-forge
cfgv 2.0.1 py_0 conda-forge
cfn-lint 0.23.5 py37_0 conda-forge
chardet 3.0.4 py37_1003 conda-forge
click 7.0 py_0 conda-forge
cloudpickle 1.2.1 py_0 conda-forge
cmake 3.15.2 hf94ab9c_0 conda-forge
commonmark 0.9.0 py_0 conda-forge
cookies 2.2.1 py_0 conda-forge
cryptography 2.7 py37h72c5cf5_0 conda-forge
cudatoolkit 9.2 0
cudf 0.10.0a0+1424.g24f354d.dirty dev_0 <develop>
cudnn 7.6.0 cuda9.2_0
cupy 6.0.0 py37hc15394e_0
curl 7.65.3 hf8cf82a_0 conda-forge
cython 0.29.13 py37he1b5a44_0 conda-forge
cytoolz 0.10.0 py37h516909a_0 conda-forge
dask 1.2.1+254.g558e11c dev_0 <develop>
dask-core 2.3.0 py_0
dask-cudf 0.10.0a0+1424.g24f354d.dirty dev_0 <develop>
decorator 4.4.0 py_0 conda-forge
defusedxml 0.5.0 py_1 conda-forge
distributed 2.3.2+14.g7a1a369 pypi_0 pypi
dlpack 0.2 he1b5a44_0 conda-forge
docker-py 4.0.2 py37_0 conda-forge
docker-pycreds 0.4.0 py_0 conda-forge
docutils 0.15.2 py37_0 conda-forge
double-conversion 3.1.5 he1b5a44_1 conda-forge
ecdsa 0.13 py_0 conda-forge
editdistance 0.5.3 py37hf484d3e_0 conda-forge
entrypoints 0.3 py37_1000 conda-forge
expat 2.2.5 he1b5a44_1003 conda-forge
fastavro 0.22.4 py37h516909a_0 conda-forge
fastrlock 0.4 py37he6710b0_0
flake8 3.7.7 py37_0
flask 1.1.1 py_1 conda-forge
flatbuffers 1.11.0 he1b5a44_0 conda-forge
freetype 2.10.0 he983fc9_1 conda-forge
fsspec 0.4.4 py_0 conda-forge
future 0.17.1 py37_1000 conda-forge
gflags 2.2.2 he1b5a44_1001 conda-forge
glog 0.4.0 he1b5a44_1 conda-forge
gmp 6.1.2 hf484d3e_1000 conda-forge
grpc-cpp 1.23.0 h18db393_0 conda-forge
heapdict 1.0.0 py37_1000 conda-forge
httpretty 0.9.6 py_0 conda-forge
hypothesis 4.34.0 py37_0 conda-forge
icu 64.2 he1b5a44_1 conda-forge
identify 1.4.7 py_0 conda-forge
idna 2.8 py37_1000 conda-forge
imagesize 1.1.0 py_0 conda-forge
importlib_metadata 0.20 py37_0 conda-forge
ipykernel 5.1.2 py37h5ca1d4c_0 conda-forge
ipython 7.8.0 py37h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
isort 4.3.21 py37_0
itsdangerous 1.1.0 py_0 conda-forge
jedi 0.15.1 py37_0 conda-forge
jinja2 2.10.1 py_0 conda-forge
jmespath 0.9.4 py_0 conda-forge
jpeg 9c h14c3975_1001 conda-forge
json5 0.8.5 py_0
jsondiff 1.1.2 py_0 conda-forge
jsonpatch 1.24 py_0 conda-forge
jsonpickle 1.2 py_0 conda-forge
jsonpointer 2.0 py_0 conda-forge
jsonschema 3.0.2 py37_0 conda-forge
jupyter-server-proxy 1.1.0 pypi_0 pypi
jupyter_client 5.3.1 py_0 conda-forge
jupyter_core 4.4.0 py_0 conda-forge
jupyterlab 1.0.2 py37hf63ae98_0
jupyterlab-nvdashboard 0.1.9 pypi_0 pypi
jupyterlab_server 1.0.0 py_1
krb5 1.16.3 h05b26f9_1001 conda-forge
libblas 3.8.0 12_openblas conda-forge
libcblas 3.8.0 12_openblas conda-forge
libcurl 7.65.3 hda55be3_0 conda-forge
libedit 3.1.20170329 hf8c457e_1001 conda-forge
libevent 2.1.10 h72c5cf5_0 conda-forge
libffi 3.2.1 he1b5a44_1006 conda-forge
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
liblapack 3.8.0 12_openblas conda-forge
libnvstrings 0.9.0 cuda9.2_0 rapidsai
libopenblas 0.3.7 h6e990d7_1 conda-forge
libpng 1.6.37 hed695b0_0 conda-forge
libprotobuf 3.8.0 h8b12597_0 conda-forge
librmm 0.9.0 cuda9.2_0 rapidsai
libsodium 1.0.17 h516909a_0 conda-forge
libssh2 1.8.2 h22169c7_2 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.0.10 h57b8799_1003 conda-forge
libuv 1.31.0 h516909a_0 conda-forge
llvmlite 0.29.0 py37hfd453ef_1 conda-forge
locket 0.2.0 py_2 conda-forge
lz4-c 1.8.3 he1b5a44_1001 conda-forge
markdown 2.6.11 pypi_0 pypi
markupsafe 1.1.1 py37h14c3975_0 conda-forge
mccabe 0.6.1 py_1 conda-forge
mistune 0.8.4 py37h14c3975_1000 conda-forge
mock 3.0.5 py37_0 conda-forge
more-itertools 7.2.0 py_0 conda-forge
moto 1.3.8 py_1 conda-forge
msgpack-python 0.6.1 py37h6bb024c_0 conda-forge
multidict 4.5.2 pypi_0 pypi
nbconvert 5.6.0 py37_1 conda-forge
nbformat 4.4.0 py_1 conda-forge
nbsphinx 0.4.2 py_0 conda-forge
nccl 1.3.5 cuda9.2_0
ncurses 6.1 hf484d3e_1002 conda-forge
nodeenv 1.3.3 py_0 conda-forge
nodejs 10.13.0 he6710b0_0
notebook 6.0.1 py37_0 conda-forge
numba 0.45.1 py37hb3f55d8_0 conda-forge
numpy 1.17.1 py37h95a1406_0 conda-forge
numpydoc 0.9.1 py_0 conda-forge
nvstrings 0.9.0 py37_0 rapidsai
olefile 0.46 py_0 conda-forge
openssl 1.1.1d h7b6447c_1
packaging 19.0 py_0 conda-forge
pandas 0.24.2 py37hb3f55d8_0 conda-forge
pandoc 1.19.2 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.5.1 py_0 conda-forge
partd 1.0.0 py_0 conda-forge
pexpect 4.7.0 py37_0 conda-forge
pickleshare 0.7.5 py37_1000 conda-forge
pillow 6.1.0 py37h6b7be26_1 conda-forge
pip 19.2.3 py37_0 conda-forge
pluggy 0.12.0 py_0 conda-forge
pre_commit 1.18.1 py37_0 conda-forge
prometheus_client 0.7.1 py_0 conda-forge
prompt_toolkit 2.0.9 py_0 conda-forge
psutil 5.6.3 py37h516909a_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
py 1.8.0 py_0 conda-forge
pyarrow 0.14.1 py37h8b68381_0 conda-forge
pycodestyle 2.5.0 py_0 conda-forge
pycparser 2.19 py37_1 conda-forge
pycryptodome 3.8.2 py37he80fd80_0 conda-forge
pyflakes 2.1.1 py_0 conda-forge
pygments 2.4.2 py_0 conda-forge
pynvml 8.0.3 pypi_0 pypi
pyopenssl 19.0.0 py37_0 conda-forge
pyparsing 2.4.2 py_0 conda-forge
pyrsistent 0.15.4 py37h516909a_0 conda-forge
pysocks 1.7.0 py37_0 conda-forge
pytest 5.1.2 py37_0 conda-forge
python 3.7.3 h33d41f4_1 conda-forge
python-dateutil 2.8.0 py_0 conda-forge
python-jose 2.0.2 py_0 conda-forge
pytz 2019.2 py_0 conda-forge
pyyaml 5.1.2 py37h516909a_0 conda-forge
pyzmq 18.0.2 py37h1768529_2 conda-forge
rapidjson 1.1.0 he1b5a44_1002 conda-forge
re2 2019.09.01 he1b5a44_0 conda-forge
readline 8.0 hf8c457e_0 conda-forge
recommonmark 0.6.0 py_0 conda-forge
requests 2.22.0 py37_1 conda-forge
responses 0.9.0 py_0 conda-forge
rhash 1.3.6 h14c3975_1001 conda-forge
rmm 0.9.0 py37_0 rapidsai
s3fs 0.3.4 py_0 conda-forge
s3transfer 0.2.1 py37_0 conda-forge
send2trash 1.5.0 py_0 conda-forge
setuptools 41.2.0 py37_0 conda-forge
simpervisor 0.3 pypi_0 pypi
six 1.12.0 py37_1000 conda-forge
snappy 1.1.7 he1b5a44_1002 conda-forge
snowballstemmer 1.9.0 py_0 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sphinx 2.2.0 py_0 conda-forge
sphinx-markdown-tables 0.0.9 pypi_0 pypi
sphinx_rtd_theme 0.4.3 py_0 conda-forge
sphinxcontrib-applehelp 1.0.1 py_0 conda-forge
sphinxcontrib-devhelp 1.0.1 py_0 conda-forge
sphinxcontrib-htmlhelp 1.0.2 py_0 conda-forge
sphinxcontrib-jsmath 1.0.1 py_0 conda-forge
sphinxcontrib-qthelp 1.0.2 py_0 conda-forge
sphinxcontrib-serializinghtml 1.1.1 py_0 conda-forge
sphinxcontrib-websupport 1.1.2 py_0 conda-forge
sqlite 3.29.0 hcee41ef_1 conda-forge
tblib 1.4.0 py_0 conda-forge
terminado 0.8.2 py37_0 conda-forge
testpath 0.4.2 py_1001 conda-forge
thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
tk 8.6.9 hed695b0_1002 conda-forge
toml 0.10.0 py_0 conda-forge
toolz 0.10.0 py_0 conda-forge
tornado 6.0.3 py37h516909a_0 conda-forge
traitlets 4.3.2 py37_1000 conda-forge
uriparser 0.9.3 he1b5a44_1 conda-forge
urllib3 1.25.3 py37_0 conda-forge
virtualenv 16.7.5 py_0 conda-forge
wcwidth 0.1.7 py_1 conda-forge
webencodings 0.5.1 py_1 conda-forge
websocket-client 0.56.0 py37_0 conda-forge
werkzeug 0.15.5 py_0 conda-forge
wheel 0.33.6 py37_0 conda-forge
wrapt 1.11.2 py37h516909a_0 conda-forge
xmltodict 0.12.0 py_0 conda-forge
xz 5.2.4 h14c3975_1001 conda-forge
yaml 0.1.7 h14c3975_1001 conda-forge
yarl 1.3.0 pypi_0 pypi
zeromq 4.3.2 he1b5a44_2 conda-forge
zict 1.0.0 py_0 conda-forge
zipp 0.6.0 py_0 conda-forge
zlib 1.2.11 h516909a_1005 conda-forge
zstd 1.4.0 h3b9ef0a_0 conda-forge
Additional context
Although the use of cupy is trivial/unnecessary in the code snippet above, we want to use cupy in practice to avoid device-host transfers.
cc @pentschev @brandon-b-miller @quasiben
Well, I have good news for this thread. I went to do some more debugging, and I've found this with gdb:
(gdb) t 20
[Switching to thread 20 (Thread 0x7f1070ff9700 (LWP 8248))]
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
225 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such file or directory.
(gdb) bt
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1 0x000055e9a0ea82d8 in PyCOND_TIMEDWAIT (cond=0x55e9a10daa38 <_PyRuntime+1208>,
mut=0x55e9a10daa68 <_PyRuntime+1256>, us=5000)
at /home/conda/feedstock_root/build_artifacts/python_1562015400360/work/Python/condvar.h:90
#2 take_gil (tstate=0x55e9bbc4fb00)
at /home/conda/feedstock_root/build_artifacts/python_1562015400360/work/Python/ceval_gil.h:208
#3 PyEval_RestoreThread () at /home/conda/feedstock_root/build_artifacts/python_1562015400360/work/Python/ceval.c:271
#4 0x000055e9a0f77970 in PyGILState_Ensure ()
at /home/conda/feedstock_root/build_artifacts/python_1562015400360/work/Python/pystate.c:1067
#5 0x00007f113ccae9d7 in _CallPythonObject (pArgs=0x7f1070ff4e10, flags=4353, converters=0x7f11359c4400,
callable=0x7f10cd3d09d8, setfunc=0x7f113cca99b0 <L_set>, restype=0x7f113ccf95a0, mem=0x7f1070ff4fa0)
at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callbacks.c:140
#6 closure_fcn () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callbacks.c:292
#7 0x00007f113cc983d0 in ffi_closure_unix64_inner ()
from /home/pentschev/miniconda3/envs/rn-0.10/lib/python3.7/lib-dynload/../../libffi.so.6
#8 0x00007f113cc98798 in ffi_closure_unix64 ()
from /home/pentschev/miniconda3/envs/rn-0.10/lib/python3.7/lib-dynload/../../libffi.so.6
#9 0x00007f1134a2c16b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f113493d197 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#11 0x00007f113493d1c0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#12 0x00007f1134a8ca36 in cuOccupancyMaxPotentialBlockSize () from /usr/lib/x86_64-linux-gnu/libcuda.so
#13 0x00007f113cc98630 in ffi_call_unix64 ()
from /home/pentschev/miniconda3/envs/rn-0.10/lib/python3.7/lib-dynload/../../libffi.so.6
#14 0x00007f113cc97fed in ffi_call ()
from /home/pentschev/miniconda3/envs/rn-0.10/lib/python3.7/lib-dynload/../../libffi.so.6
#15 0x00007f113ccaefce in _call_function_pointer (argcount=6, resmem=0x7f1070ff5510, restype=<optimized out>,
atypes=0x7f1070ff5490, avalues=0x7f1070ff54d0, pProc=0x7f1134a8c9b0 <cuOccupancyMaxPotentialBlockSize>, flags=4353)
at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#16 _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#17 0x00007f113ccafa04 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
...
Again cuOccupancyMaxPotentialBlockSize, which was the exact same issue from https://github.com/rapidsai/ucx-py/issues/187.
TL;DR: https://github.com/numba/numba/pull/4581 fixes this too.
@pentschev thanks so much for tracking this down.
Thanks @pentschev! You are my hero :)
I'd just like to quote @pentschev's key comment from the discussion in ucx-py#187:
To describe briefly, the problem is the numba.forall call, which internally calls cuOccupancyMaxPotentialBlockSize. This last function requires two function pointers, one being the CUDA kernel itself, and the other being a function to calculate how much shared memory the call requires. The problem lies in the latter, which is defined in https://github.com/numba/numba/blob/master/numba/cuda/compiler.py#L288. Since that is a Python lambda function, when cuOccupancyMaxPotentialBlockSize calls that function back, it tries to acquire the GIL, which causes a deadlock (as both the thread executing cuOccupancyMaxPotentialBlockSize and the thread executing cudaMemcpyAsync lock the same CUDA mutex). The GIL can then never be acquires since both threads can never complete.
What we need to prevent is that CUDA calls (e.g., function callbacks passed to libcuda) never tries to acquire the GIL. To fix that in the present case, we can simply pass a C function pointer instead of passing a Python function to it.
I will close this issue since the discussion already has a "home", and there is now a numba PR/fix.
Most helpful comment
Well, I have good news for this thread. I went to do some more debugging, and I've found this with gdb:
Again
cuOccupancyMaxPotentialBlockSize, which was the exact same issue from https://github.com/rapidsai/ucx-py/issues/187.TL;DR: https://github.com/numba/numba/pull/4581 fixes this too.