Cudf: [BUG] cuDF 10.0 RMM_FREE: global function call is not configured

Created on 18 Oct 2019 · 17Comments · Source: rapidsai/cudf

I conda installed rapids 0.10 but the kernel die when I try to read a parquet file with either _cudf.read_parquet_ or _dask_cudf.read_parquet_.

conda install -c rapidsai -c nvidia -c conda-forge rapids=0.10 rapids-xgboost dask python=3.7 cudatoolkit=10.0 ipykernel boto3 boto s3fs idna=2.7 PyYAML=3.13 urllib3=1.24.3

Jupyter log error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured

? - Needs Triage bug

Source

ivenzor

👍1

All 17 comments

Thanks for reporting @ivenzor - are you able to provide a minimal example that reproduces the above error? That would make debugging the issue much easier.

Otherwise, are you able to share your Jupyter Notebook?

shwina on 21 Oct 2019

@shwina thanks for your reply. It seems _cudf.read_parquet / dask_cudf.read_parquet_ are working as expected in another dataset in parquet format in s3, but the kernel die with the error I mentioned if I use again the first dataset in parquet format in s3 that I was testing.

I'm not able to share the data or the notebook, but I will try to isolate the issue as much as I can and will provide more details.

ivenzor on 23 Oct 2019

We are running into the same issue when using cuDF java bindings on Spark to load big csv data (more than 10G). Part of the log:

19/10/29 23:04:09 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 44.5 KiB, free 31.8 GiB)
19/10/29 23:04:09 INFO CodeGenerator: Code generated in 30.311032 ms
19/10/29 23:04:09 INFO CodeGenerator: Code generated in 25.242875 ms
19/10/29 23:04:09 INFO CodeGenerator: Code generated in 12.017672 ms
19/10/29 23:04:10 INFO TorrentBroadcast: Started reading broadcast variable 9 with 1 pieces (estimated total size 4.0 MiB)
19/10/29 23:04:10 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 46.7 KiB, free 31.8 GiB)
19/10/29 23:04:10 INFO TorrentBroadcast: Reading broadcast variable 9 took 5 ms
19/10/29 23:04:10 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 960.3 KiB, free 31.8 GiB)
19/10/29 23:04:10 INFO FilePartitionReader: Reading file path: hdfs://spark-egx-02:8020/data/mortgage/orig/perf/Performance_2000Q2.txt, range: 0-751462133, partition values: [empty row]
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured


[2019-10-29 23:04:16.135]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 306079 Aborted                 (core dumped) LD_LIBRARY_PATH="/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:" /usr/lib/jvm/java-8-openjdk-amd64//bin/java -server -Xmx61440m '-XX:+UseNUMA' -Djava.io.tmpdir=/hadoop/yarn/local/usercache/gashen/appcache/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/tmp '-Dspark.history.ui.port=18081' '-Dspark.driver.port=39954' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://[email protected]:39954 --executor-id 13 --hostname spark-egx-09 --cores 12 --app-id application_1572038031099_0036 --user-class-path file:/hadoop/yarn/local/usercache/gashen/appcache/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/__app__.jar --user-class-path file:/hadoop/yarn/local/usercache/gashen/appcache/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/cudf-0.10-SNAPSHOT-cuda10.jar --user-class-path file:/hadoop/yarn/local/usercache/gashen/appcache/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/xgboost4j_2.12-1.0.0-SNAPSHOT.jar --user-class-path file:/hadoop/yarn/local/usercache/gashen/appcache/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/xgboost4j-spark_2.12-1.0.0-SNAPSHOT.jar --user-class-path file:/hadoop/yarn/local/usercache/gashen/appcache/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/rapids-4-spark-0.1-SNAPSHOT.jar > /hadoop/yarn/log/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/stdout 2> /hadoop/yarn/log/application_1572038031099_0036/container_e41_1572038031099_0036_01_000014/stderr
Last 4096 bytes of stderr :
9 23:03:34 INFO TorrentBroadcast: Started reading broadcast variable 6 with 1 pieces (estimated total size 4.0 MiB)
19/10/29 23:03:34 INFO TransportClientFactory: Successfully created connection to spark-egx-03.nvidia.com/10.136.6.5:44587 after 1 ms (0 ms spent in bootstraps)
19/10/29 23:03:34 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2.4 KiB, free 31.8 GiB)
19/10/29 23:03:34 INFO TorrentBroadcast: Reading broadcast variable 6 took 58 ms
19/10/29 23:03:34 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 4.4 KiB, free 31.8 GiB)
19/10/29 23:03:34 INFO Executor: Finished task 3.0 in stage 2.0 (TID 118). 1331 bytes result sent to driver

firestarman on 30 Oct 2019

I think it's a bit of a misleading error message that is really an out-of-memory error (allocation failure).

OlivierNV on 30 Oct 2019

@OlivierNV I also thougt that, but the file was not very big and I could read it in pandas and its memory size was only a couple hundred MB.

ivenzor on 6 Nov 2019

@ivenzor any luck on finding a minimal reproducer?

shwina on 6 Nov 2019

@shwina I am also facing the same issue while reading the several orc files from hdfs using dask cuda worker in multi node dask cluster. I am using cudf.read_orc() api to read the files.
Unfortunately, I can not share the data too.

Here is worker log:

distributed.core - INFO - Starting established connection
distributed.core - INFO - Starting established connection
2019-11-08 06:57:04,734 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-08 06:57:04,743 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-08 06:57:04,754 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-08 06:57:04,788 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
distributed.utils_perf - INFO - full garbage collection released 404.14 MB from 6603 reference cycles (threshold: 10.00 MB)
distributed.utils_perf - INFO - full garbage collection released 3.09 GB from 4715 reference cycles (threshold: 10.00 MB)
nvs-idx: RMM_ALLOC((nil),58818160)=4
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::allocate(): RMM_ALLOC: unspecified launch failure
distributed.utils_perf - INFO - full garbage collection released 2.34 GB from 1700 reference cycles (threshold: 10.00 MB)
distributed.utils_perf - INFO - full garbage collection released 9.10 GB from 3326 reference cycles (threshold: 10.00 MB)
distributed.worker - ERROR - Worker stream died during communication: tcp://10.0.1.137:41268
Traceback (most recent call last):
  File "/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/tcp.py", line 184, in read
    n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 1881, in gather_dep
    self.rpc, deps, worker, who=self.address
  File "/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 3109, in get_data_from_worker
    max_connections=max_connections,
  File "/conda/envs/rapids/lib/python3.7/site-packages/distributed/core.py", line 531, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/tcp.py", line 204, in read
    convert_stream_closed_error(self, e)
  File "/conda/envs/rapids/lib/python3.7/site-packages/distributed/comm/tcp.py", line 130, in convert_stream_closed_error
    raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

AK-ayush on 8 Nov 2019

I believe this error is triggered by an exception that gets thrown in a destructor. If you're willing to build cudf from source, one way to try to narrow down where the error is occurring is to instrument the RMM_TRY and CUDA_TRY macros to log the __FILE__ and __LINE__ to stderr just before the throw call. Then hopefully it will show where the original error occurs that gets obscured by the thrust system error, and that may shed light onto what the real problem is.

Speaking of errors being thrown while cleaning up from an error, there are many places in the code that throw when a CUDA error occurs without clearing the error. As the stack gets unrolled and destructors invoked, any destructor that also checks and throws on a CUDA error is going to trigger this type of issue. Is there a reason to leave the CUDA error pending if the exception being thrown contains the detail of the CUDA error? cc: @harrism @jrhemstad

jlowe on 10 Nov 2019

👍1

Possibly not. That said, it's bad to throw in a destructor, and we avoid this in the latest RMM. Are there places in libcudf that throw in destructors?

harrism on 3 Dec 2019

Maybe I'm misreading the code, but it looks like RMM device vectors can still throw in their destructor. rmm_allocator::deallocate will throw if RMM_FREE returns an error, and rmm:free will return an error if there is a CUDA error pending.

jlowe on 3 Dec 2019

Maybe I'm misreading the code, but it looks like RMM device vectors can still throw in their destructor. rmm_allocator::deallocate will throw

You are correct. This will cause a throw in the destructor of rmm::device_vector, which is bad.

Speaking of errors being thrown while cleaning up from an error, there are many places in the code that throw when a CUDA error occurs without clearing the error

I agree, we can/should clear the error in CUDA_TRY. . Can you open an issue?

jrhemstad on 3 Dec 2019

So I spent some time looking into this, and the root cause is nuanced. Luckily the fix is pretty easy. I'll have a PR to RMM to fix it shortly.

jrhemstad on 3 Dec 2019

Fixed by https://github.com/rapidsai/rmm/pull/193

jrhemstad on 3 Dec 2019

So, foot in mouth on my part. We removed throwing from RMM free, but we forgot about the Thrust allocator. BTW, if rmm::device_vector throw's in its destructor, isn't that because thrust::device_vector throws in its destructor? Is Thrust a bad C++ citizen in this regard?

harrism on 3 Dec 2019

if rmm::device_vector throw's in its destructor, isn't that because thrust::device_vector throws in its destructor? Is Thrust a bad C++ citizen in this regard?

thrust::device_vector may or may not throw, but in this case it was our fault because rmm_allocator::deallocate throws (which is called in ~device_vector()).

jrhemstad on 4 Dec 2019

If there's another thrust allocator that throws in its deallocate method then it would have the same throws-within-destructor issue.

jlowe on 4 Dec 2019

👍1

Resolved.