Cudf 0.14 gives error when used in python multiprocessing Pool. However, it works in version 0.12. Here is the code to reproduce.
import cudf
import pandas as pd
from multiprocessing import Pool
def get_df(idx):
pdf = pd.DataFrame({
'a':[1,2],
'b':[3,4]
})
return cudf.from_pandas(pdf)
# Parallelize the method calls
with Pool(2) as pool:
pool.map(get_df, [1,2])
Error is
MemoryError Traceback (most recent call last)
<ipython-input-1-757e5676c563> in <module>
11
12 with Pool(2) as pool:
---> 13 pool.map(get_df, [1,2])
~/miniconda3/envs/gpu/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
264 in a list that is returned.
265 '''
--> 266 return self._map_async(func, iterable, mapstar, chunksize).get()
267
268 def starmap(self, func, iterable, chunksize=None):
~/miniconda3/envs/gpu/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
MemoryError: std::bad_alloc: CUDA error at: /conda/conda-bld/librmm_1591196551527/work/include/rmm/mr/device/cuda_memory_resource.hpp66: cudaErrorInitializationError initialization error
cudf was installed using Anaconda on bare metal. I am attaching the outputs of cudf/print_env.sh
print_env_12.txt
print_env_14.txt
Looks related to the new memory resource bindings; will investigate.
Thanks for reporting. This is likely due to the call to fork(), which will attempt to share the CUDA context created in the parent process. One fix is to use spawn() instead:
import cudf
import pandas as pd
from multiprocessing import get_context
def get_df(idx):
pdf = pd.DataFrame({
'a':[1,2],
'b':[3,4]
})
return cudf.from_pandas(pdf)
if __name__ == "__main__":
ctx = get_context("spawn")
# Parallelize the method calls
with ctx.Pool(2) as pool:
print(pool.map(get_df, [1,2]))
Does that help with your problem?
No. The error message is
AttributeError: Can't get attribute 'get_df' on <module '__main__' (built-in)>
Hmm, how are you running this test? Interactively with IPython/Jupyter or invoking it as a script?
MemoryError Traceback (most recent call last) <ipython-input-1-757e5676c563> in <module>
Looks like in iPython.
No. The error message is
AttributeError: Can't get attribute 'get_df' on <module '__main__' (built-in)>
Looks like the known limitation (the 2nd gray box in the link) of the python's multiprocessing when used interactively with iPython.
When launched as a script, @shwina 's suggestion shouldn't see the AttributeError, should it?
But not sure about the original error message below, is it related to usage of fork vs spawn or sth else.
MemoryError: std::bad_alloc: CUDA error at: /conda/conda-bld/librmm_1591196551527/work/include/rmm/mr/device/cuda_memory_resource.hpp66: cudaErrorInitializationError initialization error
... This is likely due to the call to
fork(), which will attempt to share the CUDA context created in the parent process. One fix is to usespawn()instead:import cudf import pandas as pd from multiprocessing import get_context def get_df(idx): pdf = pd.DataFrame({ 'a':[1,2], 'b':[3,4] }) return cudf.from_pandas(pdf) if __name__ == "__main__": ctx = get_context("spawn") # Parallelize the method calls with ctx.Pool(2) as pool: print(pool.map(get_df, [1,2]))
Verified that this works when run as a script. With the default fork() starting method, it would hit the initialization error.
@shwina your suggestion works fine when run as a script. Thanks.
Thanks for letting us know!