Cudf: [DOC] Replace OutOfMemory exception with UnsupportedGPU exception

Created on 20 Mar 2020 · 10Comments · Source: rapidsai/cudf

Describe the bug

We end up getting deployed in scenarios where cudf, on initialization, throws an RMM exception for out of memory during load, when in reality it is rejecting the available hardware.

This typically impacts our new-to-GPU users and even advanced ones during a config mistakes. E.g., on Azure, the default-available GPUs are K80s (until users jump quota hoops), so the typical Azure first-use experience is to spin up and get this misleading error. It's quite a tough and misleading experience for most people until they've been burnt enough.

We end up doing all sorts of things to try to get users to pick the right env etc. beforehand, but invariably, mistakes will happen, even by advanced users (misconfig, ...).

Not sure if this is better in cudf or rmm.

Steps/Code to reproduce bug

import cudf ; cudf.DataFrame({'x': [1]}) on an old popular GPU like the k80

And/or on common other init next steps like set_alloc

Expected behavior

Fail with UnsupportedDeviceError or something similarly indicative

Environment overview (please complete the following information)

Everywhere. We happen to hit it in Docker.

bug cuDF (Python)

Source

lmeyerov

All 10 comments

We have similarly poor error messages for old CUDA versions, old driver versions, etc. that we should handle in one swoop.

We should also strive to maintain allowing cudf to be imported on a machine with no GPU for things like API enumeration and whatnot.

kkraus14 on 20 Mar 2020

❤1

@lmeyerov what version are you on?

harrism on 20 Mar 2020

We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13

(0.13/0.14 should be faster b/c the 0.11 & 0.12 upgrades involved 100-200 unit tests & further automation around how we use it)

lmeyerov on 20 Mar 2020

Also, we always ship as docker, and in cloud cases but not on-prem, get to control the host as ubuntu 18 + whatever the aws+azure nvidia drivers are at that time

lmeyerov on 20 Mar 2020

If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call.

Ex: something like a healthcheck() or validity() . in opencl, you get back a set of valid devices & their specs, and can even pick which you're using (==> cooperatively schedulable.)

That wouldn't help direct cudf users like data scientists tho.

lmeyerov on 20 Mar 2020

This is pretty straightforward.

At import cudf:

GPU compute capability: cudaDeviceGetAttribute and check for cudaDevAttrComputeCapabilityMajor > 6000.
CUDA Runtime: cudaRuntimeGetVersion and check for > 10000
Driver version: check that the result of cudaDriverVersion is >= to the result from cudaRuntimeGetVersion.

Does numba or cupy already wrap the appropriate APIs? Or would we need to do so in cuDF cython?

To support @kkraus14's comment of being able to do import cudf on machines without GPUs, you can first do cudaGetDeviceCount() and only run the above if the number of devices is greater than zero.

jrhemstad on 20 Mar 2020

👍1

If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call.

Overhead on the check should be pretty low so I'm not too concerned.

We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13

Note 0.12 has a bunch of non-trivial memory overhead for Strings where you may just want to go from 0.11 --> 0.13 which has the overhead completely removed and additional memory usage improvements.

kkraus14 on 20 Mar 2020

We're stuck near-term on 0.12 b/c blazing isn't blessed for 0.13 afaict: https://anaconda.org/blazingsql/blazingsql

But yeah, the 0.12/0.13/0.14 upgrades seem to be battling overhead & memory issues we're seeing, so def excited!

lmeyerov on 21 Mar 2020

@jrhemstad Sanity check re:not importing, will doing submodule imports (from cudf.io.parquet...) still trigger running cudf/__init__.py? I'm not up on python module semantics, so not sure if putting the check at module import time will still allow GPU-less module reflection.

lmeyerov on 21 Mar 2020

@jrhemstad Sanity check re:not importing, will doing submodule imports (from cudf.io.parquet...) still trigger running cudf/__init__.py? I'm not up on python module semantics, so not sure if putting the check at module import time will still allow GPU-less module reflection.

Yes it still will: https://docs.python.org/3/reference/import.html#regular-packages

kkraus14 on 21 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings