Cudf: [BUG] thrust::system::system_error what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Created on 11 Aug 2020  Â·  36Comments  Â·  Source: rapidsai/cudf

This script just reads randomly created JSON files using Dask with no heavy processing.

Dask Worker logs show something like the errors below, which eventually causes workers to restart frantically and eventually cause connection issues b/w the scheduler and workers.

NOTE: If I do not use Dask, the processing seems to go though without failures.

Worker logs:

terminate called after throwing an instance of 'thrust::system::system_error' what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called recursively
distributed.nanny - INFO - Worker process 13050 was killed by signal 6

I used the following commands —

  1. Start Scheduler: nohup dask-scheduler --host localhost &> scheduler.out &
  2. Start Workers: CUDA_VISIBLE_DEVICES=0 nohup dask-worker localhost:8786 --nprocs 2 --nthreads 2 --memory-limit="16GB" --resources "process=1" >& worker.out &

Logs can be seen in scheduler.out and worker.out.

Random JSON files producer script:

# Creates 25 JSON files, 2*120MB each 

from random import randrange,seed
import json
import math
import time
import random

num_columns = 40

def column_names(size):
    base_cols = ["AppId{}", "LoggedTime{}", "timestamp{}"]
    cols = []
    mult = math.ceil(size/len(base_cols))
    for i in range(mult):
        for c in base_cols:
            cols.append(c.format(i))
            if(len(cols) == size): break
    return cols

def generate_json(num_columns):
    dict_out = {}
    cols = column_names(num_columns)
    for col in cols:
        if col.startswith("AppId"): dict_out[col] = randrange(1,50000)
        elif col.startswith("LoggedTime"): dict_out[col] = randrange(1,50000)
        else: dict_out[col] = randrange(1,50000)
    return json.dumps(dict_out)

for i in range(0,25):
    count = 0
    f = open("json_files/json-%i.txt" % i, "w+")
    while count < 2*150000:
        f.write(generate_json(num_columns) + "\n")
        count = count + 1
    f.close()

Processing script:

from distributed import Client, LocalCluster
import cudf

client = Client("localhost:8786")
client.get_versions(check=True)

def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)

batch_arr = [i for i in range(1,25)]
res = client.map(func_json, batch_arr)
print(client.gather(res))

Can someone please help? I'm seeing this kind of failure only as recent as one week.

I am using a fresh conda environment with this being the only installation command:
conda install -y -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.2.

I am using a T4 GPU with CUDA 10.2.

P.S. This seems similar to #5897.

bug cuIO dask libcudf

Most helpful comment

Got local repro with multithreaded JSON reads:

TEST_F(JsonReaderTest, Repro)
{
  auto read_all = [&]() {
    cudf_io::read_json_args in_args{cudf_io::source_info{""}};
    in_args.lines = true;
    for (int i = 0; i < 25; ++i) {
      in_args.source =
        cudf_io::source_info{"/home/vukasin/cudf/json-" + std::to_string(i) + ".txt"};
      auto df = cudf_io::read_json(in_args);
    }
  };

  auto th1 = std::async(std::launch::async, read_all);
  auto th2 = std::async(std::launch::async, read_all);
}

Reproes fairly consistently.

All 36 comments

This might be an OOM caused by the additional processing for JSON input with object rows.

I was able to process dataframes much, much larger than this until about a week ago. Also, each JSON file is pretty small (~250MB in size). I am running it on a T4 (16GB GPU Memory), so I think there's enough GPU memory. I am also seeing this issue when processing less than 25 files (around 10-15).

I also see a lot of Dask-CUDA issues popping up, maybe this is stemming from one of those bugs? I am by no means an expert, but just saying.

Could be a processing bug, but I wonder why it would only happen with Dask. I agree that there should be enough memory, but it might still be an OOM issue because of unreasonably large overhead during reads.
Can you please share the log with the CUDA issues?

So if you run the scripts above as is, you should see the same CUDA errors in the worker logs, given your environment is the same as the one I described above. Are there any other logs I should be looking at? Would be happy to help, but I thought a minimal reproducer would be best for you guys to debug.

P.S. I am using CUDA 10.2 and Python 3.7/3.8 both show the same errors.

My system doesn't have the same device memory capacity. So I would appreciate the logs if it's easy for you to get.

No problem, I just sent them over. Thanks for digging into this!

If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

I see this problem both when using multiple GPUs as well as a single GPU. But I do believe it has something to do with the issues you mentioned.

If it is an OOM issue it's possible this is related to an RMM/Dask-CUDA/Dask issue where device 0 is the only device being used even though multiple GPUs are requested

Just to add to this, IOW this is an issue related to PR ( https://github.com/rapidsai/rmm/pull/466 ). We are discussing this in other contexts as well.

Thanks for pointing this out, @jakirkham!

Filed an MRE here ( https://github.com/rapidsai/dask-cuda/issues/364 ).

@chinmaychandak, could you please try PR ( https://github.com/rapidsai/dask-cuda/pull/363 )? Requires the very latest (like from minutes ago) rmm installed as well.

We went ahead and merged that dask-cuda PR and nightlies have been produced. Please let us know if you still see issues with them.

Sorry I missed the earlier message. Great, thanks a lot! Will give it a shot soon.

@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently?

@jakirkham I'm still seeing the same issues with the latest nightlies (0.16.0a200812). Can you try and reproduce them locally so that I can make sure I'm not doing anything differently?

I'm running the repro locally, will update once the script is done.

Update:

If I start workers using --nprocs 2 --nthreads 1 or --nprocs 1 --nthreads 1, everything gets processed smoothly. Only when I have each process having multiple threads do I see the issue. So --nprocs 2 --nthreads 2 fails. This is interesting. I think this should give us some more insight as to where this issue stems from.

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

When using GPUs with dask the current working assumption is that there should be 1 worker and 1 thread per GPU. This is generally for proper CUDA context creation but also useful resource management. We built dask-cuda to make this setup trivial for users.

I agree, but we have been using multiple Dask worker processes per GPU for high throughput for custreamz streaming pipelines. And it has been working flawlessly until recently.

@quasiben in this case they're using multiple processes and CUDA MPS in order to handle workloads that don't nicely saturate the entire GPU on their own.

@chinmaychandak it seems like everything is working as long as you have a single thread per process, yes?

That may be true, but Ben is right. That's not expected to work currently. Not to say we are against changing this (and it is part of the reason for pushing for PTDS 😉)

it seems like everything is working as long as you have a single thread per process, yes?

Yes, @kkraus14, that's correct. I even tested it with the accelerated Kafka bit now in one of the more complex cuStreamz pipelines, just to see if that works, and it does work fine as long as there's one thread per process. But for most pipelines we do use multiple threads per process, especially for benchmarking purposes.

Again, we've already been doing this for over a year, and it's never been a problem. Not sure why I'm seeing this issue. I think @vuule mentioned that he couldn't reproduce the issue locally. Maybe I'm doing something wrong here then.

cc @harrism as we're seeing a threading related issue and there was substantial changes with regards to RMM and threading.

@kkraus14, an update: I'm now seeing the same issues with multiple processes too, but the issue happened after like 10 minutes of starting a stream with high-speed input rate. That's why I probably couldn't see it with the minimal reproducer. I will try it with reading a larger number of JSON files to see if multiple processes is failing too.

Maybe retry with newer RMM packages ( https://github.com/rapidsai/rmm/pull/493 )?

Just to reiterate, I wouldn't expect Dask-CUDA to work with multiple threads per worker today ( https://github.com/rapidsai/dask-cuda/issues/109 ).

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Will give it a shot when the nightlies are out. Actually, they're out. Let me try.

I wouldn't expect Dask-CUDA to work with multiple threads per worker today

When I do conda list dask-cuda, nothing shows up. My reproducer is only relying on RMM and not dask-cuda, I think. Nevertheless, as I said above, we've been trying these multi-process multi-thread Dask workers per-GPU for the last year and they've never been a problem. We do need CUDA MPS and there are multiple CUDA contexts created, but functionality wise, they've proven to be working fine.

I would really appreciate it if someone can try to run the repro locally to see if they're seeing the same error as me.

Maybe retry with newer RMM packages ( rapidsai/rmm#493 )?

Just did, still doesn't seem to work.

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

What does Dask do when you schedule more than one thread per worker? Does it give each thread it's own pool? When you have multiple processes per GPU, is it setting pool sizes appropriately?

Let's move that discussion over here ( https://github.com/rapidsai/dask-cuda/issues/109 ) (if that's ok 🙂).

Edit: Answered in comment ( https://github.com/rapidsai/dask-cuda/issues/109#issuecomment-673677150 ).

Does this reproduce with a ThreadPoolExecutor. Maybe something like this?

from concurrent.futures import ThreadPoolExecutor
import cudf


def func_json(batch):
    file = f"json_files/json-{batch}.txt"
    df = cudf.read_json(file, lines=True, engine="cudf")
    return len(df)


with ThreadPoolExecutor(max_workers=1) as executor:
    batch_arr = [i for i in range(1, 25)]
    res = executor.map(func_json, batch_arr)
    for e in res:
        print(e)

Edit: May be worth playing with max_workers here.

Okay, so I thought of using CSV files instead of JSON so, I used

import cudf
for i in range(0,20):
    file = f"json_files/json-{i}.txt"
    cudf.read_json(file, lines=True, engine="cudf").to_csv("csv_files/csv-"+str(i)+".csv")

to convert existing JSON to CSV files, and then updated the repro script to call read_csv

def func_csv(batch):
    file = f"csv_files/csv-{batch}.csv"
    df = cudf.read_csv(file)
    return len(df)

It seems to run fine with 2 processes and 2 threads. So this is specifically happening with the JSON reader?

Got local repro with multithreaded JSON reads:

TEST_F(JsonReaderTest, Repro)
{
  auto read_all = [&]() {
    cudf_io::read_json_args in_args{cudf_io::source_info{""}};
    in_args.lines = true;
    for (int i = 0; i < 25; ++i) {
      in_args.source =
        cudf_io::source_info{"/home/vukasin/cudf/json-" + std::to_string(i) + ".txt"};
      auto df = cudf_io::read_json(in_args);
    }
  };

  auto th1 = std::async(std::launch::async, read_all);
  auto th2 = std::async(std::launch::async, read_all);
}

Reproes fairly consistently.

I'm suspecting synchronization issue(s) that got exposed by GPU saturation from concurrent reads.
Digging into the repro, I found a few places where the synchronization is iffy.
Need to look into it some more to root cause.

Were there any significant changes recently that could have caused this? Because I wonder why we weren't seeing these issues before.

I made significant change to the JSON reader 2 weeks ago that could affect this.

Was this page helpful?
0 / 5 - 0 ratings