I start training process on my Red Hat Enterprise Linux Server 7.4, after hundreds of iterations, the error always occurred: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). What should I do to solve this problem? I need help.
Hello, thank you for your interest in our work! This is an automated response. Please note that most technical problems are due to:
git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:sudo rm -rf yolov3 # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py # verify detection
python3 train.py # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
hello @glenn-jocher , I have solved this error by set "--num-workers 0" guided by the doc of PyTorch torch.utils.data DataLoader, thanks
"--num-workers 0" will slow down training.
I fixed it by adding "--ipc=host" in my docker container configuration.
@mozpp yes, this is already the default usage in the dockerfile examples:
https://github.com/ultralytics/yolov3/blob/master/Dockerfile
I fixed it by following this comment
https://stackoverflow.com/a/59029085
Hope it help!
@mozpp
"--num-workers 0" will slow down training.
I fixed it by adding "--ipc=host" in my docker container configuration.
How do you add this to a Docker file?
Most helpful comment
"--num-workers 0" will slow down training.
I fixed it by adding "--ipc=host" in my docker container configuration.