Darknet: signal 11 while training network

Created on 19 Apr 2020 · 4Comments · Source: AlexeyAB/darknet

I've never had this problem with Darknet before: Command terminated by signal 11.

I'm using yolov3-tiny. While training my network, it only trains for a few seconds and then segfaults. These networks previously trained fine, not sure what changed. I did reboot the device, but same behaviour. This is the tail of the output:

````
(next mAP calculation at 1000 iterations)
23: 363.468109, 470.556122 avg loss, 0.000000 rate, 0.317069 seconds, 1472 images, 2.632841 hours left
Loaded: 0.374803 seconds - performance bottleneck on CPU or Disk HDD/SSD

(next mAP calculation at 1000 iterations)
24: 362.992065, 459.799713 avg loss, 0.000000 rate, 0.319242 seconds, 1536 images, 2.629300 hours left
Loaded: 0.265978 seconds - performance bottleneck on CPU or Disk HDD/SSD

(next mAP calculation at 1000 iterations)
25: 360.171051, 449.836853 avg loss, 0.000000 rate, 0.293526 seconds, 1600 images, 2.622240 hours left
Loaded: 0.370749 seconds - performance bottleneck on CPU or Disk HDD/SSD
OpenCV exception: load_image_mat_cv
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 16 Avg (IOU: 0.416460, GIOU: 0.294782), Class: 0.533438, Obj: 0.529112, No Obj: 0.544821, .5R: 0.263158, .75R: 0.052632, count: 19, class_loss = 135.00
4944, iou_loss = 2.308304, total_loss = 137.313248
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 23 Avg (IOU: 0.312199, GIOU: 0.077779), Class: 0.537887, Obj: 0.579798, No Obj: 0.565750, .5R: 0.142857, .75R: 0.000000, count: 14, class_loss = 577.37
0056, iou_loss = 3.108215, total_loss = 580.478271
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 16 Avg (IOU: 0.414418, GIOU: 0.329078), Class: 0.520000, Obj: 0.577239, No Obj: 0.544411, .5R: 0.307692, .75R: 0.000000, count: 13, class_loss = 134.71
7758, iou_loss = 1.639465, total_loss = 136.357224
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 23 Avg (IOU: 0.224744, GIOU: -0.013977), Class: 0.515081, Obj: 0.591532, No Obj: 0.566179, .5R: 0.100000, .75R: 0.000000, count: 20, class_loss = 578.3
00171, iou_loss = 8.046387, total_loss = 586.346558

Tensor Cores are disabled until the first 3000 iterations are reached.Command terminated by signal 11

````

My darknet files are created using a script, so I'm relatively certain everything is created correctly. The command I'm running (which is also within a script) is this:

~/darknet/darknet detector -map -dont_show train ~/nn/handwashing/handwashing.data ~/nn/handwashing/handwashing_yolov3-tiny.cfg

The version of darknet I'm using is from just a few days ago:

> git log -1
commit 342a8d1561c19317f2d5fda0f099449b79b51716 (HEAD -> master, origin/master, origin/HEAD)
Author: AlexeyAB <[email protected]>
Date:   Mon Apr 13 23:03:50 2020 +0300

    Fixed std thread

The only local changes I have are to the first few lines of Makefile, where I've changed the following:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
OPENMP=1
LIBSO=1

Training is on Ubuntu 18.04.4 with a GeForce RTX 2070. Running nvidia-smi reports:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    On   | 00000000:01:00.0  On |                  N/A |
| 37%   29C    P8    21W / 175W |      0MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Likely bug Solved

Source

stephanecharette

Most helpful comment

I should fix source code to add some error message explaining the reason for the crash.

AlexeyAB on 19 Apr 2020

👍6

All 4 comments

Did a git pull to get the latest version. Same problem. So I turned on debugging and ran it in gdb. This is the backtrace where the segfault happens:

Thread 12 "darknet" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff12b7e700 (LWP 5395)]
__memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:403
(gdb) bt
#0  0x00007fffd286c696 in __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:403
#1  0x00005555556062c5 in load_data_detection (n=11, paths=0x5555869a2fa0, m=1018, w=608, h=608, c=3, boxes=90, classes=3, use_flip=1, use_gaussian_noise=0, use_blur=0, use_mixup=3, jitter=0.300000012, hue=0
.100000001, saturation=1.5, exposure=1.5, mini_batch=0, track=0, augment_speed=0, letter_box=0, show_imgs=0) at ./src/data.c:1144
#2  0x000055555560714e in load_thread (ptr=0x7ffeac005c30) at ./src/data.c:1402
#3  0x000055555560790f in run_thread_loop (ptr=0x7ffec4001e10) at ./src/data.c:1456
#4  0x00007fffd2ba96db in start_thread (arg=0x7fff12b7e700) at pthread_create.c:463
#5  0x00007fffd28d288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) frame 1
#1  0x00005555556062c5 in load_data_detection (n=11, paths=0x5555869a2fa0, m=1018, w=608, h=608, c=3, boxes=90, classes=3, use_flip=1, use_gaussian_noise=0, use_blur=0, use_mixup=3, jitter=0.300000012,
    hue=0.100000001, saturation=1.5, exposure=1.5, mini_batch=0, track=0, augment_speed=0, letter_box=0, show_imgs=0) at ./src/data.c:1144
(gdb)

stephanecharette on 19 Apr 2020

Something seems to be wrong with the Mosaic implementation.

Can you attach cfg-file?
Show content of files bad.list and bad_label.list
Do you get this issue on the same line every time?
Do you get this issue without mosaic=1 in cfg-file?

AlexeyAB on 19 Apr 2020

I figured it out. One of the training/validation images wasn't copied over correctly, and had a size of zero bytes. As soon as I deleted that image from the list, everything worked correctly.

stephanecharette on 19 Apr 2020

👍2

I should fix source code to add some error message explaining the reason for the crash.

AlexeyAB on 19 Apr 2020

👍6

Was this page helpful?

0 / 5 - 0 ratings