Darknet: Training process killed after resizing

Created on 25 Mar 2019 · 10Comments · Source: AlexeyAB/darknet

I did all the steps to train with 1 class but when I start training I get the error like in the screenshot.
I put batch = 64 and subdivision = 8 on the .cfg file and I do not have any bad.list file after running the command
./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74
on the /darknet-master/build/darknet/x64 path.
Can anyone help me figure it out why this is happening?
Screenshot from 2019-03-25 11-23-48

Source

lfares

Most helpful comment

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:

#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done

This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

drapado on 25 Mar 2019

👍2 ❤1 😄1

All 10 comments

Probably @AlexeyAB can give you some causes for that.

#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done

drapado on 25 Mar 2019

👍2 ❤1 😄1

@lfares

How many CPU RAM do you have?
Do you use the latest version of Darknet?
What parameters do you set in the Makefile?

I put batch = 64 and subdivision = 8 on the .cfg file

As I see from your screenshot that you use batch=32. Does the error occur for both batch=64 and 32?
Do you get this issue with batch=16 subdivisions=2 ?

AlexeyAB on 25 Mar 2019

On Nvidia Jetson Xavier. Changing the batch to 8 and subdivisions to 1 worked for running the training.

kevinrev26 on 8 May 2019

@kevinrev26 It looks like batch=8 is required for training due to low CPU RAM capacity on Jetson.

AlexeyAB on 8 May 2019

i was able to train, but the detection is not working properly. Is there a lower limit for the dataset images or something?

kevinrev26 on 8 May 2019

@kevinrev26 You must train at least 4000 iterations with batch=64. Or 32 000 iterations with batch=8. And you should use pre-trained weights-file for training.

AlexeyAB on 8 May 2019

👍1

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:
#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done
This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

Funny, i also did it via a bash script in the crontab some weeks ago :-D

crontab looks like this

crontab -l

```...
...

Start the script every minute

*/1 * * * * /opt/start_training.sh
...
...```

/opt/start_training.sh looks like this:

```#!/bin/bash

Check if final weights already reached, if that is the case we can exit the script

as we dont need to train further the same weights file again and again

if [ -e "/srv/storage/training/608_weights/608_weights_final.weights" ];
then
echo "[ $(date) ] Final Weights already reached - Exit :-)"
exit 0
fi

Check if the detector is already running, if not we will start the detector

after a next check if it is the first run or if it is a restart run of the detector

if [ "$(ps auxfw | grep -v grep | grep "detector train" -q; echo $?)" -ne 0 ];
then
echo -e "[ $(date) --- Start training ]";
telinit 2
cd /home/user/computer-vision/darknet2/;
# Check if we got a "..._last.weights" file, if yes, we will use it to start our detector
# from this position, so we use the last "..._last.weights" file as checkpoint
# If such a file doesnt exist, we know we can start from zero
if [ -e "/srv/storage/training/608_weights/608_weights_last.weights" ];
then
./darknet detector train data/.data cfg/.cfg /srv/storage/training/608_weights/608_weights_last.weights darknet53.conv.74 -dont_show -map -gpus 0,1,2,3 1>>/srv/storage/training/608_training_log1.log 2>>/srv/storage/training/608_training_log2.log
else
./darknet detector train data/.data cfg/.cfg darknet53.conv.74 -dont_show -map -gpus 0,1,2,3 1>>/srv/storage/training/608_training_log1.log 2>>/srv/storage/training/608_training_log2.log
fi
else
echo -e "[ $(date) --- already training ]";
fi```

flowzen1337 on 12 Jun 2019

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:
#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done
This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

Hi , can you tell me where should i create this bash file? in the darknet directory or in some system directory ?

MuhammadAsadJaved on 6 Dec 2019

Hi, this is the code to write in a bash file, which should be saved as TheNameYouWant.sh
You can save it wherever you want in your disk, it doesn't matter as long as the location to the darknet executable is well written in the code.

Once it's saved just open a linux terminal, go to the directory where you saved the file and type:
sh TheNameYouWant.sh

drapado on 6 Dec 2019

Thank you so much.

On Fri, Dec 6, 2019 at 9:46 PM David Rapado notifications@github.com
wrote:

Hi, this is the code to write in a bash file, which should be saved as
TheNameYouWant.sh
You can save it wherever you want in your disk, it doesn't matter as long
as the location to the darknet executable is well written in the code.

Once it's saved just open a linux terminal, go to the directory where you
saved the file and type:
sh TheNameYouWant.sh

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/AlexeyAB/darknet/issues/2728?email_source=notifications&email_token=AG4GR5EVCPBLMEE77QSUEYDQXJJUBA5CNFSM4HA4ICF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGEEAQI#issuecomment-562577473,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AG4GR5EBIAFX3SSBDQBFZ2TQXJJUBANCNFSM4HA4ICFQ
.