Darknet: Training process killed after resizing

Created on 25 Mar 2019  Â·  10Comments  Â·  Source: AlexeyAB/darknet

I did all the steps to train with 1 class but when I start training I get the error like in the screenshot.
I put batch = 64 and subdivision = 8 on the .cfg file and I do not have any bad.list file after running the command
./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74
on the /darknet-master/build/darknet/x64 path.
Can anyone help me figure it out why this is happening?
Screenshot from 2019-03-25 11-23-48
Screenshot from 2019-03-25 11-39-47 - 1

Most helpful comment

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:

#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done

This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

All 10 comments

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:

#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done

This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

@lfares

  • How many CPU RAM do you have?

  • Do you use the latest version of Darknet?

  • What parameters do you set in the Makefile?

I put batch = 64 and subdivision = 8 on the .cfg file

  • As I see from your screenshot that you use batch=32. Does the error occur for both batch=64 and 32?

  • Do you get this issue with batch=16 subdivisions=2 ?

On Nvidia Jetson Xavier. Changing the batch to 8 and subdivisions to 1 worked for running the training.

@kevinrev26 It looks like batch=8 is required for training due to low CPU RAM capacity on Jetson.

i was able to train, but the detection is not working properly. Is there a lower limit for the dataset images or something?

@kevinrev26 You must train at least 4000 iterations with batch=64. Or 32 000 iterations with batch=8. And you should use pre-trained weights-file for training.

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:

#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done

This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

Funny, i also did it via a bash script in the crontab some weeks ago :-D

crontab looks like this

crontab -l

```...
...

Start the script every minute

*/1 * * * * /opt/start_training.sh
...
...```

/opt/start_training.sh looks like this:

```#!/bin/bash

Check if final weights already reached, if that is the case we can exit the script

as we dont need to train further the same weights file again and again

if [ -e "/srv/storage/training/608_weights/608_weights_final.weights" ];
then
echo "[ $(date) ] Final Weights already reached - Exit :-)"
exit 0
fi

Check if the detector is already running, if not we will start the detector

after a next check if it is the first run or if it is a restart run of the detector

if [ "$(ps auxfw | grep -v grep | grep "detector train" -q; echo $?)" -ne 0 ];
then
echo -e "[ $(date) --- Start training ]";
telinit 2
cd /home/user/computer-vision/darknet2/;
# Check if we got a "..._last.weights" file, if yes, we will use it to start our detector
# from this position, so we use the last "..._last.weights" file as checkpoint
# If such a file doesnt exist, we know we can start from zero
if [ -e "/srv/storage/training/608_weights/608_weights_last.weights" ];
then
./darknet detector train data/.data cfg/.cfg /srv/storage/training/608_weights/608_weights_last.weights darknet53.conv.74 -dont_show -map -gpus 0,1,2,3 1>>/srv/storage/training/608_training_log1.log 2>>/srv/storage/training/608_training_log2.log
else
./darknet detector train data/.data cfg/.cfg darknet53.conv.74 -dont_show -map -gpus 0,1,2,3 1>>/srv/storage/training/608_training_log1.log 2>>/srv/storage/training/608_training_log2.log
fi
else
echo -e "[ $(date) --- already training ]";
fi```

Probably @AlexeyAB can give you some causes for that.

From my experience, the training process sometimes stops suddenly and it can be a problem specially during night when you are sleeping. To prevent losing hours of training due to a sudden error/stop I created a simple bash file:

#!/bin/bash
cd /home/...the location of your darknet executable
while :
    do
        ./darknet detector train data/obj.data yolov3.cfg backup/yolov3_last.weights
done

This works very well for me since every 100 iterations the weights are saved as yolov3_last.weights. Basically, every time the training process is stopped, it's restarted again with the last saved weights

Hi , can you tell me where should i create this bash file? in the darknet directory or in some system directory ?

Hi, this is the code to write in a bash file, which should be saved as TheNameYouWant.sh
You can save it wherever you want in your disk, it doesn't matter as long as the location to the darknet executable is well written in the code.

Once it's saved just open a linux terminal, go to the directory where you saved the file and type:
sh TheNameYouWant.sh

Thank you so much.

On Fri, Dec 6, 2019 at 9:46 PM David Rapado notifications@github.com
wrote:

Hi, this is the code to write in a bash file, which should be saved as
TheNameYouWant.sh
You can save it wherever you want in your disk, it doesn't matter as long
as the location to the darknet executable is well written in the code.

Once it's saved just open a linux terminal, go to the directory where you
saved the file and type:
sh TheNameYouWant.sh

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/AlexeyAB/darknet/issues/2728?email_source=notifications&email_token=AG4GR5EVCPBLMEE77QSUEYDQXJJUBA5CNFSM4HA4ICF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGEEAQI#issuecomment-562577473,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AG4GR5EBIAFX3SSBDQBFZ2TQXJJUBANCNFSM4HA4ICFQ
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HilmiK picture HilmiK  Â·  3Comments

louisondumont picture louisondumont  Â·  3Comments

shootingliu picture shootingliu  Â·  3Comments

off99555 picture off99555  Â·  3Comments

Yumin-Sun-00 picture Yumin-Sun-00  Â·  3Comments