Actually I have trained a model with 960hrs Librispeech data and setting the learning rate and lrcrit to 0.002 as you mentioned to set them on the order of 1e-3 or 1e-4). After 500 epoches the loss seems still high around 7.06 and the dev-TER is still at 15%. It took about 2 weeks to finish training.
After that I am trying to continue the training using the save last_model and the continue option. it gives me the issue below.

I'm asking how should I set my 000_model_last.bin flagsfile, should it be writtern in train.cfg. And it will be better to mention it in the docs train.md
I had the same issue, I think it's due to a bug in the code. in the train source code, line 69:
} else if (runStatus == "continue") {
runPath = argv[2];
while (fileExists(getRunFile("model_last.bin", runIdx, runPath))) {
++runIdx;
}
When you continue training you have to specify the runpath as an argument in the command line, it does not seem to work when you just add it to the train.cfg file.
@drkingman @akhiari — yep, you need to your runpath for continue or fork mode as the argument immediately after continue or fork. It's a little nonstandard, we'll consider adding it as a flag.
I had the same issue, I think it's due to a bug in the code. in the train source code, line 69:
} else if (runStatus == "continue") {
runPath = argv[2];
while (fileExists(getRunFile("model_last.bin", runIdx, runPath))) {
++runIdx;
}
When you continue training you have to specify the runpath as an argument in the command line, it does not seem to work when you just add it to the train.cfg file.
hi, I tried mpirun --allow-run-as-root -n 2 /root/.../Train continue runPath=/.../ -enable_distributed true --flagsfile /.../continue.cfg, but it doesnot seem to work
@drkingman — the runpath is set as argv[2] alone, and isn't parsed as a flag:
mpirun --allow-run-as-root -n 2 /root/.../Train continue ~/my/path -enable_distributed true --flagsfile /.../continue.cfg
will work (remove runpath=)
could you try with --iter=10000000?
I faced the same issue and used the "continue" command like the way mentioned. It worked. Thanks.
Most helpful comment
@drkingman — the runpath is set as
argv[2]alone, and isn't parsed as a flag:will work (remove
runpath=)