Wav2letter: Problems with the continue option

Created on 24 Feb 2019  Â·  6Comments  Â·  Source: flashlight/wav2letter

Actually I have trained a model with 960hrs Librispeech data and setting the learning rate and lrcrit to 0.002 as you mentioned to set them on the order of 1e-3 or 1e-4). After 500 epoches the loss seems still high around 7.06 and the dev-TER is still at 15%. It took about 2 weeks to finish training.

After that I am trying to continue the training using the save last_model and the continue option. it gives me the issue below.
default
I'm asking how should I set my 000_model_last.bin flagsfile, should it be writtern in train.cfg. And it will be better to mention it in the docs train.md

enhancement

Most helpful comment

@drkingman — the runpath is set as argv[2] alone, and isn't parsed as a flag:

mpirun --allow-run-as-root -n 2 /root/.../Train continue ~/my/path -enable_distributed true --flagsfile /.../continue.cfg

will work (remove runpath=)

All 6 comments

I had the same issue, I think it's due to a bug in the code. in the train source code, line 69:
} else if (runStatus == "continue") {
runPath = argv[2];
while (fileExists(getRunFile("model_last.bin", runIdx, runPath))) {
++runIdx;
}
When you continue training you have to specify the runpath as an argument in the command line, it does not seem to work when you just add it to the train.cfg file.

@drkingman @akhiari — yep, you need to your runpath for continue or fork mode as the argument immediately after continue or fork. It's a little nonstandard, we'll consider adding it as a flag.

I had the same issue, I think it's due to a bug in the code. in the train source code, line 69:
} else if (runStatus == "continue") {
runPath = argv[2];
while (fileExists(getRunFile("model_last.bin", runIdx, runPath))) {
++runIdx;
}
When you continue training you have to specify the runpath as an argument in the command line, it does not seem to work when you just add it to the train.cfg file.

hi, I tried mpirun --allow-run-as-root -n 2 /root/.../Train continue runPath=/.../ -enable_distributed true --flagsfile /.../continue.cfg, but it doesnot seem to work

@drkingman — the runpath is set as argv[2] alone, and isn't parsed as a flag:

mpirun --allow-run-as-root -n 2 /root/.../Train continue ~/my/path -enable_distributed true --flagsfile /.../continue.cfg

will work (remove runpath=)

could you try with --iter=10000000?

I faced the same issue and used the "continue" command like the way mentioned. It worked. Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

isaacleeai picture isaacleeai  Â·  5Comments

zhengqun picture zhengqun  Â·  5Comments

hajix picture hajix  Â·  4Comments

pzelasko picture pzelasko  Â·  6Comments

tarang-jain picture tarang-jain  Â·  3Comments