Wav2letter: Transfer learning options

Created on 29 Jan 2020  路  9Comments  路  Source: flashlight/wav2letter

Hi there I have been going to the doc and try to find a way to apply transfer learning, my question is there is any option to freeze some parts of the AM models ?, I saw that there is "fork" mode in for training (which I imagine is used for full finetuning) but I can find any flag to indicate the trained model binary

transfer learning

Most helpful comment

Hi,
We don't support it off-the-shelf but it is possible to do fine-tuning in wav2letter++ by making some code changes. Just add the following function

// nLayersForFinetuning - number of layers (with parameters) starting from the last layer that we need to fine tune
void setTrainForFinetuning(std::shared_ptr<fl::Module> ntwrk, int nLayersForFinetuning) {
  if (nLayersForFinetuning < 0) {
    ntwrk->train();
    return;
  }
  auto seq = std::dynamic_pointer_cast<fl::Sequential>(ntwrk);
  if (!seq) {
    throw std::runtime_error("something went wrong.");
  }
  int processedLastLayers = 0;
  for (int i = seq->modules().size() - 1; i >= 0; --i) {
    auto module = seq->module(i);
      if (processedLastLayers < nLayersForFinetuning && module->params().size() > 0) {
        processedLastLayers++;
        module->train();
      } else {
        module->eval();
      }
  }
}

and replace ntwrk->train() in Train.cpp file with setTrainForFinetuning(ntwrk, nLayersForFinetuning) and the run the training in fork mode.

Hope it helps !

All 9 comments

Hi,
We don't support it off-the-shelf but it is possible to do fine-tuning in wav2letter++ by making some code changes. Just add the following function

// nLayersForFinetuning - number of layers (with parameters) starting from the last layer that we need to fine tune
void setTrainForFinetuning(std::shared_ptr<fl::Module> ntwrk, int nLayersForFinetuning) {
  if (nLayersForFinetuning < 0) {
    ntwrk->train();
    return;
  }
  auto seq = std::dynamic_pointer_cast<fl::Sequential>(ntwrk);
  if (!seq) {
    throw std::runtime_error("something went wrong.");
  }
  int processedLastLayers = 0;
  for (int i = seq->modules().size() - 1; i >= 0; --i) {
    auto module = seq->module(i);
      if (processedLastLayers < nLayersForFinetuning && module->params().size() > 0) {
        processedLastLayers++;
        module->train();
      } else {
        module->eval();
      }
  }
}

and replace ntwrk->train() in Train.cpp file with setTrainForFinetuning(ntwrk, nLayersForFinetuning) and the run the training in fork mode.

Hope it helps !

great! thanks for the help , I imagine I will also need to recompile the code right?, in the other hand is posible to perform finetuning using the fork mode (I mean using a pre-trained model and changing the output dim to the token size in my problem)?

I imagine I will also need to recompile the code right?

Yes, make sure you are using the latest code, make the code changes as mentioned above and recompile.

changing the output dim to the token size in my problem

Yes, it is possible! But it would need some more code changes for your specific use case. If you can mention the architecture, model being fine-tuned and details on the tokens that you want to change, I can give some code pointers.

that would be very nice thanks, I'm using the original conv_glu because of the relatively small number of parameters for my AM and i'm using wordpiece tokenizer with 9996 different tokens, so I was planning to use that architecture changing the final linear layer output dim, but i'm not sure if fork mode will work to use the pretrained weights as start point

For running the fork model,

Screen Shot 2020-01-31 at 8 36 45 AM

  • Run the model with
    > Train fork --tokensdir '' --tokensfile --lexicon --archdir '' --arch -- other necessary gflags

Note that this is rough way to do this. You might have to adapt depend on your specific use case.

FWIW, I wouldn't recommend to use the conv_glu model for token size 9996 because of two reasons

  • It is trained with ASG criterion. Using such high token size will make the training very slow.
  • The total stride of the architecture is 2. For such high token size, we keep total stride of the architecture 8 for making the training faster.

thanks for the recomendations ;)

so you recommend me to use word in char level if i want to use GLU with ASG right

yes.

Hi, We are trying to do transfer learning from librivox SOTA recipe using TDS Seq2Seq (as mentioned here: https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/sota/2019), with 4 warmup epochs and stepsize 60 epochs on our private dataset with seq2seq criterion. (creating the lexicon and tokens and wordpiece in the usual way given by the recipe).

However, even after 7-8 epochs, the train WER still stays high (say, 100 to 105) and doesn't change. What could be causing the issue that the WER doesn't come down even a bit?

Also, are there some recommended config settings for better transfer learning?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zhengqun picture zhengqun  路  5Comments

bmblr497 picture bmblr497  路  5Comments

megharangaswamy picture megharangaswamy  路  5Comments

mlexplore1122 picture mlexplore1122  路  3Comments

nihiluis picture nihiluis  路  5Comments