Fairseq: I want to finetune this model with my own audio files, how can that be done? I've nearly 10 hours of data .

Created on 18 Aug 2020 · 9Comments · Source: pytorch/fairseq

question

Source

MrityunjoyS

Most helpful comment

Even for pretraining or finetuning what I'm seeing that transcript file of all the audio files are needed, like which is available for Librispeech dataset. But the audio files which are with me, I don't have transcript for them, is it possible to train or finetune the model with audio files whose transcript file is not present

the step in create wav2vec pre trainning model not need transcription.
what you do just make sure the wav file in format 16k, 1 channel and already splitting.
base on wav2vec 2.0 README.md, they suggest wavfile lenght 15s to 30s. but in my experience, maximum 30s for the super computer spec because really consume GPU memory.

for me, i make the wav file 2s to 15s, it can train in google colab. and 2s to 6s in my pc with 4GB GPU memory.
i download the audio data from youtube, with ytmp3, chunk it base silence with tracehold setting that make sure the maximum lenght audio is can fit in GPU memory, convert mp3 to wav, with set freq to 16k and 1 channel.

CREATE PRE-TRAINED MODEL

step to do it:

put the wav that already have spec like i mention above in a specific directory, eg: wav_file.
create a new directory for save the manifest, eg: wav_manifest.
create a new directory to save the result, eg: w2v2_pre_train_model
run the wav2vec_manifest.pythat inside fairseq/examples/wav2vec directory. with this command (base wav2vec 2.0 README.md):
python3 'examples/wav2vec/wav2vec_manifest.py' '/path/to/wav_file' --dest 'path/to/wav_manifest' --ext wav
it will create the train.tsv and valid.tsv in your wav_manifest directory.
the start train to make pre-trainned model, by use the command in wav2vec README.md. i choose used the base model on 1 GPU, with command:

python3 fairseq/train.py path/to/wav_manifest \
--save-dir path/to/w2v2_pre_train_model --fp16 --num-workers 128 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 256 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 \
--lr 0.0005 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1400000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d

i choose number worker 128 because i train in google colab. if you use a home pc, reduce is base on your processor number - 3, eg: processor number 8, if we reduce 2 or 3, will be 5 or 6 worker. but a little warning here. it's make your pc really quite slow or laggy. if you want use it while working something else, use google colab or something similar, like kaggle or AWS. or reduce the number worker to 2 or 3.

in trainning to create pre-trainned model, in my experience, it stop automaticly when it reach max result. i train it 1 weeks in google colab, to do that i use 2 account that continues used in 12 hrs to exceed maximum used of GPU (12 hrs/day).
so i can still use my pc to do something else, without laggy.

after pre-trainned model create, the next step is to train it again (fine tuning method) with labeled wav.

FINE TUNING

i use dataset from mozila commonvoice, that have 8 hrs labelled wav. so start here you must prepare the wav audio that the number or the condition is base on your cpabilities to work on it. if you have 10k hrs, audio unlabelled. yes.. it's a hardwork if you want to make all audio labelled. but the benefit of using wav2vec 2.0, you not need have labelled audio that much, in research paper they used 58k hrs unlabelled audio to create pre-tainned model and use just 10 minutes labelled audio data to use in fine tuning.
in my assumption 58k hrs labelled data is already have all data that needed to understanding speech features and with 10 minutes labelled to fine-tuning, it's enough to map the features to label. if we not have that much unlabelled data, for another language, i think the labelled data percentage should be raised. if you use english, just use the model that already share by FB.

so, the step to prepare fine-tuning is:

make sure have labelled audio data in wav format, 16k, 1 channel.
put the label or transcripton file in the same folder in wav file. the transcription file format is:

file_name1.wav HI HOW ARE YOU
file_name1.wav THIS IS JUST A SAMPLE OF TRANSCRIPTION FORMAT
file_name3.wav THAT YOU SHOULD BUILD

save the transcription with format folder_name.trans.txt, illustration: i save the wav file in folder / directory named labelled_wav_file so the name of transcription file is labelled_wav_file.trans.txt and the content of the file like the example above.

you can use uppercase or lower case letter. but don't use both of it. just pick one type of it.

run the wav2vec_manifest.py again to produce train.tsv and valid.tsv file from labeled audio data.
python3 examples/wav2vec/wav2vec_manifest.py /path/to/labeled_wav_file --dest /labbeled_manifest/path --ext wav
after that, run twice the file libri_labels.py in fairseq/examples/wav2vec/ directory with comand:

python3 libri_labels.py /path/to/file/labelled_wav_file/train.tsv --output-dir /path/to/file/labelled_wav_file/ --output-name train
python3 libri_labels.py /path/to/file/labelled_wav_file/valid.tsv --output-dir /path/to/file/labelled_wav_file/ --output-name valid

that both command resulting file train.ltr, train.wrd, valid.ltr and valid.wrd. there is some error in run libri_labels.py because of the key in dictionary not match with the key that used in open the dictionay. but i already fix the code few hours ago. so better you download again the libri_labels.py file.

if error when run libri_labels.py, replace this script to libri_labels.py:

import argparse
import os


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("tsv")
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--output-name", required=True)
    args = parser.parse_args()

    os.makedirs(args.output_dir, exist_ok=True)

    transcriptions = {}

    with open(args.tsv, "r") as tsv, open(
        os.path.join(args.output_dir, args.output_name + ".ltr"), "w"
    ) as ltr_out, open(
        os.path.join(args.output_dir, args.output_name + ".wrd"), "w"
    ) as wrd_out:
        root = next(tsv).strip()
        print('root',root)
        for line in tsv:
            line = line.strip()

            dir = os.path.dirname(line)

            if dir not in transcriptions:
                parts = dir.split(os.path.sep)

                trans_path = f"{parts[0]}.trans.txt"

                path = os.path.join(root, dir, trans_path)

                assert os.path.exists(path)
                texts = {}
                with open(path, "r") as trans_f:
                    for tline in trans_f:
                        items = tline.strip().split()
                        texts[items[0]] = " ".join(items[1:])

                transcriptions[dir] = texts
            part = os.path.basename(line).split(".")[0]+'.wav'

            assert part in transcriptions[dir]
            print(transcriptions[dir][part], file=wrd_out)
            print(
                " ".join(list(transcriptions[dir][part].replace(" ", "|"))) + " |",
                file=ltr_out,
            )


if __name__ == "__main__":
    main()

edit the file train.ltr, train.wrd, valid.ltr and valid.wrd to make sure the file only contains A to Z character (if you use UPPERCASE letter) or a to z character (if you use lower case letter), space and '|' character.
create a new file named dict.ltr.txt. open it in text editor. then:
use find features in that text editor, I use sublime text, then I search the A character because I use uppercase letter. the text editor will show the number of 'A' character. record or write it.
do the same thing until all character count. i mean A to Z character and '|' character. then write it base the number of the character (ascending base number). example:

then save it. remember the file should named as dict.ltr.txt.

create the lexicon.txt file. i use the train.wrd and valid.wrd to make the lexicon.txt. script i use and write by my self:

import os, codecs, re, pandas as pd
a = 'train.wrd'
b = 'valid.wrd'

df1 = pd.read_csv(a, header=None)
df2 = pd.read_csv(b, header=None)

df1.columns = ['raw']
df2.columns = ['raw']

df1 = df1.drop_duplicates('raw',keep='last')
df2 = df2.drop_duplicates('raw',keep='last')

sentence1 = df1['raw'].to_list()
sentence2 = df2['raw'].to_list()
sentence = sentence1 + sentence2

word = []
for x in sentence:
    tmp = x.split(' ')
    for y in tmp:
        if y not in word:
            word.append(y)

lexicon = []
for x in range(len(word)):
    wrd = word[x]
    temp = []
    for y in wrd:
        temp.append(y)
    result = ' '.join(temp) + ' |'
    lexicon.append(wrd + '\t ' + result)

file_to_save = 'lexicon.txt'
f=codecs.open(file_to_save,'a+','utf8')
for x in lexicon:
    f.write(x+'\n')
f.close()

it will create file lexicon.txt that have format, something like this:

EVERY E V E R Y |
WORD W O R D |
THAT T H A T |
EXISTS E X I S T S |
IN I N |
YOUR Y O U R |
LABEL L A B E L |
OR O R |
TRANSCRIPTION T R A N S C R I P T I O N |
FILE F I L E |
WILL W I L L |
WRITE W R I T E |
DOWN D O W N |
LIKE L I K E |
THIS T H I S |

if you use lower case letter in transcription, that make the train.wrd and train.wrd, contains lower case letter., so it become:

every e v e r y |
word w o r d |
that t h a t |
etc ...

I use kenlm, so need .bin file, create it by follow the instruction in here, just follow the instruction from 19.50 until 20:49.
now you ready to fine-tuning...the command is base fine-tuning command README.md

valid_subset=train
python train.py --distributed-world-size 24 --distributed-port $PORT /path/to/training_data --save-dir /model/path --fp16 \
--wer-args '("/path/to/lm/4-gram.bin","/path/to/lexicon",2,-1)' \
--post-process letter --valid-subset $valid_subset --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 4 \
--max-update 80000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc --w2v-path /path/to/pretrained/model \
--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d

now, i want to asking here... because resulting error in run number 8. the error is:

File "/home/bram/Documents/coding/Speech/fairseq/train.py", line 14, in <module>
    cli_main()
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq_cli/train.py", line 345, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/distributed_utils.py", line 268, in call_main
    main(args, **kwargs)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq_cli/train.py", line 61, in main
    model = task.build_model(args)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/tasks/fairseq_task.py", line 546, in build_model
    model = models.build_model(args, self)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/models/__init__.py", line 57, in build_model
    return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 168, in build_model
    w2v_encoder = Wav2VecEncoder(args, task.target_dictionary)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 331, in __init__
    args.w2v_path, arg_overrides
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/checkpoint_utils.py", line 211, in load_checkpoint_to_cpu
    setattr(args, arg_name, arg_val)
AttributeError: 'NoneType' object has no attribute 'dropout'

the command i use is:

python3 train.py '/home/bram/Documents/coding/speech/traindata/text_label' \
--save-dir '/home/bram/Documents/coding/speech/traindata/model_finetuning_wav2vec' --fp16 
--wer-args '("/home/bram/Documents/coding/speech/traindata/text_label/lm.bin","/home/bram/Documents/coding/speech/traindata/text_label/lexicon.txt",2,-1)' \
--post-process letter --valid-subset valid --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 128 \
--max-update 400000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc \
--w2v-path '/home/bram/Documents/coding/speech/traindata/w2v2_pre_traned_model/checkpoint_best.pt' \
--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d

the folder /home/bram/Documents/coding/speech/traindata/text_label contains:

1. dict.ltr.txt
2. lexicon.txt
3. lm.bin
4. train.tsv
5. train.wrd
6. train.ltr
7. valid.tsv
8. valid.wrd
9. valid.ltr

the folder /home/bram/Documents/coding/speech/traindata/model_finetuning_wav2vec is empty means to save the finetuning model that resulting in finetuning process

the folder /home/bram/Documents/coding/speech/traindata/w2v2_pre_traned_model/ is contained:

1. checkpoint_best.pt
2. checkpoint_last.pt

this file resulting from pre-trainned process..

I try to debugging, by folllow the process, step by step, and found that error is happened:

fairseq/train.py -> call the file fairseq/fairseq_cli/train.py

EDIT:
I found the error and why, but cannot solve the problem.
the error happen in File fairseq/fairseq/checkpoint_utils.py, line 211, in load_checkpoint_to_cpu function.

I try to reproduce the step. here the report:

line 201 def load_checkpoint_to_cpu(path, arg_overrides=None):
this function call by the /fairseq/fairseq/models/wav2vec/wav2vec2_asr.py line 330. before it, there is command to reproduce the arg_overrides variable.
which the arg_overrides variable now is:

arg_overrides = {'dropout': 0.0,
                 'activation_dropout': 0.1,
                 'dropout_input': 0,
                 'attention_dropout': 0.0,
                 'mask_length': 10,
                 'mask_prob': 0.5,
                 'mask_selection': 'static',
                 'mask_other': 0.0,
                 'no_mask_overlap': False,
                 'mask_channel_length': 64,
                 'mask_channel_prob': 0.5,
                 'mask_channel_selection': 'static',
                 'mask_channel_other': 0.0,
                 'no_mask_channel_overlap': False,
                 'encoder_layerdrop': 0.1,
                 'feature_grad_mult': 0.0}

and the path is args.w2v_path that is '/home/bram/Documents/coding/speech/traindata/w2v2_pre_traned_model/checkpoint_best.pt' in come because we set option --w2v-path.

so the function def load_checkpoint_to_cpu(path, arg_overrides=None): define variable path = '/home/bram/Documents/coding/speech/traindata/w2v2_pre_traned_model/checkpoint_best.pt'

ok, then...
with open(PathManager.get_local_path(path), "rb") as f: in line 203 of file fairseq/fairseq/checkpoint_utils.py means call to read the file checkpoint_best.pt and define it as f variable.

error happens when executing:
state = torch.load(f, map_location=lambda s, l: default_restore_location(s, "cpu")) in line 204 file fairseq/fairseq/checkpoint_utils.py.

the error report said 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte meaning that torch cannot load the checkpoint_best.pt.

any suggestion, can somebody help? if I can fix this I will continues the tutorial.

EDIT:
The problem is caused by the model file .pt doesn't have ['args'], that why error happens.

the root of error caused by the script in wav2vec README.md not save 'args' in checkpoint_model.pt, if you re-create the pre-trainned model process with custom dataset.

yes, the script that we use to build custom pre-trained model, from custom dataset, do not save the 'args' that we use in train pre-trainned model. so we must add it by our self.

to make the pre-trainned model has 'args', so it can be read in finetuning, run this script:

import torch, argparse, logging, math, os, random, sys, numpy as np
from fairseq import options

# argumen is base your command that you use in create pre-trainned model. this is jus an example
# last command that i use in run train.py is like this:
'''
 python3 '/content/repo/fairseq/train.py' --distributed-world-size 1 --distributed-port 0 '/content/drive/My Drive/wav_manifest' \
--save-dir '/content/drive/My Drive/wav2vec_v2_pre_train_model' --fp16 --num-workers 128 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 768 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2,0.25,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --max-update 600000 \
--lr 0.0004 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.03 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.05 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 1500000 --min-sample-size 5000 --dropout 0.05 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1400000 --max-update 600000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --encoder-ffn-embed-dim 4096 --encoder-attention-heads 16 --no-epoch-checkpoints
'''
# copy that argument above and convert it to this line above:

argument = [
'/content/repo/fairseq/train.py', '--distributed-world-size', '1', '--distributed-port', '0', '/content/drive/My Drive/wav_manifest',
'--save-dir', '/content/drive/My Drive/wav2vec_v2_pre_train_model', 
'--fp16', '--no-epoch-checkpoints', '--skip-invalid-size-inputs-valid-test', '--infonce','--quantize-targets',
'--num-workers', '128', '--task', 'audio_pretraining', '--criterion', 'wav2vec', '--arch', 'wav2vec2',
'--log-keys', '["prob_perplexity","code_perplexity","temp"]',  '--extractor-mode', 'default',
'--conv-feature-layers', '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2',
'--final-dim', '768', '--latent-vars', '320', '--latent-groups', '2', '--latent-temp', '(2,0.25,0.999995)',  '--optimizer', 'adam',
'--adam-betas', '(0.9,0.98)', '--adam-eps', '1e-06', '--lr-scheduler', 'polynomial_decay', '--max-update', '600000',
'--lr', '0.0004', '--warmup-updates', '32000', '--mask-length', '10', '--mask-prob', '0.65', '--mask-selection', 'static', '--mask-other', '0',
'--encoder-layerdrop', '0.03', '--dropout-input', '0.1', '--dropout-features', '0.1', '--feature-grad-mult', '0.05',
'--loss-weights', '[0.1, 10]', '--conv-pos', '128', '--conv-pos-groups', '16', '--num-negatives', '100', '--cross-sample-negatives', '0',
'--max-sample-size', '1500000', '--min-sample-size', '5000', '--dropout', '0.05', '--attention-dropout', '0.1', '--weight-decay', '0.01',
'--max-tokens', '1400000', '--max-update', '600000',  '--ddp-backend', 'no_c10d', '--encoder-ffn-embed-dim', '4096', '--encoder-attention-heads', '16'
] 

logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=os.environ.get("LOGLEVEL", "INFO").upper(),
    stream=sys.stdout,
)
logger = logging.getLogger("fairseq_cli.train")

parser = options.get_training_parser()

sys.argv = argument

args = options.parse_args_and_arch(parser, modify_parser=None)

model_file = '/content/drive/MyDrive/wav2vec_v2_pre_train_model/checkpoint_best.pt'

file = torch.load(model_file,map_location=None)

file['args'] = args

torch.save(file,'/content/drive/MyDrive/model_pretrain/fixed_model.pt')

after this my error is in file w2l_decoder.py line 156, in a iteration command:
self.trie.insert(spelling_idxs, word_idx, score)
so i try to print the value that iterating before inserting to trie, and print word success after it succes input to trie.
this I add to script:

for i, (word, spellings) in enumerate(self.lexicon.items()):
    word_idx = self.word_dict.get_index(word)
    _, score = self.lm.score(start_state, word_idx)
    for spelling in spellings:
        spelling_idxs = [tgt_dict.index(token) for token in spelling]
        assert (
                    tgt_dict.unk() not in spelling_idxs
                ), f"{spelling} {spelling_idxs}"
        print(word,spelling_idxs, word_idx, score)
        self.trie.insert(spelling_idxs, word_idx, score)
        print('success')

the result is:

TERBAWA [12, 8, 14, 18, 4, 25, 4, 5] 2727 -5.560528755187988
success
SUATUPUN [13, 10, 4, 12, 10, 19, 10, 6, 5] 2782 -5.560528755187988
success
NYAMBAR [6, 21, 4, 11, 18, 4, 14, 5] 2626 -6.238666534423828
success
DIBERIKANNYA [15, 7, 18, 8, 14, 7, 9, 4, 6, 6, 21, 4, 5] 2765 -5.809808731079102
success
CENTANG [24, 8, 6, 12, 4, 6, 16, 5] 2667 -6.238666534423828
success
IRI [7, 14, 7, 5] 2676 -5.1968841552734375
success
PERMUKAAN [19, 8, 14, 11, 10, 9, 4, 4, 6, 5] 1895 -3.8156046867370605
success
TEMENAN [12, 8, 11, 8, 6, 4, 6, 5] 2843 -6.238666534423828
success
PENGEN [19, 8, 6, 16, 8, 6, 5] 2713 -6.09476375579834
success
CHANNEL [24, 20, 4, 6, 6, 8, 17, 5] 2609 -5.809808731079102
success
DIK [15, 7, 9, 5] 2785 -6.238666534423828
success
APAA [4, 19, 4, 4, 5] 3068 -6.238666534423828
success
HAAH [20, 4, 4, 20, 5] 2350 -6.238666534423828
success
BAGAI [18, 4, 16, 4, 7, 5] 2317 -6.09476375579834
success
RIWAYAT [14, 7, 25, 4, 21, 4, 12, 5] 2743 -5.809808731079102
success
KONTEKSTUAL [9, 22, 6, 12, 8, 9, 13, 12, 10, 4, 17, 5] 2689 -6.09476375579834
success
NGELUARIN [6, 16, 8, 17, 10, 4, 14, 7, 6, 5] 2724 -6.238666534423828
success
MENEMPATKANNYA [11, 8, 6, 8, 11, 19, 4, 12, 9, 4, 6, 6, 21, 4, 5] 2662 -5.6675238609313965
success
SETRUM [13, 8, 12, 14, 10, 11, 5] 1859 -6.238666534423828
success
TINGGALIN [12, 7, 6, 16, 16, 4, 17, 7, 6, 5] 3127 -6.238666534423828
success
WUK [25, 10, 9, 5] 2694 -6.238666534423828
success
GUYYSS [16, 10, 21, 21, 13, 13, 5] 1866 -6.238666534423828
success
RABA [14, 4, 18, 4, 5] 2601 -5.928036689758301
success
KESELNYA [9, 8, 13, 8, 17, 6, 21, 4, 5] 2253 -6.238666534423828
success
TAROH [12, 4, 14, 22, 20, 5] 2432 -6.238666534423828
success
LAGIPULA [17, 4, 16, 7, 19, 10, 17, 4, 5] 2356 -4.919636249542236
success
SENENG [13, 8, 6, 8, 6, 16, 5] 2767 -6.238666534423828
success
SITULAH [13, 7, 12, 10, 17, 4, 20, 5] 2706 -6.09476375579834
success
PACAR [19, 4, 24, 4, 14, 5] 2752 -6.09476375579834
success
PREMIER [19, 14, 8, 11, 7, 8, 14, 5] 2580 -5.1968841552734375
success
NEH [6, 8, 20, 5] 2659 -6.238666534423828
success
TERNAK [12, 8, 14, 6, 4, 9, 5] 2653 -5.287886619567871
success
MENAMAINYA [11, 8, 6, 4, 11, 4, 7, 6, 21, 4, 5] 2286 -6.238666534423828
success
KUMAKAN [9, 10, 11, 4, 9, 4, 6, 5] 1958 -6.238666534423828
success
MEMPERDAYAKAN [11, 8, 11, 19, 8, 14, 15, 4, 21, 4, 9, 4, 6, 5] 2760 -6.09476375579834
success
RISET [14, 7, 13, 8, 12, 5] 2905 -3.847456932067871
success
NGERASA [6, 16, 8, 14, 4, 13, 4, 5] 3242 -6.238666534423828
success
DARAT [15, 4, 14, 4, 12, 5] 3044 -4.73591947555542
success
CERDIK [24, 8, 14, 15, 7, 9, 5] 1926 -5.474752426147461
success
ADAPUN [4, 15, 4, 19, 10, 6, 5] 2085 -3.4240756034851074
success
TERUTAMA [12, 8, 14, 10, 12, 4, 11, 4, 5] 2612 -3.8312268257141113
success
NGEGOJEK [6, 16, 8, 16, 22, 23, 8, 9, 5] 2566 -6.238666534423828
success
JIJIK [23, 7, 23, 7, 9, 5] 2012 -6.09476375579834
success
IH [7, 20, 5] 2719 -6.238666534423828
success
BEDOLAH [18, 8, 15, 22, 17, 4, 20, 5] 3039 -6.238666534423828
success
DAMAR [15, 4, 11, 4, 14, 5] 2700 -6.238666534423828
success
KONTEN [9, 22, 6, 12, 8, 6, 5] 2645 -5.6675238609313965
success
HAWILA [20, 4, 25, 7, 17, 4, 5] 2701 -6.238666534423828
success
BENTAR [18, 8, 6, 12, 4, 14, 5] 2177 -6.238666534423828
success
GUYS [16, 10, 21, 13, 5] 1861 -6.238666534423828
success
HEY [20, 8, 21, 5] 2419 -6.238666534423828
success
NYARI [6, 21, 4, 14, 7, 5] 2844 -6.238666534423828
success
UPLOAD [10, 19, 17, 22, 4, 15, 5] 2572 -6.09476375579834
success
KAWINANNYA [9, 4, 25, 7, 6, 4, 6, 6, 21, 4, 5] 2282 -6.238666534423828
success
TULANGKU [12, 10, 17, 4, 6, 16, 9, 10, 5] 2583 -6.238666534423828
success
TULANG [12, 10, 17, 4, 6, 16, 5] 2433 -3.659620523452759
success
JOGET [23, 22, 16, 8, 12, 5] 2056 -6.238666534423828
success
CEWEK [24, 8, 25, 8, 9, 5] 2753 -6.238666534423828
success
BERPENDIDIKAN [18, 8, 14, 19, 8, 6, 15, 7, 15, 7, 9, 4, 6, 5] 2542 -5.928036689758301
success
HEINEKEN [20, 8, 7, 6, 8, 9, 8, 6, 5] 3007 -6.238666534423828
success
SOLUSINYA [13, 22, 17, 10, 13, 7, 6, 21, 4, 5] 2650 -4.248564720153809
success
TUMITNYA [12, 10, 11, 7, 12, 6, 21, 4, 5] 2777 -6.09476375579834
success
TENTULAH [12, 8, 6, 12, 10, 17, 4, 20, 5] 2343 -5.809808731079102
corrupted double-linked list

with this clue the error caused by my lexicon.txt, but... what make it error?
I already cek is there any 'unrecognize' character, or something else, I'm not found it.
I cek did the word duplicated, not found it...

but the funny things is...
if I run it again...
the result that i found is:

TERBAWA [12, 8, 14, 18, 4, 25, 4, 5] 2727 -5.560528755187988
success
SUATUPUN [13, 10, 4, 12, 10, 19, 10, 6, 5] 2782 -5.560528755187988
success
NYAMBAR [6, 21, 4, 11, 18, 4, 14, 5] 2626 -6.238666534423828
success
DIBERIKANNYA [15, 7, 18, 8, 14, 7, 9, 4, 6, 6, 21, 4, 5] 2765 -5.809808731079102
success
CENTANG [24, 8, 6, 12, 4, 6, 16, 5] 2667 -6.238666534423828
success
IRI [7, 14, 7, 5] 2676 -5.1968841552734375
success
PERMUKAAN [19, 8, 14, 11, 10, 9, 4, 4, 6, 5] 1895 -3.8156046867370605
success
TEMENAN [12, 8, 11, 8, 6, 4, 6, 5] 2843 -6.238666534423828
success
PENGEN [19, 8, 6, 16, 8, 6, 5] 2713 -6.09476375579834
success
CHANNEL [24, 20, 4, 6, 6, 8, 17, 5] 2609 -5.809808731079102
success
DIK [15, 7, 9, 5] 2785 -6.238666534423828
success
APAA [4, 19, 4, 4, 5] 3068 -6.238666534423828
success
HAAH [20, 4, 4, 20, 5] 2350 -6.238666534423828
success
BAGAI [18, 4, 16, 4, 7, 5] 2317 -6.09476375579834
success
RIWAYAT [14, 7, 25, 4, 21, 4, 12, 5] 2743 -5.809808731079102
success
KONTEKSTUAL [9, 22, 6, 12, 8, 9, 13, 12, 10, 4, 17, 5] 2689 -6.09476375579834
success
NGELUARIN [6, 16, 8, 17, 10, 4, 14, 7, 6, 5] 2724 -6.238666534423828
success
MENEMPATKANNYA [11, 8, 6, 8, 11, 19, 4, 12, 9, 4, 6, 6, 21, 4, 5] 2662 -5.6675238609313965
success
SETRUM [13, 8, 12, 14, 10, 11, 5] 1859 -6.238666534423828
success
TINGGALIN [12, 7, 6, 16, 16, 4, 17, 7, 6, 5] 3127 -6.238666534423828
success
WUK [25, 10, 9, 5] 2694 -6.238666534423828
success
GUYYSS [16, 10, 21, 21, 13, 13, 5] 1866 -6.238666534423828
success
RABA [14, 4, 18, 4, 5] 2601 -5.928036689758301
success
KESELNYA [9, 8, 13, 8, 17, 6, 21, 4, 5] 2253 -6.238666534423828
success
TAROH [12, 4, 14, 22, 20, 5] 2432 -6.238666534423828
success
LAGIPULA [17, 4, 16, 7, 19, 10, 17, 4, 5] 2356 -4.919636249542236
success
SENENG [13, 8, 6, 8, 6, 16, 5] 2767 -6.238666534423828
success
SITULAH [13, 7, 12, 10, 17, 4, 20, 5] 2706 -6.09476375579834
success
PACAR [19, 4, 24, 4, 14, 5] 2752 -6.09476375579834
success
PREMIER [19, 14, 8, 11, 7, 8, 14, 5] 2580 -5.1968841552734375
success
NEH [6, 8, 20, 5] 2659 -6.238666534423828
success
TERNAK [12, 8, 14, 6, 4, 9, 5] 2653 -5.287886619567871
success
MENAMAINYA [11, 8, 6, 4, 11, 4, 7, 6, 21, 4, 5] 2286 -6.238666534423828
success
KUMAKAN [9, 10, 11, 4, 9, 4, 6, 5] 1958 -6.238666534423828
success
MEMPERDAYAKAN [11, 8, 11, 19, 8, 14, 15, 4, 21, 4, 9, 4, 6, 5] 2760 -6.09476375579834
success
RISET [14, 7, 13, 8, 12, 5] 2905 -3.847456932067871
success
NGERASA [6, 16, 8, 14, 4, 13, 4, 5] 3242 -6.238666534423828
success
DARAT [15, 4, 14, 4, 12, 5] 3044 -4.73591947555542
success
CERDIK [24, 8, 14, 15, 7, 9, 5] 1926 -5.474752426147461
success
ADAPUN [4, 15, 4, 19, 10, 6, 5] 2085 -3.4240756034851074
success
TERUTAMA [12, 8, 14, 10, 12, 4, 11, 4, 5] 2612 -3.8312268257141113
success
NGEGOJEK [6, 16, 8, 16, 22, 23, 8, 9, 5] 2566 -6.238666534423828
success
JIJIK [23, 7, 23, 7, 9, 5] 2012 -6.09476375579834
success
IH [7, 20, 5] 2719 -6.238666534423828
success
BEDOLAH [18, 8, 15, 22, 17, 4, 20, 5] 3039 -6.238666534423828
success
DAMAR [15, 4, 11, 4, 14, 5] 2700 -6.238666534423828
success
KONTEN [9, 22, 6, 12, 8, 6, 5] 2645 -5.6675238609313965
success
HAWILA [20, 4, 25, 7, 17, 4, 5] 2701 -6.238666534423828
success
BENTAR [18, 8, 6, 12, 4, 14, 5] 2177 -6.238666534423828
success
GUYS [16, 10, 21, 13, 5] 1861 -6.238666534423828
success
HEY [20, 8, 21, 5] 2419 -6.238666534423828
success
NYARI [6, 21, 4, 14, 7, 5] 2844 -6.238666534423828
success
UPLOAD [10, 19, 17, 22, 4, 15, 5] 2572 -6.09476375579834
success
KAWINANNYA [9, 4, 25, 7, 6, 4, 6, 6, 21, 4, 5] 2282 -6.238666534423828
success
TULANGKU [12, 10, 17, 4, 6, 16, 9, 10, 5] 2583 -6.238666534423828
success
TULANG [12, 10, 17, 4, 6, 16, 5] 2433 -3.659620523452759
success
JOGET [23, 22, 16, 8, 12, 5] 2056 -6.238666534423828
success
CEWEK [24, 8, 25, 8, 9, 5] 2753 -6.238666534423828
success
BERPENDIDIKAN [18, 8, 14, 19, 8, 6, 15, 7, 15, 7, 9, 4, 6, 5] 2542 -5.928036689758301
success
HEINEKEN [20, 8, 7, 6, 8, 9, 8, 6, 5] 3007 -6.238666534423828
success
SOLUSINYA [13, 22, 17, 10, 13, 7, 6, 21, 4, 5] 2650 -4.248564720153809
success
TUMITNYA [12, 10, 11, 7, 12, 6, 21, 4, 5] 2777 -6.09476375579834
success
TENTULAH [12, 8, 6, 12, 10, 17, 4, 20, 5] 2343 -5.809808731079102
success
KEEMPAT [9, 8, 8, 11, 19, 4, 12, 5] 2738 -3.6319289207458496
success
KULARANG [9, 10, 17, 4, 14, 4, 6, 16, 5] 2587 -6.238666534423828
success
TIMBUL [12, 7, 11, 18, 10, 17, 5] 2555 -4.70242166519165
success
APAAN [4, 19, 4, 4, 6, 5] 2124 -6.238666534423828
success
NYALA [6, 21, 4, 17, 4, 5] 2435 -5.403155326843262
success
ISTERIMU [7, 13, 12, 8, 14, 7, 11, 10, 5] 2911 -5.474752426147461
success
MAKANYA [11, 4, 9, 4, 6, 21, 4, 5] 2801 -4.241161823272705
Segmentation fault (core dumped)

why it's funny, and make me headaches?

see at word: TENTULAH
in first attempt, the report show, it success inserting to trie.

TENTULAH [12, 8, 6, 12, 10, 17, 4, 20, 5] 2343 -5.809808731079102
success

but in the second time I run the command,
I found that words, make the error because failed to print('success')

the report print:

TENTULAH [12, 8, 6, 12, 10, 17, 4, 20, 5] 2343 -5.809808731079102
Segmentation fault (core dumped)

wow...

in this fase, I cannot do a debuuging anymore because of the kenlm use c++ language and cthon language.

any clue guys?

maybe Mr @alexeib can help me...? I already ask in kenlm issues, not been answering right now

wahyubram82 on 31 Oct 2020

❤4 👍1

All 9 comments

I want to finetune wav2vec 2.0 model with custom audio data. As there's no fine steps given how to implement that, I can't proceed. And then I need the model for audio-streaming and using it for Text-To-Speech recognition.

MrityunjoyS on 18 Aug 2020

👍2

1) use wav2vec_manifest to build a manifest of your audio data
2) create a parallel file containing labels (see libri-labels.py for example format. in particular if using letters, you need a word ending token (we use |) that is appended to the end of every word, and each symbol separated by space)
3) modify the example finetuning command to be more suitable to 10h. you can use the example command for 100h in the readme file and modify the parameters to what was used for 10h in the paper

alexeib on 26 Aug 2020

Thanks for your insight @alexeib , will try this.
One more thing I need to ask is that is it possible to train my own unsupervised model with this wa2vec2.0 model. If I've like 10k hours of my audio files without any labels, can you suggest how can use your model to train from scratch with my own data

MrityunjoyS on 27 Aug 2020

you can use the example pre-training command from the wav2vec readme to do your own pre-training

alexeib on 28 Aug 2020

Even for pretraining or finetuning what I'm seeing that transcript file of all the audio files are needed, like which is available for Librispeech dataset. But the audio files which are with me, I don't have transcript for them, is it possible to train or finetune the model with audio files whose transcript file is not present

MrityunjoyS on 29 Aug 2020

👍1

I have a similar question. Using the pretrained models, can I use those embeddings to further train some downstream task?

Specifically, I have raw audio that I want to convert to the embeddings. I want to then use those embeddings in a downstream task. How would I accomplish that?

shamoons on 2 Sep 2020

Even for pretraining or finetuning what I'm seeing that transcript file of all the audio files are needed, like which is available for Librispeech dataset. But the audio files which are with me, I don't have transcript for them, is it possible to train or finetune the model with audio files whose transcript file is not present

CREATE PRE-TRAINED MODEL

step to do it:

put the wav that already have spec like i mention above in a specific directory, eg: wav_file.
create a new directory for save the manifest, eg: wav_manifest.
create a new directory to save the result, eg: w2v2_pre_train_model
run the wav2vec_manifest.pythat inside fairseq/examples/wav2vec directory. with this command (base wav2vec 2.0 README.md):
python3 'examples/wav2vec/wav2vec_manifest.py' '/path/to/wav_file' --dest 'path/to/wav_manifest' --ext wav
it will create the train.tsv and valid.tsv in your wav_manifest directory.
the start train to make pre-trainned model, by use the command in wav2vec README.md. i choose used the base model on 1 GPU, with command:

python3 fairseq/train.py path/to/wav_manifest \
--save-dir path/to/w2v2_pre_train_model --fp16 --num-workers 128 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 256 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 \
--lr 0.0005 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1400000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d

after pre-trainned model create, the next step is to train it again (fine tuning method) with labeled wav.

FINE TUNING

so, the step to prepare fine-tuning is:

make sure have labelled audio data in wav format, 16k, 1 channel.
put the label or transcripton file in the same folder in wav file. the transcription file format is:

file_name1.wav HI HOW ARE YOU
file_name1.wav THIS IS JUST A SAMPLE OF TRANSCRIPTION FORMAT
file_name3.wav THAT YOU SHOULD BUILD

you can use uppercase or lower case letter. but don't use both of it. just pick one type of it.

run the wav2vec_manifest.py again to produce train.tsv and valid.tsv file from labeled audio data.
python3 examples/wav2vec/wav2vec_manifest.py /path/to/labeled_wav_file --dest /labbeled_manifest/path --ext wav
after that, run twice the file libri_labels.py in fairseq/examples/wav2vec/ directory with comand:

python3 libri_labels.py /path/to/file/labelled_wav_file/train.tsv --output-dir /path/to/file/labelled_wav_file/ --output-name train
python3 libri_labels.py /path/to/file/labelled_wav_file/valid.tsv --output-dir /path/to/file/labelled_wav_file/ --output-name valid

if error when run libri_labels.py, replace this script to libri_labels.py:

import argparse
import os


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("tsv")
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--output-name", required=True)
    args = parser.parse_args()

    os.makedirs(args.output_dir, exist_ok=True)

    transcriptions = {}

    with open(args.tsv, "r") as tsv, open(
        os.path.join(args.output_dir, args.output_name + ".ltr"), "w"
    ) as ltr_out, open(
        os.path.join(args.output_dir, args.output_name + ".wrd"), "w"
    ) as wrd_out:
        root = next(tsv).strip()
        print('root',root)
        for line in tsv:
            line = line.strip()

            dir = os.path.dirname(line)

            if dir not in transcriptions:
                parts = dir.split(os.path.sep)

                trans_path = f"{parts[0]}.trans.txt"

                path = os.path.join(root, dir, trans_path)

                assert os.path.exists(path)
                texts = {}
                with open(path, "r") as trans_f:
                    for tline in trans_f:
                        items = tline.strip().split()
                        texts[items[0]] = " ".join(items[1:])

                transcriptions[dir] = texts
            part = os.path.basename(line).split(".")[0]+'.wav'

            assert part in transcriptions[dir]
            print(transcriptions[dir][part], file=wrd_out)
            print(
                " ".join(list(transcriptions[dir][part].replace(" ", "|"))) + " |",
                file=ltr_out,
            )


if __name__ == "__main__":
    main()

edit the file train.ltr, train.wrd, valid.ltr and valid.wrd to make sure the file only contains A to Z character (if you use UPPERCASE letter) or a to z character (if you use lower case letter), space and '|' character.
create a new file named dict.ltr.txt. open it in text editor. then:
use find features in that text editor, I use sublime text, then I search the A character because I use uppercase letter. the text editor will show the number of 'A' character. record or write it.
do the same thing until all character count. i mean A to Z character and '|' character. then write it base the number of the character (ascending base number). example:

then save it. remember the file should named as dict.ltr.txt.

create the lexicon.txt file. i use the train.wrd and valid.wrd to make the lexicon.txt. script i use and write by my self:

import os, codecs, re, pandas as pd
a = 'train.wrd'
b = 'valid.wrd'

df1 = pd.read_csv(a, header=None)
df2 = pd.read_csv(b, header=None)

df1.columns = ['raw']
df2.columns = ['raw']

df1 = df1.drop_duplicates('raw',keep='last')
df2 = df2.drop_duplicates('raw',keep='last')

sentence1 = df1['raw'].to_list()
sentence2 = df2['raw'].to_list()
sentence = sentence1 + sentence2

word = []
for x in sentence:
    tmp = x.split(' ')
    for y in tmp:
        if y not in word:
            word.append(y)

lexicon = []
for x in range(len(word)):
    wrd = word[x]
    temp = []
    for y in wrd:
        temp.append(y)
    result = ' '.join(temp) + ' |'
    lexicon.append(wrd + '\t ' + result)

file_to_save = 'lexicon.txt'
f=codecs.open(file_to_save,'a+','utf8')
for x in lexicon:
    f.write(x+'\n')
f.close()

it will create file lexicon.txt that have format, something like this:

EVERY E V E R Y |
WORD W O R D |
THAT T H A T |
EXISTS E X I S T S |
IN I N |
YOUR Y O U R |
LABEL L A B E L |
OR O R |
TRANSCRIPTION T R A N S C R I P T I O N |
FILE F I L E |
WILL W I L L |
WRITE W R I T E |
DOWN D O W N |
LIKE L I K E |
THIS T H I S |

if you use lower case letter in transcription, that make the train.wrd and train.wrd, contains lower case letter., so it become:

every e v e r y |
word w o r d |
that t h a t |
etc ...

I use kenlm, so need .bin file, create it by follow the instruction in here, just follow the instruction from 19.50 until 20:49.
now you ready to fine-tuning...the command is base fine-tuning command README.md

valid_subset=train
python train.py --distributed-world-size 24 --distributed-port $PORT /path/to/training_data --save-dir /model/path --fp16 \
--wer-args '("/path/to/lm/4-gram.bin","/path/to/lexicon",2,-1)' \
--post-process letter --valid-subset $valid_subset --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 4 \
--max-update 80000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc --w2v-path /path/to/pretrained/model \
--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d

now, i want to asking here... because resulting error in run number 8. the error is:

File "/home/bram/Documents/coding/Speech/fairseq/train.py", line 14, in <module>
    cli_main()
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq_cli/train.py", line 345, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/distributed_utils.py", line 268, in call_main
    main(args, **kwargs)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq_cli/train.py", line 61, in main
    model = task.build_model(args)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/tasks/fairseq_task.py", line 546, in build_model
    model = models.build_model(args, self)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/models/__init__.py", line 57, in build_model
    return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 168, in build_model
    w2v_encoder = Wav2VecEncoder(args, task.target_dictionary)
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 331, in __init__
    args.w2v_path, arg_overrides
  File "/home/bram/Documents/coding/Speech/fairseq/fairseq/checkpoint_utils.py", line 211, in load_checkpoint_to_cpu
    setattr(args, arg_name, arg_val)
AttributeError: 'NoneType' object has no attribute 'dropout'

the command i use is:

python3 train.py '/home/bram/Documents/coding/speech/traindata/text_label' \
--save-dir '/home/bram/Documents/coding/speech/traindata/model_finetuning_wav2vec' --fp16 
--wer-args '("/home/bram/Documents/coding/speech/traindata/text_label/lm.bin","/home/bram/Documents/coding/speech/traindata/text_label/lexicon.txt",2,-1)' \
--post-process letter --valid-subset valid --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 128 \
--max-update 400000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc \
--w2v-path '/home/bram/Documents/coding/speech/traindata/w2v2_pre_traned_model/checkpoint_best.pt' \
--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d

the folder /home/bram/Documents/coding/speech/traindata/text_label contains:

1. dict.ltr.txt
2. lexicon.txt
3. lm.bin
4. train.tsv
5. train.wrd
6. train.ltr
7. valid.tsv
8. valid.wrd
9. valid.ltr

the folder /home/bram/Documents/coding/speech/traindata/model_finetuning_wav2vec is empty means to save the finetuning model that resulting in finetuning process

the folder /home/bram/Documents/coding/speech/traindata/w2v2_pre_traned_model/ is contained:

1. checkpoint_best.pt
2. checkpoint_last.pt

this file resulting from pre-trainned process..

I try to debugging, by folllow the process, step by step, and found that error is happened:

fairseq/train.py -> call the file fairseq/fairseq_cli/train.py

EDIT:
I found the error and why, but cannot solve the problem.
the error happen in File fairseq/fairseq/checkpoint_utils.py, line 211, in load_checkpoint_to_cpu function.

I try to reproduce the step. here the report:

arg_overrides = {'dropout': 0.0,
                 'activation_dropout': 0.1,
                 'dropout_input': 0,
                 'attention_dropout': 0.0,
                 'mask_length': 10,
                 'mask_prob': 0.5,
                 'mask_selection': 'static',
                 'mask_other': 0.0,
                 'no_mask_overlap': False,
                 'mask_channel_length': 64,
                 'mask_channel_prob': 0.5,
                 'mask_channel_selection': 'static',
                 'mask_channel_other': 0.0,
                 'no_mask_channel_overlap': False,
                 'encoder_layerdrop': 0.1,
                 'feature_grad_mult': 0.0}