Fairseq: [QUESTION] vq-wav2vec speech recognition example

Created on 24 Apr 2020  路  7Comments  路  Source: pytorch/fairseq

In the example of the new model vq-wav2vec here is showed an example how to extract features and generated the embedding for a random speech data:

import torch
from fairseq.models.wav2vec import Wav2VecModel

cp = torch.load('/path/to/vq-wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()

wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
_, idxs = model.vector_quantizer.forward_idx(z)
print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model

It would be worth to provide an example how to use it on the speech downstream task.

question

Most helpful comment

For the results in the vq-wav2vec paper, once you tokenize your target data you can just follow the roberta example. e.g. preprocess data

python preprocess.py --dataset-impl mmap --trainpref train.txt --destdir . --workers 60 --only-source --validpref valid.txt --srcdict dict.txt

then train BERT

python train.py --distributed-world-size 128 --distributed-port 55498 /path/to/data--save-dir /checkpoint/dir --train-subset train --fp16 --num-workers 4 --save-interval-updates 25000 --keep-interval-updates 1 --no-epoch-checkpoints --task masked_lm --criterion masked_lm --sample-break-mode eos --tokens-per-sample 3072 --max-positions 6144 --arch roberta_base --optimizer adam --adam-betas (0.9, 0.98) --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0005 --total-num-update 250000 --warmup-updates 10000 --mask-multiple-length 10 --mask-prob 0.5 --mask-stdev 10 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --max-tokens 4096 --update-freq 1 --max-update 250000 --seed 5 --log-format json --log-interval 500 --skip-invalid-size-inputs-valid-test

You can then extract RoBERTa representations and feed them as input into your favorite ASR model.
You can also fine-tune this model directly with CTC loss as we did in Effectiveness paper but (https://arxiv.org/abs/1911.03912) but some of the code for this is currently not merged. It should be relatively straight forward to do if you want to try it yourself. I'll add the code in the coming weeks.

All 7 comments

I also want to know how to do that thing.

CC @alexeib

For the results in the vq-wav2vec paper, once you tokenize your target data you can just follow the roberta example. e.g. preprocess data

python preprocess.py --dataset-impl mmap --trainpref train.txt --destdir . --workers 60 --only-source --validpref valid.txt --srcdict dict.txt

then train BERT

python train.py --distributed-world-size 128 --distributed-port 55498 /path/to/data--save-dir /checkpoint/dir --train-subset train --fp16 --num-workers 4 --save-interval-updates 25000 --keep-interval-updates 1 --no-epoch-checkpoints --task masked_lm --criterion masked_lm --sample-break-mode eos --tokens-per-sample 3072 --max-positions 6144 --arch roberta_base --optimizer adam --adam-betas (0.9, 0.98) --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0005 --total-num-update 250000 --warmup-updates 10000 --mask-multiple-length 10 --mask-prob 0.5 --mask-stdev 10 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --max-tokens 4096 --update-freq 1 --max-update 250000 --seed 5 --log-format json --log-interval 500 --skip-invalid-size-inputs-valid-test

You can then extract RoBERTa representations and feed them as input into your favorite ASR model.
You can also fine-tune this model directly with CTC loss as we did in Effectiveness paper but (https://arxiv.org/abs/1911.03912) but some of the code for this is currently not merged. It should be relatively straight forward to do if you want to try it yourself. I'll add the code in the coming weeks.

@alexeib Thanks, We have tried to adapt our pipeline codebase that currently is using DeepSpeech2 / KML, but we were not able to go further. It would be worth to have a step by step guide.
Thank you so much.

@alexeib thank you for the example. When I run the BERT training call it throws errors saying the flags --mask-stdev and --mask-multiple-length don't exist, and I cannot find a reference to them anywhere in the repo, could they have been deprecated?

ah you are right, i probably forgot to add these options to roberta. i'll look into it soon, meanwhile you can try to add it yourself. forget mask-stdev, and for multiple lengths it looks something like this in mask_tokens_dataset:

            # decide elements to mask
            mask = np.full(sz, False)
            num_mask = int(
                # add a random number for probabilistic rounding
                self.mask_prob * sz / float(self.mask_multiple_length) + np.random.rand()
            )

            mask_idc = np.random.choice(sz, num_mask, replace=False)

            if self.mask_stdev > 0.:
                lengths = np.random.normal(self.mask_multiple_length, self.mask_stdev , size=num_mask)
                lengths = [max(0, int(round(x))) for x in lengths]
                mask_idc = np.asarray(
                    [
                        mask_idc[j] + offset
                        for j in range(len(mask_idc))
                        for offset in range(lengths[j])
                    ],
                    dtype=np.int64
                )
            else:
                mask_idc = np.concatenate([mask_idc + i for i in range(self.mask_multiple_length)])

Thanks, I will give that a go!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mjpost picture mjpost  路  35Comments

xiaoshengjun picture xiaoshengjun  路  23Comments

neel04 picture neel04  路  15Comments

astariul-colanim picture astariul-colanim  路  14Comments

astariul-colanim picture astariul-colanim  路  21Comments