Fairseq: [QUESTION] vq-wav2vec speech recognition example

Created on 24 Apr 2020 · 7Comments · Source: pytorch/fairseq

In the example of the new model vq-wav2vec here is showed an example how to extract features and generated the embedding for a random speech data:

import torch
from fairseq.models.wav2vec import Wav2VecModel

cp = torch.load('/path/to/vq-wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()

wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
_, idxs = model.vector_quantizer.forward_idx(z)
print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model

It would be worth to provide an example how to use it on the speech downstream task.

question

Source

loretoparisi

👍5

Most helpful comment

For the results in the vq-wav2vec paper, once you tokenize your target data you can just follow the roberta example. e.g. preprocess data

python preprocess.py --dataset-impl mmap --trainpref train.txt --destdir . --workers 60 --only-source --validpref valid.txt --srcdict dict.txt

then train BERT

python train.py --distributed-world-size 128 --distributed-port 55498 /path/to/data--save-dir /checkpoint/dir --train-subset train --fp16 --num-workers 4 --save-interval-updates 25000 --keep-interval-updates 1 --no-epoch-checkpoints --task masked_lm --criterion masked_lm --sample-break-mode eos --tokens-per-sample 3072 --max-positions 6144 --arch roberta_base --optimizer adam --adam-betas (0.9, 0.98) --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0005 --total-num-update 250000 --warmup-updates 10000 --mask-multiple-length 10 --mask-prob 0.5 --mask-stdev 10 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --max-tokens 4096 --update-freq 1 --max-update 250000 --seed 5 --log-format json --log-interval 500 --skip-invalid-size-inputs-valid-test

You can then extract RoBERTa representations and feed them as input into your favorite ASR model.
You can also fine-tune this model directly with CTC loss as we did in Effectiveness paper but (https://arxiv.org/abs/1911.03912) but some of the code for this is currently not merged. It should be relatively straight forward to do if you want to try it yourself. I'll add the code in the coming weeks.

alexeib on 25 Apr 2020

❤2 👍2

All 7 comments

I also want to know how to do that thing.

zqs01 on 24 Apr 2020

👍2

CC @alexeib

lematt1991 on 24 Apr 2020

❤1

For the results in the vq-wav2vec paper, once you tokenize your target data you can just follow the roberta example. e.g. preprocess data

python preprocess.py --dataset-impl mmap --trainpref train.txt --destdir . --workers 60 --only-source --validpref valid.txt --srcdict dict.txt

then train BERT

python train.py --distributed-world-size 128 --distributed-port 55498 /path/to/data--save-dir /checkpoint/dir --train-subset train --fp16 --num-workers 4 --save-interval-updates 25000 --keep-interval-updates 1 --no-epoch-checkpoints --task masked_lm --criterion masked_lm --sample-break-mode eos --tokens-per-sample 3072 --max-positions 6144 --arch roberta_base --optimizer adam --adam-betas (0.9, 0.98) --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0005 --total-num-update 250000 --warmup-updates 10000 --mask-multiple-length 10 --mask-prob 0.5 --mask-stdev 10 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --max-tokens 4096 --update-freq 1 --max-update 250000 --seed 5 --log-format json --log-interval 500 --skip-invalid-size-inputs-valid-test

alexeib on 25 Apr 2020

❤2 👍2

@alexeib Thanks, We have tried to adapt our pipeline codebase that currently is using DeepSpeech2 / KML, but we were not able to go further. It would be worth to have a step by step guide.
Thank you so much.

loretoparisi on 19 May 2020

@alexeib thank you for the example. When I run the BERT training call it throws errors saying the flags --mask-stdev and --mask-multiple-length don't exist, and I cannot find a reference to them anywhere in the repo, could they have been deprecated?

david-macleod on 29 May 2020

ah you are right, i probably forgot to add these options to roberta. i'll look into it soon, meanwhile you can try to add it yourself. forget mask-stdev, and for multiple lengths it looks something like this in mask_tokens_dataset:

            # decide elements to mask
            mask = np.full(sz, False)
            num_mask = int(
                # add a random number for probabilistic rounding
                self.mask_prob * sz / float(self.mask_multiple_length) + np.random.rand()
            )

            mask_idc = np.random.choice(sz, num_mask, replace=False)

            if self.mask_stdev > 0.:
                lengths = np.random.normal(self.mask_multiple_length, self.mask_stdev , size=num_mask)
                lengths = [max(0, int(round(x))) for x in lengths]
                mask_idc = np.asarray(
                    [
                        mask_idc[j] + offset
                        for j in range(len(mask_idc))
                        for offset in range(lengths[j])
                    ],
                    dtype=np.int64
                )
            else:
                mask_idc = np.concatenate([mask_idc + i for i in range(self.mask_multiple_length)])

alexeib on 30 May 2020

❤1 👍1

Thanks, I will give that a go!

david-macleod on 31 May 2020

Was this page helpful?

0 / 5 - 0 ratings