In the example of the new model vq-wav2vec here is showed an example how to extract features and generated the embedding for a random speech data:
import torch
from fairseq.models.wav2vec import Wav2VecModel
cp = torch.load('/path/to/vq-wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
_, idxs = model.vector_quantizer.forward_idx(z)
print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model
It would be worth to provide an example how to use it on the speech downstream task.
I also want to know how to do that thing.
CC @alexeib
For the results in the vq-wav2vec paper, once you tokenize your target data you can just follow the roberta example. e.g. preprocess data
python preprocess.py --dataset-impl mmap --trainpref train.txt --destdir . --workers 60 --only-source --validpref valid.txt --srcdict dict.txt
then train BERT
python train.py --distributed-world-size 128 --distributed-port 55498 /path/to/data--save-dir /checkpoint/dir --train-subset train --fp16 --num-workers 4 --save-interval-updates 25000 --keep-interval-updates 1 --no-epoch-checkpoints --task masked_lm --criterion masked_lm --sample-break-mode eos --tokens-per-sample 3072 --max-positions 6144 --arch roberta_base --optimizer adam --adam-betas (0.9, 0.98) --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0005 --total-num-update 250000 --warmup-updates 10000 --mask-multiple-length 10 --mask-prob 0.5 --mask-stdev 10 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --max-tokens 4096 --update-freq 1 --max-update 250000 --seed 5 --log-format json --log-interval 500 --skip-invalid-size-inputs-valid-test
You can then extract RoBERTa representations and feed them as input into your favorite ASR model.
You can also fine-tune this model directly with CTC loss as we did in Effectiveness paper but (https://arxiv.org/abs/1911.03912) but some of the code for this is currently not merged. It should be relatively straight forward to do if you want to try it yourself. I'll add the code in the coming weeks.
@alexeib Thanks, We have tried to adapt our pipeline codebase that currently is using DeepSpeech2 / KML, but we were not able to go further. It would be worth to have a step by step guide.
Thank you so much.
@alexeib thank you for the example. When I run the BERT training call it throws errors saying the flags --mask-stdev and --mask-multiple-length don't exist, and I cannot find a reference to them anywhere in the repo, could they have been deprecated?
ah you are right, i probably forgot to add these options to roberta. i'll look into it soon, meanwhile you can try to add it yourself. forget mask-stdev, and for multiple lengths it looks something like this in mask_tokens_dataset:
# decide elements to mask
mask = np.full(sz, False)
num_mask = int(
# add a random number for probabilistic rounding
self.mask_prob * sz / float(self.mask_multiple_length) + np.random.rand()
)
mask_idc = np.random.choice(sz, num_mask, replace=False)
if self.mask_stdev > 0.:
lengths = np.random.normal(self.mask_multiple_length, self.mask_stdev , size=num_mask)
lengths = [max(0, int(round(x))) for x in lengths]
mask_idc = np.asarray(
[
mask_idc[j] + offset
for j in range(len(mask_idc))
for offset in range(lengths[j])
],
dtype=np.int64
)
else:
mask_idc = np.concatenate([mask_idc + i for i in range(self.mask_multiple_length)])
Thanks, I will give that a go!
Most helpful comment
For the results in the vq-wav2vec paper, once you tokenize your target data you can just follow the roberta example. e.g. preprocess data
then train BERT
You can then extract RoBERTa representations and feed them as input into your favorite ASR model.
You can also fine-tune this model directly with CTC loss as we did in Effectiveness paper but (https://arxiv.org/abs/1911.03912) but some of the code for this is currently not merged. It should be relatively straight forward to do if you want to try it yourself. I'll add the code in the coming weeks.