Fairseq: vq-wav2vec ➡ RoBERTa pipeline

Created on 6 Mar 2020 · 13Comments · Source: pytorch/fairseq

❓ Questions and Help

What is your question?

I am trying to take the outputs from the vqwav2vec model and pass them to the pre-trained RoBERTa model. After looking at the docs and source code I believe what I have supplied is the correct vqwav2vec / RoBERTa pipeline to extract RoBERTa features, could this please be confirmed?

Also, is wrapping the sequence with <s> tokens necessary in the audio use-case?

Many thanks

Code

def indices_to_string(idxs):
    # based on fairseq/examples/wav2vec/vq-wav2vec_featurize.py
    return " ".join("-".join(map(str, a.tolist())) for a in idxs.squeeze(0))

z = vqwav2vec.feature_extractor(x.unsqueeze(0))
_, idxs = vqwav2vec.vector_quantizer.forward_idx(z)

idx_str = indices_to_string(idxs)
tokens = roberta.task.source_dictionary.encode_line(idx_str, append_eos=False, add_if_not_exist=False)
last_layer_features = roberta.extract_features(tokens)

needs triage question

Source

david-macleod

👍1

Most helpful comment

both approaches should work (that mask will be at last position so it shouldn't matter)

alexeib on 12 Mar 2020

👍2

All 13 comments

I am also following your pipeline. I have used first wav2vec k means pipeline to extract tokens and send the tokens though the given roberta checkpoint. Just to clarify, did you also use the dictionary given inside Roberta-wav2vec checkpoint folder? Also what are the architectural information in the given roberta checkpoint.

shamanez on 9 Mar 2020

you extract the codes to a text file, then you run preprocess.py as if it was a regular text file. specify --source-dict and point to the dict file in the tar (otherwise it will construct a new dict which wont match to what it was trained with). e.g. if you extracted codes to train.src and your corresponding labels file is train.ltr. if you dont have labels then use --only-source

python ~/fairseq-py/preprocess.py --dataset-impl mmap --trainpref train --destdir . --workers 40  --srcdict dict.txt --validpref valid -s src -t ltr

alexeib on 11 Mar 2020

👍2

Thanks a lot.

shamanez on 11 Mar 2020

I have run the following command

python ~/fairseq/preprocess.py --dataset-impl mmap --trainpref train --destdir . --workers 40  --srcdict dict.txt -s src --only-source

and it creates two files train.src-None.src.bin and train.src-None.src.idx. What do I do with these files, in order to extract roberta features?

david-macleod on 11 Mar 2020

@david-macleod I tried to train the Roberta base by feeding tokens that are generated by following the @alexeib's pipeline. In those tokens, I saw Beginning Of Sentence (BOS) and End Of Sentence (EOS) tokens similar to Roberta. Other tokens are perfectly matched with your pipeline. So this is the same as the Roberta. In your pine line, just add 0 to begging and 2 to the end.

shamanez on 11 Mar 2020

thanks @shamanez, so I could just update my function like so:

def indices_to_string(idxs):
    # based on fairseq/examples/wav2vec/vq-wav2vec_featurize.py
    return "<s>" + " ".join("-".join(map(str, a.tolist())) for a in idxs.squeeze(0)) + "</s>"

Just out of interest, how did you go from the outputs of preprocess.py (the .bin and .idx files) to training Roberta?

david-macleod on 11 Mar 2020

Sorry, i thought the question was how to train using the vq-wav2vec codes.

if you use --only-source, then omit the --s and --t and specify the full file name in trainpref. e.g.

python ~/fairseq-py/preprocess.py --dataset-impl mmap --trainpref train.src --destdir . --workers 40 --only-source --validpref valid.src

you then will get a train.bin and train.idx (and valid/test if you specify validpref and testpref). you can then train a roberta model by following the roberta examples and providing path with your train.bin/idx as data path.

for extracting representations from roberta, you can load the roberta model, then iterate over the lines in your text file and do something like

for words in lines:
            vec = [0] + [self.dict.index(w) for w in words] + [2]
            x = torch.LongTensor(vec).unsqueeze(0).cuda()
            z = self.roberta_model.extract_features(x)

alexeib on 11 Mar 2020

@david-macleod

Just out of interest, how did you go from the outputs of preprocess.py (the .bin and .idx files) to training Roberta?

First, you need to run the following script to convert wav files in to K-means tokenized representations.

 $ PYTHONPATH /path/to/fairseq python examples/wav2vec/vq-wav2vec_featurize.py --data-dir /manifest/path --output-dir /path/to/output \
--checkpoint /model/path/checkpoint_best.pt --split train valid test --extension tsv

Then follow the first part of the above @alexeib 's answer.

shamanez on 11 Mar 2020

@alexeib what is the self.dict. in your answer? Can I load the dict.txt given in checkpoint directly?

What if I use the following method to extract tokens?
tokens = roberta.task.source_dictionary.encode_line(idx_str, append_eos=False, add_if_not_exist=False)

This way the dictionary is modified with special tokens and indexes are shifted by three values. Because in a usual Roberta task you have to add<s>, </s>,<pad> and , <unk>.

shamanez on 12 Mar 2020

yes it is the loaded dict.txt using Dictionary class

alexeib on 12 Mar 2020

👍1

What if I use the following method to extract tokens?
tokens = roberta.task.source_dictionary.encode_line(idx_str, append_eos=False, add_if_not_exist=False)

This way the dictionary is modified with special tokens and indexes are shifted by three values. Because in a usual Roberta task you have to add<s>, </s>,<pad> and , <unk>.

shamanez on 12 Mar 2020

@alexeib @shamanez I am also interested to know if using roberta.task.source_dictionary for the token mapping is as valid approach, compared to loading dict.txt. It also has a <mask> token in position 23672.

david-macleod on 12 Mar 2020

both approaches should work (that mask will be at last position so it shouldn't matter)

alexeib on 12 Mar 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings