I am trying to take the outputs from the vqwav2vec model and pass them to the pre-trained RoBERTa model. After looking at the docs and source code I believe what I have supplied is the correct vqwav2vec / RoBERTa pipeline to extract RoBERTa features, could this please be confirmed?
Also, is wrapping the sequence with <s> tokens necessary in the audio use-case?
Many thanks
def indices_to_string(idxs):
# based on fairseq/examples/wav2vec/vq-wav2vec_featurize.py
return " ".join("-".join(map(str, a.tolist())) for a in idxs.squeeze(0))
z = vqwav2vec.feature_extractor(x.unsqueeze(0))
_, idxs = vqwav2vec.vector_quantizer.forward_idx(z)
idx_str = indices_to_string(idxs)
tokens = roberta.task.source_dictionary.encode_line(idx_str, append_eos=False, add_if_not_exist=False)
last_layer_features = roberta.extract_features(tokens)
I am also following your pipeline. I have used first wav2vec k means pipeline to extract tokens and send the tokens though the given roberta checkpoint. Just to clarify, did you also use the dictionary given inside Roberta-wav2vec checkpoint folder? Also what are the architectural information in the given roberta checkpoint.
you extract the codes to a text file, then you run preprocess.py as if it was a regular text file. specify --source-dict and point to the dict file in the tar (otherwise it will construct a new dict which wont match to what it was trained with). e.g. if you extracted codes to train.src and your corresponding labels file is train.ltr. if you dont have labels then use --only-source
python ~/fairseq-py/preprocess.py --dataset-impl mmap --trainpref train --destdir . --workers 40 --srcdict dict.txt --validpref valid -s src -t ltr
Thanks a lot.
I have run the following command
python ~/fairseq/preprocess.py --dataset-impl mmap --trainpref train --destdir . --workers 40 --srcdict dict.txt -s src --only-source
and it creates two files train.src-None.src.bin and train.src-None.src.idx. What do I do with these files, in order to extract roberta features?
@david-macleod I tried to train the Roberta base by feeding tokens that are generated by following the @alexeib's pipeline. In those tokens, I saw Beginning Of Sentence (BOS) and End Of Sentence (EOS) tokens similar to Roberta. Other tokens are perfectly matched with your pipeline. So this is the same as the Roberta. In your pine line, just add 0 to begging and 2 to the end.
thanks @shamanez, so I could just update my function like so:
def indices_to_string(idxs):
# based on fairseq/examples/wav2vec/vq-wav2vec_featurize.py
return "<s>" + " ".join("-".join(map(str, a.tolist())) for a in idxs.squeeze(0)) + "</s>"
Just out of interest, how did you go from the outputs of preprocess.py (the .bin and .idx files) to training Roberta?
Sorry, i thought the question was how to train using the vq-wav2vec codes.
if you use --only-source, then omit the --s and --t and specify the full file name in trainpref. e.g.
python ~/fairseq-py/preprocess.py --dataset-impl mmap --trainpref train.src --destdir . --workers 40 --only-source --validpref valid.src
you then will get a train.bin and train.idx (and valid/test if you specify validpref and testpref). you can then train a roberta model by following the roberta examples and providing path with your train.bin/idx as data path.
for extracting representations from roberta, you can load the roberta model, then iterate over the lines in your text file and do something like
for words in lines:
vec = [0] + [self.dict.index(w) for w in words] + [2]
x = torch.LongTensor(vec).unsqueeze(0).cuda()
z = self.roberta_model.extract_features(x)
@david-macleod
Just out of interest, how did you go from the outputs of preprocess.py (the .bin and .idx files) to training Roberta?
First, you need to run the following script to convert wav files in to K-means tokenized representations.
$ PYTHONPATH /path/to/fairseq python examples/wav2vec/vq-wav2vec_featurize.py --data-dir /manifest/path --output-dir /path/to/output \
--checkpoint /model/path/checkpoint_best.pt --split train valid test --extension tsv
Then follow the first part of the above @alexeib 's answer.
@alexeib what is the self.dict. in your answer? Can I load the dict.txt given in checkpoint directly?
What if I use the following method to extract tokens?
tokens = roberta.task.source_dictionary.encode_line(idx_str, append_eos=False, add_if_not_exist=False)
This way the dictionary is modified with special tokens and indexes are shifted by three values. Because in a usual Roberta task you have to add<s>, </s>,<pad> and , <unk>.
yes it is the loaded dict.txt using Dictionary class
What if I use the following method to extract tokens?
tokens = roberta.task.source_dictionary.encode_line(idx_str, append_eos=False, add_if_not_exist=False)
This way the dictionary is modified with special tokens and indexes are shifted by three values. Because in a usual Roberta task you have to add<s>, </s>,<pad> and , <unk>.
@alexeib @shamanez I am also interested to know if using roberta.task.source_dictionary for the token mapping is as valid approach, compared to loading dict.txt. It also has a <mask> token in position 23672.
both approaches should work (that mask will be at last position so it shouldn't matter)
Most helpful comment
both approaches should work (that mask will be at last position so it shouldn't matter)