Fairseq: Want to construct a STT from wav2vec

Created on 2 Jul 2020 · 3Comments · Source: pytorch/fairseq

Want to construct a STT from wav2vec

Well, I have just started learning ASR (or STT?) and TTS so I have found wav2vec and it looks that I can do what I have in mind, but currently really lost because first time doing things like this so I have taked the example and wonder what to do next :).

Code

#base from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

import torch
import torchaudio
from fairseq.models.wav2vec import Wav2VecModel

#cp = torch.load('/path/to/wav2vec.pt')
# download from https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt
cp = torch.load('/home/tyoc213/Chromium/wav2vec_large.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()

#wav_input_16khz = torch.randn(1,10000)
#z = model.feature_extractor(wav_input_16khz)
filename = "/home/tyoc213/github/fairseq/hello.wav"
# https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html
waveform, sample_rate = torchaudio.load(filename)
print(f"waveform {waveform.shape}")
z = model.feature_extractor(waveform)

c = model.feature_aggregator(z)
z = z.squeeze(0)
c = c.squeeze(0)
# print(z.shape)
# print(c.shape)

var.detach().numpy() instead.
z = z.squeeze(0).detach().numpy()
c = c.squeeze(0).detach().numpy()
print(f"z {z.shape}")
print(f"c {c.shape}")


# then what??? cant find info here:
# https://github.com/pytorch/fairseq/issues/1228
# https://github.com/pytorch/fairseq/issues/2058
# https://github.com/pytorch/fairseq/issues/1811




##### extracted mostly from examples/wav2vec/wav2vec_featurize.py
import h5py

import os
import numpy as np
class H5Writer():
    """ Write features as hdf5 file in wav2letter++ compatible format """

    def __init__(self, fname):
        self.fname = fname
        # NOTE: commented out this, because breaks to write...
        #print(f"name->{self.fname}")
        #path = os.path.dirname(self.fname)
        #print(f"path->{path}")
        #os.makedirs(self.fname, exist_ok=True)

    def write(self, data):
        print(f"data shape {data.shape}")
        channel, T = data.shape

        with h5py.File(self.fname, "w") as out_ds:
            data = data.T.flatten()
            out_ds["features"] = data
            out_ds["info"] = np.array([16e3 // 160, T, channel])

def dowrite(self_use_feat = False):
    feat = z if self_use_feat else c
    target_fname = f"h5_use_feat_{self_use_feat}"
    writer = H5Writer(target_fname)
    writer.write(feat)

dowrite()
dowrite(True)

What have you tried?

Just the above... I dont know exactly what to do after have those 2 files...

Also, after this is solved, should I keep asking questions here?, or I should ask instead in https://discuss.pytorch.org/c/audio/9 ??? (havent found much there)

What's your environment?

fairseq Version 0.9.0 a87cafda718c7706e6f1694f0d39fc589ed2b264
PyTorch Version 1.5
OS Linux
Python version: 3.7.6
CUDA/cuDNN version:
cudnn 7.6.0 cuda10.1_0 nvidia
GPU models and configuration: GeForce RTX 2080 Driver Version: 435.21 CUDA Version: 10.1

question

Source

tyoc213

👍1

Most helpful comment

you can
a) use fairseq speech recognition models (check in examples/speech_recognition) with logmel filterbanks
b) adapt those models to accept wav2vec features as input instead
c) feed these representations into some other model (we used wav2letter++ in our paper)
d) wait for wav2vec 2.0 that should be coming out in the next few weeks that is able to do both pre-training and speech recognition in one model