Well, I have just started learning ASR (or STT?) and TTS so I have found wav2vec and it looks that I can do what I have in mind, but currently really lost because first time doing things like this so I have taked the example and wonder what to do next :).
#base from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
import torch
import torchaudio
from fairseq.models.wav2vec import Wav2VecModel
#cp = torch.load('/path/to/wav2vec.pt')
# download from https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt
cp = torch.load('/home/tyoc213/Chromium/wav2vec_large.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()
#wav_input_16khz = torch.randn(1,10000)
#z = model.feature_extractor(wav_input_16khz)
filename = "/home/tyoc213/github/fairseq/hello.wav"
# https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html
waveform, sample_rate = torchaudio.load(filename)
print(f"waveform {waveform.shape}")
z = model.feature_extractor(waveform)
c = model.feature_aggregator(z)
z = z.squeeze(0)
c = c.squeeze(0)
# print(z.shape)
# print(c.shape)
var.detach().numpy() instead.
z = z.squeeze(0).detach().numpy()
c = c.squeeze(0).detach().numpy()
print(f"z {z.shape}")
print(f"c {c.shape}")
# then what??? cant find info here:
# https://github.com/pytorch/fairseq/issues/1228
# https://github.com/pytorch/fairseq/issues/2058
# https://github.com/pytorch/fairseq/issues/1811
##### extracted mostly from examples/wav2vec/wav2vec_featurize.py
import h5py
import os
import numpy as np
class H5Writer():
""" Write features as hdf5 file in wav2letter++ compatible format """
def __init__(self, fname):
self.fname = fname
# NOTE: commented out this, because breaks to write...
#print(f"name->{self.fname}")
#path = os.path.dirname(self.fname)
#print(f"path->{path}")
#os.makedirs(self.fname, exist_ok=True)
def write(self, data):
print(f"data shape {data.shape}")
channel, T = data.shape
with h5py.File(self.fname, "w") as out_ds:
data = data.T.flatten()
out_ds["features"] = data
out_ds["info"] = np.array([16e3 // 160, T, channel])
def dowrite(self_use_feat = False):
feat = z if self_use_feat else c
target_fname = f"h5_use_feat_{self_use_feat}"
writer = H5Writer(target_fname)
writer.write(feat)
dowrite()
dowrite(True)
Just the above... I dont know exactly what to do after have those 2 files...
Also, after this is solved, should I keep asking questions here?, or I should ask instead in https://discuss.pytorch.org/c/audio/9 ??? (havent found much there)
Any update ? How to get the Text from Wave file (audio) ?
you can
a) use fairseq speech recognition models (check in examples/speech_recognition) with logmel filterbanks
b) adapt those models to accept wav2vec features as input instead
c) feed these representations into some other model (we used wav2letter++ in our paper)
d) wait for wav2vec 2.0 that should be coming out in the next few weeks that is able to do both pre-training and speech recognition in one model
We'll wait for the Wave2vec 2.0 , Thanks @alexeib !
Most helpful comment
you can
a) use fairseq speech recognition models (check in examples/speech_recognition) with logmel filterbanks
b) adapt those models to accept wav2vec features as input instead
c) feed these representations into some other model (we used wav2letter++ in our paper)
d) wait for wav2vec 2.0 that should be coming out in the next few weeks that is able to do both pre-training and speech recognition in one model