I am not that familiar with all the mathematics behind all these signals formulation but I would like to convert a given audio time series or audio buffer into midi
I am encountering this problem of having negative values and large numbers beyond 0-127 as the result and thus unable to create a physical midi file to listen to test or my procedures are just plainly wrong in the first place?
Would be great if anybody would be able to explain these as well:
import librosa
import numpy as np
signal, sampleRate = librosa.load(path="sound.wav", sr=None, mono=True, dtype=np.float32)
melspec = librosa.feature.melspectrogram(y=signal, sr=sampleRate, n_fft=1024, hop_length=1024, n_mels=128)
freq = librosa.mel_to_hz(melspec)
midi = librosa.hz_to_midi(freq)
print(midi)
[[int]] with elements' values between 0 to 127
[[-287 -130 -105 ... -118 -134 -117]
[-292 -144 -110 ... -117 -131 -110]
[-301 -144 -127 ... -111 -122 -98]
...
[-329 -327 -332 ... -334 -322 -272]
[-328 -330 -333 ... -325 -348 -271]
[-332 -340 -323 ... -325 -332 -273]]
I think this stems from a misunderstanding of what these functions do.
mel_to_hz converts mel bin indices to their corresponding frequencies. It does not operate on the values contained in a mel spectrogram array.
@bmcfee Do you have any suggestion on how I should be approaching this problem?
Essentially I am trying to analyze the accuracy of a vocal recording with a musical score (midi) in terms of pitch and tone etc.
So the general problem that you're describing (audio -> midi or symbolic score) is incredibly difficult and an active research area.
The more specific problem you describe later (vocal recording to pitch) is easier, but still takes considerable modeling and parameter tuning to achieve accurate results. This kind of thing isn't currently implemented in librosa, though we provide the building blocks to do it. There's an open issue #527 to implement a simple pitch tracking algorithm, which would provide fundamental frequency estimates over time for a given recording. From that, you could convert pitch (hz) to midi, and then round that to integer values to get quantized notes, if that's what you're ultimately after.
If you just need something that works out of the box already, you might look into melodia or deep salience (with the singlef0 option).
Most helpful comment
So the general problem that you're describing (audio -> midi or symbolic score) is incredibly difficult and an active research area.
The more specific problem you describe later (vocal recording to pitch) is easier, but still takes considerable modeling and parameter tuning to achieve accurate results. This kind of thing isn't currently implemented in librosa, though we provide the building blocks to do it. There's an open issue #527 to implement a simple pitch tracking algorithm, which would provide fundamental frequency estimates over time for a given recording. From that, you could convert pitch (hz) to midi, and then round that to integer values to get quantized notes, if that's what you're ultimately after.
If you just need something that works out of the box already, you might look into melodia or deep salience (with the
singlef0option).