Librosa: Data loaded differs between librosa and pydub

Created on 12 Aug 2019  路  3Comments  路  Source: librosa/librosa

Description


Data loaded in from wav file differs between two popular audio libs.

Steps/Code to Reproduce

from pydub import AudioSegment
import soundfile
import librosa
import numpy as np

audio = AudioSegment.from_file('matt_00007.wav', format='WAV')
audio2 = soundfile.read('matt_00007.wav')[0]
audio3, samplerate = librosa.load('matt_00007.wav', sr=16000)

# sample rate of data loaded in by pydub is 16000hz

print(np.array(audio.get_array_of_samples())[:5])
print((audio2 * samplerate2)[:5])
print((audio3 * samplerate3)[:5])

# Output:
[ 259  264  359 -244  317]
[ 126.46484375  128.90625     175.29296875 -119.140625    154.78515625]
[ 126.46484   128.90625   175.29297  -119.140625  154.78516 ]

Expected Results


Outputs should be the same. I need to guarantee consistency when I generate MFCCs on other platforms and possible in C#. So I need to know which is correct.

Actual Results


Output differs. See above.

Versions

Darwin-18.6.0-x86_64-i386-64bit
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.4
SciPy 1.2.1
librosa 0.7.0


matt_00007.wav.zip

IO

Most helpful comment

no problem, glad we could sort it out quickly!

All 3 comments

The main difference here is coming from the choice of dtype. Librosa defaults to float32, soundfile defaults to float64 (hence the slightly higher precision), and it looks like pydub is returning integer-valued samples. I'm not sure I follow why you're multiplying the sample value by the sampling rate though? If this is 16bit audio, then it should be multiplying by 32768. On my machine for your file, this produces:

In [18]: (y * 32768)[:5]                                                                      
Out[18]: array([ 259.,  264.,  359., -244.,  317.], dtype=float32)

which matches your reported values for pydub.

Librosa and soundfile appear to be in agreement (up to numerical precision), which is all we can guarantee from our side. Librosa does not support integer-valued samples because many of the downstream analyses (STFT etc) would implicitly cast to floating point anyway, so we opted to put that requirement up front in the audio buffer validation check.

To summarize: I think everything here is behaving to spec.

You are totally right. My mistake. Please forgive my naivety as I begin working in with audio in python. Thank you!

no problem, glad we could sort it out quickly!

Was this page helpful?
0 / 5 - 0 ratings