Hello Team,
Thank you for your effort in building such a fantastic package. This is not a bug , its more of understanding what this package can do.
Please put your suggestion if spleeter can be used for noise cleansing from audio files. As per my understanding / analogy if I consider noise as music and conversation as singer's voice, then can I separate noise from audio conversation using spleeter.
Your input will be helpful.
I am not affiliated with the project, but FWIW :
I grabbed 1:00 to 3:00 of the audio of this BMXer recording himself on a GoPro camera, riding a bike around Los Angeles and speaking to the camera.
https://www.youtube.com/watch?v=a0vgJ3TeJzA
I then ran it through the 2stem spleeter :
spleeter separate -o ./audio_output/ -p spleeter:2stems -i input/00-YOUTUBE/a0vgJ3TeJzA.flac
I then merged the two files into one, with each file panned hard right and left :
ffmpeg -i vocals.wav -i accompaniment.wav -filter_complex "[0:a][1:a]amerge=inputs=2,pan=stereo|c0<c0+c1|c1<c2+c3[a]" -map "[a]" stereo.split.accomp.and.vocals.wav
The result was the bicyclist talking in my left ear, and a bunch of background noise in my right ear. When I removed the right ear, I heard the speaking quite clearly in my left ear. When I removed the left ear, I heard almost no speaking in the right ear. The only time I heard his voice in the right ear was when it was mistakable for a percussive drum sound. I sometimes heard people other than him speaking in the left ear, when he biked past people who were talking.
I believe this is sufficient proof of the concept of using spleeter's 2stem model to isolate spoken human voices from recordings with background noise. Try it yourself and see!
Hi @AIGyan
We haven't done any sort of evaluation on this task, nor was our model trained on such examples. Speech enhancement and denoising being an active research field, I assume there are more specialized tools to do that out there.
That being said I can only agree with @awesomer feel free to try it out and let us know what you find!
Most helpful comment
I am not affiliated with the project, but FWIW :
I grabbed 1:00 to 3:00 of the audio of this BMXer recording himself on a GoPro camera, riding a bike around Los Angeles and speaking to the camera.
https://www.youtube.com/watch?v=a0vgJ3TeJzA
I then ran it through the 2stem spleeter :
spleeter separate -o ./audio_output/ -p spleeter:2stems -i input/00-YOUTUBE/a0vgJ3TeJzA.flacI then merged the two files into one, with each file panned hard right and left :
ffmpeg -i vocals.wav -i accompaniment.wav -filter_complex "[0:a][1:a]amerge=inputs=2,pan=stereo|c0<c0+c1|c1<c2+c3[a]" -map "[a]" stereo.split.accomp.and.vocals.wavThe result was the bicyclist talking in my left ear, and a bunch of background noise in my right ear. When I removed the right ear, I heard the speaking quite clearly in my left ear. When I removed the left ear, I heard almost no speaking in the right ear. The only time I heard his voice in the right ear was when it was mistakable for a percussive drum sound. I sometimes heard people other than him speaking in the left ear, when he biked past people who were talking.
I believe this is sufficient proof of the concept of using spleeter's 2stem model to isolate spoken human voices from recordings with background noise. Try it yourself and see!