Spleeter: [Discussion] Using Spleeter pretrained tensorflow models directly

Created on 1 Dec 2019  Â·  19Comments  Â·  Source: deezer/spleeter

I'm mostly interested in using Spleeter's pretrained models directly within my C++ application, skipping as much as possible the provided python scripts.

My understanding of how to use those pretrained models is as follow (based on the 4stems base_config.json) :

  1. convert audio data to 44100Hz/stereo/float32
  2. transform it to 2 complex spectrograms (one per channel, fft size 4096 samples, hann windowing, fft step 1024 samples)
  3. convert the 2 complex spectrograms to 2 magnitude spectrograms
  4. take the first 512 ffts/lowest 1024 bins of those 2 magnitude spectrogram, resulting in a 512x1024x2 float32 data block
  5. feed that 512x1024x2 float32 data block to the pretrained model (using tensorflow), and get 4 512x1024x2 float32 prediction data blocks back
  6. use the resulting 4 predictions, the original magnitude spectrogram and the separation_exponent to compute 4 instruments masks: instrument_masks=(predictions^separation_exponent)/(original_magnitude_spectrogram^separation_exponent)
  7. apply the instrument mask to the original complex spectrograms and compute the inverse transforms to get 4 audio stems
  8. move by 512 ffts, and do it again from step 4 until the end of the file.

Am I correct or did I misunderstand or missed something ? Is the tensorflow model inputting and outputting float32 numbers ?

thanks !

question

Most helpful comment

Hi @divideconcept. In the .pb files we are exporting and that you can integrate in your app using the c++ tensorflow API, all the operations you list are already included. When you load the graph in your app, you only need to care about feeding it the raw waveform as (-1, 2) shaped float32 array. You will get the stems as output. You can check the content of the graph with tensorboard.

That being said, if you want for some reason to do the DSP part yourself and only use the Unet part of the graph, then yes, you are definitely on the right track. Just make sure in step 7 to also apply Hann windowing when computing the inverse Fourier transforms. When reconstructing the raw audio, mind the overlap between windows. It is actually a STFT you are inverting. You will need to divide the result of summing the overlapping frames by 1.5 to account for the two Hann windows and the overlap of 1/4th. Depending on the implementation of the inverse transform you might also need to divide the raw audio you get after inversion by the window length.

Hope that helps.

All 19 comments

Hi @divideconcept. In the .pb files we are exporting and that you can integrate in your app using the c++ tensorflow API, all the operations you list are already included. When you load the graph in your app, you only need to care about feeding it the raw waveform as (-1, 2) shaped float32 array. You will get the stems as output. You can check the content of the graph with tensorboard.

That being said, if you want for some reason to do the DSP part yourself and only use the Unet part of the graph, then yes, you are definitely on the right track. Just make sure in step 7 to also apply Hann windowing when computing the inverse Fourier transforms. When reconstructing the raw audio, mind the overlap between windows. It is actually a STFT you are inverting. You will need to divide the result of summing the overlapping frames by 1.5 to account for the two Hann windows and the overlap of 1/4th. Depending on the implementation of the inverse transform you might also need to divide the raw audio you get after inversion by the window length.

Hope that helps.

Hi @alreadytaikeune thanks for confirming !
Yep, I understand the inverse transform part and what's needed to properly stitch the audio segments.

Actually the most mysterious part to me now is step 5. : is there a way to convert your set of meta/index/data files into a .pb file that would take a 512x1024x2 block as an input, and would output several 512x1024x2 blocks ? When I look at the structure of the provided models, even in their meta/index/data form it seems they are already structured to take a waveform as an input instead of a spectrogram.
Any tip on how to input a spectrogram instead (and get spectrograms in return) would be appreciated!

@alreadytaikeune , first of all, thanks to Deezer for releasing Spleeter with such a permissive license! I'm also very interested in this. A .pb file that takes the STFT as input and outputs the instrument masking weights for each stem would be great. Unfortunately, I don't have any experience with TensorFlow, but as far as I understand, this involves "freezing" the graph and specifying the right input and output nodes, right? I think it could be interesting for all parties to agree on a format so that these models could be embedded into the commercial tools available. This would make it easier for end-users to use and we could add some additional features for comfortable use, such as background separation to make it seem like a real-time operation. I see that you're the developer behind SpectraLayers, @divideconcept. I would be very interested in adding this into the audio editor Acoustica that I'm involved with as well.

Hi @divideconcept , you can convert the pb file into an event file that you can then analyze using tensorboard. This will allow you to see exactly which tensors correspond to what you want to define as inputs and outputs. You can then use these names to export a subgraph that only deals with spectrograms.

You can use the following snippet to create the event file. model_file is something like 2stems/model.data-00000-of-00001, and log_dir is the directory where to write the event file.

with session.Session(graph=ops.Graph()) as sess:
    with gfile.FastGFile(model_file, "rb") as f:
      graph_def = graph_pb2.GraphDef()
      graph_def.ParseFromString(f.read())
      importer.import_graph_def(graph_def)

    pb_visual_writer = summary.FileWriter(log_dir)
    pb_visual_writer.add_graph(sess.graph) 

As to the last part of the question, although you can run the separation model for all values of freq_bin_max that are multiple of 128, it is quite unclear what the performances will be since the same bin index in the input spectrogram will not correspond the the same frequencies anymore, and increasing the resolution of the spectrogram modifies some characteristics like the sharpness of the peaks. It is therefore hard to predict how the model will behave in this setting. You can try it though!

Hello @saagedal thanks for your interest. You are right about freezing the graph. That's probably the easiest way to go to use the model outside the code we have provided an embed it in another software. That being said, I'm not sure about what you mean when you say we need to agree on a format. You mean an exchange format for the graph? Why isn't the default protobuf file not sufficient? Do you mean the standardize the name of such or such input/output tensor?

Thanks for answering, @alreadytaikeune. It's good to hear that the direction is correct. I'm sorry for not being concise regarding the standardized format. I was thinking about an XML file or similar that describes the names of the input and output tensors to use for STFT input and instrument masks output. That way it would be very easy to exchange the model.

I did manage to load the models in TensorBoard and examine them, however, the graphs are a bit intimidating. I’ve also tried to “freeze” the pretrained model using the “freeze_graph” command line tool. First I had to create a .pbtxt that describes the graph (using the write_graph command). The freeze_graph lets us specify on or more output nodes, so that in theory, it should be possible to get the masks. I thought I could use the STFT from the graph and simply resample audio on demand and feed it to the graph. Here’s the command line I used:

freeze_graph --input_graph="2stems.pbtxt" --input_checkpoint="2stems/model" --output_graph="frozen.pb" --output_node_names="conv2d_13/Sigmoid, conv2d_6/Sigmoid"

The above is for the 2stems model and as far as I can understand the “conv2d_13/Sigmoid” and “conv2d_6/Sigmoid” nodes contain the masking for voice accompaniment and voice respectively. Unfortunately, I get the error “IndexError: list index out of range” when I try this and this is where I’m currently stuck…

How do I create a pb file from the existing models?

Okay, I managed to get the model from the checkpoint into tensor board. I found the input "waveform" as (-1, 2) shaped float32 array. I however can't find the output nodes, which I need to define for freezing the graph. What are the names of the output nodes to get the stems waveform?

Hello, Sorry if I am late to the discussion. I was wondering where I could access the pretrained models spleeter uses. I checked the Github Releases link and it would not load....

Hello, I'm working on a project that aims at giving a simple C++ interface to the spleeter inference.
I am using tensorflow graph loading as suggested above. The project is still on a _very_ early stage but you can take a look at that piece of code to understand how I do it.

@flocked, you can check the input and output names using the tensorflow saved_model_cli. For example, when I check the export 2stems model I get

> saved_model_cli show --dir path/to/exported/model --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['audio_id'] tensor_info:
        dtype: DT_STRING
        shape: unknown_rank
        name: Placeholder_1:0
    inputs['mix_spectrogram'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 512, 1024, 2)
        name: strided_slice_3:0
    inputs['mix_stft'] tensor_info:
        dtype: DT_COMPLEX64
        shape: (-1, 2049, 2)
        name: transpose_1:0
    inputs['waveform'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 2)
        name: Placeholder:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['accompaniment'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 2)
        name: strided_slice_23:0
    outputs['audio_id'] tensor_info:
        dtype: DT_STRING
        shape: unknown_rank
        name: Placeholder_1:0
    outputs['vocals'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 2)
        name: strided_slice_13:0
  Method name is: tensorflow/serving/predict

I have also been working on this. I have separately got about as far as @gvne, and took a very similar approach - though was using the lower level libtensorflow C API, having exported the estimator in the SavedModel format. Nice to know we are on similar paths.

Passing off the entire audio file for inference uses far too much memory for this to be usable in production, and we need to take an approach of exporting a subgraph I feel so that we can feed it in a streaming fashion similar to the training network.

Though, I have been wondering - would it be possible to just pass smaller chunks of audio to the network and let it do the STFT internally, and then stitch the output together (probably in a streaming fashion, out to a file). Or, would this yield different results? I've not thought it through, and might experiment. @alreadytaikeune do you think this would work?

Hello @LucasThompson, glad to hear I'm not the only one using this method.

For large input, I was thinking about concatenating chunks by processing extra samples on the edges and using a cross fade on overlapping sections to avoid discontinuities. But that would probably give a different result than processing the whole file. Not sure if that would be significant though. I'd be happy to read it if you get a chance to test it.

That being said, I agree it would be great to be able to work on STFTs rather than waveforms. It would avoid computing them twice when you want to chain the extraction with a frequency domain filter.
I see there is a mix_stft input when checking the saved model. Maybe adding the vocals_stft and accompaniment_stft won't be too much of a problem.

I'm interested in this too. Will any of you make the code available or will it be proprietary?

Hi @aidv sorry for the late reply. My code is available on github under MIT license (just like spleeter).
At the moment I support Unix system but I'm working on adding Windows support soon.

@gvne: thanks for posting. Im curious. Is the performance of tensorflow in c++ simliar as tensorflow python (cpu) or much slower?

@flocked doesn't seem worse to me. In my tests I'm processing 6seconds of stereo audio in two stems in around 2.5 seconds. I didn't test the python version extensively though.

Closing this but I added an entry in the FAQ pointing to this discussion

@gvne Great. Thanks. Is there some way to contact you in private?

@aidv Sure, you should find a contact on my profile. If you want to discuss the project, you can also raise an issue on my project.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kno3a87 picture kno3a87  Â·  5Comments

eoeintu picture eoeintu  Â·  4Comments

sullivanstrong picture sullivanstrong  Â·  4Comments

Rahul-Sindhu picture Rahul-Sindhu  Â·  4Comments

daslicht picture daslicht  Â·  4Comments