Hi! I decided to use this project as part of our company hack days to build a voice activated game, but struggled finding how to stream microphone input into it. After digging through the docs, there's some examples using "MicrophoneInput.read" but no link to what this is supposed to be.
After searching around a bit, I found easy_audio https://github.com/lsegal/easy_audio but it was not easy to work with, as the library itself is old and portaudio (the underlying library and C ffi calling code) is difficult to get to grips with. It would be great if there was either a recommended library or some more stuff put into this library to provide a microphone sampling api. Some examples of functionality could be: listen for x seconds, listen for a certain word/phrase.
In any case, some clarity around this would be great, as ruby really does not seem to have much in the way of sound libraries, and those that do exist are old and difficult to use.
Cheers!
@Jarob22 Thanks for opening this issue! We definitely want to help people get up and running quickly with these awesome (and fun) ML APIs like Cloud Speech API.
@blowmage Any ideas on how we might provide a bit more onboarding help in our Speech docs with microphone input?
@Jarob22 What platform are you targeting? Most of the libraries used to access a microphone are platform specific, which is why we used a made-up library in the code examples. Unfortunately, your observations matches ours: collecting audio from an input in Ruby is not well supported or difficult to use.
Here is a streaming speech example running on OS X using the coreaudio gem. It collects the microphone input and formats it for the LINEAR16 format.
require "google/cloud/speech"
require "coreaudio"
input_device = CoreAudio.default_input_device
input_buffer = input_device.input_buffer 1024
speech = Google::Cloud::Speech.new
# Stream audio until the first utterance is found.
stream = speech.stream encoding: :linear16,
language: "en-US",
sample_rate: input_device.actual_rate
stream.start
input_buffer.start
25.times do
break if stream.stopped?
# Collect data from only the left channel
bits = @input_buffer.read(4096).to_a.map(&:first)
# Convert the bits to 16-bit signed little-endian samples
sample = bits.pack("s<*")
# Send the sample to the Speech stream
stream.send sample
end
input_buffer.stop
stream.stop
stream.wait_until_complete!
results = stream.results
puts results.first.transcript
If you want to watch me stumble through live coding this example you can see it here.
Hi @blowmage,
I'm using Mac, and even this would have been preferable to easy_audio. I spent an entire day being stuck because of some arcane issues existing between the C ffi and the GIL!
What data does CoreAudio return? One of the other confusions with easy_audio (well really portaudio) was that the data it returned was not a series of floats -1 to +1, but a series of floats from -? to +? (I never managed to figure out what the max/min values were).
CoreAudio returns an NArray of a pair of Integers. Most of CoreAudio is C and Objective C, so I remember looking at the source a lot to figure out how it works.
Ah, are these integers between -32768 and 32767? I was converting portaudio's floats and capping them at those values to get a stream of ints, then converting to little endian samples and packing. I guess this would work - what I was trying to do was set the stream going, append the audio input to a temporary buffer, then when the amplitude went above a threshold, start actually listening for a command word (like google home/echo has 'hey google' except mine was going to be something like 'play') then the command like 'move left' or 'throw fireball' etc.
I believe OS X CoreAudio uses 16bit samples for audio, so it should be fine.
Alright. The code I initially used to record a simple command is in https://github.com/Jarob22/uocum_ludum/blob/master/sokoban.rb at the bottom if you wanted to use it for examples. Yours would be a great one to have, @blowmage.
The current Speech example for streaming looks like this:
require "google/cloud/speech"
speech = Google::Cloud::Speech.new
stream = speech.stream encoding: :linear16,
language: "en-US",
sample_rate: 16000
# Stream 5 seconds of audio from the microphone
# Actual implementation of microphone input varies by platform
5.times do
stream.send MicrophoneInput.read(32000)
end
The fictional MicrophoneInput.read is explained as: "Actual implementation of microphone input varies by platform". Personally I'm hesitant about getting deeper into platform-specific details, or showing easy_audio examples that may be problematic for some users.
It's great that we're documenting OS X solutions here in this issue.
@frankyn Any thoughts regarding a Ruby tutorial for real-world Speech input streaming?
Rewrote the issue title for future searchers.
Closing, since @blowmage is in agreement that replacing MicrophoneInput.read with real-world code in the API doc example is currently problematic due to the platform-specific nature of the solutions.
Thanks again for bringing this up @Jarob22, I'm sure the code that @blowmage posted and the code that you've linked will be very helpful for other users.
@quartzmo No worries, hope this helps future people. In lieu of replacing the current documentation with platform specific stuff, maybe at least a link to this ticket for "a couple of real world examples" or similar wording would be useful?
Let's see what @frankyn has to say.
Ok!
Is coreAudio gem support linux?
Is coreAudio gem support linux?
No. The coreaudio gem is specific to OS X.