Discord.py: RFC: Voice Receive API Design/Usage

Created on 20 Feb 2018  路  58Comments  路  Source: Rapptz/discord.py

Note: DO NOT use this in production. The code is messy (and possibly broken) and probably filled with debug prints. Use only with the intent to experiment or give feedback, although almost everything in the code is subject to change.

Behold the voice receive RFC. This is where I ask for design suggestions and feedback. Unfortunately not many people seem to have any idea of what their ideal voice receive api would look like so it falls to me to come up with everything. Should anyone have any questions/comments/concerns/complaints/demands please post them here. I will be posting the tentative design components here for feedback and will update them occasionally. For more detailed information on my progress see the project on my fork. I will also be adding an example soonish.

Overview

The main concept behind my voice receive design is to mirror the voice send api as much as possible. However, due to receive being more complex than send, I've had to take some liberties in creating some new concepts and functionality for the more complex parts. The basic usage should be relatively familiar:

vc = await channel.connect()
vc.listen(MySink())

The voice send api calls an object that produces PCM packets a Source, whereas the receive api refers to them as a Sink. Sources have a read() function that produces PCM packets, so Sinks have a write(data) function that does something with PCM packets. Sinks can also optionally accept opus data to bypass the decoding stage if you so desire. The signature of the write(data) function is currently just a payload blob with the opus data, pcm data, and rtp packet, mostly for my own convenience during development. This is subject to change later on.

The new VoiceClient functions are basically the same as the send variants, with listen() being the new counterpart to play().

Note: The stop() function has been changed to stop both playing and listening. I have added stop_playing() and stop_listening() for individual control.

Built in Sinks

For simply saving voice data to a file, you can use the built in WaveSink to write them to a wav file. The way I have this currently implemented, however, is completely broken for more than one user.

Note: Here lies my biggest problem. I currently do not have any way to combine multiple voice "streams" into one stream. The way this works is Discord sends packets for all users on the same socket, differentiated by an id (aka ssrc, from RTP spec). These packets have timestamps, but with a random start offset, per ssrc. RTP has a mechanism where the reference time is sent in a control packet, but as far as I can tell, Discord doesn't send these control packets. As such, I have no way of properly synchronizing streams without excessive guesswork based on arrival time in the socket (unreliable at best). Until I can solve this there will be a few holes in the design, for example, how to record the whole conversation in a voice channel instead of individual users.

Sinks can be composed much like Sources can (PCMVolumeTransformer+FFmpegPCMAudio, etc). I will have some built in sinks for handling various control actions, such as filtering by user or predicate.

# only listen to message.author
vc.listen(UserFilter(MySink(), message.author))

# listen for 10 seconds
vc.listen(TimedFilter(MySink(), 10))

# arbitrary predicate, could check flags, permissions, etc
vc.listen(ConditionalFilter(MySink(), lambda data: ...))

and so forth. As usual, these are subject to change when I go over this part of the design again.

As mentioned before, mixing is still my largest unsolved problem. Combining all voice data in a channel into one stream is surely a common use case, and i'll do my best to try and figure out a solution, but I can't promise anything yet. If it turns out that my solution is too hacky, I might have to put it in some ext package on pypi (see: ext.colors).

For volume control, I recently found that libopus has a gain setting in the decoder. This is probably faster and more accurate than altering pcm packets after they've been decoded. Unfortunately, I haven't quite figured out how to expose this setting yet, so I don't have any public api to show for it.

That should account for most of the public api part that i've designed so far. I still have a lot of miscellaneous things to do so no ETA. Again, if you have any feedback whatsoever please make yourself known either here or in the discord server.



Old issue content
Voice receive has been an occasionally requested feature for a very long time now, but its actual usage seems to be limited to some form of voice recognition (for commands or some other form of automation) and for recording. This is all fine and dandy, but issue that appears is how to expose this in the library. With the v1.0 rewrite voice send was redesigned to be more composable and give more control to the user. However, that also means the voice systems were designed with only send in mind. Voice send is far less complex than receive, so some considerable amount of effort has to go into designing an api that both fits well with the library and, more importantly, is useful and user friendly.

Essentially, voice receive is reading RTP packets from a socket and then decrypting and decoding them from Opus to PCM. At this point is fairly trivial to simply expose an event that produces PCM data and a member object, but this is far from useful or user friendly. This leads us to the main question:

Those who want to use voice receive, what kind of api do you want? What kind of api would you expect from the library?

Ideally, simple or common use cases should be trivial to accomplish. More complex requirements should not be difficult or unwieldy. The library should have enough batteries included to handle most simple situations, such as saving audio to a file. What form this takes is what needs to be decided.

My part in all this

I've been working on voice receive for a few months now, with most of my time spend trying to wrangle technical or design related problems. Those who have been listening have heard of my struggles with RTCP packets (or rather lack thereof), the ultimate goal being functional stream mixing. This means combining the audio streams from multiple users into one stream of data (each user has their own stream). In the RTP spec, discord is supposed to send RTCP packets with the timing information necessary to synchronize these streams together. Actually mixing them afterwards is trivial. The problem is that Discord does not reliably send RTCP packets. You may or may not get them depending on which voice server you end up on. Trying to bodge some optimistic yolo handling is too sketchy to be the official implementation, and is subject to the kind of problems that creep in from having questionable code. Danny would never approve of that kind of code anyways, seeing as he then has to support it.

Possible design

In the following section I will explain the design I came up with when working on this. If you want to reply to this RFC without being biased by reading my design, skip this section and come back after you've had your say.


Click to be biased expand

My idea was to essentially mirror the voice send api. This seemed like the logical conclusion to me. The voice send api has source objects which produce voice data. These can be composed in any way and eventually fed to the lib where it handles it from there. So why not do the reverse for voice receive? Have objects that receive voice data and do some arbitrary operations on it. These can be composed the same way source objects can be: usually nested, but advanced usages can have branching or some other form of dynamic usage. I decided to call these objects Sinks. You would use them the same way you would sources.

```py
# using a source
voice_client.play(discord.FFmpegPCMAudio("song.mp3"))

# using a sink
voice_client.read(discord.WaveSink("voice.wav"))
```

The WaveSink is a simple built-in sink for writing the PCM data to a wav file. This may seem simple but it doesn't actually work. Since there's no way to mix audio, this can only work for one user, otherwise you end up mashing audio from multiple users together. It worked great with just one person though!

Technical issues aside, this demonstrates the main concept for simple usages. However the more I worked on this the more I realized that trying to mirror the voice send as much as possible just wouldn't work due to how simple it was. There's also the need to have controls for deciding who to handle data from (who to ignore), how long to record data for (besides explicitly stopping the whole reader), etc.

Basically, without mixing, my design is incomplete. I can't complete it until I have a reliable way to mix audio since I can't design an api around something that doesn't exist. Supposedly discord is going to push out an update to all voice servers in the future, which hopefully means either RTCP will reliably be sent or there will be some other synchronization mechanism. I decided that if I do go through with this design i'll have to complete it without a way to mix audio, since there's just no good way to do it without RTCP packets for timing data.


RFC feature request

Most helpful comment

I have updated the OP. Anyone vaguely interested in this feature should read the new content.

All 58 comments

My thoughts. As a starting point, yes, the API should provide separate PCM chunks for each member being listened to. If no filters are set, all members are listened to, including those joining the call after listening began.

To decode PCM properly, the library needs to put packets into order and identify lost packets. The packets have an incrementing sequence number that will be used. This all implies buffering and some error handling, such as filling in lost data (with opus) and simply discarding packets if they are received too late.

Each audio chunk gets a number specifying its position (in milliseconds) in relation to when the listening session began. These chunks can then be fed into a mixer to produce a single stream if desired. For simplicity, all chunks should be assigned into perfect 20ms slots (e.g. 40 and 60, not 36 and 51).

No special timing information should be necessary. Record the time whenever someone starts speaking; every following chunk can be placed exactly 20ms after the previous one. There will be five silent packets to signify they've stopped speaking. In case those are not received there should also be a timeout.

If enough latency is added, the library should be able to give a reliable output under most conditions. With minimal latency, there will be more issues. There's no free lunch :)

I've already had a go at writing all of the decoding and processing for this. I spent a lot of time writing classes for packet types discord doesn't send. The timestamp data in RTP packets starts from a random offset, and that offset is different per ssrc. RTCP packets have the reference time for calculating when these packets were created. Relying on gateway websocket events to sync up voice socket UDP packets is racy and not reliable, especially considering the gateway code can be blocked by user code and speaking events have no id (timestamp).

Seems safe to bet that Discord makes no effort to synchronize speakers. Each stream is basically played "as soon as it's received" with the shortest buffer they think they can get away with.

If there's someone in the call from New Zealand and we're always making fun of him for laughing at jokes three seconds too late, I would expect the data coming from the Voice Receive API to reflect that. I would not expect it to "fix" the latency.

So I think you can track the local packet receive times and use some of those without shame. Again, most packets can actually be placed right after the previous one, ignoring the receive time. Never the first packet after silence tho, of course.

Going to assume that this attempt has stalled?

Any idea on if anyone will try and implement it?

Also a good api would just pass in the raw PCM that comes out of the opus library. Must like the mumble python api. (Another chat server that uses opus)

This should always be the same length unless discord is changing the opus config options on the fly.

I'm still working on it. The fact that this is an "unsupported" feature and my desire to make a sufficiently high level api for ease of use coupled with my sporadic motivation and inspiration make for slow progress. It also seems that few people have useful input on this issue meaning that I am basically on my own for the most part. If you want to see what I have so far I keep my fork updated. There's still a lot to do and I haven't written docs or added an example yet so make of it what you will. https://github.com/imayhaveborkedit/discord.py

Welp guess I am stuck on using my own server with mumble for my HamRadio Remote station.

I see this fork is still getting regular updates.

You've mentioned a use-case, where it's possible to use a WaveSink to get raw PCM/Wav data. As it happens this is exactly the kind of thing I need.

Basically, my use-case is simply getting raw PCM data and passing it along to do some basic speech recognition (fwiw, a very basic speech-to-text bot for a deaf friend). I assume no audio mixing means there is no way to actually tell which packet comes from which user? Regardless, such a situation doesn't impact me greatly, since, assuming the (in my case, 2 or 3) users speak in a somewhat orderly fashion, my output would simply be the transcript of what has been said, no matter who said it. The only other thing needed in this case, for me, would be a way to detect when a user has stopped speaking so I can pass along the file/buffer without words being cut out. Ideally, this could be done by storing the audio data in a buffer, but writing to a file and using that as input would work fairly good, as well.

Would it be possible to share a minimal piece of code that exemplifies the use-case you described, or provide any hints for the direction I should take in my implementation of this use-case?

Don't worry, that use case is probably one of the two major cases that I expect people to have. WaveSink is specifically for writing data to a wav file, the point being that the built in wave module takes care of writing all the headers and such. The data you get in the first place is already PCM, so unless whatever flow you have requires a file on the filesystem, you don't need that one.

When I mention "mixing" I'm referring to combining the various streams of user voice data into a single combined stream. This is a problem I haven't quite figured out how to do properly yet since discord doesn't seem to provide the required RTCP packets necessary to synchronize the streams. If I do come up with something and it ends up being too jank to be in the main lib I'm considering making some sort of ext.voice package on pypi. Anyways, these "streams" are per user (actually ssrc, which is just an id in the RTP spec, but are mapped to user ids), so the data you get will include a member object. The exact format of this I haven't decided on yet, so right now it's just sort of a payload blob object with the pcm, opus, and rtppacket object (mostly for my own convenience during development).

Delimiting speech segments is still something I don't quite know how to handle yet. I think this might be a problem I have to put onto the user since I don't see a good way to do it lib-side. In the example I'm writing for this I'm thinking about setting the pattern for doing so to use the priority speaking feature. Relying on speaking events and or arbitrary waiting periods does not sound reliable enough to use by default. Using priority speaking to indicate the segment of speech to be recognized/processed would be very convenient for both me since I don't need to do anything in the lib for it and for the user since being a PTT feature means if they mess it up it's their fault.

Unfortunately have speech recognition in an example is a bit out of scope for the lib examples, but I plan on having an additional example in a gist that demonstrates this using the SpeechRecognition module most likely. In your case, until I design out how it would work with the various services in the aforementioned library (which might end up being the same anyways), it would probably be waiting for priority speaking from some member, collecting their voice data in your container of choice (in memory or filesystem based), and processing them once priority speaking ends.

I have updated the OP. Anyone vaguely interested in this feature should read the new content.

Got around to writing a short example this weekend. It's written in a rush, it's probably not how discord.py is meant to be used, but it works.

As mentioned in the OP, I don't know if the mixing works for this example. I've used a UserFilter, but I haven't yet tested with more users, to see if it actually filters. I managed to do speech-to-text by just coarsely segmenting data into 5 second chunks, since I don't think anyone speaks in long-winded sentences. Or at least, I don't. The accuracy is determined by the speech recognition service, but in general, it's pretty decent. A downside to this segmenting approach is that it only processes/posts the resulting text every 5 seconds. For speech to text, it's not ideal, but this could work reasonably well for some sort of voice commands. Another issue is that, sometimes, the 5 second segment might get only part of your sentence, but this is somewhat mitigated by stripping leading zeros in the buffer.

I faintly remember someone mentioning silence is marked by 5 chunks of 0x00, so I've been trying to implement a way to delimit speech (or rather, words) by looking for these chunks, but I haven't found a reliable way to do this yet. I've been looking over raw bytes output, seeing if this theory holds up, and it seems like it might, but I'd probably have to apply some sort of regex to make sure there really aren't any chunks when speaking occurs.

FWIW, here's the gist: https://gist.github.com/Apfelin/c9cbb7988a9d8e55d77b06473b72dd57

Looks great, but I keep getting an error on line 12

This is not yet implemented in the main library. @Apfelin was using imayhaveborkedit's fork, which does have discord.reader.AudioSink

@imayhaveborkedit Further to your proposed API, I have implemented a listener bot (that saves stuff to disk) in discord.js (I'm hoping to move it to Python ultimately). This leaves things such as mixing and figuring out when to join streams up to the user - which I think is the right thing to do as everyone will want something different. (Providing a separate lib for the common use cases makes sense, such as mixing users - I think it should be out of scope of this.)

The implementation in JS involves listening for the "speaking" event, then binding a receiver/sink (stream) to it to accept the data. When binding the receiver you can select the mode (eg: PCM, or Opus. Wav also makes sense since Python has that built in). Then every chunk is written to the receiver as it comes in (the actual implementation for out of order packets etc is unclear to me).

When the user stops talking/they release PTT, the end event is triggered, and the stream is closed.

I would argue, for the moment, the filters per user etc are not required, rather, in the on_speaking event, the user can decide if the want to save this stream or not (the event is given the member details), and they could return the stream to write to (or call a method on a passed object), if no stream is returned, no action is taken (the overhead of this would be minimal compared to everything else that has to happen to stream data). Again, some common classes could be provided to simplify the process (eg: a stream to file class).

I know I'm coming late to the party here and may have missed some stuff (I have not yet used this, but trying to figure it out), but hopefully that makes some sense.

I'm sure you have see this, but just in case, this is the VoiceReceiver API for discord.js: https://discord.js.org/#/docs/main/master/class/VoiceReceiver (not much there), and an example of the Voice API in use (slightly out of date, but gives you a feel for it as the discord.js docs are not great): https://gist.github.com/eslachance/fb70fc036183b7974d3b9191601846ba

I have updated the OP again with new info about stopping sinks. (I guess I didn't...?)

@sillyfrog Sorry for not responding until now. The whole mixing thing if I do figure it out will still be entirely optional. It would exist as a batteries-included utility. If a user wants to handle the data differently of course they can go about it their own way. The problem is that this is not an easy thing to do, even less so doing it correctly. I honestly don't expect many people to be able to come up with a decent solution to this that involves mixing the data live. I still believe that getting the combined audio of everyone speaking in a channel is a common and valid use case and as such a utility for doing so should be included in the lib.

The d.js example vaguely follows the concept I had in mind for this but I would design a somewhat higher level interface for it. Perhaps with a context manager. Or maybe I wont and just do it "manually" in the example to set the precedent. Or maybe just leave it as an exercise to the user.

I've used d.js in the past and initially found it annoying to have to deal with individual user audiostreams, so +1 to a simple vc.listen(AudioSink()) function.

What's currently blocking your progress on this? I'm trying to bring my mental model of the current problems up to speed so that I can hopefully contribute.

re: Sinks and filters

Source and Sink are basically IO streams, with a filter being an in-memory sink (i.e. doesn't write out to a file). To compose filters, for example, could be done in essentially a list or a linked list where a call to write() at the head will propagate the data down the chain recursively. Example usage:

wavSink = WavSink(filename='output_file.wav')
volumeFilter = VolumeFilter()
someOtherFilter1 = SomeFilter1()
someOtherFilter2 = SomeFilter2()

composedFilter = volumeFilter.compose([someOtherFilter1, someOtherFilter2])
# or maybe
# composedFilter = BlankFilter([volumeFilter, someOtherFilter1, someOtherFilter2])
wavSink.filter = composedFilter

vc.listen(wavSink)

Writes would propagate like so:

wavSink.write(data)
volumeFilter.write(data)
someOtherFilter1.write(data) 
someOtherFilter2.write(data)
-----------------------------------------------
> someOtherFilter2 modifies data and returns it
> someOtherFilter1 modifies data and returns it
> volumeFilter modifies data and returns it
> wavSink writes data to file

To be honest, this is horrifying. Voice is already threaded. You don't need a thread for state. You don't need two different instances of the sink. This is not how this class is used. It's clear that I need extensive examples to try to prevent people from writing code like this.

For reference, this is the typical usage pattern:

vc.listen(discord.UserFilter(discord.WaveSink('file.wav'), some_member))
...
vc.stop_listening()

That's it. Note that the Filter objects are probably going to be changed at some point since I don't like the design very much in its current state.

Is it possible to send audio from system microphone with this?

And can i send audio from vc.listen() to system speakers in real time?

@brownbananas95 Sending audio is already implemented. I think you can but i'm unsure

@brownbananas95 sending (mike->discord) works without issue. Receiving (Discord->speakers) is a bit more complex, as you currently receive each user as a separate stream. You can receive these, but must mix them prior to sending to the computer's speakers.

@gerth2 do you have a code example for sending mike->discord?

Sure thing - I yanked the meaningful guts from a project I have in-flight at the moment using this Voice Receive fork of discord.py:

https://gist.github.com/gerth2/8ee0c918606b4c501759a9c333393398

Let me know if you run into issues, I can zip up a latest copy of what we're working on to send to you.

Exactly what I needed! Thank you!

FWIW, yesterday, I got a crude but (apparently?) functional PCM audio mixing strategy sorted out, outside of this library, but using its API's. Working on open-sourcing the code some time this week (still has hardcoded private API keys in it, need to fix).

Since this is an RFC, my comment: I like the API as provided - the alignment with the existing read side feels nice and fuzzy. Any issues with internals aside, it looks nice and seems to work fine from the outside.

Let me know if you run into issues, I can zip up a latest copy of what we're working on to send to you.

What is your contact information?

FWIW, yesterday, I got a crude but (apparently?) functional PCM audio mixing strategy sorted out, outside of this library, but using its API's. Working on open-sourcing the code some time this week (still has hardcoded private API keys in it, need to fix).

Since this is an RFC, my comment: I like the API as provided - the alignment with the existing read side feels nice and fuzzy. Any issues with internals aside, it looks nice and seems to work fine from the outside.

This is interesting. Perhaps this would help complete the Voice Receive fork by imayhaveborkedit, and potentially get the PR approved in discord.py/master :-)

@brownbananas95 code is here

Is this still being considered as a feature or will it only exist in forks?

What is the status of this RFC? Any timeframe as to when this will be ready to be integrated into master, considering the fork, which seem to have this working?

Maybe writing a function that mutes all other users except a user it wants to specifically listen to from the perspective of the bot would work? Since discord takes it all as one stream? There's no point listening to multiple people at the same time unless you want to record it, or create a gateway to ps party for example. surely?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danshat picture danshat  路  3Comments

MrBrahm picture MrBrahm  路  3Comments

JackL0 picture JackL0  路  3Comments

Nicba1010 picture Nicba1010  路  3Comments

superloach picture superloach  路  3Comments