Hoping you can give a bit of guidance how you expected us to use the recordings. With a bit of digging it seems like I can use ffmpeg to piece together the audio files with timing offsets, but from the console, the time stamps appear to only be at 1 second intervals. This could leave a potential offset of up almost a full second if one person joined on the early side of that second interval and the other joined on the late side of that one interval. With multiple participants in a fast paced conversation, couldn't this leave annoying overlaps, or am I reading too much into this?
The end goal is a single audio file with all of the participants audio in sync so the end user can listen to the entire conversation regardless of when the participants joined.
@acoolprogrammer123 - Take a look at this gist which shows an example of how to mix recorded tracks and place the tracks in order. https://gist.github.com/ktoraskartwilio/8745fb5066977808543a894c7d1e6667
Beautifully put together!
Just to clarify on my timing question, while the creation time of the room is only in "seconds" resolution, ff_probe gives "milliseconds" resolution on the actual start time, so that that the two audio files will not be in danger of overlapping themselves by a single second if the one starts slightly earlier in the second, considering millisecond resolution.
I guess a second example might be in place to illustrate:
Alice starts a room and joins it.
20 seconds later Bob joins. (Your offset code accounts for this)
Alice: "I will ask you colors and you respond with Yes or No if you like them"
Alice: "Red"
Bob: "Yes"
Alice: "Blue"
Bob: "No"
Alice: "Green"
Bob: "Yes"
If each of those Q & A's are fired off in rapid succession with each voice taking half a second or less, could there be the potential of Alice's "Blue" overlapping Bob's "No" (our worse yet coming in after the "Blue" if Alice's true "join" time was at 1.0001 seconds, while Bob's true join time was at 20.9999 and the full second resolution round that down to 1 and 20 losing the fact that in reality Bob was almost a full second later than Alice.
If ffprobe's millisecond resolution accounts for this, great, and if not, any suggestions?
Sorry for being so picky, but I am looking at a potential use case of two highly trained users going through rapid fire questions multiple times per day with a third highly trained user validating results from the recording at a later point, and sync delays of even 500 milliseconds might drive them crazy with the overlap.
@ktoraskartwilio We have spent a good deal of time working on stitching videos back together. High level point: this is really something Twilio should be doing for customers, we are finding it a royal pain to get a usable video out of pieces. We started with the gist, but ran into a number of things. Some of this may help you, or someone else, or you can help us with some of our difficulties.
We ran into problems processing some video recordings. They change resolutions and this seemed to cause problems in later processing. Combined videos would freeze for instance at the same place where the resolution changed. So we reencoded them first.
We have more than 2 video files to combine. Only two participants, but each time there was a minor disconnection for the user, a new video file got created. So we might have three or four video files. We don't know which video belongs to each person so we wrote a script that looked at the length of the videos and their start times and ping ponged the video between each half of the screen. Ugh, but it seems to work.
Getting the audio to sync with the video is easier said than done. Just the audio delay and video offset isn't quite enough. But adding aresample=async=1 to the audio filter chain seemed to fix it. This took a long time to sort out.
There were various other params we added to get decent encoding speed and quality, but those are optional or can be tweaked.
A sample script we are generating is below.
ffmpeg -y -i /tmp/RM359a097bf459c327e34954525862d2b7/RT4d34aa2bae28a73f8079a6e1b65b3aa4 -c:v libvpx -crf 23 -b:v 1M -cpu-used 3 -threads 8 /tmp/RM359a097bf459c327e34954525862d2b7/RT4d34aa2bae28a73f8079a6e1b65b3aa4-reprocessed.mkv && \
ffmpeg -y -i /tmp/RM359a097bf459c327e34954525862d2b7/RT21a59c19d91a7644b8c76d3fd12c3b32 -c:v libvpx -crf 23 -b:v 1M -cpu-used 3 -threads 8 /tmp/RM359a097bf459c327e34954525862d2b7/RT21a59c19d91a7644b8c76d3fd12c3b32-reprocessed.mkv && \
ffmpeg -y -i /tmp/RM359a097bf459c327e34954525862d2b7/RTd9075ac090a32a21e71269e6b08f4053 -c:v libvpx -crf 23 -b:v 1M -cpu-used 3 -threads 8 /tmp/RM359a097bf459c327e34954525862d2b7/RTd9075ac090a32a21e71269e6b08f4053-reprocessed.mkv && \
ffmpeg -y -c:a libopus -i /tmp/RM359a097bf459c327e34954525862d2b7/RT2568dcb62e48c44c4ac61094c4377553 -c:a libopus -i /tmp/RM359a097bf459c327e34954525862d2b7/RTb5d9c3c0a59e98061159ddb310587b71 -c:a libopus -i /tmp/RM359a097bf459c327e34954525862d2b7/RT955f8ed764b5fda55a4dcd23cfed673d -filter_complex "[0]adelay=25|25[t0];[t0]aresample=async=1[t0r];[1]adelay=2363|2363[t1];[t1]aresample=async=1[t1r];[2]adelay=983525|983525[t2];[t2]aresample=async=1[t2r];[t0r][t1r][t2r]amix=inputs=3" -c:a libopus /tmp/RM359a097bf459c327e34954525862d2b7/output_audio.mka && \
ffmpeg -y -itsoffset 0 -i /tmp/RM359a097bf459c327e34954525862d2b7/RTd9075ac090a32a21e71269e6b08f4053-reprocessed.mkv -itsoffset 2.364 -i /tmp/RM359a097bf459c327e34954525862d2b7/RT21a59c19d91a7644b8c76d3fd12c3b32-reprocessed.mkv -itsoffset 983.503 -i /tmp/RM359a097bf459c327e34954525862d2b7/RT4d34aa2bae28a73f8079a6e1b65b3aa4-reprocessed.mkv -filter_complex "[0]pad=iw*2:ih[t0];[t0][1]overlay=W/2:0[t1];[t1][2]overlay=0:0[t2]" -map [t2] -an -c:v libvpx -crf 23 -b:v 1M -cpu-used 3 -threads 8 /tmp/RM359a097bf459c327e34954525862d2b7/output_video.webm && \
ffmpeg -y -i /tmp/RM359a097bf459c327e34954525862d2b7/output_audio.mka -i /tmp/RM359a097bf459c327e34954525862d2b7/output_video.webm -c:v copy -c:a copy /tmp/RM359a097bf459c327e34954525862d2b7/full.webm
To add to the wish list, when paying $0.30 per GB recorded, the size difference between audio and video is quite large when all we want is the audio. Why am I paying to record the video when I am going to immediately delete it without reviewing it?
Would be really wonderful if the RecordParticipantsOnConnect had an option to do audio only.
@ktoraskartwilio Is there a way to get a single mixed recorded file for the group room? I need to give access to the recordings of the group chat room to my clients right after the session is completed.
Hi @mcorner , @hverma , @acoolprogrammer123 ,
Please take a look at the Recordings Compositions API and let me know if this helps your use case.
Thanks,
Manjesh Malavalli
JSDK Team
Most helpful comment
To add to the wish list, when paying $0.30 per GB recorded, the size difference between audio and video is quite large when all we want is the audio. Why am I paying to record the video when I am going to immediately delete it without reviewing it?
Would be really wonderful if the RecordParticipantsOnConnect had an option to do audio only.