It seems like Peertube handles video transcoding quite inefficiently right now. For example, if someone uploads a 1080p video and all transcodings are enabled, Peertube will perform the following transcodings, in order:
1080p -> 240p
1080p -> 320p
1080p -> 480p
1080p -> 720p
That means ffmpeg has to parse a lot of data for each transcoding. I feel like it would be faster if Peertube performed transcodings in this order:
1080p -> 720p
720p -> 480p
480p -> 320p
320p -> 240p
I haven't tested this, but it seems to make sense, as ffmpeg has to handle less data for each conversion (and I doubt there would be a noticable quality difference). Even better, ffmpeg has an option for creating multiple outputs from a single input in the same process:
https://trac.ffmpeg.org/wiki/Creating%20multiple%20outputs
I can make a benchmark for this later if you wish.
I just ran a little benchmark on this, using the scripts here, with ffmpeg 4.0
$ time ./test-ffmpeg-1.sh
292.43user 1.29system 1:58.55elapsed 247%CPU (0avgtext+0avgdata 267664maxresident)k
0inputs+83968outputs (0major+118149minor)pagefaults 0swaps
$ time ./test-ffmpeg-2.sh
236.03user 1.01system 1:45.28elapsed 225%CPU (0avgtext+0avgdata 267752maxresident)k
0inputs+90712outputs (0major+86538minor)pagefaults 0swaps
$ time ./test-ffmpeg-3.sh
264.97user 0.76system 1:25.68elapsed 310%CPU (0avgtext+0avgdata 431644maxresident)k
0inputs+107600outputs (0major+69703minor)pagefaults 0swaps
It seems like in case 3, the -threads parameter is applied per output, and so the CPU usage is much higher than in case 1 and 2. So the comparison might not be completely fair, but I'm not sure how to fix it.
Edit: I just noticed that under some circumstances, Peertube runs another transcoding job when a video is imported (to convert to the right video format?). That job could probably also be included with option 3. So we'd only have a single transcoding job, whereas now we have up to 6 in the worst case. So this could really save a lot of CPU usage. And CPU usage is the main bottleneck on my server right now.
Just for reference, implementing multiple output transcoding would involve using fluent-ffmpeg's multiple outputs.
@Nutomic Don't know more about multiple threads decoding or encoding, but as for the CPU usage being high on case 3, I think it's rather because case 1 and 2 are hitting other limits: CPU usage could be low, if there are some other bottle neck? e.g., disk IO? With independant ffmpeg instances, that's 4 times more IO. Maybe there is the bottleneck, which multiple outputs alleviate by using only one instance of ffmpeg (but not of the encoders/decoders).
@rigelk For tests 1 and 2, I didn't run the tests in parallel, but one after the other. So disk usage should be about the same. And I was testing on an SSD, with a rather slow CPU, so it's probably not limited by IO. But in the end, I didn't check disk IO during the test so I'm not sure about it.
re number of threads, the current settings (maxing out at 8) are a bit low for running this on a modern i7 CPU. I'm experimenting on my instance with larger numbers.
I'm no x264 expert and I should find a better reference for this than stackoverflow, but this claims the default (-threads 0) is currently equivalent to 1.5x number of cores with x264:
https://superuser.com/questions/155305/how-many-threads-does-ffmpeg-use-by-default
Even with 12 threads I'm seeing idle CPU time on this box.
My machine specs:
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
4 physical cores, 8 logical cores.
The max threads selectable in the UI (8) leaves the server about half idle. By editing the form client-side I can easily change the value to anything (aside from zero, which would actually be convenient). I'm trying 12 now, but there's still about 10% idle time. Actual CPU usage varies according to the specifics of the codec's thread usage, it seems. I'll try 16 threads next, but it seems from what others have said that it may be more effective to run multiple transcodes simultaneously, each with fewer threads.
With 12 threads I'm seeing 20-50% idle time on the 480p transcodes. I'm fairly sure it's not an I/O issue, as the storage is fast and there's enough RAM for the whole source file. I think it's just a limit on how many cores ffmpeg can use in all particular transcode pipelines. I'll open a separate issue for changing the number of concurrent ffmpeg processes.
It might be due to the underlying encoders, which can only be used in a limited of parallel threads.
Seems pretty weird to me
time(decode+scale)<
Going through four lossy stages to save very minimal CPU overhead here.
In case we are really talking some very obsure input format that is hard to decode maybe do the Source->1080p version first, and then all subsequent encodes based on the 1080p version.
Wait, doesn't this lead to lower quality videos? If you transcode sequentially, you're losing quality from the intermediates. You want to source your transcode from the best version each time.
@rlaager the quality loss is real but marginal, while the overall computational cost is lowered. The latter is our biggest constraint in a self-hosted environment.
Most helpful comment
Wait, doesn't this lead to lower quality videos? If you transcode sequentially, you're losing quality from the intermediates. You want to source your transcode from the best version each time.