Javacpp-presets: ffmpeg returns wrong timestamp on aac decoding

Created on 8 Nov 2020 · 43Comments · Source: bytedeco/javacpp-presets

I want to decode audio from a recording file (.mp4) using GeForce Experience.
I would like to create a program to get a specified number of samples from a specified sample position.
However, the file (.mp4) recorded using GeForce Experience seems to be a bit special and the timestamps are shifted.
In fact, when I use AVFrame#best_effort_timestamp to get the timestamp, it looks like the wrong timestamp is output.
For files recorded with OBS (H264, aac, .mp4), the timestamp increases by 1024 for each frame decoded.
But for GeForce Experience files, the timestamp increases by about 1000-1050.
In this case, the value of AVFrame#nb_samples is always 1024.
The code I used is almost identical to https://github.com/bytedeco/javacpp-presets/issues/942.
I used JavaCV and got the same result.
Is there a way to get an accurate timestamp?

ffmpeg's log

  libavutil      56. 42.102 / 56. 42.102
  libavcodec     58. 80.100 / 58. 80.100
  libavformat    58. 42.100 / 58. 42.100
  libavdevice    58.  9.103 / 58.  9.103
  libavfilter     7. 77.101 /  7. 77.101
  libswscale      5.  6.101 /  5.  6.101
  libswresample   3.  6.100 /  3.  6.100
  libpostproc    55.  6.100 / 55.  6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'D:\動画\Runtime\Runtime 2019.11.10 - 18.47.48.08.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    date            : 2019
    encoder         : Lavf58.42.100
  Duration: 02:09:47.66, start: 0.000000, bitrate: 5212 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 5069 kb/s, 60 fps, 60 tbr, 15360 tbn, 120 tbc (default)
    Metadata:
      handler_name    : VideoHandle
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      handler_name    : SoundHandle

The code I tested on JavaCV

public class Main {
    public static void main(String[] args) throws IOException {
        FFmpegGrabber grabber = new FFmpegGrabber("D:\\動画\\Runtime\\Runtime 2019.11.10 - 18.47.48.08.mp4");
        grabber.start();
        System.out.println("set");
        grabber.setTimestamp(0);
        System.out.println(grabber.getTimestamp());
        OutputStream stream = Files.newOutputStream(Paths.get("out.pcm"));
        for(int l = 0;l < 100;l++) {
            Frame frame = grabber.grabSamples();
            ShortBuffer buffer = (ShortBuffer) frame.samples[0];
            System.out.println(buffer.remaining());
            System.out.println(grabber.getTimestamp());
            for(int i = 0;i < buffer.remaining();i++) {
                byte[] data = new byte[2];
                short sh = buffer.get(i);
                data[0] = (byte) (sh & 0xFF);
                data[1] = (byte) (sh >> 8 & 0xFF);
                stream.write(data);
            }
        }
        stream.close();
        System.out.println(grabber.getTimestamp());
        grabber.close();
    }
}

output

set
start time0
0
2048
39104
2048
60187
2048
82062
2048
103020
2048
124145
2048
145916
2048
166791
2048
188479
2048
209687
...

help wanted question

Source

kusaanko

All 43 comments

I've converted the video files using the following command. Use this command to reset the PTS of the frame.

ffmpeg -i "D:\動画\Runtime\Runtime 2019.11.10 - 18.47.48.08.mp4" -acodec aac -ab 120k -af asetpts=N/SR/TB "Runtime 2019.11.10 - 18.47.48.08custom.mp4"

Then AVFrame#best_effort_timestamp returned a normal value!
However, ffplay can play and seek the original file successfully.
I'm hoping to get the correct PTS without conversion.

kusaanko on 8 Nov 2020

@anotherche Any ideas?

saudet on 9 Nov 2020

Could it be that GeForce Experience encodes audio with variable frame duration? I don't know if this is possible. But if this is permissible, then why players should not play or seek successfully?
However, in fact, I cannot understand the question. For example, what does "the timestamp increases by about 1000-1050" mean? These are too small values for the duration of audio frames. Usually they are 20 times longer in duration. Maybe it's not about the increase in the timestamp, but about the size of the buffers? But then, it seems that it is quite acceptable for frames of the same duration to be encoded with different amounts of data. Variable bit rate. Isn't it?

anotherche on 9 Nov 2020

Sorry. I'm not good at English.
I use AVFrame#best_effort_timestamp to get the start position of the AVFrame samples, AVFrame#pts will give you the same result.
For example, with a file recorded in OBS I can get 1024 samples per frame. Also, for every frame I advance, the timestamp I can get with AVFrame#best_effort_timestamp increases by 1024.
However, if I record a file with GeForce Experience, I can get 1024 samples per frame, but the timestamp I can get with AVFrame#best_effort_timestamp increases by 1000-1050.
In fact, it doesn't matter if I don't use AVFrame#best_effort_timestamp, as long as I can get the start of the frame samples.

kusaanko on 9 Nov 2020

How can timestamp increase by ~1000 from frame to frame? 1000 increase in timestamp corresponds to 1 ms of sound duration, which is too short duration for a typical audio frame.
In your example output:
2048
39104
2048
60187
2048
82062
2048
103020
2048
124145
2048
145916
2048
166791
2048
188479
2048
209687
...
the timestamp increase is about 21000 (21 ms), which is quite normal. So what do you mean by the 1000-1050 increase in timestamp?

anotherche on 9 Nov 2020

Ahhh, I'm sorry.
Its output is a microsecond obtained by JavaCV.
Here's the raw value.

OBS file

nb_samples1024
best_effort_timestamp0
nb_samples1024
best_effort_timestamp1024
nb_samples1024
best_effort_timestamp2048
nb_samples1024
best_effort_timestamp3072
nb_samples1024
best_effort_timestamp4096
nb_samples1024
best_effort_timestamp5120
nb_samples1024
best_effort_timestamp6144
nb_samples1024
best_effort_timestamp7168
nb_samples1024
best_effort_timestamp8192
nb_samples1024
best_effort_timestamp9216
nb_samples1024
best_effort_timestamp10240
...

GeForce Experience file

nb_samples1024
best_effort_timestamp0
nb_samples1024
best_effort_timestamp1877
nb_samples1024
best_effort_timestamp2889
nb_samples1024
best_effort_timestamp3939
nb_samples1024
best_effort_timestamp4945
nb_samples1024
best_effort_timestamp5959
nb_samples1024
best_effort_timestamp7004
nb_samples1024
best_effort_timestamp8006
nb_samples1024
best_effort_timestamp9047
nb_samples1024
best_effort_timestamp10065
nb_samples1024
best_effort_timestamp11093
nb_samples1024
best_effort_timestamp12121
nb_samples1024
best_effort_timestamp13144
...

kusaanko on 9 Nov 2020

Then what is the code for this output?

anotherche on 9 Nov 2020

Here's the minimum code

AVFormatContext format_context;
AVCodecContext codec_context;
AVStream audio_stream;
AVFrame frame;
AVPacket packet;
File file;
format_context = new AVFormatContext(null);
format_context = avformat_alloc_context();
avformat_open_input(format_context, file.toAbsolutePath().toString(), null, null);
avformat_find_stream_info(format_context, (AVDictionary) null);
audio_stream = null;
for (int i = 0; i < format_context.nb_streams(); ++i) {
    if (format_context.streams(i).codecpar().codec_type() == AVMEDIA_TYPE_AUDIO) {
        audio_stream = format_context.streams(i);
        break;
    }
}
codec = avcodec_find_decoder(audio_stream.codecpar().codec_id());
codec_context = avcodec_alloc_context3(codec);
avcodec_parameters_to_context(codec_context, audio_stream.codecpar());
avcodec_open2(codec_context, codec, null);

while (av_read_frame(format_context, packet) == 0) {
    if (packet.stream_index() == audio_stream.index()) {
        if (avcodec_send_packet(codec_context, packet) != 0) {
            throw new IllegalArgumentException("avcodec_send_packet failed\n");
        }
        av_frame_unref(frame);
        if (avcodec_receive_frame(codec_context, frame) == 0) {
            System.out.println("nb_samples" + frame.nb_samples());
            System.out.println("best_effort_timestamp" + frame.best_effort_timestamp());
        }
    }
    av_packet_unref(packet);
}

kusaanko on 9 Nov 2020

Is there a difference in output with
av_frame_get_best_effort_timestamp(frame) instead of frame.best_effort_timestamp()?

anotherche on 9 Nov 2020

No. The result is the same.

kusaanko on 9 Nov 2020

OK, then I guess that this is because of variable bitrate of the encoded audio. The encoded audio steam is cut by equal frames (1024 samples each) but because of VBR the frames are decoded to pieces of audio having variable duration. That is why the presentation time advances so differently.

anotherche on 9 Nov 2020

Is there another way to get the start position of samples of a frame?
I initially thought about decoding everything and recording the correct sample start position, but that method is very inefficient and consumes a lot of memory.

kusaanko on 9 Nov 2020

It's not clear. When you ask "Is there a way to get an accurate timestamp?", do you mean the time in ms at which some audio frame starts? If so, grabber.getTimestamp() in your code just returns the timestamp you need.

anotherche on 9 Nov 2020

Let me guess. It's probably that you want to read audio data starting from a time moment t0. But that time t0 is not exactly the starting position of an audio frame, in which that position is located. Right? Then, as far as I know, grabber.setTimestamp(t0) will set the correct position for you (the real starting position of that frame). As you rather need audio, you should use grabber.setAudioTimestamp(t0) instead.

anotherche on 9 Nov 2020

Well, careful code reading shows that currently all variants of grabber.SetTimestamp() return a frame just next to the desired time position. This should probably be changed in future releases. At least in the setTimestamp(timestamp, checkFrame) method, as well as in setAudioTimestamp and in setVideoTimestamp.
Right now you can use methods from avformat, avutil, etc., as you do
first, seek to the position located before the desired with avformat_seek_file if necessary
then read frames with av_read_frame in the while cycle
then get pts of that frame with frame.best_effort_timestamp()
Then get the timestamp of that frame with

AVRational time_base = audio_st.time_base();
timestamp = 1000000L * pts * time_base.num() / time_base.den();

then estimate the duration of that frame

duration= 1000000.0*frame.nb_samples()/codec_context.sample_rate() ;

then check if the desired position is located within that frame

timestamp+duration> the_desired_time_in_microseconds

if so, then the last read frame is the frame you need.

anotherche on 10 Nov 2020

That code doesn't appear to work properly.
From what I've found, AVFrame#pkt_duration seems to be able to predict the timestamp of the next frame.
AVFrame#best_effort_timestamp returns a value equivalent to AVFrame#pts and this value does not appear to match the sample position.
The GeForce Experience file returns a different value than the actual microseconds in that program, because the audio_st.time_base() returns a num of 1 and a den of 48,000.
For example, where the original timestamp of the frame is 0.042666666666666666666665 seconds, the code would show 0.0601875.

kusaanko on 10 Nov 2020

So, the idea works with the AVFrame#pkt_duration. Perfect!
Although I still don't understand what the original problem was :)
By the way, why do you think that the time_base=1/48000 is wrong? As I can see in your ffmpeg log, the time_base corresponds to the sample rate here.
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)

anotherche on 10 Nov 2020

For example, where the original timestamp of the frame is 0.042666666666666666666665 seconds, the code would show 0.0601875.

Let me guess once more. 2048/48000=0.042(66)
What is that "0.0601875" then?

anotherche on 10 Nov 2020

Ahh, I see that 0.0601875=2889/48000 is the timestamp in seconds of the 3rd frame in the GeForce encoded video.

nb_samples1024
best_effort_timestamp0
nb_samples1024
best_effort_timestamp1877
nb_samples1024
best_effort_timestamp2889

Can you show the output of
System.out.println("pts " + frame.best_effort_timestamp()+" pkt_duration "+ frame.pkt_duration());
for the GeForce encoded file?

anotherche on 11 Nov 2020

Here's the output.

pts 0 pkt_duration 1877
pts 1877 pkt_duration 1012
pts 2889 pkt_duration 1050
pts 3939 pkt_duration 1006
pts 4945 pkt_duration 1014
pts 5959 pkt_duration 1045
pts 7004 pkt_duration 1002
pts 8006 pkt_duration 1041
pts 9047 pkt_duration 1018
pts 10065 pkt_duration 1028

kusaanko on 11 Nov 2020

Reading the source code of ffplay gave me a hint.
It seems that ffplay allows for a few sample deviations.
In fact, a deviation of 0.0175209 seconds is not discernible to humans.
So I decided to use the following method.

    private AVFormatContext format_context;
    private AVStream audio_stream;
    private AVCodec codec;
    private AVCodecContext codec_context;
    private AVFrame frame;
    private AVPacket packet;
    private long now_timestamp;
    private long next_timestamp;
    private boolean audio_seek;
    public static AVRational av_time_base;
    static {
        av_time_base = new AVRational();
        av_time_base.num(1);
        av_time_base.den(AV_TIME_BASE);
    }

First, grab it.
Set the current timestamp to 0 at first, and then use the value calculated from AVFrame#nb_samples for the next position.
Here's the code.

    public boolean grab() {
        //Decoding if left over as multiple frames may be included
        if (avcodec_receive_frame(codec_context, frame) == 0) {
            now_timestamp = next_timestamp;
            next_timestamp = now_timestamp + frame.nb_samples();
            av_packet_unref(packet);
            return true;
        }
        while (av_read_frame(format_context, packet) == 0) {
            if (packet.stream_index() == audio_stream.index()) {
                if (avcodec_send_packet(codec_context, packet) != 0) {
                    throw new IllegalArgumentException("avcodec_send_packet failed\n");
                }
                av_frame_unref(frame);
                if (avcodec_receive_frame(codec_context, frame) == 0) {
                    now_timestamp = next_timestamp;
                    next_timestamp = now_timestamp + frame.nb_samples();
                    av_packet_unref(packet);
                    return true;
                }
            }
            av_packet_unref(packet);
        }
        return false;
    }

The next step is to seek.
For the seek, we use -1 for the stream index and specify a microsecond timestamp. (seconds * AV_TIME_BASE)
In this case, we need to consider AVFormatContext#start_time.

    public void seek(long samplePos)
    {
        long timestamp = (long) (samplePos * 1000000d / codec_context.sample_rate()) + format_context.start_time();
        avformat_seek_file(format_context, -1, Long.MIN_VALUE, timestamp, Long.MAX_VALUE, AVSEEK_FLAG_BACKWARD);
        avcodec_flush_buffers(codec_context);
    }

After the seek, we can get the current sample location in the following way
In this case, we need to consider AVFormatContext#start_time.

now_timestamp = (long)(((frame.pts()) * av_q2d(audio_stream.time_base()) - (format_context.start_time() * av_q2d(av_time_base))) * codec_context.sample_rate());

Calculate the deviation from the acquired timestamp and grab the necessary amount.

Finally, this is the whole code.

    public boolean grab() {
        //Decoding if left over as multiple frames may be included
        if (avcodec_receive_frame(codec_context, frame) == 0) {
            if (audio_seek) {
                now_timestamp = (long)(((frame.pts()) * av_q2d(audio_stream.time_base()) - (format_context.start_time() * av_q2d(av_time_base))) * codec_context.sample_rate());
                audio_seek = false;
            }
            else {
                now_timestamp = next_timestamp;
            }
            next_timestamp = now_timestamp + frame.nb_samples();
            av_packet_unref(packet);
            return true;
        }
        while (av_read_frame(format_context, packet) == 0) {
            if (packet.stream_index() == audio_stream.index()) {
                if (avcodec_send_packet(codec_context, packet) != 0) {
                    throw new IllegalArgumentException("avcodec_send_packet failed\n");
                }
                av_frame_unref(frame);
                if (avcodec_receive_frame(codec_context, frame) == 0) {
                    if (audio_seek) {
                        now_timestamp = (long)(((frame.pts()) * av_q2d(audio_stream.time_base()) - (format_context.start_time() * av_q2d(av_time_base))) * codec_context.sample_rate());
                        audio_seek = false;
                    }
                    else {
                        now_timestamp = next_timestamp;
                    }
                    next_timestamp = now_timestamp + frame.nb_samples();
                    av_packet_unref(packet);
                    return true;
                }
            }
            av_packet_unref(packet);
        }
        return false;
    }
    public void seek(long samplePos)
    {
        long timestamp = (long) (samplePos * 1000000d / codec_context.sample_rate()) + format_context.start_time();
        avformat_seek_file(format_context, -1, Long.MIN_VALUE, timestamp, Long.MAX_VALUE, AVSEEK_FLAG_BACKWARD);
        avcodec_flush_buffers(codec_context);
        audio_seek = true;
        while(grab() && next_timestamp < samplePos) {}
    }

Follow the same procedure for the video.
In the case of a video, the code to get the location of the current frame is as follows

now_timestamp = (long) (((frame.pts()) * av_q2d(video_stream.time_base()) - (format_context.start_time() * av_q2d(av_time_base))) * av_q2d(video_stream.avg_frame_rate()));

For a video, you don't need to calculate the next timestamp.
When you seek, you may be more advanced than the desired location.
In that case, seeking earlier timestamp works well.

In the case of best_effort_timestamp, the value may not be set. Use AVFrame#pts instead.

JavaCV does not calculate the timestamp of the audio from AVFrame#nb_samples and does not go back if the video is ahead of where you want it to be when you seek it. The code in JavaCV also needs to be fixed.

What do you think about my idea?

kusaanko on 23 Nov 2020

@kusaanko , the problem for me is that I still cannot understand your problem. It seems that you merely want to read some audio frames sequentially, starting with certain timestamp. But why then seek for every next frame? You don't have to seek every next frame. Because av_read_frame() does that for you.
`

int | av_read_frame (AVFormatContext *s, AVPacket *pkt)
-- | --
| Return the next frame of a stream.

You should simply call your grab() over and over.

anotherche on 23 Nov 2020

My objectives are as follows

Being able to accurately seek to the desired frame.
Knowing exactly the current position.

I also want to process each frame one by one, so I want to make grabbing a method to make it easier to handle.

kusaanko on 23 Nov 2020

All these tasks, as far as I understand them, are naturally solved by the methods of FFmpegFrameGrabber class (or by using their internal code). That's why I can't figure out why this doesn't meet your needs.

* Being able to accurately seek to the desired frame.
can be done with setAudioTimestamp(long timestamp), as you need audio frames. However, as I mentioned above, the current code seem to return right the next frame if you ask for a timestamp located between timestamps of two neighbour frames. You can look to the code to get an idea how to change it to return correct frame.
There is a line in the setTimestamp:
if (this.timestamp >= timestamp - 1) break;
I believe that it can be changed to a
if (this.timestamp + this_frame_duration_in_microsec> timestamp -1) break;
to obtain the frame containing the desired timestamp.

* Knowing exactly the current position.
This depends on what you mean by "the current position".
If you mean the starting position of a current frame in microseconds, then it is exactly FFmpegFrameGrabber.timestamp

I also want to process each frame one by one, so I want to make grabbing a method to make it easier to handle.

This, it seems to me, is what is causing you the problem. You seem to think that in order to read the next frame, you need to know the moment (position, timestamp, whatever) when it starts, to seek there and to grab it. The truth is, you absolutely don't need to know where the next frame starts and seek there. Just grab frames again and again, repeatedly. You do not need to seek for the every next frame.

anotherche on 23 Nov 2020

The problem with calculating the timestamp of each frame is that the positions of the samples calculated from the timestamp are off by a few samples.
This is inconvenient for me to know the current position when there is a request to read again after I have taken as many samples as I need.
I need to calculate the current position exactly to know if I need to seek when there is a read request.
In fact, the program I wrote before accomplishes my goal.

kusaanko on 23 Nov 2020

@kusaanko If you see a way to make JavaCV better, please do send a pull request for review! Thank you

saudet on 26 Nov 2020

@kusaanko I have carefully checked the proposed code and can draw the following

Despite the fact that you correctly calculate the change in the position of the starting sample from frame to frame (next_timestamp = now_timestamp + frame.nb_samples();), you still have to use the first frame PTS to calculate the starting position (now_timestamp = (long)(((frame.pts()) * av_q2d(audio_stream.time_base()) - (format_context.start_time() * av_q2d(av_time_base))) * codec_context.sample_rate());). In any case, this introduces the error that you wrote above (the allowed deviations). The only possibility of absolutely correct seek to a frame with the required sample is to build a map like PTS_of_a_audioframe:starting_sample_position_of_this_audioframe. But this, of course, can be very expensive, since it requires pre-decoding of the entire audio track.
The current version of FFmpegFrameGrabber allows (almost... read next) you to reach your goal with the exactly same error, since timestamps are also calculated from PTS (just like in your now_timestamp). However, to get the same result as in your code, we need, as I said above, to change the condition for stopping the frame grabbing cycle (in the setTimestamp method) so that the stop to occur not on the next frame (as is done now), but on the current one. In your code, this matches next_timestamp <samplePos condition. To achieve the same result in the current FFmpegFrameGrabber code, you just need to make the change I wrote above.

if (this.timestamp >= timestamp - 1) break;
I believe that it can be changed to a
if (this.timestamp + this_frame_duration_in_microsec> timestamp -1) break;

I'm just checking the code modified in this way and it works exactly as I understand you need it (again - if I understand you correctly at all). In addition, I have made some changes that prevent erroneous seeks (sometimes avformat_seek_file seeks to a frame just after the timestamp requested, not before as it should). Moreover, the seek speed is almost doubled for the setAudioTimestamp and setVideoTimestamp methods. I will request a PR after I debug this new code.

Optionally, we can even try to correct for the above-mentioned error due to rounding off the PTS values. This adjustment will work under two conditions: a. the sample rate should be constant throughout the entire file (I don't know if video containers allow combining an audio track from several parts with different sample rate); b. nb_samples are the same for all audio frames (I don't know if video containers allow variable nb_samples in audio tracks or combining several parts with different nb_samples). If these conditions are met, then the timestamp of a frame calculated from its PTS can be adjusted so that it is a multiple of nb_samples/sampleRate. For example, the correction can always be applied if it not exceeds ~5% of the frame duration. Then, even if the above condition is not met, this correction will not lead to a large error (my testing shows that the error in timeStamps calculated from PTS does not exceed 5%). And if the condition is met, the correction solves the problem of inaccurate search when a sample position is requested that is close to the beginning or end of the frame. By the analogy, the timeStamps of video frames can be corrected similar way as they should advance as multiple of 1/frameRate.

anotherche on 26 Nov 2020

I also considered the "1" method for accurate seeking, but didn't use it because it was too inefficient. However, you can get enough accuracy without using the "1" method. I realized this and decided to ignore some errors.
Method "2" is a better way. Why it is twice as fast is not clear to me.
The "3" method may not be very practical. Some of the videos I have rarely return AVFrame#nb_samples different from other frames. When using the "3" method, you can use AVCodecContext#frame_size.

I also wrote the following process for returning when the avformat_seek_file seeks after the request. (For video)

    public void seek(long frame)
    {
        seek_only(frame);
        long f = frame - (now_timestamp - frame) - 3;
        while(now_timestamp > frame) {
            if(f < 0) f = 0;
            seek_only(f);
            if(f == 0) break;
            f -= 20;
        }
        while( now_timestamp < frame && grab() ) {}
    }

    public void seek_only(long frame)
    {
        long timestamp = (long) (frame * 1000000L / ((double)video_stream.avg_frame_rate().num() / video_stream.avg_frame_rate().den())) + format_context.start_time();
        avformat_seek_file(format_context, -1, Long.MIN_VALUE, timestamp, Long.MAX_VALUE, 0);
        avcodec_flush_buffers(codec_context);
        grab();
    }

(now_timestamp is frame number in the video, not microseconds.)

kusaanko on 26 Nov 2020

Method "2" is a better way. Why it is twice as fast is not clear to me.

I mean that right now I'm testing the new code for the FFmpegFrmeGrabber.setTimestamp method. The new code twice as fast compared to the present code in JavaCV 1.54 release if accurate seeking is used with setAudioTimestamp or setVideoTimeStamp (special variants of the setTimestamp method where the seek uses only the specified frame type, audio or video). And, of course, the new code results in accurate seek, as accurate as yours (seeks to the frame containing samples at requested timestamp, in microsecs).

The "3" method may not be very practical.

I still think it is not that impractical. We have already seen that PTS of frames (from which their timestamps are calculated) increase non-uniformly. I have tested this on several files. (x264, aac, mkv container). Timestamps of video frames calculated from their PTS increase non-uniformly with deviations from the uniform increase within +- 0.5 milliseconds (~ 1% of the video frame duration). This is quite small, although it will lead to the fact that in 1% of cases the method will return the next or previous frame, instead of the desired one (after all, we are talking about the task of precision positioning, right?). But the situation appeared worse for audio frames: the PTSs of audio frames deviate from the exact positions within +- 2 milliseconds (~ 9.4% of the audio frame duration, +- 96 samples out of 1024 in the tested files). Therefore, when you need to get the exact position of a frame containing certain samples that are located close to the beginning or end of the frame (closer than these 96 samples), then using PTS to calculate the frame position may result in the next or previous frame instead of what you want. And this will happen already in ~ 10% of cases. Not so little if we strive for the precision?
That's why I'm implementing this method in new FFmpegFrmeGrabber.setTimestamp code. Based on the measured deviations of audio PTSs, I apply the timestamp correction if estimated deviation does not exceed 10% of the frame duration. This is still a very small value in the usual sense, since standard frame seek methods generally return frames within 1 second of the required one. At the same time, we get the possibility of almost accurate positioning in the case of video files with a constant audio frame size (almost as accurate as if you decode the entire audio). For all other cases, this correction will lead to almost the same positioning error as in the method without this correction (in several percents of cases, positioning will be offset by 1 frame).

Some of the videos I have rarely return AVFrame#nb_samples different from other frames. When using the "3" method, you can use AVCodecContext#frame_size

Thanks. This can be additional method to get the audio frame size. If it is set by encoders or decoders, it indicates that the frame size is constant.

int AVCodecContext::frame_size
Number of samples per channel in an audio frame.

    encoding: set by libavcodec in avcodec_open2(). Each submitted frame except the last must contain 
       exactly frame_size samples per channel. 
       May be 0 when the codec has AV_CODEC_CAP_VARIABLE_FRAME_SIZE set, then the frame size 
       is not restricted.
    decoding: may be set by some decoders to indicate constant frame size

Although it often happens with audio and video files, that such optional metadata is not set correctly.

anotherche on 26 Nov 2020

I listened to your description and thought the "3" method was worth a try.
In fact, I once used the same method as the "3" method to fix the timestamp after a seek. But this time I gave up because for some reason I couldn't get it to play well.
Worth trying again.
But even if you can correct it, there will still be some discrepancies. Because the number of samples in every frame may not match up.
But if it gives a higher accuracy than the "2" method, it's a great way.

kusaanko on 26 Nov 2020

The code for setTimestamp with the precise timestamp correction is almost done. However, I came across a question to which I cannot find a clear answer. It turned out that when seeking using the correction, it is necessary to take into account not only the AVFomatContext start_time - the start time of the entire video, but also the AVStream start_time - the start time of an individual stream. And here I cannot understand whether it is necessary to sum them to each other (to find the real start time of that stream), or the start_time of the stream already takes into account the start_time of the entire video? In other words, with respect to what time are the timestamps of the frames of the selected stream counted - relative to the start_time of the entire video or relative to the absolute time?
In the ffplay code, I found such a code.

static int64_t start_time = AV_NOPTS_VALUE;
...
 /* if seeking requested, we execute it */
    if (start_time != AV_NOPTS_VALUE) {
        int64_t timestamp;

        timestamp = start_time;
        /* add the stream start time */
        if (ic->start_time != AV_NOPTS_VALUE)
            timestamp += ic->start_time;
        ret = avformat_seek_file(ic, -1, INT64_MIN, timestamp, INT64_MAX, 0);
        if (ret < 0) {
            av_log(NULL, AV_LOG_WARNING, "%s: could not seek to position %0.3f\n",
                    is->filename, (double)timestamp / AV_TIME_BASE);
        }
    }

It seems to follow from it that the starting times must be added. But is it really so? @saudet , @kusaanko

anotherche on 28 Nov 2020

😕1

It seems all start_time values are absolute, that is they should not be added one another.
I have a video with AVFormatContext start_time=0, but AVStream start_time is 9 for the audio stream (and it is zero for the video stream). 9 is the time in audio stream timebase, so that it equals to 9000 microseconds. I cut first 1 min of that video and asked to add a time offset to the output file with the command

ffmpeg -i input.mkv" -t 00:01:00 -output_ts_offset 0.1 -c copy test.mkv

which adds an offset of 100000 microseconds to the file.

Then ffprobe -show_format -show_streams test.mkv contains following

[STREAM]
index=0
codec_name=h264
...
time_base=1/1000
start_pts=100
start_time=0.100000
[/STREAM]
[STREAM]
index=1
codec_name=aac
...
time_base=1/1000
start_pts=109
start_time=0.109000
...
[/STREAM]
[FORMAT]
filename=test.mkv
...
start_time=0.100000
...
[/FORMAT]

FFmpegFrameGrabber gives following for that test.mkv:
```
oc.start_time = 100000 (start time of the file)
video_st.start_time = 100 (start time of the video stream in its timebase units)
audio_st.start_time = 109 (start time of the video stream in its timebase units)
first audio frame timestamp = 109000
first video frame timestamp = 100000
````
That is, audio is still delayed by 9 ms respective to video, and both audio and video streams have start_time shifted by 0.1 sec forward.
So, finally, AvStream start_time is absolute start time of that stream.

anotherche on 28 Nov 2020

In ffplay, ic is an AVFormatContext. Also, the variable start_time seems to be set only when you add -ss to the ffplay argument.

static int opt_seek(void *optctx, const char *opt, const char *arg)
{
    start_time = parse_time_or_die(opt, arg, 1);
    return 0;
}
...
static const OptionDef options[] = {
...
    { "ss", HAS_ARG, { .func_arg = opt_seek }, "seek to a given position in seconds", "pos" },

The actual seek program is as follows.

                    int64_t ts;
                    int ns, hh, mm, ss;
                    int tns, thh, tmm, tss;
                    tns  = cur_stream->ic->duration / 1000000LL;
                    thh  = tns / 3600;
                    tmm  = (tns % 3600) / 60;
                    tss  = (tns % 60);
                    frac = x / cur_stream->width;
                    ns   = frac * tns;
                    hh   = ns / 3600;
                    mm   = (ns % 3600) / 60;
                    ss   = (ns % 60);
                    av_log(NULL, AV_LOG_INFO,
                           "Seek to %2.0f%% (%2d:%02d:%02d) of total duration (%2d:%02d:%02d)       \n", frac*100,
                            hh, mm, ss, thh, tmm, tss);
                    ts = frac * cur_stream->ic->duration;
                    if (cur_stream->ic->start_time != AV_NOPTS_VALUE)
                        ts += cur_stream->ic->start_time;
                    stream_seek(cur_stream, ts, 0, 0);

/* seek in the stream */
static void stream_seek(VideoState *is, int64_t pos, int64_t rel, int seek_by_bytes)
{
    if (!is->seek_req) {
        is->seek_pos = pos;
        is->seek_rel = rel;
        is->seek_flags &= ~AVSEEK_FLAG_BYTE;
        if (seek_by_bytes)
            is->seek_flags |= AVSEEK_FLAG_BYTE;
        is->seek_req = 1;
        SDL_CondSignal(is->continue_read_thread);
    }
}

        if (is->seek_req) {
            int64_t seek_target = is->seek_pos;
            int64_t seek_min    = is->seek_rel > 0 ? seek_target - is->seek_rel + 2: INT64_MIN;
            int64_t seek_max    = is->seek_rel < 0 ? seek_target - is->seek_rel - 2: INT64_MAX;
// FIXME the +-2 is due to rounding being not done in the correct direction in generation
//      of the seek_pos/seek_rel variables

            ret = avformat_seek_file(is->ic, -1, seek_min, seek_target, seek_max, is->seek_flags);

This code is executed when you right-click on the ffplay window. The seek rate is determined by the x-coordinate of the right-clicked window.
In this case, only AVFormatContext#start_time is used.
I don't know if I should actually use AVStream#start_time or not. I can't notice a difference of 9ms.

kusaanko on 28 Nov 2020

However, I think AVStream#start_time should be used instead of AVFormatContext#start_time to calculate the frame's timestamp if it affects the frame's timestamp.
But I don't know if AVStream#start_time should be used in avformat_seek_file.
If you use AVStream#index for the stream index in avformat_seek_file, then I think using AVStream#start_time is a must.
You might be using AVStream#start_time internally since you are using -1 for the stream index.

kusaanko on 28 Nov 2020

I don't know if I should actually use AVStream#start_time or not. I can't notice a difference of 9ms.

I'm sure no one can notice! But. Let me remember the first message in this long discussion. Just a quote:

But for GeForce Experience files, the timestamp increases by about 1000-1050.
In this case, the value of AVFrame#nb_samples is always 1024.

Somewhat later you write:

The problem with calculating the timestamp of each frame is that the positions of the samples calculated from the timestamp are off by a few samples.

From this I had a conclusion that you needed an extra precise seek to an audio frame, because deviations of few samples were unacceptable for you. If even we say of 50 samples of audio with samplerate=48000 samples/sec, then this deviation is only 1 ms, which is even less that that 9 ms. Then I do not understand your initial problem again, because now 9 ms difference is not important for you, although in the beginning it was that even 1 ms deviations were important.
But if a difference like 9 ms is not important, then deviations up to 400 samples are not important too. I may even suppose that a deviation up to a frame duration is not very important for most of people, because it still corresponds to a few dozen of milliseconds. In this case even the existing JavaCV release should be acceptable for you, because using methods setAudioTimestamp and setVideoTimestamp, introduced since ~1.4 release return you a frame just next to the required samples (to obtain a frame-precise result you should use these setAudioTimestamp and setVideoTimestamp methods, and not the standard setTimestamp, which may return you a frame within 1 second of the required timestamp). As I mentioned above, this can be improved; the new code I'm working now on, thanks to your issue message, improves this - the returned frame will almost always contain the required samples.

JavaCV does not calculate the timestamp of the audio from AVFrame#nb_samples
and does not go back if the video is ahead > of where you want it to be when you seek it. The code in JavaCV also needs to be > fixed.

From the preceding it follows that it is not important at all how to seek to the desired frame using nb_samples or timestamps - the search result will have the same error of +-frame. And such an error, in most cases, is completely invisible, as you write about those 9 ms. It doesn't matter how we move to the desired frame after calling avformat_seek_file, using nb_samples or using timestamps - we will still come to the same frame. If even we imagine a case of a corrupted file, where all the timestamps are completely wrong, you still have to use avformat_seek_file first, which will also send you to the wrong starting position. And then it becomes completely unimportant how carefully you move from frame to frame if the starting point of the movement was wrong.
As for the backward searching, you cannot implement it directly by means of ffmpeg in any way, as far as I know. So it is impossible to change something in JavaCV to make it happen. This can be achieved only by repeated seek to an earlier position and monitoring the timestamps of the video and audio frames read after this. In this case it is the setTimestamp methode is what you really need because it put you in the position a second before the position in all the streams. After which you can grab frame by frame to find required video and audio and what else (subs, for example). However, I will tell you a secret about that avformat_seek_file. Sometimes, I guess when a keyframe is lying just next to the requested timestamp, avformat_seek_file puts you not before the timestamp but just after it. There is a variable in this method, max_ts, that can be believed to help in this situation, but it does not in reality. The real way to prevent this behaviour, from my experience, is to seek 0.5 second earlier with the avformat_seek_file. It will increase the number of subsequent search steps by several tens of frames only which is not so many.

However, I think AVStream#start_time should be used instead of AVFormatContext#start_time to calculate the frame's timestamp if it affects the frame's timestamp.
But I don't know if AVStream#start_time should be used in avformat_seek_file.
If you use AVStream#index for the stream index in avformat_seek_file, then I think using AVStream#start_time is a must.
You might be using AVStream#start_time internally since you are using -1 for the stream index.

For the possibility of precise timestamp correction the usage of AVStream#start_time is crucial because all timestamps of frames belonging to that stream are shifted by the AVStream#start_time. It does not matter what stream index is specified in avformat_seek_file as you always have to grab more frames after the call of avformat_seek_file to reach a frame you really need. (But, Hmm, to be true, I have never tried to call avformat_seek_file with explicitly specified stream index. What if it just makes all the work for us?).

anotherche on 29 Nov 2020

No. I just said I don't know if I should use AVStream#start_time because I can't notice the 9ms difference.
I didn't say to ignore the 9ms difference.
If I could, I would seek to the exact position as much as possible. But if it's not technically possible, some compromise is necessary.

Also, in my experience, if you use AVStream#index for the stream index of avformat_seek_file, the timestamp unit is AVStream#time_base. Also, the seek works fine. The problem is that there is a discrepancy when calculating the timestamp from the time. It would be great if we could get rid of this misalignment.

kusaanko on 29 Nov 2020

The problem is that there is a discrepancy when calculating the timestamp from the time. It would be great if we could get rid of > this misalignment.

Let's try again to understand that we are talking about the same thing. If we are talking about the values of the timestamp returned by frame.timestamp or grabber.getTimeStamp(), then it is calculated from the PTSs and, of course, has some deviation from real time. But as we have already noted here many times, this deviation is most often within an interval corresponding to several tens of samples, which corresponds to several milliseconds. This means that in 99.999% of cases this deviation is not a problem at all. If someone, and there are very few such people, I suppose, (for example, me; perhaps, you), is still interested in the exact time of the frame or even the required sample, then in most cases we can adjust these timestamp values to real time. For me, in my tasks, it was always enough just to get a frame containing the right moment in time. This, as I said, is already practically achieved in the current version of JavaCV (OK, we have to fix one feature, when in reality it is not the desired frame that is returned, but the next frame; but this is a simple fix). This works if the required time is not near the border of two frames. Otherwise, there is a 50% chance that we will get the previous or next frame.
So, I already have a code that adjusts the values of timestamps to the real time. Now I don't have a lot of time to do a PR. Possibly after mid-December. Correction will work in cases when all audio frames have the same nb_samples and sampleRate (I have already checked about 20 videos - all of them satisfy this condition). To make it clear what this is about, below is an approximate correction code.

//Real timestamp of a grabbed frame. To be calculated in the following code
double real_ts=frame.timestamp; 

// A threshold value to decide if the correction is reasonable. 
// In some of my videos it can be slightly bigger than 0.1. So, it can be better set, say, 0.2
deltaThreshold = 0.1;

//Estimate of the frame duration
double frameDuration = 0.0; 
if (frame.image != null && getFrameRate() > 0)  
    frameDuration =  AV_TIME_BASE / (double)getFrameRate();
else if (frame.samples != null && samples_frame!= null && getSampleRate() > 0) 
    frameDuration =  AV_TIME_BASE * samples_frame.nb_samples() / (double)getSampleRate();

//Estimate of the correction coefficient of the deviation.
// ts0 is the starting timestamp of the needed stream.
// Timestamps of audio frames should increase by equal duration
//(if nb_samples or sampleRate are constant) 
// If the increase is not uniform (because of deviations in PTSs), then delta!=0
double delta = 0.0; 
if (frameDuration>0.0)  
    delta = (ts-ts0)/frameDuration - Math.round((ts-ts0)/frameDuration); 

// If the correction coefficient is too large it is possible that the correction is not reasonable
//This can be because of not constant nb_samples or sampleRate
//If nb_samples or sampleRate are not constant, while delta <deltaThreshold, 
//we can still apply the correction, because it will not spoil anything anymore 
//(adjustments within a few milliseconds will not be important for such a stream)
if (Math.abs(delta)>deltaThreshold) delta=0.0; 

//Let's apply the correction
ts-=delta*frameDuration;

anotherche on 30 Nov 2020

The code can be used in setTimestamp method (I checked it on those 20 videos, it is fine). It can be used outside as well, in applications using JavaCV.

anotherche on 30 Nov 2020

There are interesting results for ac3 encoded audio streams. When decoding, such files do not give errors and the frame timestamps grow strictly evenly. But when seek is used, the first 1, 2 or 3 frames read just after avformat_seek_file call are most often read with errors and have shifted PTSs (this means, among other things, that these PTSs are set by the decoder / demuxer as they are different (correct ones) when decoding the entire file from start to finish). This does not make a problem for the correction, since after these first strange frames, subsequent frames are read without errors and they have correct PTS (after avformat_seek_file is called you still always need to grab several tens, sometimes hundreds, frames until the desired timestamp is reached). This is probably due to the fact that the ac3 stream cannot be decoded from an arbitrary frame, but after a few bad frames, the decoder returns to normal operation.

anotherche on 30 Nov 2020

I'm waiting for a pull request.
My complete objective was not achieved, but I'm going to close this issue since I'm satisfied with the final result.
Thank you very much.

kusaanko on 10 Dec 2020

@kusaanko Did you also have a pull request you wanted to merge?

saudet on 6 Jan 2021

@kusaanko Did you also have a pull request you wanted to merge?

No, I don't have.

kusaanko on 8 Jan 2021

😕1

@kusaanko Please try the new code added from pull https://github.com/bytedeco/javacv/pull/1559! Thanks

saudet on 12 Jan 2021

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[question] Any plans about "The Java configuration files" generator

archenroot · 23Comments

Merging with JCuda and JOpenCL projects for better quality cuda interfaces

archenroot · 56Comments

Need the RTP/NTP time stamps provided in the RTP packets of H.264-encoded video stream

ryantheseer · 43Comments

SIGSEGV crash opencv bytedeco

siddharthmudgal · 26Comments

ClassNotFoundException: org.bytedeco.javacpp.avutil$Pool_free_Pointer on 3.0.2-1.2 version

mtadmk · 33Comments