Linux: Crash when trying to encode JPEG using OMX

Created on 30 May 2020  路  30Comments  路  Source: raspberrypi/linux

Describe the bug
When trying to encode a JPEG image using OMX in ustreamer, the kernel crashes with stacktrace about DMA. The problem is observed on the 5.4.42 Arch Linux ARM kernel, but not observed on Raspbian with fresh kernel obtained using rpi-update. The likely reason is that the Alarm kernel is built from a commit https://github.com/raspberrypi/linux/commit/5e5024f643caa53ff59a6e00f40a9b55f7fc4e17 while the Raspbian kernel is build from https://github.com/raspberrypi/linux/commit/971a2bb14b459819db1bda8fcdf953e493242b42. The Arch kernel is slightly newer, and contains two commits that could have caused the problem. At least they are about DMA, and that's what the stacktrace is all about: https://github.com/raspberrypi/linux/commit/d8cbdaa729d5d3e9a1c18150bf4de69335a85a40 and https://github.com/raspberrypi/linux/commit/79495a5ecdfba69de51e88701a69c42d09806d84.

To reproduce

  • Install Arch Linux ARM with the latest kernel 5.4.42 or Raspbian (and compile kernel from commit https://github.com/raspberrypi/linux/commit/5e5024f643caa53ff59a6e00f40a9b55f7fc4e17).
  • Compile ustreamer with OMX support.
  • Run capture process from any v4l2 device via OMX: ustreamer --device /dev/video0 --encoder=OMX --workers=1.
  • Get a frozen process. After pressing Ctrl+C process turns into a zombie. At the same time, dmesg is experiencing the described crash.

Expected behaviour
OMX encoder should work correctly, as on older versions of the kernel.

Actual behaviour
Process frozen. Observes the kernel stacktrace.

System
Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

  • Which model of Raspberry Pi? e.g. Pi3B+, PiZeroW
    RPi4 2Gb
  • Which OS and version (cat /etc/rpi-issue)?
    Arch Linux ARM
  • Which firmware version (vcgencmd version)?
    369ed4e44cb5a080a2dfa7f854ae4ff46b7c9ef9
  • Which kernel version (uname -a)?
    5.4.42

Logs

[  148.617613] 8<--- cut here ---
[  148.620716] Unhandled fault: unknown 3 (0x2a03) at 0xffeee000
[  148.626514] pgd = 21038de6
[  148.629239] [ffeee000] *pgd=80000000007003, *pmd=2ffbe003, *pte=c00fff54e3e71f
[  148.636518] Internal error: : 2a03 [#1] PREEMPT SMP ARM
[  148.641779] Modules linked in: usb_f_mass_storage usb_f_hid usb_f_acm u_serial btsdio bluetooth ecdh_generic ecc brcmfmac brcmutil cfg80211 raspberrypi_hwmon hwmon i2c_mux_pinctrl i2c_mux bcm2835_unicam i2c_bcm2835 bcm2835_v4l2(C) bcm2835_codec(C) bcm2835_isp(C) v4l2_mem2mem iproc_rng200 rng_core bcm2835_mmal_vchiq(C) videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common rpivid_mem uio_pdrv_genirq uio sch_fq_codel snd_bcm2835(C) snd_pcm snd_timer snd tc358743 v4l2_dv_timings v4l2_fwnode videodev mc cec libcomposite dwc2 udc_core drm drm_panel_orientation_quirks ip_tables x_tables ipv6 nf_defrag_ipv6
[  148.698262] CPU: 1 PID: 155 Comm: vchiq-slot/0 Tainted: G         C        5.4.42-1-ARCH #1
[  148.706633] Hardware name: BCM2711
[  148.710048] PC is at b15_dma_inv_range+0x20/0x50
[  148.718599] LR is at dma_cache_maint_page+0xa8/0x150
[  148.727498] pc : [<c0219d6c>]    lr : [<c0215bb0>]    psr: 000b0013
[  148.737726] sp : eeb67db8  ip : 00400fff  fp : c1404f98
[  148.746883] r10: 00000030  r9 : 00000002  r8 : c1555440
[  148.756068] r7 : c1407c84  r6 : fff54e3e  r5 : 00000000  r4 : 00000020
[  148.766656] r3 : 0000003f  r2 : 00000040  r1 : ffeef000  r0 : ffeee000
[  148.777100] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  148.788171] Control: 30c5383d  Table: 204a5580  DAC: 55555555
[  148.797915] Process vchiq-slot/0 (pid: 155, stack limit = 0x41d87e62)
[  148.808346] Stack: (0xeeb67db8 to 0xeeb68000)
[  148.816732] 7da0:                                                       eeb67dfc c0219e14
[  148.832977] 7dc0: ffeee000 ffffc4dc ef085c00 ed6a3ba0 00000fff ef138810 00000fe0 00000020
[  148.849495] 7de0: 00000000 e6580194 e6545000 c0216114 c0219e14 00000000 54e3e020 00000fff
[  148.866438] 7e00: ef138810 00000000 00000002 c029d728 00000fe0 00000002 eeb67e64 c0cedc28
[  148.883883] 7e20: 54e3e020 00000fff ef138810 c029d8f8 00000fe0 00000002 ef0da000 ef0d2d00
[  148.901904] 7e40: 9a48a783 e6545020 00000001 00000003 ef138810 c029d940 00000fe0 00000002
[  148.920474] 7e60: 00000000 c0ce842c 00000002 e6545068 00002540 e6545014 c15b1d38 c14a78f8
[  148.939531] 7e80: 00000003 c0b66f60 00000000 c0b5b8dc 00000004 c1591848 400b0013 c1591848
[  148.959039] 7ea0: 200b0013 c159178c c15917f4 e18634cc 00000008 e1863400 e6585b28 c14a78f8
[  148.978971] 7ec0: e1863570 e6580194 c15917f4 c0b60ff4 eeb67f1c c0cedc28 ef668880 c0255080
[  148.999580] 7ee0: ffffffff c1400000 00000000 00000002 00000010 00000009 0000000e 00003b38
[  149.020752] 7f00: c10d7804 e6580020 c15b1b34 c1591b30 e6580194 c1591814 c1591884 c0e8bf80
[  149.042467] 7f20: c10d8b58 c10d8b2c c1591848 c10d85cc c10d8718 e18634e0 c15918ec c10650a8
[  149.064604] 7f40: c10d86e4 c10d86c0 00000000 ef399e00 c0273544 eeb67f54 eeb67f54 c1404f88
[  149.087150] 7f60: ef0b1d04 eeafb440 eeb61700 00000000 eeb66000 c15917f4 c0b60094 eeafb45c
[  149.110327] 7f80: ef0b1d04 c024ec5c 00000000 eeb61700 c024eb34 00000000 00000000 00000000
[  149.134033] 7fa0: 00000000 00000000 00000000 c02010d8 00000000 00000000 00000000 00000000
[  149.157814] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  149.181586] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[  149.205350] [<c0219d6c>] (b15_dma_inv_range) from [<c0215bb0>] (dma_cache_maint_page+0xa8/0x150)
[  149.229387] [<c0215bb0>] (dma_cache_maint_page) from [<c0216114>] (__dma_page_dev_to_cpu+0x2c/0xc4)
[  149.253280] [<c0216114>] (__dma_page_dev_to_cpu) from [<c029d728>] (dma_direct_sync_single_for_cpu+0xbc/0xc0)
[  149.277603] [<c029d728>] (dma_direct_sync_single_for_cpu) from [<c029d8f8>] (dma_direct_unmap_page+0xb4/0xb8)
[  149.301254] [<c029d8f8>] (dma_direct_unmap_page) from [<c029d940>] (dma_direct_unmap_sg+0x44/0x60)
[  149.323356] [<c029d940>] (dma_direct_unmap_sg) from [<c0b66f60>] (vchiq_complete_bulk+0x1a4/0x260)
[  149.344992] [<c0b66f60>] (vchiq_complete_bulk) from [<c0b60ff4>] (slot_handler_func+0xf60/0x1560)
[  149.366167] [<c0b60ff4>] (slot_handler_func) from [<c024ec5c>] (kthread+0x128/0x154)
[  149.385806] [<c024ec5c>] (kthread) from [<c02010d8>] (ret_from_fork+0x14/0x3c)
[  149.404578] Exception stack(0xeeb67fb0 to 0xeeb67ff8)
[  149.415213] 7fa0:                                     00000000 00000000 00000000 00000000
[  149.434245] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  149.453142] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  149.465086] Code: e1a02312 e2423001 e1100003 e1c00003 (1e070f3e) 
[  149.476410] ---[ end trace ee601eaafd00e3bb ]---
[  149.486095] note: vchiq-slot/0[155] exited with preempt_count 1

Most helpful comment

Okay. I'll describe the entire process step by step.

  1. Get Raspberry Pi 4 2Gb (I think the other one will do, too) with Raspbian.
  2. Run sudo rpi-update to get the latest 5.4 kernel. It will deliver kernel from this commit: https://github.com/raspberrypi/linux/commit/971a2bb14b459819db1bda8fcdf953e493242b42
  3. Add cma=128M to /boot/cmdline.txt.
  4. Connect any USB V4L2-compatible webcam.
  5. Install ustreamer with OMX support:
$ sudo apt install libevent-dev libjpeg8-dev uuid-dev libbsd-dev
$ git clone --depth=1 https://github.com/pikvm/ustreamer
$ cd ustreamer
$ make -j5 WITH_OMX=1
  1. Try to run capture and encoding with libjpeg on CPU: ./ustreamer --device=/dev/video0 --device-timeout=5 --encoder=cpu --workers=1
    The presence Capturing ... in the output will indicate that the process is running normally. After pressing Ctrl+C the process will stop.
  2. Try to run capture and encoding using OMX: ./ustreamer --device=/dev/video0 --device-timeout=5 --encoder=omx --workers=1. Make sure that everything works as in the previous case.
  3. Compile and install the new kernel from head of the rpi-5.4.y branch (KERNEL=kernel7l, used this manual). Reboot.
  4. Repeat the test from point 5 (CPU). Everything will work fine.
  5. Repeat the test from point 6 (OMX). Get the kernel crash:
    terminal.txt. Also
    dmesg.txt.

On all previous kernel releases, there were never any problems. I built a version from the head of the branch to make sure that fresh commits didn't fix this problem.

All 30 comments

Run sudo rpi-update on the Raspbian image to pull in a kernel with those commits. If it fails, give clear instructions on how to reproduce the crash.

The commits that caused the problem have not yet been included in the new kernel build, which is available via rpi-update. The latest kernel that is delivered by rpi-update is built from a commit https://github.com/raspberrypi/linux/commit/971a2bb14b459819db1bda8fcdf953e493242b42. Commits that may have caused the problem are made later. I encountered this problem because I'm using Arch Linux ARM, which built the kernel from a newer commit https://github.com/raspberrypi/linux/commit/5e5024f643caa53ff59a6e00f40a9b55f7fc4e17.

See: https://github.com/raspberrypi/linux/commits/rpi-5.4.y
袘械蟹 懈屑械薪懈

There were a few changes in rpi-update, but they didn't affect the kernel image, it is still from the same version as ten days ago.

If it fails, give clear instructions on how to reproduce the crash.

Okay. I'll describe the entire process step by step.

  1. Get Raspberry Pi 4 2Gb (I think the other one will do, too) with Raspbian.
  2. Run sudo rpi-update to get the latest 5.4 kernel. It will deliver kernel from this commit: https://github.com/raspberrypi/linux/commit/971a2bb14b459819db1bda8fcdf953e493242b42
  3. Add cma=128M to /boot/cmdline.txt.
  4. Connect any USB V4L2-compatible webcam.
  5. Install ustreamer with OMX support:
$ sudo apt install libevent-dev libjpeg8-dev uuid-dev libbsd-dev
$ git clone --depth=1 https://github.com/pikvm/ustreamer
$ cd ustreamer
$ make -j5 WITH_OMX=1
  1. Try to run capture and encoding with libjpeg on CPU: ./ustreamer --device=/dev/video0 --device-timeout=5 --encoder=cpu --workers=1
    The presence Capturing ... in the output will indicate that the process is running normally. After pressing Ctrl+C the process will stop.
  2. Try to run capture and encoding using OMX: ./ustreamer --device=/dev/video0 --device-timeout=5 --encoder=omx --workers=1. Make sure that everything works as in the previous case.
  3. Compile and install the new kernel from head of the rpi-5.4.y branch (KERNEL=kernel7l, used this manual). Reboot.
  4. Repeat the test from point 5 (CPU). Everything will work fine.
  5. Repeat the test from point 6 (OMX). Get the kernel crash:
    terminal.txt. Also
    dmesg.txt.

On all previous kernel releases, there were never any problems. I built a version from the head of the branch to make sure that fresh commits didn't fix this problem.

Thanks - that helps a lot. I've completed all the instructions that don't require a web cam...

I don't know if this problem will be reproduced, for example, on OMXPlayer. I have no way to check OMXPlayer since there is no HDMI adapter (I only use RPi remotely). So it may only be specific for JPEG compression.

It seems that it is not enough to just create a OMX context and configure ports, it's need to start the compression process itself. Ustreamer starts this when it receives data from the camera. A crash occurs when some of this code is executed: https://github.com/pikvm/ustreamer/blob/master/src/encoders/omx/encoder.c#L186

Using a borrowed webcam I can repeat this, although it usually takes a few attempts on a 4GB Pi4.

This is very strange, because on my 2GB it reproduces every time. Are there any additional differences between the 2 Gb and 4 Gb models that may affect this?

I think it's just due to buffer (of some kind) placement in memory. The VPU can only see the first 1GB, a fact that VCHIQ knows and may be making use of in a (now) inappropriate way. I have all the tools I need now to understand and fix this.

Thank you for the explanation. Can I ask one question about OMX? Perhaps this is also a bug and it may also be related to memory placement - I don't know.

To make encoding in ustreamer work quickly, each frame from the camera is encoded in a separate thread. In the case of a processor, libjpeg is simply called for compression in each thread. But in the case of OMX, I create three OMX contexts and three JPEG encoding chains. Up to three inclusive - everything is working. But if I create 4, then at some point OMX will freeze and stop working. I can't even properly release its resources to start the process again.

The number 3 was selected experimentally, because I could not find any information about restrictions on the number of simultaneously running encoders. Is there a description of this? Is it a bug that after 3x OMX freezes?

Ideally, I hope that after fixing the subj bug, I could still use three encoders. Only with this configuration I can get more than 24fps JPEG for FullHD. But if this limitation is a bug and should be higher (and OMX should not hang), I am ready to help you fix it in every possible way.

In the description of the current subj bug I intentionally limited the number of workers to one: --workers=1. Please try making three later using --workers=3.

The problem turned out to be simple - one of the use of g_dev by the vchiq driver that should have been changed to g_dma_dev was overlooked, leading to the crash. It's slightly surprising that it worked at all.

With a patched kernel, testing with --workers=1 and --workers=3 was successful. I tried with --workers=4, but ustreamer/OMX is wise to that:

-- INFO  [526.948    stream] -- OMX encoder sets limit for worker threads: 3

The trivial fix is in rpi-5.4.y.

Cool!

About ustreamer: this is a limit so that you don't accidentally exceed the number of workers. Please try to recompile it with another options and repeat the test. It's easy:

$ make clean
$ git pull --rebase
$ make -j5 WITH_OMX=1 CFLAGS="-O3 -DCFG_OMX_MAX_ENCODERS=100"
$ ./ustreamer --device=/dev/video0 --device-timeout=5 --encoder=omx --workers=5

5 workers seems to work OK for me.

Hmm, it looks like something has changed since my last test. I will find out the exact reproducing conditions, and if everything works out, I'll create a new issue.

Aaand are there any limitations, even in theory? Or is OMX supposed to handle any number of encoders?

In any case, thank you for fixing this error. One less problem in my life :)

OMX is more @6by9's area than mine.

There should be no limit on the number of encoders other than if there is a limit on the number of components - I haven't looked at that bit of the IL core for a long time to know if there is or not.

The JPEG block does not support "context switches" during an encode, so once one image_encode has opened the codec it will hold it until complete. All encodes will therefore effectively be serialised, therefore I'm not sure what gain you get from having multiple workers.

@6by9 This is very strange, because my experiments have shown significant performance gains when using three workers with their own OMX contexts instead of one. I have measured timings for each stage, from getting the raw image from the camera to preparing the compressed image for transmission over the network. This clearly shows that the encoding time is not vary slightly when I use one or three workers, but the FPS with three workers is significantly higher than with one. It seems that some part of the entire chain around the encoder can work in parallel. Perhaps this is exactly copying and reading from the buffer, while the encoder itself is blocked, I don't know. Okay, now I'm wondering why this even works.

For example, the working time of a single worker when compressing an image of the size of 1920x1080 is approximately 0.04-0.05 seconds. It corresponds to 20 FPS. In particular, this is confirmed by calculations: 1 / 0.04 = 20. In other words, you can see that the code is optimized and is close to producing the maximum possible output for a single worker.

Next, if I run 5 workers, the working time of one of them will increase quite a bit: 0.05-0.06 seconds. Taking into account the overhead and the operation of the image smoothness correction algorithm (blah-blah-blah, my local math), I get as much as 40 FPS.

To make it clearer how it works: while one worker compresses one frame, the next free worker takes a new frame. And so on. Some of the raw frames are skipped by the workers so that the image is more smoothed. this is provided dynamically by the algorithms. That is, due to a slight (tenths of a second) delay, I can provide more FPS.

That sounds like you're keeping the hardware fully utilised by another job ready immediately when the first one completes.

The same result should be possible with one component and multiple input and output buffers. As long as the next buffer is already queued when the first completes, then the hardware remains fully utilised. You will want larger than default buffers on both sides so that there is no blocking within the encode, and your code needs to look out for the OMX_BUFFERFLAG_ENDOFFRAME flag (which you also want to add to the input buffer).

The same result should be possible with one component and multiple input and output buffers.

Do you mean that the main overhead costs go to OMX_FillThisBuffer() and OMX_EmptyThisBuffer()? This is the only explanation that comes to mind, given that the other timings are negligible, and the total time of the encoding chain is 0.05 seconds. This includes putting data in the OMX buffer, waiting, and reading from the buffer. According to my measurements, the image capture time from the source (TC358743) is too short and the calculations confirm this. So my only hypothesis is that the OMX buffer. I think I'll check it out.

You will want larger than default buffers on both sides so that there is no blocking within the encode, and your code needs to look out for the OMX_BUFFERFLAG_ENDOFFRAME flag (which you also want to add to the input buffer).

That's exactly what I'm doing :) I wrote code for one worker, debugged it, and then just out of curiosity, I ran several (thanks to the architecture, it was easy) and was extremely surprised by the results. It was then that I came across the restriction of 3 simultaneous contexts, which I described (apparently, since then the firmware has changed and it was a bug) and began to use this feature as it is.

The main issue is that typically nBufferCountActual is set to 1 to match nBufferCountMin. That means that you have a single buffer that has to be shuttled back to the ARM for filling before it can be resubmitted to the hardware.
If nBufferCountActual is set to say 3, then the ARM can be filling one whilst the JPEG encoder is processing another, and a third is potentially in transit between the two. You obviously then need to be a bit smarter around the buffer handling than a single input_buffer and output_buffer pointer.

This is doubly the case with the output port where the default buffer size is only 80kB, so you're likely to end up with multiple buffer round trips to complete the image (unless they are very heavily compressed). Increase that buffer size to something more closely matching your output image size and you'll save time there. I don't see anything in _omx_setup_output that alters the buffer size or count.
The hardware has to pause whenever there is insufficient output buffer to write the encoded data to. If you want maximum throughput, then you really want to avoid that.

I'll add that none of the 4 input formats that you support are actually fully native encodings. They all require an internal conversion first to a 3 plane YUV format.

Thank you for the explanation, and for looking at my code. I really appreciate it. Now it makes sense to me. There is a field for experiments. The only strange thing is the size of the output buffer. I don't really change it anywhere, but I checked it after configuring portdef and its size is 2088960 bytes. This was the first optimization from your list that I wanted to check out, and I was very surprised that JPEG subtraction occurs in a single iteration. It seems that OMX configures the appropriate buffer size itself. Is this normal behavior?

I'm surprised too as I'd had one look through the code and confirmed that the original code path set it to 80kB.

What I'd missed was that someone had added a hack such that bBufferSize gets set to
max(input->sPortDef.format.image.nFrameWidth * input->sPortDef.format.image.nSliceHeight, 80<<10)
due to "The smaller memory buffer breaks xbmc use of jpeg encode. Use the old scheme for now".

So instead of fixing their client they hacked the firmware :-(

Anyway, that answers the question of why you get a bigger buffer size. Presumably your image is 1920x1080.

Exactly. Also probably for portability reasons I should increase the buffer manually.

Do you really think that using multiple buffers will work faster than multiple contexts in different threads? I mean, much faster to cover the overhead of switching threads. It seems like a pretty substantial reworking of the existing logic, but even if I use three buffers, can I get the encoding time of the one image less than 0.05 seconds?

I'm asking because you might have some benchmarks or something. The benchmark I found many years ago uses only one buffer.

A 1080p frame taking 50ms? That sounds a fair amount higher than I'd expect, but normally I don't have to factor in the conversion from YUYV. Converting from RGB will be a fair amount slower as it is a software conversion to YUV. The TC358743 will happily produce either format (BGR3 or UYVY).

The JPEG hardware block itself should be running at at least 200MPix/s, so sub 20ms for a 1080p frame.

The difference between having multiple buffers on a single component vs mulitple components is likely to be relatively small. Context switches within the firmware are pretty efficient.

Yes, 1080p...

Wait, I'm already using UYVY. I take it from TC358743 and send it straight to OMX. Is it somewhere being transcoded into something else by the processor?

I also measured how long it takes for each step in the execution of the function omx_encoder_compress_buffer(). 10ms tooks memcpy to the OMX input buffer and other 10ms more to OMX_EmptyThisBuffer() call. I only make one call to each function for the input image size of 4177920 bytes. Okay, that's four megabytes per frame, quite a lot of space. I think this is the reason why processing the entire frame including the encoding stage takes 50ms.

30ms is slightly different from your predicted 20ms, and another 20ms is spent on input buffer transfer. The question is, does it really have to work at this speed?

OK, I couldn't tell from the code what format you were going to be using. UYVY is the best you'll do in this case.

10ms for a memcpy sounds quite high for copying 4MB of data. Annoyingly with IL you also need to memcpy from ARM memory to VPU memory, although that is done with DMA.
As you only have one buffer flying around, if you rejig the buffer allocation and enabling of the input port to be in omx_encoder_compress_buffer and use OMX_UseBuffer with the existing buffer pointer, then you can avoid that userspace copy.
You will need to enable and disable the input port within omx_encoder_compress_buffer as IL doesn't complete the state transition until the buffers are allocated.

If you switched to using MMAL, then you can avoid both copies.

  • V4L2 supports VIDIOC_EXPBUF to export a buffer as a dmabuf.
  • VCSM can import a dmabuf in, and map it into the VPU relocatable heap.
  • MMAL supports MMAL_PARAMETER_ZERO_COPY being set on a port to pass a buffer through to the VPU without the copy, by passing in a relocatable heap handle.

I do a similar thing within yavta which hooks V4L2 into the MMAL ISP component for display and rendering.

This discussion has wandered a long way off topic, but the original issue is now fixed in the current rpi-update firmware.

@6by9 Thanks for the tips, I will try it.

@pelwell Sorry :) Thank you again for the fix.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wudo94 picture wudo94  路  5Comments

Nuntis-Spayz picture Nuntis-Spayz  路  5Comments

awlx picture awlx  路  4Comments

mohmedelwany picture mohmedelwany  路  5Comments

Joulinar picture Joulinar  路  5Comments