Pymupdf: page.getText for "dict" or "rawdict" with any flags returns only text blocks.

Created on 6 Jul 2020 · 3Comments · Source: pymupdf/PyMuPDF

Describe the bug (mandatory)

Page 0 of pct_481288.pdf contain one embedded image, however page.getText("dict",flags = fitz.TEXT_INHIBITS_SPACES) returns only the text blocks.

To Reproduce (mandatory)

pct_481288.pdf

    path   = './pct_481288.pdf'
    doc    = fitz.open(path)
    pg_0  = doc[0] 

    pg_dc = page.getText("dict",flags = fitz.TEXT_INHIBIT_SPACES) # could be any flags

    for blk in pg_dc["blocks"]:
         if blk["type"] == 1:
              print(blk)

Expected behavior (optional)

1 image block should be printed.

Your configuration (mandatory)

[GCC 8.3.0] linux
PyMuPDF 1.16.10: Python bindings for the MuPDF 1.16.0 library.
Version date: 2019-12-21 07:31:32.
Built for Python 3.7 on linux (64-bit).

Additional context (optional)

Possible indentation issue at line 465 and 466, causing the TEXT_PRESERVE_IMAGES line skipped when flags is not None.
https://github.com/pymupdf/PyMuPDF/blob/4546862accd82f3b746578c2d8bab227229f6327/fitz/utils.py#L463-L466

bug resolved

Source

tanaskumar

All 3 comments

I understand. It's very early in the morning here and I am still busy with my first cup of coffee, so maybe I am overlooking something.
But the intention is to _allow suppressing images_ for outputs supporting them (because images are so damn large). So there is a default flag combination for each output type, which is taken if flags=None. If flags is not None, then this means the developer knows what he doing and no further logic is applied.