Pymupdf: Question / Comment: Error when get bbox with page.getImageBbox()

Created on 22 Oct 2020  路  4Comments  路  Source: pymupdf/PyMuPDF

Hi JorjMcKie,

I got problems when extracting image bbox with page.getImageBbox() from this pdf.

File "D:\89_Program_Files\Python368\lib\site-packages\fitz\fitz.py", line 4791, in get ImageBbox
    raise ValueError("unsupported image item")
ValueError: unsupported image item

What I did:

import fitz

doc = fitz.open("test.pdf")
page = doc[0]

# I can get the image from page
items = page.getImageList(full=True)
print(items)
# [(8, 0, 1600, 939, 8, 'DeviceRGB', '', 'Im0', 'FlateDecode', 7)]

# but failed to get the bbox
bbox = page.getImageBbox(items[0])
print(bbox)
# ValueError: unsupported image item

Check fitz.py line 4791:

if item[-1] != 0:
    raise ValueError("unsupported image item")

In this case, Seems the last xref item[-1]==7 != 0 caused the error.

So, how to get bbox of image with item[-1] != 0? Thanks in advance.

question resolved

Most helpful comment

Thanks for the explanation.

You can only find out the bbox occupied by the XObject doc.getPageXObjectList(page.number)

I still can't get the right bbox with getPageXObjectList, considered also page.transformationMatrix, as documented in the Docs. Indeed, "It is indeed a highly complex topic".

Then I try to bypass item[-1] != 0, and see what happen from MuPDF. Luckily I get the result. Maybe "MuPDF algorithm is not bug-free", but at least works for this case. So a workaround for my case, bypass item[-1] != 0 to give MuPDF a chance. Of course, wrap in try/except clause for safe.

import fitz

doc = fitz.open("test.pdf")
page = doc[0]
items = page.getImageList(full=True)

# bypass `item[-1] != 0`
item = list(items[0])
item[-1]=0

bbox = page.getImageBbox(item)
print(bbox)
# Rect(57.900001525878906, 129.5078125, 688.8927612304688, 582.1100463867188)

All 4 comments

I tried to cross check the bbox with page.getText(), but unfortunately I can't get any image blocks, seems it's due to this image is partly outside the page.

print(page.getText('rawdict'))
# {'width': 612.0, 'height': 792.0, 'blocks': []}

As your analysis correctly showed:
Image bboxes can only be determined if the page directly displays the image. In your case, a so-called "Form XObject" is invoked (xref 7), which in turn displays an image.
You can only find out the bbox occupied by the XObject doc.getPageXObjectList(page.number), which contains a tuple for the bbox.
The reason for this restriction mainly is that the resp. MuPDF algorithm is not bug-free. It is indeed a highly complex topic, because XObjects can be nested inside each other at arbitrary levels, each with its own transformation matrix, bbox and what not.

Thanks for the explanation.

You can only find out the bbox occupied by the XObject doc.getPageXObjectList(page.number)

I still can't get the right bbox with getPageXObjectList, considered also page.transformationMatrix, as documented in the Docs. Indeed, "It is indeed a highly complex topic".

Then I try to bypass item[-1] != 0, and see what happen from MuPDF. Luckily I get the result. Maybe "MuPDF algorithm is not bug-free", but at least works for this case. So a workaround for my case, bypass item[-1] != 0 to give MuPDF a chance. Of course, wrap in try/except clause for safe.

import fitz

doc = fitz.open("test.pdf")
page = doc[0]
items = page.getImageList(full=True)

# bypass `item[-1] != 0`
item = list(items[0])
item[-1]=0

bbox = page.getImageBbox(item)
print(bbox)
# Rect(57.900001525878906, 129.5078125, 688.8927612304688, 582.1100463867188)

Maybe I should follow your implicit recommendation, and no longer refuse to do it like that.
The only thing I do not want to see happening is having to deal with issues, when it didn' twork correctly ... 馃槑.
BTW, the risk to see exception is minor - which is unfortunate in this case: you will get an incorrect bbox without necessarily noticing it.
Maybe I let it go through and warn if I detect that the image is embedded in an XObject.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cherryjo18 picture cherryjo18  路  3Comments

alono88 picture alono88  路  3Comments

shredderzwj picture shredderzwj  路  4Comments

akjanik picture akjanik  路  3Comments

Ricardomol picture Ricardomol  路  4Comments