Hi JorjMcKie,
I got problems when extracting image bbox with page.getImageBbox() from this pdf.
File "D:\89_Program_Files\Python368\lib\site-packages\fitz\fitz.py", line 4791, in get ImageBbox
raise ValueError("unsupported image item")
ValueError: unsupported image item
What I did:
import fitz
doc = fitz.open("test.pdf")
page = doc[0]
# I can get the image from page
items = page.getImageList(full=True)
print(items)
# [(8, 0, 1600, 939, 8, 'DeviceRGB', '', 'Im0', 'FlateDecode', 7)]
# but failed to get the bbox
bbox = page.getImageBbox(items[0])
print(bbox)
# ValueError: unsupported image item
Check fitz.py line 4791:
if item[-1] != 0:
raise ValueError("unsupported image item")
In this case, Seems the last xref item[-1]==7 != 0 caused the error.
So, how to get bbox of image with item[-1] != 0? Thanks in advance.
I tried to cross check the bbox with page.getText(), but unfortunately I can't get any image blocks, seems it's due to this image is partly outside the page.
print(page.getText('rawdict'))
# {'width': 612.0, 'height': 792.0, 'blocks': []}
As your analysis correctly showed:
Image bboxes can only be determined if the page directly displays the image. In your case, a so-called "Form XObject" is invoked (xref 7), which in turn displays an image.
You can only find out the bbox occupied by the XObject doc.getPageXObjectList(page.number), which contains a tuple for the bbox.
The reason for this restriction mainly is that the resp. MuPDF algorithm is not bug-free. It is indeed a highly complex topic, because XObjects can be nested inside each other at arbitrary levels, each with its own transformation matrix, bbox and what not.
Thanks for the explanation.
You can only find out the bbox occupied by the XObject
doc.getPageXObjectList(page.number)
I still can't get the right bbox with getPageXObjectList, considered also page.transformationMatrix, as documented in the Docs. Indeed, "It is indeed a highly complex topic".
Then I try to bypass item[-1] != 0, and see what happen from MuPDF. Luckily I get the result. Maybe "MuPDF algorithm is not bug-free", but at least works for this case. So a workaround for my case, bypass item[-1] != 0 to give MuPDF a chance. Of course, wrap in try/except clause for safe.
import fitz
doc = fitz.open("test.pdf")
page = doc[0]
items = page.getImageList(full=True)
# bypass `item[-1] != 0`
item = list(items[0])
item[-1]=0
bbox = page.getImageBbox(item)
print(bbox)
# Rect(57.900001525878906, 129.5078125, 688.8927612304688, 582.1100463867188)
Maybe I should follow your implicit recommendation, and no longer refuse to do it like that.
The only thing I do not want to see happening is having to deal with issues, when it didn' twork correctly ... 馃槑.
BTW, the risk to see exception is minor - which is unfortunate in this case: you will get an incorrect bbox without necessarily noticing it.
Maybe I let it go through and warn if I detect that the image is embedded in an XObject.
Most helpful comment
Thanks for the explanation.
I still can't get the right bbox with
getPageXObjectList, considered alsopage.transformationMatrix, as documented in the Docs. Indeed, "It is indeed a highly complex topic".Then I try to bypass
item[-1] != 0, and see what happen from MuPDF. Luckily I get the result. Maybe "MuPDF algorithm is not bug-free", but at least works for this case. So a workaround for my case, bypassitem[-1] != 0to give MuPDF a chance. Of course, wrap in try/except clause for safe.