Pymupdf: getPageImageList lists all images in the PDF document

Created on 19 Aug 2020 · 3Comments · Source: pymupdf/PyMuPDF

Describe the bug (mandatory)

When using doc.getPageImageList(pno), all images in the PDF document are listed, instead of only the images in that specific page. Additionally, fitz.open(pdf_file) correctly lists the number of pages in the PDF file.

This happens only using some of the PDF files I've tested this on. I have used this package without issues on many files. Is there a way to check properties of the PDF file to understand why this happens?

Your configuration

MacOS
PyMuPDF 1.17.0: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-05-13 20:05:13.
Built for Python 3.7 on darwin (64-bit).

question resolved

Source

alono88

All 3 comments

This may hapen and is a normal phenomenon.
Some PDF creators just decide to list all images ever used by any page in a central, "parent" /Resources object, even if any single page may not contain all or even any of these images.
Looking at the documentation you wil find more explanation on this.
If you absolutely need the sublist of images actually in use, check if page.getImageBbox() returns an infinite rectangle for that image.

JorjMcKie on 19 Aug 2020

👍1

Page /Resources is an _inheritable_ object - see the glossary of the documentation. This notion / feature is used to offer ways for reducing specification effort for PDF creators.
If I remeber correctly, page.cleanContents() will localize the resources and more precisely reflect what this page is actually doing. So this is another option for getting sure - allthough it does change the page.

JorjMcKie on 19 Aug 2020

👍1

Thank you for the quick response. I'll try both methods.

alono88 on 19 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings