Pymupdf: getPageImageList lists all images in the PDF document

Created on 19 Aug 2020  路  3Comments  路  Source: pymupdf/PyMuPDF

Describe the bug (mandatory)

When using doc.getPageImageList(pno), all images in the PDF document are listed, instead of only the images in that specific page. Additionally, fitz.open(pdf_file) correctly lists the number of pages in the PDF file.

This happens only using some of the PDF files I've tested this on. I have used this package without issues on many files. Is there a way to check properties of the PDF file to understand why this happens?

Your configuration

MacOS
PyMuPDF 1.17.0: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-05-13 20:05:13.
Built for Python 3.7 on darwin (64-bit).

question resolved

All 3 comments

This may hapen and is a normal phenomenon.
Some PDF creators just decide to list all images ever used by any page in a central, "parent" /Resources object, even if any single page may not contain all or even any of these images.
Looking at the documentation you wil find more explanation on this.
If you absolutely need the sublist of images actually in use, check if page.getImageBbox() returns an infinite rectangle for that image.

Page /Resources is an _inheritable_ object - see the glossary of the documentation. This notion / feature is used to offer ways for reducing specification effort for PDF creators.
If I remeber correctly, page.cleanContents() will localize the resources and more precisely reflect what this page is actually doing. So this is another option for getting sure - allthough it does change the page.

Thank you for the quick response. I'll try both methods.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dothinking picture dothinking  路  4Comments

harveyspecter09 picture harveyspecter09  路  3Comments

Ricardomol picture Ricardomol  路  4Comments

deepanshug picture deepanshug  路  3Comments

shredderzwj picture shredderzwj  路  4Comments