When using doc.getPageImageList(pno), all images in the PDF document are listed, instead of only the images in that specific page. Additionally, fitz.open(pdf_file) correctly lists the number of pages in the PDF file.
This happens only using some of the PDF files I've tested this on. I have used this package without issues on many files. Is there a way to check properties of the PDF file to understand why this happens?
MacOS
PyMuPDF 1.17.0: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-05-13 20:05:13.
Built for Python 3.7 on darwin (64-bit).
This may hapen and is a normal phenomenon.
Some PDF creators just decide to list all images ever used by any page in a central, "parent" /Resources object, even if any single page may not contain all or even any of these images.
Looking at the documentation you wil find more explanation on this.
If you absolutely need the sublist of images actually in use, check if page.getImageBbox() returns an infinite rectangle for that image.
Page /Resources is an _inheritable_ object - see the glossary of the documentation. This notion / feature is used to offer ways for reducing specification effort for PDF creators.
If I remeber correctly, page.cleanContents() will localize the resources and more precisely reflect what this page is actually doing. So this is another option for getting sure - allthough it does change the page.
Thank you for the quick response. I'll try both methods.