Pymupdf: Question / Comment: Open PDF files as streams with fitz.open()

Created on 24 Aug 2020 · 3Comments · Source: pymupdf/PyMuPDF

I am working on a project that takes PDFs as streams downloaded from Azure Blob Storage. I need PyMuPDF to open the stream and read content just like a normal file. Is there a way to achieve this? For example, I have a test case here using PyPDF2:

from io import BytesIO
stream_buffer = BytesIO()
with stream_buffer as download_stream:
    blob_client.download_blob(max_concurrency=3).readinto(download_stream) # just download PDFs from Azure as in-memory streams
    fileReader = PyPDF2.PdfFileReader(stream_buffer) # working with stream
    print(fileReader.numPages) # this will give me number of pages from each PDF

I also took a look at the official doc here about opening a doc which states the input file is the actual file. Is there a parameter that I could pass to recognize it as a stream in PyMuPDF? or is this something I need to do before feeding docs to PyMuPDF i.e. getbuffer() or getvalue() from io library. I also found an old version of PyMuPDF back in 2016 with similar issue here and wonder if anything has changed since then.

Thanks for your valuable input and suggestions.

question resolved

Source

liamsuma

Most helpful comment

It worked and I should've read more carefully about docs.

stream (bytes,bytearray,BytesIO) –
A memory area containing a supported document. Its type must be specified by either filename or filetype.
(Changed in version 1.14.13) io.BytesIO is now also supported.

I am closing this issue now and thanks again for your quick help.

liamsuma on 24 Aug 2020

👍2

All 3 comments

The documentation says this:
Overview of possible forms (using the open synonym of Document):

>>> # from a file
>>> doc = fitz.open("some.pdf")
>>> doc = fitz.open("some.file", None, "pdf") # copes with wrong extension
>>> doc = fitz.open("some.file", filetype="pdf") # copes with wrong extension
>>>
>>> # from memory
>>> doc = fitz.open("pdf", mem_area)
>>> doc = fitz.open(None, mem_area, "pdf")
>>> doc = fitz.open(stream=mem_area, filetype="pdf")
>>>
>>> # new empty PDF
>>> doc = fitz.open()
>>>

"From memory" seems what you are look for. mem_area may be one of the Python types bytearray, bytes, io.BytesIO.

JorjMcKie on 24 Aug 2020

🎉2

Thanks for such a quick reply. Please allow me to test it and get back to you ASAP.

liamsuma on 24 Aug 2020

It worked and I should've read more carefully about docs.

stream (bytes,bytearray,BytesIO) –
A memory area containing a supported document. Its type must be specified by either filename or filetype.
(Changed in version 1.14.13) io.BytesIO is now also supported.

I am closing this issue now and thanks again for your quick help.

liamsuma on 24 Aug 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings