I am working on a project that takes PDFs as streams downloaded from Azure Blob Storage. I need PyMuPDF to open the stream and read content just like a normal file. Is there a way to achieve this? For example, I have a test case here using PyPDF2:
from io import BytesIO
stream_buffer = BytesIO()
with stream_buffer as download_stream:
blob_client.download_blob(max_concurrency=3).readinto(download_stream) # just download PDFs from Azure as in-memory streams
fileReader = PyPDF2.PdfFileReader(stream_buffer) # working with stream
print(fileReader.numPages) # this will give me number of pages from each PDF
I also took a look at the official doc here about opening a doc which states the input file is the actual file. Is there a parameter that I could pass to recognize it as a stream in PyMuPDF? or is this something I need to do before feeding docs to PyMuPDF i.e. getbuffer() or getvalue() from io library. I also found an old version of PyMuPDF back in 2016 with similar issue here and wonder if anything has changed since then.
Thanks for your valuable input and suggestions.
The documentation says this:
Overview of possible forms (using the open synonym of Document):
>>> # from a file
>>> doc = fitz.open("some.pdf")
>>> doc = fitz.open("some.file", None, "pdf") # copes with wrong extension
>>> doc = fitz.open("some.file", filetype="pdf") # copes with wrong extension
>>>
>>> # from memory
>>> doc = fitz.open("pdf", mem_area)
>>> doc = fitz.open(None, mem_area, "pdf")
>>> doc = fitz.open(stream=mem_area, filetype="pdf")
>>>
>>> # new empty PDF
>>> doc = fitz.open()
>>>
"From memory" seems what you are look for. mem_area may be one of the Python types bytearray, bytes, io.BytesIO.
Thanks for such a quick reply. Please allow me to test it and get back to you ASAP.
It worked and I should've read more carefully about docs.
stream (bytes,bytearray,BytesIO) –
A memory area containing a supported document. Its type must be specified by either filename or filetype.
(Changed in version 1.14.13) io.BytesIO is now also supported.
I am closing this issue now and thanks again for your quick help.
Most helpful comment
It worked and I should've read more carefully about docs.
I am closing this issue now and thanks again for your quick help.