Pymupdf: Question / Comment: EPUB: extracting blocks without experiencing page breaks

Created on 25 Aug 2020 · 2Comments · Source: pymupdf/PyMuPDF

Let me consider the following code.

import fitz


if __name__ == '__main__':
    filename = "Test.epub"
    doc = fitz.open(filename)
    for page in doc:
        for b in page.getText("blocks"):
            print("Block", b[-2])
            print(b[4])
            print("-" * 50)

So I am able to extract blocks per each page. I suppose that the actual page breaks are virtual since EPUB is a reflowable document format. I would like to extract all blocks from Test.epub document without encountering page breaks.

Comment: The page breaks are bad for me since they can split a block (actually paragraph inside initial text) into two pieces. So I will need to do postprocessing while trying to merge such blocks again. It would be much simpler to get rid of page breaks at the very beginning.

question resolved

Source

andrei-volkau

Most helpful comment

@JorjMcKie, your approach is working! I tested it. Thank you very much! I am closing this question.

andrei-volkau on 25 Aug 2020

🎉1 👍1

All 2 comments

actual page breaks are virtual since EPUB is a reflowable document

That's correct.

I would like to extract all blocks from Test.epub document without encountering page breaks.

This is not achievable - except with your own code. I don't know if something is possible by re-layouting the document with a really giant page size, so big that the document then only has one page ... virtually.
Basically something along this idea:

get first page dimension
get number of pages
doc.layout() with a rectangle having a height of doc[0].rect.height * doc.pageCount.
then do your extract .