Let me consider the following code.
import fitz
if __name__ == '__main__':
filename = "Test.epub"
doc = fitz.open(filename)
for page in doc:
for b in page.getText("blocks"):
print("Block", b[-2])
print(b[4])
print("-" * 50)
So I am able to extract blocks per each page. I suppose that the actual page breaks are virtual since EPUB is a reflowable document format. I would like to extract all blocks from Test.epub document without encountering page breaks.
Comment: The page breaks are bad for me since they can split a block (actually paragraph inside initial text) into two pieces. So I will need to do postprocessing while trying to merge such blocks again. It would be much simpler to get rid of page breaks at the very beginning.
actual page breaks are virtual since EPUB is a reflowable document
That's correct.
I would like to extract all blocks from Test.epub document without encountering page breaks.
This is not achievable - except with your own code. I don't know if something is possible by re-layouting the document with a really giant page size, so big that the document then only has one page ... virtually.
Basically something along this idea:
doc.layout() with a rectangle having a height of doc[0].rect.height * doc.pageCount.@JorjMcKie, your approach is working! I tested it. Thank you very much! I am closing this question.
Most helpful comment
@JorjMcKie, your approach is working! I tested it. Thank you very much! I am closing this question.