Pymupdf: readjusting the size of rect returned by page.searchfor()

Created on 24 Nov 2020  Â·  3Comments  Â·  Source: pymupdf/PyMuPDF

I am facing in issue while redacting some data in pdf file.
While collecting a list of areas of words to redact and then finally using those areas to redact, text in upper line or line underneath is getting hidden as well.

I was wondering if there is any way to adjust the rectangle area to exactly match the area of word to be redacted?

Thanks for your help in advance.

question

Most helpful comment

Thanks a lot Jorj 🙂.

It worked superb.


From: Jorj X. McKie notifications@github.com
Sent: 24 November 2020 17:50
To: pymupdf/PyMuPDF PyMuPDF@noreply.github.com
Cc: deepanshug garg.deepanshu15@outlook.com; Author author@noreply.github.com
Subject: Re: [pymupdf/PyMuPDF] readjusting the size of rect returned by page.searchfor() (#731)

A common problem - unfortunately. Goes back to MuPDF's character inclusion logic: a character is considered part of the redact rect if its bbox intersects (as opposed to "is contained in") it.
Independently from this fact, many PDFs propably have highly overlapping line bboxes ...
So to be sure, you can reduce the height of the returned hit rectangle around its middle line, like this snippet, which creates a sub rectangle of 20% of original height, but same width:

h = rect.height # height of hit rectangle
my = (rect.y0 + rect.y1) / 2 # y of middle line
y0 = my - h * 0.1 # new upper vertical coord
y1 = my + h * 0.1 # new lower vertical coord
rect.y0 = y0
rect.y1 = y1

now define redact annot with modified rect

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/731#issuecomment-732939674, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJCFZFHDA5FCURHIGBEXW5LSROQJXANCNFSM4UAXCZZA.

All 3 comments

A common problem - unfortunately. Goes back to MuPDF's character inclusion logic: a character is considered part of the redact rect if its bbox intersects (as opposed to "is contained in") it.
Independently from this fact, many PDFs propably have highly overlapping line bboxes ...
So to be sure, you can reduce the height of the returned hit rectangle around its middle line, like this snippet, which creates a sub rectangle of 20% of original height, but same width:

h = rect.height  # height of hit rectangle
my = (rect.y0 + rect.y1) / 2  # y of middle line
y0 = my - h * 0.1  # new upper vertical coord
y1 = my + h * 0.1  # new lower vertical coord
rect.y0 = y0
rect.y1 = y1
# now define redact annot with modified rect

Thanks a lot Jorj 🙂.

It worked superb.


From: Jorj X. McKie notifications@github.com
Sent: 24 November 2020 17:50
To: pymupdf/PyMuPDF PyMuPDF@noreply.github.com
Cc: deepanshug garg.deepanshu15@outlook.com; Author author@noreply.github.com
Subject: Re: [pymupdf/PyMuPDF] readjusting the size of rect returned by page.searchfor() (#731)

A common problem - unfortunately. Goes back to MuPDF's character inclusion logic: a character is considered part of the redact rect if its bbox intersects (as opposed to "is contained in") it.
Independently from this fact, many PDFs propably have highly overlapping line bboxes ...
So to be sure, you can reduce the height of the returned hit rectangle around its middle line, like this snippet, which creates a sub rectangle of 20% of original height, but same width:

h = rect.height # height of hit rectangle
my = (rect.y0 + rect.y1) / 2 # y of middle line
y0 = my - h * 0.1 # new upper vertical coord
y1 = my + h * 0.1 # new lower vertical coord
rect.y0 = y0
rect.y1 = y1

now define redact annot with modified rect

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/731#issuecomment-732939674, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJCFZFHDA5FCURHIGBEXW5LSROQJXANCNFSM4UAXCZZA.

The problem becomes trickier, if you not only want to remove redacted text, but replace it with some other text - using the _original_ hit rectangle ...
This feature is part of applying redactions normally, but the memory of the original rectangle has gone of course.
If you ever run into this type of issue, you must remember the original rectangle yourself, add and immediately apply the redaction (without letting it insert text itself) and "manually" insert text in the remembered hit rect (which is now emptied) ...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Matmaus picture Matmaus  Â·  3Comments

harveyspecter09 picture harveyspecter09  Â·  3Comments

tanaskumar picture tanaskumar  Â·  3Comments

Harshil783 picture Harshil783  Â·  4Comments

shredderzwj picture shredderzwj  Â·  4Comments