Pdf.js: getTextContent() text items have wrong height

Created on 12 Apr 2017 · 14Comments · Source: mozilla/pdf.js

pdfJS Version: 1.7.290
nodeJS Version: v6.9.3
Test PDF file: test.pdf

TL;DR: textContent.height is way off compared to rendered PDF, I'm not sure if this is a bug, an invalid PDF file or if this is intended behaviour.

One some PDF files (see attached file for example) the textContent items seem to have a wrong value for the 'height' property.

Consider the following example, which is the text 'Uw rekening' just below the top-right logo:

{
  "str": "Uw rekening",
  "dir": "ltr",
  "width": 98.928,
  "height": 0.54,
  "transform": [
    18,
    0,
    0,
    18,
    441.81,
    708.4499999999999
  ],
  "fontName": "g_d9_f21"
}

Here the 'height' property value is 0.54, whilst Math.sqrt(t[2]*t[2] + t[3]*t[3]) = 18 is expected.
When looking at the rendered PDF, we can also confirm that the text in question is actually rendered as 18px high.

I traced the code backwards in the PDFJS source, this is what I found:

In flushTextContentItem(), the height is 18, but then multiplied with textContentItem.textAdvanceScale, which has a value of 0.03 for the attached PDF.

If we look at ensureTextContentItem(), we see that textAdvanceScale is calculated as follows:

textAdvanceScale = Math.sqrt(ctm[0]*ctm[0] + ctm[1]+ctm[1]) * Math.sqrt(tlm[0]*tlm[0] + tml[1]*tlm[1])

Where ctm is the content transform matrix, and tlm the text line matrix.

The text line matrix looks just fine, but (in case of this PDF example), the ctm seems very unlikely:

[
  0.03,
  0,
  0,
  0.03,
  0,
  0
]

Eventually I found that a cm operator is encountered with args [0.03, 0, 0, 0.03, 0, 0], which is then handled in preprocessCommand() and triggers stateManager.transform(args), where the ctm is updated to [0.03, 0, 0, 0.03, 0, 0].

But this is where my debugger threw in the towel as it crashes when trying to navigate through the massive 57k LOC PDFJS library.

When inspecting the PDF, I find this part:

0.03 0 0 0.03 0 0 cm
BT
/F9 600.00 Tf
0.89 0.00 0.10 rg
14727 23615 TD
(Uw rekening) Tj
*snip*
ET Q

So yes, the graphic static is modified right before the text portion, but that's about where my knowledge of the PDF format ends. I don't know if the 'graphic state' is supposed to influence text size?

So, in conclusion: I don't know if this is a bug, an invalid PDF document or an intended behaviour. But I do know that height 0.54 is not how the document is actually rendered.

To get the actual rendered height of a text item, can I safely assume that the 'real' height is equal to Math.sqrt(t[2]*t[2] + t[3]*t[3]) ?

4-text-selection

Source

LeonMelis

Most helpful comment

I found problem in commit https://github.com/mozilla/pdf.js/commit/4537590033169915e68f6480e2463bc4b2175f78
before this commit height multiply to textAdvanceScale only for vertical fonts
after multiply in any cases

aberkovsky on 15 Dec 2018

👍10

All 14 comments

Looks like this may be a regression from #7879

brendandahl on 12 Apr 2017

Same deal here using pdfjs-dist 1.8.412

Heights are greatly exaggerated:

{ str: 'Y',
  dir: 'ltr',
  width: 4.335500000000001,
  height: 42.25,
  transform: [ 0, 6.5, -6.5, 0, 488.4, 611.8 ],
  fontName: 'Helvetica' }

For a single "Y" at a 6.5pt font, I would expect this to be near 6.5pt in height, not multiplied by 6.5.

Saltallica on 3 Jun 2017

If it helps, this worked correctly in 1.6.210, which I have reverted back to.

Saltallica on 3 Jun 2017

👍3

Same issue using pdfjs 1.9.646

Heights are reported as the square of their actual values.

{
  "initialized": false,
  "str": [],
  "width": 19.91999320925712,
  "height": 423.18367629061225,
  "vertical": false,
  "lastAdvanceWidth": 0.91875,
  "lastAdvanceHeight": 0,
  "textAdvanceScale": 20.57142864,
  "spaceWidth": 0.6,
  "fakeSpaceMin": 0.18,
  "fakeMultiSpaceMin": 0.8999999999999999,
  "fakeMultiSpaceMax": 2.4,
  "textRunBreakAllowed": false,
  "transform": [
    6.300000021,
    0,
    0,
    20.57142864,
    59.76000019919999,
    751.1999999999999
  ],
  "fontName": "Courier"
}

chadkirby on 17 Nov 2017

@LeonMelis @Saltallica @chadkirby If you don't mind sharing, what are you using getTextContent for? I've recently been thinking about cleaning up a few things, but don't want to remove things people are using.

brendandahl on 17 Nov 2017

@brendandahl In a nutshell, I am trying to draw rectangles around certain paragraphs. I search through each page for target strings, then when I find the target, I need to compute the paragraph's bounding box so that I can annotate that portion of the page.

chadkirby on 17 Nov 2017

I use it to extract text from a document, with coordinates and dimensions of the textboxes. This allows me to parse documents and extract relevant data. I'm also using it to draw rectangles around certain text elements (highlighting them), like chadkirby does.

LeonMelis on 18 Nov 2017

👍2

VIewed on Mac bla.pdf
I am working with version 1.8.418 and also getting heights that are greatly exaggerated.In my document i get height of 144 instead of what i expected is about 20 . The function getTextContent returns wrong height value . Any solution beside changing version to 1.6.210 ?? @yurydelendik
when i am searching for the word test in the file i get the test word highlighted wrong . i attached my sample file .

mv80 on 7 Dec 2017

This still seems to be a problem. I'm using [email protected]

I use getTextContent to extract the text and map text items to groups roughly representing paragraphs. The dimensions and position are key to this.

In my _own_ documents, I have found that item.transform[3] consistently provides a value close enough for my purposes.

{
  height: 72.89999999999999, // Way off
  transform: [
    9, // This value is close
    0,
    0,
    8.1, // So is this value
    54,
    756.0884
  ],
  width: 10.008000000000001
}

jacksteamdev on 15 Dec 2017

This is still a problem in 1.10.97. And like other - I'm using PDF.js not just as a viewer, but as a way to extract data by getting text boxes from certain areas of a document - without having accurate heights it doesn't work.

Saltallica on 22 May 2018

Also... is there a way we can get someone to look in to this particular issue?

Saltallica on 23 May 2018

If it worked in 1.6.210, you could try using git bisect to find out the commit where it regressed if it's indeed a regression. That would help to speed up the resolution process.

timvandermeij on 23 May 2018

I have the same problem but i get height: 195.4404 it should be something somewhere between 12 and 15 (i guess).

I use pdfjs to index a pdf, and then open the pdf with pdfjs with the correct page and y coordinate.

I'm using pdfjs-dist: 1.9.638, upgrading to 2.0.943 did not help, i also tried to revert back to 1.6.210 but that did not solve my issue. (it might be a faulty pdf, but it renders correctly in all pdfs readers i have tried including pdfjs).

Is there some other way i can calculate the top position? i'm currently using item.transform[5] + item.height.

2008-005.pdf