PDF file:
https://drive.google.com/open?id=1ne84gRIMnss30UeXA475A84AY5pmYJyY
The text layer is rendered with extra spaces between each letter, for example:
Correct text: "VELKOMMEN TIL ALMINDINGEN"
Rendered text: "V E L K O M M E N T I L A L M I N D I N G E N"
I have got the same issue. Please try this file: https://drive.google.com/file/d/0BxnImIPU4vSKbDAzekNEMGlhcjQ/view?usp=sharing
PDFJS cannot find out the phrase "IF E" if I copy text from the viewer.
@Woodgnome Might want to retest with the latest version. There seems to be a smaller number of extra spaces in my tests
I just downloaded the original PDF and tried opening it in the public demo:
http://i.imgur.com/K4onsTC.png
Still broken from what I can see.
I think I understand what's going on here. The PDF contains TJ
operators where each glyph is preceded by the number -70. For the title, for example, we have:
(V)-70(E)-70(L)-70(K)-70(O)-70(M)-70(M)-70(E)-70(N)
According to the specification at https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf#page=407&zoom=auto,-246,31, this value is subtracted from the horizontal spacing, so we are actually moving 70 units to the right. PDF.js interprets this as if it needs to insert a space, since it probably exceeds the SPACE_FACTOR
threshold. Not sure how to fix this (there are other issues regarding the SPACE_FACTOR
heuristic, so it might need to be revisited), but at least this seems like the cause of the issue. This looks like a rather odd thing to do, but Okular seems to handle this well.
Sounds like the PDF is pretty fucked up, but regardless both Acrobat and Chrome PDF viewer seem to handle it as well.
Without knowing anything about SPACE_FACTOR
, wouldn't you be able to compare [Left position] + [Width]
of character to [Left position]
of next character and determine if there should be a space or not depending on that? Or is that what SPACE_FACTOR
is used to do already?
I think that's what it does, but the problem is that it's a heuristic, so it won't be correct all the time. I wonder how other PDF viewers solve this.
Comparing Left Positions unfortunately does not work. I tried it on: https://github.com/mozilla/pdf.js/issues/7327
We might be able to look at how Poppler does it: https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc. It has some constants and logic to add spaces (https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc#L818) that might be different from how we do it.
Having the same problem with spaces. Is there any solution or quick fix to this issue?
Changing the SPACE_FACTOR on the pdf.worker fixed the problem for me.
From 0.3 to 0.5, but i dont know how this change can affect other documents.
It seems this problem is not fixed in the pdf.js in Firefox 50.1.0.
This is affecting a lot of my documents. Is there any way to fix this on the latest 1.7 builds?
This still seems to be an issue with 2.0.
Any progress on text-layer rendering and the SPACE_FACTOR?
Same problem with this pdf:
text-spacing-error.pdf
Configuration:
Steps to reproduce the problem:
page.getTextContent({
normalizeWhitespace: true,
disableCombineTextItems: true
})
textContent.item[i].str
What is the expected behavior?
The starting text for the second page must be:
- 5 1 DE ORGANISATIE EN DE PROBLEEMSTELLING IN HAAR CONTEXT
What went wrong?
The starting text for the second page is:
- 5 1 D E ORGANISAT IE EN D E P ROBL EEM STEL L ING IN HAAR CONTEXT
Spaces are added within the word.
The code that gets adds the space WITHIN the word is:
https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L1656-L1658
The given width (advance) is a tiny bit bigger than textContentItem.fakeSpaceMin.
@timvandermeij your statement about the advance being bigger than the spaceWidth * SPACE_FACTOR
is correct.
Solution here is to set SPACE_FACTOR to 0.4.
This renders the words perfectly.
This is an issue within the default pdf.js viewer.
In all the presented cases the issue is with capitalized words.
Maybe a solution is to introduce a bigger fakeCapSpaceMin to compare with when the glyph is capitalized and leave the SPACE_FACTOR as is.
Maybe a solution is to introduce a bigger fakeCapSpaceMin to compare with when the glyph is capitalized and leave the SPACE_FACTOR as is.
In general, given how common it's for PDF generators to provide incomplete/inconsistent/incorrect font data, attempting to do any sort of lower/upper-case detection is quite likely to cause more issues than it solves in many cases unfortunately.
EDIT: Not to mention that adding yet another heuristic, tuned for a particular set of PDF files, probably won't be a good solution in the general case.
Looking at the reference code from @timvandermeij I see a very different approach to calculating the min space:
https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc#L749-L773
Why is it pdf.js uses font.spaceWidth
and what does font.spaceWidth
mean exactly?
If this is the width of a space, why is it multiplied by 0.3
to check if a distance between characters is actually a space?
I assume that when font.spaceWidth
is the (approx) width of a space, the comparison must be with the actual value (or a value close to this width), not a 0.3 factor of this.
One more note on this:
The SPACE_FACTOR was changed in this commit https://github.com/mozilla/pdf.js/commit/109d67691c866b2c7001524e49c3e53ff9edd762.
The test pdf that was the origin for this change is unrecoverable.
Even the small change reverting this factor to 0.35 solves my problem.
Is anyone able to deduce for which cases this threshold needs to be smaller?
Maybe https://github.com/euske/pdfminer is a good reference too.
A python based text extraction tool.
This tool does not insert any space character when extracting text from a Tj instruction:
https://github.com/euske/pdfminer/blob/44977b6726640933d86028d16ca06fab5ea26d1a/pdfminer/pdfinterp.py#L753-L766
The render_string code just renders the characters in the Tj sequence:
https://github.com/euske/pdfminer/blob/44977b6726640933d86028d16ca06fab5ea26d1a/pdfminer/pdfdevice.py#L89-L102
_Edit:_ This tool does insert spaces. This code also uses the width and height to calculate margin. Not a font.spaceWidth
:
https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L369-L375
Still not sure why pdf.js has such a different approach using font.spaceWidth
and a seemingly random value.
The SPACE_FACTOR was changed in this commit 109d676.
The test pdf that was the origin for this change is unrecoverable.
A reduced test-case should have been included in the original PR, but was added in #5806, and is now part of our test-suite; please see https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue5734.pdf.
Also, if you want to try and work on improving text-selection, I'd highly recommend careful reading of https://github.com/mozilla/pdf.js/wiki/Contributing and in particular section https://github.com/mozilla/pdf.js/wiki/Contributing#4-run-lint-and-testing. There it's described how to generate reference images and run the tests locally, which is necessary in order to validate your changes when working on code residing in the /src
folder.
Original sample document is now available at https://drive.google.com/open?id=1ne84gRIMnss30UeXA475A84AY5pmYJyY (I also updated the link in the original post).
In all the presented cases the issue is with capitalized words.
Also an issue with the non-capitalized body text in this PDF.
I just created the issue #9998 which now I see that probably is a duplicate of this one (sorry about that).
In my example the spaces are inserted after all 'i', 'j' and 'l' characters. Copy and paste using Firefox (which uses pdf.js) ads the extra spaces, but Chrome and Acrobat Reader works fine.
My example has been edited with Inkscape, and all the text in in a single text box.
Im having this issue as well. How on earth do i override the SPACE_FACTOR variable?
The only way of 'overriding' SPACE_FACTOR
is to change the source code. There is no API/configuration option for it.
Assuming you're using the pdfjs-dist
npm package, you'd have to vendor node_modules/pdfjs-dist/build/pdf.worker.js
and make your change to that file (I don't recommend it).
Also, I thought I had this problem, but it turned out I did not. I'm using https://github.com/mozilla/pdf.js/blob/master/web/pdf_viewer.js
. Adding disableCombineTextItems: true
alongside the two places normalizeWhitespace: true
is used fixed my problem with extra/unwanted spaces.
It'd be great if somebody with more context/experience with pdf.js could chime in on this issue. Perhaps disableCombineTextItems
could be a setting with initializing pdf_viewer
.
I don't mean to hijack this issue, I only comment on my experience because I suspect others here may think they have the SPACE_FACTOR
, when in reality they do not. Just sharing what I wished I had found when researching my (similar) problem. :smile:
I'm having the opposite issue as described here with the PDF below (no white space at all). Are there any plans to expose the SPACE_FACTOR setting so that at least it can be adjusted on a per pdf basis to produce better results?
Most helpful comment
Im having this issue as well. How on earth do i override the SPACE_FACTOR variable?