Link to PDF file (or attach file here):
Configuration:
Steps to reproduce the problem:
What is the expected behavior? (add screenshot)
need to add spaces between words.
jansen_public_queries.pdf
What went wrong? (add screenshot)
no spaces were found between words.
Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
I think there's a fix for this problem.
On the attached image rendering the first page of this PDF, you can notice that the bounding boxes of the text items seem wider than the last printed character, e.g. "g " of searching has an extra empty space after. It means that the widths are computed correctly (textMatrix is properly updated in the code), but it doesn't add a space when it should. And it should depending on the current value of charSpacing.
So in https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1334 after the line textChunk.str.push(glyphUnicode); we should see if the current charSpacing is greater than textChunk.spaceWidth * SPACE_FACTOR.
If so, we should add a space to the str array of characters, as follows:
if (charSpacing >= textChunk.spaceWidth * SPACE_FACTOR) {
textChunk.str.push(' ');
}
Laurent.
We have quite a lot of text layer spacing issues open now. Perhaps it's indeed time to revisit the SPACE_FACTOR
constant handling.
PDF is shown as html page.Each sentence is a div , so when we copy a text actually we copy divs and between each div and other there is no space and that's why the past contain missing spaces.
@luxferoo It depends.
I attached this sample to show that some documents are rendered word by word, but not exactly, sometimes word is getting split.
This issue is really annoying, which makes text layer completely useless.
I also tried @ldenoue 's patch, it works for his sample, but not with my sample attached above.
Confirm the problem with spaces.
This is my sample pdf.
And no spaces in pdfjs viewer / renderer
Copy/paste like one big word.
Testpdfsandwich.pdf
Hey @timvandermeij any update on this, I am facing this issue too.
By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' :
bidiResult.str),
Actually, it work for me
@Lyn203 it's working for me too. Did you do a PR to fix this ?
Actually, i didn't ready understand deep all of this logic, so maybe it affect other place.
I remembered why i repair this line, in case normalizeWhitespace, some one replaceWhiteSpace, so str lost space character between chunks. I want to do a PR but how?
after make a space every chunk, i see an issue. Like file https://github.com/mozilla/pdf.js/files/1663608/p16.pdf example of @brookhong, some chunk is not right :
In v en tors ha v e long dreamed of creating mac hines that think. This desire dates bac k to at least the time of ancien t Greece. The m ythical figures Pygmalion, Daedalus , and Hephaestus ma y all be interpreted as legendary in v en tors, and Galatea, T alos, and P andora may all b e regarded as artificial life ( , Ovid and Martin 2004 Spark es 1996 T andy 1997 ; , ; , ).
and some characters in ( , Ovid and Martin 2004 Spark es 1996 T andy 1997 ; , ; , ).
is located wrong.
textContentItem
may be scaling slightly wrong in some case.
also search works incorrect with this issue
any improvements about it @timvandermeij ?
I have issues with hebrew - especially with hebrew letters with dots
Encountering this with https://www.ncsc.gov.uk/files/NCSC%20advisory%20-%20CNI%20Supply%20Chain.pdf page 13
Firefox 71
the spaces appear to be there but aren't even selectable in the span section, hence it only selects the text
Chrome's viewer has spaces, as expected
I have the same problem, spaces are missing in copied text using pdf.js and Firefox 75.0. Copying works correctly with Evince on Ubuntu and the Edge viewer on Windows 10. The document is produced by ABBYY Fine Reader.
By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me
But the chunk division is not proper. The complete word is placed into multiple span tags. So if we add space that makes the text meaning less again
Encountering this with https://www.ncsc.gov.uk/files/NCSC%20advisory%20-%20CNI%20Supply%20Chain.pdf page 13
Firefox 71
the spaces appear to be there but aren't even selectable in the span section, hence it only selects the text
Chrome's viewer has spaces, as expected
No even the chrome viewer is facing the issue
We are running into this problem using ABBYY, as reported elsewhere in this issue.
I have noticed the following difference when using PDF.js on a PDF that was OCRd by both ABBYY and Acrobat Pro:
I'm not sure what "composite" means but it seems related to font encoding, PDF.js ends up on different code paths when composite is true vs false. Does this suggest a possible workaround for using PDF.js on ABBYY OCR output? (I'm also asking ABBYY the same question.)
any update on this issue? Thanks :)
Most helpful comment
https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248
By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me