Pdf.js: missing all spaces on page 1 (text selection)

Created on 10 May 2016 · 19Comments · Source: mozilla/pdf.js

Link to PDF file (or attach file here):

Configuration:

Web browser and its version: Chrome 51.0.2704.36 beta (64-bit)
Operating system and its version: Mac OS X El Capitan 10.11.3 (15D21)
PDF.js version: online viewer (May 10, 2016)
Is an extension:

Steps to reproduce the problem:

copy text from page 1
paste somewhere else: all words touch, without any space

What is the expected behavior? (add screenshot)
need to add spaces between words.
jansen_public_queries.pdf

What went wrong? (add screenshot)
no spaces were found between words.
Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

4-text-selection

Source

ldenoue

👍4

Most helpful comment

https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248

By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me

Lyn203 on 12 Sep 2019

😕2 👍2 👎1

All 19 comments

I think there's a fix for this problem.
On the attached image rendering the first page of this PDF, you can notice that the bounding boxes of the text items seem wider than the last printed character, e.g. "g " of searching has an extra empty space after. It means that the widths are computed correctly (textMatrix is properly updated in the code), but it doesn't add a space when it should. And it should depending on the current value of charSpacing.
So in https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1334 after the line textChunk.str.push(glyphUnicode); we should see if the current charSpacing is greater than textChunk.spaceWidth * SPACE_FACTOR.

If so, we should add a space to the str array of characters, as follows:
if (charSpacing >= textChunk.spaceWidth * SPACE_FACTOR) {
textChunk.str.push(' ');
}

Laurent.
screen shot 2016-05-10 at 8 50 06 pm

ldenoue on 10 May 2016

We have quite a lot of text layer spacing issues open now. Perhaps it's indeed time to revisit the SPACE_FACTOR constant handling.

timvandermeij on 10 May 2016

PDF is shown as html page.Each sentence is a div , so when we copy a text actually we copy divs and between each div and other there is no space and that's why the past contain missing spaces.

luxferoo on 17 May 2016

@luxferoo It depends.

p16.pdf

I attached this sample to show that some documents are rendered word by word, but not exactly, sometimes word is getting split.
This issue is really annoying, which makes text layer completely useless.

I also tried @ldenoue 's patch, it works for his sample, but not with my sample attached above.

brookhong on 25 Jan 2018

👍1

Confirm the problem with spaces.
This is my sample pdf.
And no spaces in pdfjs viewer / renderer
Copy/paste like one big word.
Testpdfsandwich.pdf

DJArty on 19 Feb 2018

👍1

Hey @timvandermeij any update on this, I am facing this issue too.

Maheshme on 2 Aug 2019

https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248

By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me

Lyn203 on 12 Sep 2019

😕2 👍2 👎1

@Lyn203 it's working for me too. Did you do a PR to fix this ?

zagoa on 12 Sep 2019

Actually, i didn't ready understand deep all of this logic, so maybe it affect other place.

Lyn203 on 13 Sep 2019

I remembered why i repair this line, in case normalizeWhitespace, some one replaceWhiteSpace, so str lost space character between chunks. I want to do a PR but how?

Lyn203 on 13 Sep 2019

after make a space every chunk, i see an issue. Like file https://github.com/mozilla/pdf.js/files/1663608/p16.pdf example of @brookhong, some chunk is not right :
In v en tors ha v e long dreamed of creating mac hines that think. This desire dates bac k to at least the time of ancien t Greece. The m ythical figures Pygmalion, Daedalus , and Hephaestus ma y all be interpreted as legendary in v en tors, and Galatea, T alos, and P andora may all b e regarded as artificial life ( , Ovid and Martin 2004 Spark es 1996 T andy 1997 ; , ; , ).

and some characters in ( , Ovid and Martin 2004 Spark es 1996 T andy 1997 ; , ; , ). is located wrong.
textContentItem may be scaling slightly wrong in some case.

Lyn203 on 13 Sep 2019

also search works incorrect with this issue

Andrew3005 on 20 Dec 2019

any improvements about it @timvandermeij ?
I have issues with hebrew - especially with hebrew letters with dots

Andrew3005 on 20 Dec 2019

Encountering this with https://www.ncsc.gov.uk/files/NCSC%20advisory%20-%20CNI%20Supply%20Chain.pdf page 13

Firefox 71
ff71_pdf_selectedtext_no_spaces

the spaces appear to be there but aren't even selectable in the span section, hence it only selects the text

Chrome's viewer has spaces, as expected

wesinator on 5 Jan 2020

I have the same problem, spaces are missing in copied text using pdf.js and Firefox 75.0. Copying works correctly with Evince on Ubuntu and the Edge viewer on Windows 10. The document is produced by ABBYY Fine Reader.

http://klevas.mif.vu.lt/~valentas/pdf_js_spaces.pdf.

valentas-kurauskas on 10 Apr 2020

👍1

https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248

By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me

But the chunk division is not proper. The complete word is placed into multiple span tags. So if we add space that makes the text meaning less again

Manisha333 on 9 Jun 2020

Encountering this with https://www.ncsc.gov.uk/files/NCSC%20advisory%20-%20CNI%20Supply%20Chain.pdf page 13

Firefox 71

the spaces appear to be there but aren't even selectable in the span section, hence it only selects the text

Chrome's viewer has spaces, as expected

No even the chrome viewer is facing the issue

Manisha333 on 9 Jun 2020

We are running into this problem using ABBYY, as reported elsewhere in this issue.

I have noticed the following difference when using PDF.js on a PDF that was OCRd by both ABBYY and Acrobat Pro:

I'm not sure what "composite" means but it seems related to font encoding, PDF.js ends up on different code paths when composite is true vs false. Does this suggest a possible workaround for using PDF.js on ABBYY OCR output? (I'm also asking ABBYY the same question.)