Pdf.js: missing all spaces on page 1 (text selection)

Created on 10 May 2016  路  19Comments  路  Source: mozilla/pdf.js

Link to PDF file (or attach file here):

Configuration:

  • Web browser and its version: Chrome 51.0.2704.36 beta (64-bit)
  • Operating system and its version: Mac OS X El Capitan 10.11.3 (15D21)
  • PDF.js version: online viewer (May 10, 2016)
  • Is an extension:

Steps to reproduce the problem:

  1. copy text from page 1
  2. paste somewhere else: all words touch, without any space

What is the expected behavior? (add screenshot)
need to add spaces between words.
jansen_public_queries.pdf

What went wrong? (add screenshot)
no spaces were found between words.
Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

4-text-selection

Most helpful comment

https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248

By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me

All 19 comments

I think there's a fix for this problem.
On the attached image rendering the first page of this PDF, you can notice that the bounding boxes of the text items seem wider than the last printed character, e.g. "g " of searching has an extra empty space after. It means that the widths are computed correctly (textMatrix is properly updated in the code), but it doesn't add a space when it should. And it should depending on the current value of charSpacing.
So in https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1334 after the line textChunk.str.push(glyphUnicode); we should see if the current charSpacing is greater than textChunk.spaceWidth * SPACE_FACTOR.

If so, we should add a space to the str array of characters, as follows:
if (charSpacing >= textChunk.spaceWidth * SPACE_FACTOR) {
textChunk.str.push(' ');
}

Laurent.
screen shot 2016-05-10 at 8 50 06 pm

We have quite a lot of text layer spacing issues open now. Perhaps it's indeed time to revisit the SPACE_FACTOR constant handling.

PDF is shown as html page.Each sentence is a div , so when we copy a text actually we copy divs and between each div and other there is no space and that's why the past contain missing spaces.

@luxferoo It depends.

p16.pdf

I attached this sample to show that some documents are rendered word by word, but not exactly, sometimes word is getting split.
This issue is really annoying, which makes text layer completely useless.

I also tried @ldenoue 's patch, it works for his sample, but not with my sample attached above.

image

Confirm the problem with spaces.
This is my sample pdf.
And no spaces in pdfjs viewer / renderer
Copy/paste like one big word.
Testpdfsandwich.pdf

Hey @timvandermeij any update on this, I am facing this issue too.

https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248

By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me

@Lyn203 it's working for me too. Did you do a PR to fix this ?

Actually, i didn't ready understand deep all of this logic, so maybe it affect other place.

I remembered why i repair this line, in case normalizeWhitespace, some one replaceWhiteSpace, so str lost space character between chunks. I want to do a PR but how?

after make a space every chunk, i see an issue. Like file https://github.com/mozilla/pdf.js/files/1663608/p16.pdf example of @brookhong, some chunk is not right :
In v en tors ha v e long dreamed of creating mac hines that think. This desire dates bac k to at least the time of ancien t Greece. The m ythical figures Pygmalion, Daedalus , and Hephaestus ma y all be interpreted as legendary in v en tors, and Galatea, T alos, and P andora may all b e regarded as artificial life ( , Ovid and Martin 2004 Spark es 1996 T andy 1997 ; , ; , ).

and some characters in ( , Ovid and Martin 2004 Spark es 1996 T andy 1997 ; , ; , ). is located wrong.
textContentItem may be scaling slightly wrong in some case.

also search works incorrect with this issue

any improvements about it @timvandermeij ?
I have issues with hebrew - especially with hebrew letters with dots

Encountering this with https://www.ncsc.gov.uk/files/NCSC%20advisory%20-%20CNI%20Supply%20Chain.pdf page 13

Firefox 71
ff71_pdf_selectedtext_no_spaces

the spaces appear to be there but aren't even selectable in the span section, hence it only selects the text

Chrome's viewer has spaces, as expected

I have the same problem, spaces are missing in copied text using pdf.js and Firefox 75.0. Copying works correctly with Evince on Ubuntu and the Edge viewer on Windows 10. The document is produced by ABBYY Fine Reader.

http://klevas.mif.vu.lt/~valentas/pdf_js_spaces.pdf.

https://github.com/mozilla/pdf.js/blob/070f2d32ada10bd877bdfd6cae2a2179a20a9930/src/core/evaluator.js#L1247-L1248

By fixing someone replace all space character in chunk, so repair to
str: (normalizeWhitespace ? replaceWhitespace(bidiResult.str) + ' ' : bidiResult.str),
Actually, it work for me

But the chunk division is not proper. The complete word is placed into multiple span tags. So if we add space that makes the text meaning less again

Encountering this with https://www.ncsc.gov.uk/files/NCSC%20advisory%20-%20CNI%20Supply%20Chain.pdf page 13

Firefox 71
ff71_pdf_selectedtext_no_spaces

the spaces appear to be there but aren't even selectable in the span section, hence it only selects the text

Chrome's viewer has spaces, as expected

No even the chrome viewer is facing the issue

We are running into this problem using ABBYY, as reported elsewhere in this issue.

I have noticed the following difference when using PDF.js on a PDF that was OCRd by both ABBYY and Acrobat Pro:

image

I'm not sure what "composite" means but it seems related to font encoding, PDF.js ends up on different code paths when composite is true vs false. Does this suggest a possible workaround for using PDF.js on ABBYY OCR output? (I'm also asking ABBYY the same question.)

any update on this issue? Thanks :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

THausherr picture THausherr  路  3Comments

xingxiaoyiyio picture xingxiaoyiyio  路  3Comments

jigskpatel picture jigskpatel  路  3Comments

PeterNerlich picture PeterNerlich  路  3Comments

azetutu picture azetutu  路  4Comments