Pdf.js: Spaces missing when copying text from PDF

Created on 18 Nov 2015  路  26Comments  路  Source: mozilla/pdf.js

I know some improvements has been made recently to the handling of spacing in text but copying text from this particular PDF causes spaces between words to be completely missing (when tried using the public viewer of PDF.JS, version "1.2.131").

Spaces in text are available when copying from Acrobat, Preview on OSX etc.

LOW_Article_5.pdf

^Erik

4-text-selection

Most helpful comment

@flexpaper and @slavajacobson

I made a simple workaround by adding an space character changing textDiv.textContent = geom.str; to textDiv.textContent = geom.str + ' '; at file https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L109.

Now when I select the text to copy & paste, "line breaks" gets nice as space ;)

All 26 comments

The spaces missing due to optimization to not display whitespace divs (see https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L92). The PDF commands looks like:

/F6 1 Tf
6.3761 0 0 6.3761 257.6175 558.2142 Tm
(Waco,)Tj
/F2 1 Tf
.0002 0 0 -.0002 275.6735 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 277.3313 558.2142 Tm
(TX)Tj
/F2 1 Tf
.0002 0 0 -.0002 284.7781 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 286.4359 558.2142 Tm
(76798-7353,)Tj

Thanks for the speedy reply. The height and width on the spaces in this particular case seems to be off too. The height on a space is 0.0002 and the width width 0.00005, while the text element before a space has a height of 7.17.

Sorry I just saw that the pdf actually seems to specify this as part of the PDF commands, I guess its in the pdf then. Its actually using a different font for the spaces than for the text

Can you experiment with disabling the optimization I mentioned above? See if it will resolve the issue.

I'm not using the text_layer_builder in my case but disabling the optimisation above resolves it. I confirmed that the spaces are returned from getTextContent(). Their sizes (as I stated) are funny in this pdf but they are returned properly

How did you disable the optimization exactly? Almost every document is missing spaces between words when using Find or copying/pasting the text.

@flexpaper and @slavajacobson

I made a simple workaround by adding an space character changing textDiv.textContent = geom.str; to textDiv.textContent = geom.str + ' '; at file https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L109.

Now when I select the text to copy & paste, "line breaks" gets nice as space ;)

The spaces missing due to optimization to not display whitespace divs (see https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L92). The PDF commands looks like:

/F6 1 Tf
6.3761 0 0 6.3761 257.6175 558.2142 Tm
(Waco,)Tj
/F2 1 Tf
.0002 0 0 -.0002 275.6735 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 277.3313 558.2142 Tm
(TX)Tj
/F2 1 Tf
.0002 0 0 -.0002 284.7781 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 286.4359 558.2142 Tm
(76798-7353,)Tj

What all lines of codes need to comment or remove to remove spaces in text

@yurydelendik Can you please advise on it. I am urgently looking for it.

You can try tweaking

pdf.js/src/core/evaluator.js

Lines 1230 to 1232 in f9c5811

var SPACE_FACTOR = 0.3;
var MULTI_SPACE_FACTOR = 1.5;
var MULTI_SPACE_FACTOR_MAX = 4;

@timvandermeij Thanks. I made some changes in the evaluator.js file. But changes not reflecting after ng serve. How to build after make changes in pdf.js file

You'll need to rebuild PDF.js. Refer to the README and the wiki for how to do that.

You'll need to rebuild PDF.js. Refer to the README and the wiki for how to do that.

@timvandermeij when I am running gulp dist-install to build project getting following error:-
Cloning baseline distribution
Error: command "git" with parameters "clone,--depth,1,https://github.com/mozilla/pdfjs-dist,build/dist/" exited with code 1

Can you help on it

@timvandermeij Can you help on above error

I have never seen that error before. Try setting up a new clean environment. If you're on Windows, some other steps may be required: https://github.com/mozilla/pdf.js/wiki/Setting-up-pdf.js-Development-Environment-for-Windows

@timvandermeij Thanks for your help.

I downloaded pdf.js-2.0.943 from git and then running gulp dist-install to build project.

Afterwards I am copying all the files, folder from dist folder and then pasting into my project node-modules pdfjs-dist folder

Earlier pdf-viewer was able to show pdf with prebuilt pdfjs-dist. But Now after custom build as I explained you above its not showing pdf.

I tried it for few days, but not sure what is going wrong.

I am looking for your help on it. Your quick response will be very much helpful for me.

Please refrain from repeatedly posting completely unrelated comments in issues, since that causes notification spam for people and makes it much more difficult to follow the actual discussion.

Basically, everything from https://github.com/mozilla/pdf.js/issues/6657#issuecomment-479874353 forwards is completely unrelated here (it possibly even started with https://github.com/mozilla/pdf.js/issues/6657#issuecomment-476180422), and should have been posted in a separate new issue (with all information from ISSUE_TEMPLATE.md provided). @timvandermeij Mind hiding/removing some of the off-topic comments?

any updates on this? i am suffering a lot, i have to open PDF files in chrome to copy text proeprly

Our server-side search keywords PDF highlighting tool workarounds this issue by getting copy text from the server.
https://www.pdf-highlighter.com/docs/Text_Copy_Workaround.html
(it's a slightly customized version of PDF.js)
It could be an overkill if all you need is text copy but maybe you find it useful.

Any updates? pdf viewer is a big deal for anyone who read papers.

10640

possibly related to #7310

I need to open some PDFs in Okular (Linux) that copies the text just fine that is missing spaces in Firefox :(

I need to open some PDFs in Okular (Linux) that copies the text just fine that is missing spaces in Firefox :(

or Chrome 馃槀
although Chrome PDF viewer also has its own issues

Been there, did that, and I've found some PDF files that are missing spaces when copied both in FF and Chrome, that's why the Okular fallback that does the job steadily 馃槂

The spaces are removed when the text is copy pasted from pdf. It's working fine for some pdfs where the chunk text or the string that we get from the byte array is a complete line. It's not working when the string is split up word by word(not a proper word even) don't know why the splitting up of string is so wiered

That is when the span tag (html) has a complete line of text then copy paste works fine if it's word by word not working. Can anyone please help me out. Pdf.js version is 2.2.228

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hp011235 picture hp011235  路  4Comments

azetutu picture azetutu  路  4Comments

brandonros picture brandonros  路  3Comments

zerr0s picture zerr0s  路  3Comments

patelsumit5192 picture patelsumit5192  路  3Comments