I know some improvements has been made recently to the handling of spacing in text but copying text from this particular PDF causes spaces between words to be completely missing (when tried using the public viewer of PDF.JS, version "1.2.131").
Spaces in text are available when copying from Acrobat, Preview on OSX etc.
^Erik
The spaces missing due to optimization to not display whitespace divs (see https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L92). The PDF commands looks like:
/F6 1 Tf
6.3761 0 0 6.3761 257.6175 558.2142 Tm
(Waco,)Tj
/F2 1 Tf
.0002 0 0 -.0002 275.6735 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 277.3313 558.2142 Tm
(TX)Tj
/F2 1 Tf
.0002 0 0 -.0002 284.7781 558.2141 Tm
( )Tj
/F6 1 Tf
6.3761 0 0 6.3761 286.4359 558.2142 Tm
(76798-7353,)Tj
Thanks for the speedy reply. The height and width on the spaces in this particular case seems to be off too. The height on a space is 0.0002 and the width width 0.00005, while the text element before a space has a height of 7.17.
Sorry I just saw that the pdf actually seems to specify this as part of the PDF commands, I guess its in the pdf then. Its actually using a different font for the spaces than for the text
Can you experiment with disabling the optimization I mentioned above? See if it will resolve the issue.
I'm not using the text_layer_builder in my case but disabling the optimisation above resolves it. I confirmed that the spaces are returned from getTextContent(). Their sizes (as I stated) are funny in this pdf but they are returned properly
How did you disable the optimization exactly? Almost every document is missing spaces between words when using Find or copying/pasting the text.
@flexpaper and @slavajacobson
I made a simple workaround by adding an space character changing textDiv.textContent = geom.str;
to textDiv.textContent = geom.str + ' ';
at file https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L109.
Now when I select the text to copy & paste, "line breaks" gets nice as space ;)
The spaces missing due to optimization to not display whitespace divs (see https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L92). The PDF commands looks like:
/F6 1 Tf 6.3761 0 0 6.3761 257.6175 558.2142 Tm (Waco,)Tj /F2 1 Tf .0002 0 0 -.0002 275.6735 558.2141 Tm ( )Tj /F6 1 Tf 6.3761 0 0 6.3761 277.3313 558.2142 Tm (TX)Tj /F2 1 Tf .0002 0 0 -.0002 284.7781 558.2141 Tm ( )Tj /F6 1 Tf 6.3761 0 0 6.3761 286.4359 558.2142 Tm (76798-7353,)Tj
What all lines of codes need to comment or remove to remove spaces in text
@yurydelendik Can you please advise on it. I am urgently looking for it.
You can try tweaking https://github.com/mozilla/pdf.js/blob/f9c58115fca3d5f3e0494c08adec99c8b1c43f19/src/core/evaluator.js#L1230-L1232
You can try tweaking
Lines 1230 to 1232 in f9c5811
var SPACE_FACTOR = 0.3;
var MULTI_SPACE_FACTOR = 1.5;
var MULTI_SPACE_FACTOR_MAX = 4;
@timvandermeij Thanks. I made some changes in the evaluator.js file. But changes not reflecting after ng serve. How to build after make changes in pdf.js file
You'll need to rebuild PDF.js. Refer to the README and the wiki for how to do that.
You'll need to rebuild PDF.js. Refer to the README and the wiki for how to do that.
@timvandermeij when I am running gulp dist-install to build project getting following error:-
Cloning baseline distribution
Error: command "git" with parameters "clone,--depth,1,https://github.com/mozilla/pdfjs-dist,build/dist/" exited with code 1
Can you help on it
@timvandermeij Can you help on above error
I have never seen that error before. Try setting up a new clean environment. If you're on Windows, some other steps may be required: https://github.com/mozilla/pdf.js/wiki/Setting-up-pdf.js-Development-Environment-for-Windows
@timvandermeij Thanks for your help.
I downloaded pdf.js-2.0.943 from git and then running gulp dist-install to build project.
Afterwards I am copying all the files, folder from dist folder and then pasting into my project node-modules pdfjs-dist folder
Earlier pdf-viewer was able to show pdf with prebuilt pdfjs-dist. But Now after custom build as I explained you above its not showing pdf.
I tried it for few days, but not sure what is going wrong.
I am looking for your help on it. Your quick response will be very much helpful for me.
Please refrain from repeatedly posting completely unrelated comments in issues, since that causes notification spam for people and makes it much more difficult to follow the actual discussion.
Basically, everything from https://github.com/mozilla/pdf.js/issues/6657#issuecomment-479874353 forwards is completely unrelated here (it possibly even started with https://github.com/mozilla/pdf.js/issues/6657#issuecomment-476180422), and should have been posted in a separate new issue (with all information from ISSUE_TEMPLATE.md provided). @timvandermeij Mind hiding/removing some of the off-topic comments?
any updates on this? i am suffering a lot, i have to open PDF files in chrome to copy text proeprly
Our server-side search keywords PDF highlighting tool workarounds this issue by getting copy text from the server.
https://www.pdf-highlighter.com/docs/Text_Copy_Workaround.html
(it's a slightly customized version of PDF.js)
It could be an overkill if all you need is text copy but maybe you find it useful.
Any updates? pdf viewer is a big deal for anyone who read papers.
possibly related to #7310
I need to open some PDFs in Okular (Linux) that copies the text just fine that is missing spaces in Firefox :(
I need to open some PDFs in Okular (Linux) that copies the text just fine that is missing spaces in Firefox :(
or Chrome 馃槀
although Chrome PDF viewer also has its own issues
Been there, did that, and I've found some PDF files that are missing spaces when copied both in FF and Chrome, that's why the Okular fallback that does the job steadily 馃槂
The spaces are removed when the text is copy pasted from pdf. It's working fine for some pdfs where the chunk text or the string that we get from the byte array is a complete line. It's not working when the string is split up word by word(not a proper word even) don't know why the splitting up of string is so wiered
That is when the span tag (html) has a complete line of text then copy paste works fine if it's word by word not working. Can anyone please help me out. Pdf.js version is 2.2.228
Most helpful comment
@flexpaper and @slavajacobson
I made a simple workaround by adding an space character changing
textDiv.textContent = geom.str;
totextDiv.textContent = geom.str + ' ';
at file https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L109.Now when I select the text to copy & paste, "line breaks" gets nice as space ;)