Describe the feature:
Elasticsearch version (bin/elasticsearch --version):
ES 5.x
Plugins installed: []
ingest-attachment
JVM version (java -version):
1.8.x
OS version (uname -a if on a Unix-like system):
Description of the problem including expected versus actual behavior:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
I have created an issue here.
https://issues.apache.org/jira/browse/PDFBOX-3985
Provide logs (if relevant):
2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] [FontManager] Font not found: TimesNewRomanPS-BoldMT
2017/10/31 00:01:13.413 [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when reading table cmap
java.io.IOException: CMap subtype 14 not yet implemented
at org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
at org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100)
at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:537)
thank you for creating the upstream issue against PDFBox!
This is not the first issue we've seen dealing with parsing specific fonts. I think we can do better with the latest version of PDFBox that, if I am not mistaken, logs (instead of throws) these exceptions. That way we can still extract what we can from the pdf.
I looked at this and it seems like that Apache Tika 1.17 depends on PDFBox 2.0.8:
[INFO] +- org.apache.tika:tika-parsers:jar:1.17:compile
[INFO] | +- org.apache.tika:tika-core:jar:1.17:compile
[INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] | +- com.healthmarketscience.jackcess:jackcess:jar:2.1.8:compile
[INFO] | | \- commons-lang:commons-lang:jar:2.6:compile
[INFO] | +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.2:compile
[INFO] | +- org.tallison:jmatio:jar:1.2:compile
[INFO] | +- org.apache.james:apache-mime4j-core:jar:0.8.1:compile
[INFO] | +- org.apache.james:apache-mime4j-dom:jar:0.8.1:compile
[INFO] | +- org.apache.commons:commons-compress:jar:1.14:compile
[INFO] | +- org.tukaani:xz:jar:1.6:compile
[INFO] | +- commons-codec:commons-codec:jar:1.10:compile
[INFO] | +- org.apache.pdfbox:pdfbox:jar:2.0.8:compile
[INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.8:compile
[INFO] | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.8:compile
[INFO] | +- org.apache.pdfbox:jempbox:jar:1.8.13:compile
[INFO] | +- org.bouncycastle:bcmail-jdk15on:jar:1.54:compile
[INFO] | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.54:compile
[INFO] | +- org.bouncycastle:bcprov-jdk15on:jar:1.54:compile
[INFO] | +- org.apache.poi:poi:jar:3.17:compile
[INFO] | | \- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] | +- org.apache.poi:poi-scratchpad:jar:3.17:compile
[INFO] | +- org.apache.poi:poi-ooxml:jar:3.17:compile
[INFO] | | +- org.apache.poi:poi-ooxml-schemas:jar:3.17:compile
[INFO] | | | \- org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
[INFO] | | \- com.github.virtuald:curvesapi:jar:1.04:compile
[INFO] | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] | +- org.ow2.asm:asm:jar:5.0.4:compile
[INFO] | +- com.googlecode.mp4parser:isoparser:jar:1.1.18:compile
[INFO] | +- com.drewnoakes:metadata-extractor:jar:2.10.1:compile
[INFO] | | \- com.adobe.xmp:xmpcore:jar:5.1.3:compile
[INFO] | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] | +- com.rometools:rome:jar:1.5.1:compile
[INFO] | | \- com.rometools:rome-utils:jar:1.5.1:compile
[INFO] | +- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] | +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] | +- org.codelibs:jhighlight:jar:1.0.2:compile
[INFO] | +- com.pff:java-libpst:jar:0.8.1:compile
[INFO] | +- com.github.junrar:junrar:jar:0.7:compile
[INFO] | +- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] | +- org.apache.opennlp:opennlp-tools:jar:1.8.3:compile
[INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1.1:compile
[INFO] | +- com.tdunning:json:jar:1.8:compile
[INFO] | +- com.google.code.gson:gson:jar:2.8.1:compile
[INFO] | +- org.slf4j:slf4j-api:jar:1.7.24:compile
[INFO] | +- org.slf4j:jul-to-slf4j:jar:1.7.24:compile
[INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.24:compile
[INFO] | +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile
[INFO] | | \- org.apache.httpcomponents:httpcore:jar:4.4.7:compile
[INFO] | +- org.apache.httpcomponents:httpmime:jar:4.5.4:compile
[INFO] | +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] | +- org.apache.sis.core:sis-utility:jar:0.6:compile
[INFO] | +- org.apache.sis.storage:sis-netcdf:jar:0.6:compile
[INFO] | | +- org.apache.sis.storage:sis-storage:jar:0.6:compile
[INFO] | | \- org.apache.sis.core:sis-referencing:jar:0.6:compile
[INFO] | +- org.apache.sis.core:sis-metadata:jar:0.6:compile
[INFO] | +- org.opengis:geoapi:jar:3.0.0:compile
[INFO] | | \- javax.measure:jsr-275:jar:0.9.3:compile
[INFO] | \- edu.usc.ir:sentiment-analysis-parser:jar:0.1:compile
I can see that TIKA will be updated to a new pdfbox version with https://issues.apache.org/jira/browse/TIKA-2178 (for other reasons).
I opened https://issues.apache.org/jira/browse/TIKA-2579 to track this BTW.
I'm unsure though if that will really fix the problem though. As PDFBox team asked, @TomonoriSoejima could you share the failing PDF document so they can reproduce the problem and we can also add it to make sure that Tika next version will fix it?
Thanks!
Ping @TomonoriSoejima. Could you please share a document?
Unfortunately, a user I was dealing with the support case declined to share the reproducible file with us due to privacy and I don't have the file.
https://issues.apache.org/jira/browse/TIKA-2579 has been fixed. \o/
Let's wait for a release now.
No further feedback so closing. If this can be reproduced we can reopen the issue
Most helpful comment
I looked at this and it seems like that Apache Tika 1.17 depends on PDFBox 2.0.8:
I can see that TIKA will be updated to a new pdfbox version with https://issues.apache.org/jira/browse/TIKA-2178 (for other reasons).
I opened https://issues.apache.org/jira/browse/TIKA-2579 to track this BTW.
I'm unsure though if that will really fix the problem though. As PDFBox team asked, @TomonoriSoejima could you share the failing PDF document so they can reproduce the problem and we can also add it to make sure that Tika next version will fix it?
Thanks!