Elasticsearch: ingest-attachment plugin Font not found: TimesNewRomanPS-BoldMT

Created on 1 Nov 2017 · 7Comments · Source: elastic/elasticsearch

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
ES 5.x
Plugins installed: []
ingest-attachment
JVM version (java -version):
1.8.x
OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

ingest pdf document which contains TimesNewRomanPS-BoldMT font
ingest-pipeline should throw error below

I have created an issue here.
https://issues.apache.org/jira/browse/PDFBOX-3985

Provide logs (if relevant):

2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] [FontManager] Font not found: TimesNewRomanPS-BoldMT
2017/10/31 00:01:13.413 [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when reading table cmap
java.io.IOException: CMap subtype 14 not yet implemented
        at org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304)
        at org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114)
        at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100)
        at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
        at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
        at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
        at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
        at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.Tika.parseToString(Tika.java:537)

:CorFeatureIngest >bug feedback_needed

Source

TomonoriSoejima

Most helpful comment

I looked at this and it seems like that Apache Tika 1.17 depends on PDFBox 2.0.8:

[INFO] +- org.apache.tika:tika-parsers:jar:1.17:compile
[INFO] |  +- org.apache.tika:tika-core:jar:1.17:compile
[INFO] |  +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess:jar:2.1.8:compile
[INFO] |  |  \- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.2:compile
[INFO] |  +- org.tallison:jmatio:jar:1.2:compile
[INFO] |  +- org.apache.james:apache-mime4j-core:jar:0.8.1:compile
[INFO] |  +- org.apache.james:apache-mime4j-dom:jar:0.8.1:compile
[INFO] |  +- org.apache.commons:commons-compress:jar:1.14:compile
[INFO] |  +- org.tukaani:xz:jar:1.6:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.10:compile
[INFO] |  +- org.apache.pdfbox:pdfbox:jar:2.0.8:compile
[INFO] |  |  \- org.apache.pdfbox:fontbox:jar:2.0.8:compile
[INFO] |  +- org.apache.pdfbox:pdfbox-tools:jar:2.0.8:compile
[INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.13:compile
[INFO] |  +- org.bouncycastle:bcmail-jdk15on:jar:1.54:compile
[INFO] |  |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.54:compile
[INFO] |  +- org.bouncycastle:bcprov-jdk15on:jar:1.54:compile
[INFO] |  +- org.apache.poi:poi:jar:3.17:compile
[INFO] |  |  \- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] |  +- org.apache.poi:poi-scratchpad:jar:3.17:compile
[INFO] |  +- org.apache.poi:poi-ooxml:jar:3.17:compile
[INFO] |  |  +- org.apache.poi:poi-ooxml-schemas:jar:3.17:compile
[INFO] |  |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
[INFO] |  |  \- com.github.virtuald:curvesapi:jar:1.04:compile
[INFO] |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] |  +- org.ow2.asm:asm:jar:5.0.4:compile
[INFO] |  +- com.googlecode.mp4parser:isoparser:jar:1.1.18:compile
[INFO] |  +- com.drewnoakes:metadata-extractor:jar:2.10.1:compile
[INFO] |  |  \- com.adobe.xmp:xmpcore:jar:5.1.3:compile
[INFO] |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] |  +- com.rometools:rome:jar:1.5.1:compile
[INFO] |  |  \- com.rometools:rome-utils:jar:1.5.1:compile
[INFO] |  +- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] |  +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] |  +- org.codelibs:jhighlight:jar:1.0.2:compile
[INFO] |  +- com.pff:java-libpst:jar:0.8.1:compile
[INFO] |  +- com.github.junrar:junrar:jar:0.7:compile
[INFO] |  +- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] |  +- org.apache.opennlp:opennlp-tools:jar:1.8.3:compile
[INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1.1:compile
[INFO] |  +- com.tdunning:json:jar:1.8:compile
[INFO] |  +- com.google.code.gson:gson:jar:2.8.1:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.7.24:compile
[INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.24:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.24:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore:jar:4.4.7:compile
[INFO] |  +- org.apache.httpcomponents:httpmime:jar:4.5.4:compile
[INFO] |  +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] |  +- org.apache.sis.core:sis-utility:jar:0.6:compile
[INFO] |  +- org.apache.sis.storage:sis-netcdf:jar:0.6:compile
[INFO] |  |  +- org.apache.sis.storage:sis-storage:jar:0.6:compile
[INFO] |  |  \- org.apache.sis.core:sis-referencing:jar:0.6:compile
[INFO] |  +- org.apache.sis.core:sis-metadata:jar:0.6:compile
[INFO] |  +- org.opengis:geoapi:jar:3.0.0:compile
[INFO] |  |  \- javax.measure:jsr-275:jar:0.9.3:compile
[INFO] |  \- edu.usc.ir:sentiment-analysis-parser:jar:0.1:compile

I can see that TIKA will be updated to a new pdfbox version with https://issues.apache.org/jira/browse/TIKA-2178 (for other reasons).
I opened https://issues.apache.org/jira/browse/TIKA-2579 to track this BTW.

I'm unsure though if that will really fix the problem though. As PDFBox team asked, @TomonoriSoejima could you share the failing PDF document so they can reproduce the problem and we can also add it to make sure that Tika next version will fix it?

Thanks!

dadoonet on 21 Feb 2018

👍2

All 7 comments

thank you for creating the upstream issue against PDFBox!

talevy on 1 Nov 2017

👍1

This is not the first issue we've seen dealing with parsing specific fonts. I think we can do better with the latest version of PDFBox that, if I am not mistaken, logs (instead of throws) these exceptions. That way we can still extract what we can from the pdf.

talevy on 1 Nov 2017

I looked at this and it seems like that Apache Tika 1.17 depends on PDFBox 2.0.8:

[INFO] +- org.apache.tika:tika-parsers:jar:1.17:compile
[INFO] |  +- org.apache.tika:tika-core:jar:1.17:compile
[INFO] |  +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess:jar:2.1.8:compile
[INFO] |  |  \- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.2:compile
[INFO] |  +- org.tallison:jmatio:jar:1.2:compile
[INFO] |  +- org.apache.james:apache-mime4j-core:jar:0.8.1:compile
[INFO] |  +- org.apache.james:apache-mime4j-dom:jar:0.8.1:compile
[INFO] |  +- org.apache.commons:commons-compress:jar:1.14:compile
[INFO] |  +- org.tukaani:xz:jar:1.6:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.10:compile
[INFO] |  +- org.apache.pdfbox:pdfbox:jar:2.0.8:compile
[INFO] |  |  \- org.apache.pdfbox:fontbox:jar:2.0.8:compile
[INFO] |  +- org.apache.pdfbox:pdfbox-tools:jar:2.0.8:compile
[INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.13:compile
[INFO] |  +- org.bouncycastle:bcmail-jdk15on:jar:1.54:compile
[INFO] |  |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.54:compile
[INFO] |  +- org.bouncycastle:bcprov-jdk15on:jar:1.54:compile
[INFO] |  +- org.apache.poi:poi:jar:3.17:compile
[INFO] |  |  \- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] |  +- org.apache.poi:poi-scratchpad:jar:3.17:compile
[INFO] |  +- org.apache.poi:poi-ooxml:jar:3.17:compile
[INFO] |  |  +- org.apache.poi:poi-ooxml-schemas:jar:3.17:compile
[INFO] |  |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
[INFO] |  |  \- com.github.virtuald:curvesapi:jar:1.04:compile
[INFO] |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] |  +- org.ow2.asm:asm:jar:5.0.4:compile
[INFO] |  +- com.googlecode.mp4parser:isoparser:jar:1.1.18:compile
[INFO] |  +- com.drewnoakes:metadata-extractor:jar:2.10.1:compile
[INFO] |  |  \- com.adobe.xmp:xmpcore:jar:5.1.3:compile
[INFO] |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] |  +- com.rometools:rome:jar:1.5.1:compile
[INFO] |  |  \- com.rometools:rome-utils:jar:1.5.1:compile
[INFO] |  +- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] |  +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] |  +- org.codelibs:jhighlight:jar:1.0.2:compile
[INFO] |  +- com.pff:java-libpst:jar:0.8.1:compile
[INFO] |  +- com.github.junrar:junrar:jar:0.7:compile
[INFO] |  +- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] |  +- org.apache.opennlp:opennlp-tools:jar:1.8.3:compile
[INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1.1:compile
[INFO] |  +- com.tdunning:json:jar:1.8:compile
[INFO] |  +- com.google.code.gson:gson:jar:2.8.1:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.7.24:compile
[INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.24:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.24:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.5.4:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore:jar:4.4.7:compile
[INFO] |  +- org.apache.httpcomponents:httpmime:jar:4.5.4:compile
[INFO] |  +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] |  +- org.apache.sis.core:sis-utility:jar:0.6:compile
[INFO] |  +- org.apache.sis.storage:sis-netcdf:jar:0.6:compile
[INFO] |  |  +- org.apache.sis.storage:sis-storage:jar:0.6:compile
[INFO] |  |  \- org.apache.sis.core:sis-referencing:jar:0.6:compile
[INFO] |  +- org.apache.sis.core:sis-metadata:jar:0.6:compile
[INFO] |  +- org.opengis:geoapi:jar:3.0.0:compile
[INFO] |  |  \- javax.measure:jsr-275:jar:0.9.3:compile
[INFO] |  \- edu.usc.ir:sentiment-analysis-parser:jar:0.1:compile

Thanks!

dadoonet on 21 Feb 2018

👍2

Ping @TomonoriSoejima. Could you please share a document?

dadoonet on 9 Mar 2018

Unfortunately, a user I was dealing with the support case declined to share the reproducible file with us due to privacy and I don't have the file.

TomonoriSoejima on 9 Mar 2018

😕1

https://issues.apache.org/jira/browse/TIKA-2579 has been fixed. \o/
Let's wait for a release now.

dadoonet on 28 Mar 2018

No further feedback so closing. If this can be reproduced we can reopen the issue