I receive an error in eXide:
exerr:ERROR Problem with content extraction library: Unable to extract PDF content
and a huge stacktrace in logs:
Caused by: org.exist.contentextraction.AbortedAfterMetadataException
at org.exist.contentextraction.AbortAfterMetadataContentHandler.endElement(AbortAfterMetadataContentHandler.java:78) ~[exist-contentextraction-4.6.0.jar:4.6.0]
at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:198) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:248) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:292) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.startPage(AbstractPDF2XHTML.java:165) ~[tika-parsers-1.16.jar:1.16]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:361) ~[pdfbox-2.0.6.jar:2.0.6]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) ~[tika-parsers-1.16.jar:1.16]
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) ~[pdfbox-2.0.6.jar:2.0.6]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) ~[pdfbox-2.0.6.jar:2.0.6]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ~[tika-parsers-1.16.jar:1.16]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167) ~[tika-parsers-1.16.jar:1.16]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[tika-core-1.16.jar:1.16]
at org.exist.contentextraction.ContentExtraction.extractMetadata(ContentExtraction.java:75) ~[exist-contentextraction-4.6.0.jar:4.6.0]
... 76 more
I expected to receive the metadata html like this:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.3"/>
... more meta fields ...
</head>
</html>
xquery version "3.1";
import module namespace contentextraction = "http://exist-db.org/xquery/contentextraction" at "java:org.exist.contentextraction.xquery.ContentExtractionModule";
contentextraction:get-metadata(file:read-binary('/Users/ilagunov/Downloads/sample.pdf'))
At the same time the following XQuery works well (and is a good workaround for now):
contentextraction:get-metadata-and-content(file:read-binary('/Users/ilagunov/Downloads/sample.pdf'))
Looking at the code, this is an Tika bug. Both calls just invoke the Tika library.
Hopefully an upgrade of Tika will help here.
Maybe i am too fast... https://github.com/dizzzz/exist/blob/b163c1f969fa0bdca40d4f6ce27715ba9f034cee/extensions/contentextraction/src/main/java/org/exist/contentextraction/ContentExtraction.java#L80 there is a subtile difference in handling exceptions here....
@dizzzz Hmm... sounds like you're onto something?
even when I made the code more similar, the stacktrace remained.
P.S. I've forgotten to mention that it used to work fine with eXist-db 2.2. Checking the code for 2.2, the exception handling is the same there:
https://github.com/eXist-db/exist/blob/eXist-2.2/extensions/contentextraction/src/org/exist/contentextraction/ContentExtraction.java
I've just checked with older version of Apache Tika (from eXist-db 2.2) and it worked fine. I did the following:
I've just naively tried to upgrade to the latest Apache Tika 1.20 but it did not help - it shows the same exception. I used the following libs:
fontbox-2.0.13.jar
pdfbox-2.0.13.jar
pdfbox-debugger-2.0.13.jar
pdfbox-tools-2.0.13.jar
tika-core-1.20.jar
tika-parsers-1.20.jar
Ok that is a good test .....
If I try and reproduce your issue on 4.6.1, I get a different exception:
Mar 09, 2019 7:40:36 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Mar 09, 2019 7:40:36 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Exception in thread "java-admin-client-0.query-0" java.lang.NoSuchMethodError: org.apache.fontbox.afm.AFMParser.parse(Z)Lorg/apache/fontbox/afm/FontMetrics;
at org.apache.pdfbox.pdmodel.font.Standard14Fonts.addAFM(Standard14Fonts.java:118)
at org.apache.pdfbox.pdmodel.font.Standard14Fonts.addAFM(Standard14Fonts.java:97)
at org.apache.pdfbox.pdmodel.font.Standard14Fonts.<clinit>(Standard14Fonts.java:50)
at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:91)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:64)
at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:122)
at org.apache.pdfbox.pdmodel.font.PDType1Font.<clinit>(PDType1Font.java:79)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.exist.contentextraction.ContentExtraction.extractMetadata(ContentExtraction.java:75)
at org.exist.contentextraction.xquery.ContentFunctions.eval(ContentFunctions.java:137)
at org.exist.xquery.BasicFunction.eval(BasicFunction.java:74)
at org.exist.xquery.InternalFunctionCall.eval(InternalFunctionCall.java:41)
at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71)
at org.exist.xquery.PathExpr.eval(PathExpr.java:276)
at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71)
at org.exist.xquery.XQuery.execute(XQuery.java:261)
at org.exist.xquery.XQuery.execute(XQuery.java:185)
at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:195)
at org.exist.xmldb.LocalXPathQueryService.lambda$execute$1(LocalXPathQueryService.java:162)
at org.exist.xmldb.function.LocalXmldbFunction.apply(LocalXmldbFunction.java:46)
at org.exist.xmldb.AbstractLocal.withDb(AbstractLocal.java:196)
at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:161)
at org.exist.client.QueryDialog$QueryRunnable.run(QueryDialog.java:577)
at java.lang.Thread.run(Thread.java:748)
Never mind, I had some old Jars lying around on the classpath. I now get the same exception as @lagivan
@lagivan Would you be able to test the fix I prepared - https://github.com/eXist-db/exist/pull/2555
@adamretter tested and it does not fully work. Expected results are the following (returned by contentextraction:get-metadata-and-content function):
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="My title"/>
Actual results are the following (returned by contentextraction:get-metadata function):
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="pdf:PDFVersion"/>
<meta name="pdf:docinfo:title" content="pdf:docinfo:title"/>
Note that @content is the same as @name now. Replacing "name" with "value" in the following line solves the issue:
attributes.addAttribute("", "content", "content", "string", name);
Also the indentation has changed but it's not critical.
@lagivan doh! A silly copy-paste mistake on my part. I have fixed that now and updated the PR. Hopefully all good now?