Exist: eXist-4.6.0 contentextraction:get-metadata fails with AbortedAfterMetadataException

Created on 22 Feb 2019  路  13Comments  路  Source: eXist-db/exist

What is the problem

I receive an error in eXide:
exerr:ERROR Problem with content extraction library: Unable to extract PDF content
and a huge stacktrace in logs:

Caused by: org.exist.contentextraction.AbortedAfterMetadataException
    at org.exist.contentextraction.AbortAfterMetadataContentHandler.endElement(AbortAfterMetadataContentHandler.java:78) ~[exist-contentextraction-4.6.0.jar:4.6.0]
    at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:198) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:248) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:292) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.startPage(AbstractPDF2XHTML.java:165) ~[tika-parsers-1.16.jar:1.16]
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:361) ~[pdfbox-2.0.6.jar:2.0.6]
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) ~[tika-parsers-1.16.jar:1.16]
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) ~[pdfbox-2.0.6.jar:2.0.6]
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) ~[pdfbox-2.0.6.jar:2.0.6]
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ~[tika-parsers-1.16.jar:1.16]
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167) ~[tika-parsers-1.16.jar:1.16]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[tika-core-1.16.jar:1.16]
    at org.exist.contentextraction.ContentExtraction.extractMetadata(ContentExtraction.java:75) ~[exist-contentextraction-4.6.0.jar:4.6.0]
    ... 76 more

What did you expect

I expected to receive the metadata html like this:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.3"/>
... more meta fields ...
</head>
</html>

Describe how to reproduce or add a test

  1. Download some sample PDF, for example:
    http://www.africau.edu/images/default/sample.pdf
  2. Run XQuery:
xquery version "3.1";
import module namespace contentextraction = "http://exist-db.org/xquery/contentextraction" at "java:org.exist.contentextraction.xquery.ContentExtractionModule";
contentextraction:get-metadata(file:read-binary('/Users/ilagunov/Downloads/sample.pdf'))

At the same time the following XQuery works well (and is a good workaround for now):
contentextraction:get-metadata-and-content(file:read-binary('/Users/ilagunov/Downloads/sample.pdf'))

Context information

  • eXist Version : 4.6.0 / Git commit : 5b5ba69e6
  • Java 1.8.0_191
  • Operating System : Mac OS X 10.14.3 x86_64
  • Installed using JAR installer
  • No customizations, fresh installation of eXist-db
triage

All 13 comments

Looking at the code, this is an Tika bug. Both calls just invoke the Tika library.

Hopefully an upgrade of Tika will help here.

@dizzzz Hmm... sounds like you're onto something?

even when I made the code more similar, the stacktrace remained.

P.S. I've forgotten to mention that it used to work fine with eXist-db 2.2. Checking the code for 2.2, the exception handling is the same there:
https://github.com/eXist-db/exist/blob/eXist-2.2/extensions/contentextraction/src/org/exist/contentextraction/ContentExtraction.java

I've just checked with older version of Apache Tika (from eXist-db 2.2) and it worked fine. I did the following:

  • Replaced pdfbox* 2.0.6 libs with pdfbox 1.8.4
  • Replaced tika* 1.16 libs with tika 1.5

I've just naively tried to upgrade to the latest Apache Tika 1.20 but it did not help - it shows the same exception. I used the following libs:
fontbox-2.0.13.jar
pdfbox-2.0.13.jar
pdfbox-debugger-2.0.13.jar
pdfbox-tools-2.0.13.jar
tika-core-1.20.jar
tika-parsers-1.20.jar

Ok that is a good test .....

If I try and reproduce your issue on 4.6.1, I get a different exception:

Mar 09, 2019 7:40:36 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Mar 09, 2019 7:40:36 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Exception in thread "java-admin-client-0.query-0" java.lang.NoSuchMethodError: org.apache.fontbox.afm.AFMParser.parse(Z)Lorg/apache/fontbox/afm/FontMetrics;
    at org.apache.pdfbox.pdmodel.font.Standard14Fonts.addAFM(Standard14Fonts.java:118)
    at org.apache.pdfbox.pdmodel.font.Standard14Fonts.addAFM(Standard14Fonts.java:97)
    at org.apache.pdfbox.pdmodel.font.Standard14Fonts.<clinit>(Standard14Fonts.java:50)
    at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:91)
    at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:64)
    at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:122)
    at org.apache.pdfbox.pdmodel.font.PDType1Font.<clinit>(PDType1Font.java:79)
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
    at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
    at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
    at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at org.exist.contentextraction.ContentExtraction.extractMetadata(ContentExtraction.java:75)
    at org.exist.contentextraction.xquery.ContentFunctions.eval(ContentFunctions.java:137)
    at org.exist.xquery.BasicFunction.eval(BasicFunction.java:74)
    at org.exist.xquery.InternalFunctionCall.eval(InternalFunctionCall.java:41)
    at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71)
    at org.exist.xquery.PathExpr.eval(PathExpr.java:276)
    at org.exist.xquery.AbstractExpression.eval(AbstractExpression.java:71)
    at org.exist.xquery.XQuery.execute(XQuery.java:261)
    at org.exist.xquery.XQuery.execute(XQuery.java:185)
    at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:195)
    at org.exist.xmldb.LocalXPathQueryService.lambda$execute$1(LocalXPathQueryService.java:162)
    at org.exist.xmldb.function.LocalXmldbFunction.apply(LocalXmldbFunction.java:46)
    at org.exist.xmldb.AbstractLocal.withDb(AbstractLocal.java:196)
    at org.exist.xmldb.LocalXPathQueryService.execute(LocalXPathQueryService.java:161)
    at org.exist.client.QueryDialog$QueryRunnable.run(QueryDialog.java:577)
    at java.lang.Thread.run(Thread.java:748)

Never mind, I had some old Jars lying around on the classpath. I now get the same exception as @lagivan

@lagivan Would you be able to test the fix I prepared - https://github.com/eXist-db/exist/pull/2555

@adamretter tested and it does not fully work. Expected results are the following (returned by contentextraction:get-metadata-and-content function):

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="My title"/>

Actual results are the following (returned by contentextraction:get-metadata function):

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="pdf:PDFVersion" content="pdf:PDFVersion"/>
        <meta name="pdf:docinfo:title" content="pdf:docinfo:title"/>

Note that @content is the same as @name now. Replacing "name" with "value" in the following line solves the issue:
attributes.addAttribute("", "content", "content", "string", name);

Also the indentation has changed but it's not critical.

@lagivan doh! A silly copy-paste mistake on my part. I have fixed that now and updated the PR. Hopefully all good now?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jonjhallettuob picture jonjhallettuob  路  3Comments

joewiz picture joewiz  路  4Comments

mathias-goebel picture mathias-goebel  路  4Comments

adamretter picture adamretter  路  6Comments

Bpolitycki picture Bpolitycki  路  4Comments