The test project in order to reproduce the problem is created here.
Steps to reproduce:
./mvnw package -Pnative
./target/otaibe-apache-tika-docx-native-1.0-SNAPSHOT-runner
curl -v -H "Content-Type: application/octet-stream" -X POST --data-binary @src/test/resources/test_bg.docx http://localhost:11025/parse
mvn package -D%test.service.http.port=11025
2020-01-14 14:43:40,589 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (executor-thread-1) HTTP Request to /parse failed, error id: 7eca2481-63eb-44e0-8c4c-4d57968f69ec-1: org.jboss.resteasy.spi.UnhandledException: org.apache.xerces.parsers.ObjectFactory$ConfigurationError: Provider org.apache.xerces.parsers.XIncludeAwareParserConfiguration not found
at org.jboss.resteasy.core.ExceptionHandler.handleApplicationException(ExceptionHandler.java:106)
at org.jboss.resteasy.core.ExceptionHandler.handleException(ExceptionHandler.java:372)
at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:209)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:496)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:252)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:153)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:363)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:156)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:238)
at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:120)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.access$000(VertxRequestHandler.java:36)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:85)
at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:2011)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1535)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1426)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at java.lang.Thread.run(Thread.java:748)
at org.jboss.threads.JBossThread.run(JBossThread.java:479)
at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:460)
at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:193)
Caused by: org.apache.xerces.parsers.ObjectFactory$ConfigurationError: Provider org.apache.xerces.parsers.XIncludeAwareParserConfiguration not found
at org.apache.xerces.parsers.ObjectFactory.newInstance(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.<init>(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown Source)
at org.apache.poi.ooxml.util.DocumentHelper.newDocumentBuilder(DocumentHelper.java:91)
at org.apache.poi.ooxml.util.DocumentHelper.readDocument(DocumentHelper.java:165)
at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.parseContentTypesFile(ContentTypeManager.java:392)
at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.<init>(ContentTypeManager.java:104)
at org.apache.poi.openxml4j.opc.internal.ZipContentTypeManager.<init>(ZipContentTypeManager.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:258)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:721)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:302)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:85)
at io.quarkus.tika.TikaParser.getMetadata(TikaParser.java:68)
at io.quarkus.tika.TikaParser.getMetadata(TikaParser.java:64)
at org.otaibe.apache.tika.docx.nerror.TikaParserResource.getContentType(TikaParserResource.java:52)
at org.otaibe.apache.tika.docx.nerror.TikaParserResource.hello(TikaParserResource.java:38)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:151)
at org.jboss.resteasy.core.MethodInjectorImpl.lambda$invoke$3(MethodInjectorImpl.java:122)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:110)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:122)
at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:594)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:468)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:421)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:363)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:423)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:391)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invoke$1(ResourceMethodInvoker.java:365)
at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:110)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:365)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:477)
... 19 more
@tpenakov thanks, we've covered some cases, but since so many formats are supported not all code paths have been likely covered
Adding a step which loads org.apache.xerces.xni.parser.XMLParserConfiguration
provider resource in TikaProcessor
should fix it.
@sberyozkin thanks - Is it possible to do it on my project via configuration?
@tpenakov may be with the SubstrateVM configuration, something like -H:ReflectionConfigurationFiles=reflect-config.json
, but I can't find an example where one would set it to include some extra META-INF/services
resource.
@dmlloyd, @gsmet do you know if it is possible to do ?
just for my record, it is org.apache.tika.parser.microsoft.ooxml.OOXMLParser
which is not working in the native mode
@sberyozkin - I've tried to dig by my self and ended up with this configuration (below), but the error is still there :( Just a different configurations is missing.
Here is my reflection-config.json
[
{
"name" : "org.apache.xerces.parsers.XIncludeAwareParserConfiguration",
"allDeclaredConstructors" : true,
"allPublicConstructors" : true,
"allDeclaredMethods" : true,
"allPublicMethods" : true,
"allDeclaredFields" : true,
"allPublicFields" : true
},
{
"name" : "org.apache.xerces.impl.dv.ObjectFactory",
"allDeclaredConstructors" : true,
"allPublicConstructors" : true,
"allDeclaredMethods" : true,
"allPublicMethods" : true,
"allDeclaredFields" : true,
"allPublicFields" : true
},
{
"name" : "org.apache.poi.xwpf.usermodel.XWPFStyles",
"allDeclaredConstructors" : true,
"allPublicConstructors" : true,
"allDeclaredMethods" : true,
"allPublicMethods" : true,
"allDeclaredFields" : true,
"allPublicFields" : true
},
{
"name" : "org.apache.xerces.impl.dv.dtd.DTDDVFactoryImpl",
"allDeclaredConstructors" : true,
"allPublicConstructors" : true,
"allDeclaredMethods" : true,
"allPublicMethods" : true,
"allDeclaredFields" : true,
"allPublicFields" : true
}
]
@sberyozkin - if needed I can add the reflection-config.json to the test project?
@tpenakov thanks, I'll try to fix it at the processor level when I get to it
Maybe @tpenakov would be interested in contributing?
Thank you @gsmet ,
I do not know how to do this.
Could you please send me some links/guides?
I can try to read them and then to make a decision...
@tpenakov You can find some information at: https://quarkus.io/guides/writing-native-applications-tips
The real hard-core information however can be found here: https://quarkus.io/guides/writing-extensions.
The people on the Quarkus team would be glad to help you out should you decide to take this on
@geoand , @gsmet , @sberyozkin - I can try to do it.
@tpenakov let us know about your progress. If you don't get to it, maybe @irenakezic will be able to help.
@gsmet and @sberyozkin ,
I have some progress. But I am a little bit stuck here. Now the exception is:
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@33956e0
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at io.quarkus.tika.TikaParser.parseStream(TikaParser.java:85)
... 43 more
Caused by: org.apache.poi.ooxml.POIXMLException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart
at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:66)
at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:657)
at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:180)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:137)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:60)
at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:224)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 46 more
Caused by: java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart
at java.lang.Class.getConstructor0(DynamicHub.java:3082)
at java.lang.Class.getDeclaredConstructor(DynamicHub.java:2178)
at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:56)
at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:63)
... 54 more
Adding this code to TikaProcessor does not fix it:
@BuildStep
ReflectiveClassBuildItem reflectionXWPFStyles() {
//https://github.com/quarkusio/quarkus/issues/6549
return new ReflectiveClassBuildItem(true, true, true, "org.apache.poi.xwpf.usermodel.XWPFStyles");
}
@BuildStep
ReflectiveClassBuildItem reflectionPackagePart() {
//https://github.com/quarkusio/quarkus/issues/6549
return new ReflectiveClassBuildItem(true, true, "org.apache.poi.openxml4j.opc.PackagePart");
}
@BuildStep
ReflectiveClassBuildItem reflectionZipPackagePart() {
//https://github.com/quarkusio/quarkus/issues/6549
return new ReflectiveClassBuildItem(true, true, "org.apache.poi.openxml4j.opc.ZipPackagePart");
}
I will need more time to take a closer look. May be on Friday and during the weekend...
@tpenakov Thanks for starting looking into it.
How did you register org.apache.xerces.xni.parser.XMLParserConfiguration
?
Does updating registerTikaProviders
with something like
serviceProvider.produce(
new ServiceProviderBuildItem("org.apache.xerces.xni.parser.XMLParserConfiguration",
getProviderNames("org.apache.xerces.xni.parser.XMLParserConfiguration")));
works ?
Can you also please check that you don't have some competing dependencies as Caused by: java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart
error may imply
thanks
@sberyozkin - Thanks for the quick feedback.
About the service registration:
I've registered it in the similar way:
serviceProvider.produce(
new ServiceProviderBuildItem(XMLParserConfiguration.class.getName(),
Arrays.asList("org.apache.xerces.parsers.XIncludeAwareParserConfiguration")));
this is because of the file content on:
org/apache/xerces/parsers/org.apache.xerces.xni.parser.XMLParserConfiguration
Should I have to update the registration on your way?
About the competing dependencies:
You were right - there is competing dependencies.
One used from kogito:
[INFO] +- org.kie.kogito:drools-decisiontables:jar:0.6.1:compile
[INFO] | \- org.drools:drools-decisiontables:jar:7.29.0.Final:compile
[INFO] | +- org.apache.poi:poi-ooxml:jar:3.17:compile
[INFO] | | +- org.apache.poi:poi-ooxml-schemas:jar:3.17:compile
And one used from tika parsers:
[INFO] +- org.apache.tika:tika-parsers:jar:1.22:compile
[INFO] | +- org.apache.poi:poi-ooxml:jar:4.0.1:compile
[INFO] | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:compile
I've tried to exclude it from the tika runtime pom.xml file (extensions/tika/runtime/pom.xml
) :
```
But now the `org.apache.tika.parser.microsoft.ooxml.OOXMLParser` is not working, because the structure of the classes between `po-ooxml:3.17` and `po-ooxml:4.0.1` is different.
Also I've tried to exclude the `po-ooxml:3.17` dependency from `quarkus-bom` `pom.xml` file:
<dependency>
<groupId>org.kie.kogito</groupId>
<artifactId>drools-decisiontables</artifactId>
<version>${kogito.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
</exclusion>
</exclusions>
</dependency>
``
Now the
po-ooxml:3.17dependency is gone, however I am still getting the
java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.
Is it possible to arrange a quick call with you or with someone else from the team in order to setup my environment properly? This will help me a lot, because in that way I will save a lot of time and will be much more productive at the end.
It will be great if we can arrange a such call :)
@sberyozkin - from my previous message, please ignore the part about dependencies and about the dedicated call in order to be more productive. @geoand helps me a lot with 'productivity' setup.
@geoand - thank you for that!
About the java.lang.NoSuchMethodException: org.apache.poi.xwpf.usermodel.XWPFSettings.<init>org.apache.poi.openxml4j.opc.PackagePart
- it is fixed now.
The next challenge is:
Caused by: java.util.MissingResourceException: Resource bundle not found org.apache.xerces.impl.msg.SAXMessages. Register the resource bundle using the option -H:IncludeResourceBundles=org.apache.xerces.impl.msg.SAXMessages.
at com.oracle.svm.core.jdk.LocalizationSupport.getCached(LocalizationSupport.java:66)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:63)
at org.apache.xerces.util.SAXMessageFormatter.formatMessage(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.getProperty(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.setProperty(Unknown Source)
at org.apache.xmlbeans.impl.common.SAXHelper.trySetXercesSecurityManager(SAXHelper.java:119)
at org.apache.xmlbeans.impl.common.SAXHelper.newXMLReader(SAXHelper.java:49)
at org.apache.xmlbeans.impl.store.Locale.getSaxLoader(Locale.java:3055)
... 57 more
@tpenakov Nice, and indeed thanks to to Georgios :-)
@tpenakov this probably can be added withSubstrateResourceBuildItem
@sberyozkin ,
I've found this way:
@BuildStep
public void registerResourceBundles(BuildProducer<NativeImageResourceBundleBuildItem> resource) throws Exception {
resource.produce(new NativeImageResourceBundleBuildItem("org.apache.xerces.impl.msg.SAXMessages"));
}
@tpenakov by the way, if you add an ooxml
shortcut and use it in the configuration then it will save a ton of MBs in the native image size :-)
@tpenakov Super, I'm learning with you along the way :-)
@sberyozkin - I've added it as docx
shortcut, but you are right that ooxml
is much more correct :)
@tpenakov yes, I just saw Tim (Tika lead) referring to the whole family as ooxml
in one of the issues.
Re NativeImageResourceBundleBuildItem
vs SubstrateResourceBuildItem
, I was looking at the old version of TikaProcessor
and forgot the latter was renamed :-)
@tpenakov one other thing, you may want to import https://github.com/sberyozkin/quarkus/blob/master/extensions/tika/deployment/src/main/java/io/quarkus/tika/deployment/TikaParsersConfigBuildItem.java in all the step functions in dealing with OOXML and check if the list value returned from a map (the key is a parser name) is not null
. If it is null
then a user set some shortcuts not even involving OOXML and in this case whatever the OOXMl step does can be skipped; same for PDF related resources - at the moment they are likely adding to the native image size even if you don't want to read PDF. It can be optimized later though
@sberyozkin - ok about the OOXML.
About the PDF - do you want me to do it in the same task or will fire a separate one for it?
@tpenakov please do it for PDF as well because it would make a diff for your case :-)
@sberyozkin - I will stop for today. Now I am fighting with this obstacle. Any ideas will be appreciated, because I have no more for today :)
Caused by: java.lang.ClassCastException: org.apache.xmlbeans.impl.values.XmlComplexContentImpl cannot be cast to org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBody
at org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTDocument1Impl.getBody(Unknown Source)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:213)
at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:137)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:60)
at org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:224)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:161)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 45 more
@tpenakov thanks. One thing which I would've tried is to check in the debug mode, if, when the non-native JVM mode is used, the same code path is used as in the above exception. It might be that some other resource has not been found in the native image (something similar happened when I was looking at PDF...).
Thanks for spending your time on this issue, enjoy the weekend
@tpenakov for example, set a breakpont in org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor
and check what is happening in the JVM mode, if it checks some resources, etc.
@sberyozkin - I am doing the magic with the BP - I am setting it here: XWPFDocument.java:213
Then I am trying to dig about the classes actually used :)
Now seems that the problem is here:
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocumentBase
On this row:
SchemaType type = (SchemaType)XmlBeans.typeSystemForClassLoader(CTDocumentBase.class.getClassLoader(), "schemaorg_apache_xmlbeans.system.sD023D6490046BA0250A839A9AD24C443").resolveHandle("ctdocumentbasedf5ctype");
schemaorg_apache_xmlbeans.system.sD023D6490046BA0250A839A9AD24C443
seems like a file for me, but I am unable to find it...
@tpenakov Perhaps this SO page can help, https://stackoverflow.com/questions/26854838/java-lang-nosuchmethoderror-org-openxmlformats-schemas-wordprocessingml-x2006-m/26856735
But it suggests a bigger schemas.jar may be required - however it is not clear why it would be needed in the native mode and not in the JVM mode
@sberyozkin - During the weekend I was able to think first before work only instead :)
In my opinion - from architecture point of view we need to have these two Apache projects as Quarkus extensions:
And then to connect them to the Apache Tika? What do you think?
Hi @tpenakov thanks but Tika supports thousand(s) of formats via dozens of libraries. I'm sorry, I don't see it as a scalable way to solve Tika related native issues :-)
So what is your suggestion for Tika in native mode - to have a whitelist or blacklist for native format support?
We have an blackilst, but for the moment, seems that we need to have a whitelist instead only for formats that are tested and supported?
What do you mean ? IMHO we need to fix this issue and keep doing so for the most popular formats once the issues get discovered. Native mode already works for PDF, OpenOffice, Excel (though not 100% sure which version) and hopefully many more.
The problem with the whitelist is that a given Tika parser may support N formats.
Currently I am trying to fix it to work for docx
files.
Then I will have to fix it for xlsx
files.
Then may be for pptx
files ...
As you say there are many formats in a single parser, but in order to be certain that the Tika is working with given parser/format we have to write a test to ensure it.
If we have a whitelist (You are right that we probably will have to put the format there too), the end user will be sure what works and what does not in a native mode. And this is an important part if you want to put such service in production use.
Speaking from my own needs so far:
I need to have a microservice that rely from two Apache Tika functionalities - getMetadata->Content-type
and getText (if it is possible to read text with that content type). Because I am not sure which kind of Content-Type document I will receive from the microservice consumers, and knowing already how the Tika behaves in a native mode with unsupported/untested formats I cannot afford to use that microservice compiled to a native code in production.
IMO as a Quarkus end user - it will be much more helpful for me to know from the beginning that NOT everything is working in a native mode and to know exactly what is working for sure.
Currently if the Quarkus user reads the Apache Tika documentation - he will end up with the understanding that everything works fine in a native mode. This happened to me initially, but turns out that this is misleading.
It become a long post :) , but if I have to summarize it: Quarkus end users must know that the Apache Tika functionality is not covered in full in a native mode and IMO - they must know what exactly is.
Do not understand me wrong - I will fix OOXML parser to work in a native mode for most of the formats, but am not going to use it in my service :)
Hi @tpenakov no problems, glad you are committed to fixing this issue :-)
I agree the more tests we have the better but there are too many formats and I don't know all of them and haven't had time to cover even 50 with the tests. If we whitelist PDF, OpenOffice, OOXML, then we may have users opening the issues to do with their format N stopping working. We don't even know if this issue is specific to your specific DOCs file or to the DOCx in general
@tpenakov I think we can come up with some documentation-level guidance, lets try to resolve this issue, and then I'll open a new doc issue and will CC to you, thanks
Thank You @sberyozkin ,
I've just managed to fix it to work only for *.docx files.
It is implemented here.
Of course it is very raw implementation and the support for pdf
and ooxml
is not implemented as discussed.
I will continue to work on this task at the end of the week probably.
@tpenakov this is super, thanks a milliion. It may well be that the other OOXML formats will work now in the native mode too. It is nearly there, I'd just suggest to group those 2 functions you added it into one function, may be called something like prepareOOXmlParser
and then add TikaParsersConfigBuildItem parserConfigItem
parameter there as well and just check if the OOXML parser key is contained in the map and only do all that code if it is :-).
The same minor regrouping can be done for PDF (there are 3 PDF related functions there, but should just become one, preparePDFParser
for ex)
Thanks, Sergey
Hi @sberyozkin ,
After git rebase
from master I've started to receive this exception (below). Registering the service provider javax.xml.transform.TransformerFactory
and also registering for reflection : org.apache.xalan.processor.TransformerFactoryImpl
and com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
together with properties files: org/apache/xalan/internal/res/XSLTInfo.properties
and org/apache/xalan/res/XSLTInfo.properties
doesn't solve it. Do you have any idea what might caused it?
Caused by: javax.xml.transform.TransformerFactoryConfigurationError: Provider org.apache.xalan.processor.TransformerFactoryImpl not found
at javax.xml.transform.TransformerFactory.newInstance(Unknown Source)
at org.jboss.resteasy.plugins.providers.DocumentProvider.<init>(DocumentProvider.java:58)
at org.jboss.resteasy.plugins.providers.DocumentProvider.<init>(DocumentProvider.java:51)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.lang.Class.newInstance(Class.java:584)
at io.quarkus.arc.runtime.ArcRecorder$DefaultInstanceFactory.create(ArcRecorder.java:137)
at io.quarkus.resteasy.common.runtime.QuarkusConstructorInjector.construct(QuarkusConstructorInjector.java:41)
at org.jboss.resteasy.core.providerfactory.Utils.createProviderInstance(Utils.java:91)
at org.jboss.resteasy.core.providerfactory.ClientHelper.processProviderContracts(ClientHelper.java:146)
at org.jboss.resteasy.core.providerfactory.ResteasyProviderFactoryImpl.processProviderContracts(ResteasyProviderFactoryImpl.java:884)
at org.jboss.resteasy.core.providerfactory.ResteasyProviderFactoryImpl.registerProvider(ResteasyProviderFactoryImpl.java:876)
at org.jboss.resteasy.core.providerfactory.ResteasyProviderFactoryImpl.registerProvider(ResteasyProviderFactoryImpl.java:863)
at org.jboss.resteasy.plugins.providers.RegisterBuiltin.registerProviders(RegisterBuiltin.java:171)
at org.jboss.resteasy.plugins.providers.RegisterBuiltin.register(RegisterBuiltin.java:83)
at org.jboss.resteasy.core.ResteasyDeploymentImpl.startInternal(ResteasyDeploymentImpl.java:269)
at org.jboss.resteasy.core.ResteasyDeploymentImpl.start(ResteasyDeploymentImpl.java:90)
at io.quarkus.resteasy.runtime.standalone.ResteasyStandaloneRecorder.staticInit(ResteasyStandaloneRecorder.java:86)
at io.quarkus.deployment.steps.ResteasyStandaloneBuildStep$staticInit23.deploy_0(ResteasyStandaloneBuildStep$staticInit23.zig:730)
at io.quarkus.deployment.steps.ResteasyStandaloneBuildStep$staticInit23.deploy(ResteasyStandaloneBuildStep$staticInit23.zig:749)
at io.quarkus.runner.ApplicationImpl.<clinit>(ApplicationImpl.zig:324)
... 43 more
@sberyozkin - I've managed to fix this one by adding the xalan dependency - may be not on the righnt place, but we can discuss this later :)
The next error now is:
Error: com.oracle.svm.hosted.substitute.DeletedElementException: Unsupported method java.lang.ClassLoader.defineClass(String, byte[], int, int, ProtectionDomain) is reachable: The declaring class of this element has been substituted, but this element is not present in the substitution class
To diagnose the issue, you can add the option --report-unsupported-elements-at-runtime. The unsupported element is then reported at run time when it is accessed the first time.
Detailed message:
Trace:
at parsing com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl$TransletClassLoader.defineClass(TemplatesImpl.java:207)
Call path from entry point to com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl$TransletClassLoader.defineClass(byte[], ProtectionDomain):
at com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl$TransletClassLoader.defineClass(TemplatesImpl.java:207)
at com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl.defineTransletClasses(TemplatesImpl.java:514)
at com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl.getTransletInstance(TemplatesImpl.java:551)
at com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl.newTransformer(TemplatesImpl.java:584)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTransformerHandler(TransformerFactoryImpl.java:1168)
at com.oracle.svm.reflect.TransformerFactoryImpl_newTransformerHandler_976de68b29ce82adebd0faf47ca3be047478766f_2667.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.xml.dtm.DTMException.printStackTrace(DTMException.java:365)
at org.apache.xml.dtm.DTMException.printStackTrace(DTMException.java:289)
at com.oracle.svm.jni.functions.JNIFunctions.ExceptionDescribe(JNIFunctions.java:752)
at com.oracle.svm.core.code.IsolateEnterStub.JNIFunctions_ExceptionDescribe_b5412f7570bccae90b000bc37855f00408b2ad73(generated:0)
@sberyozkin - I've added quarkus.native.additional-build-args=--report-unsupported-elements-at-runtime
to integration-tests/tika/src/main/resources/application.properties
IMO - the error from above comment is related to other issue, and now I am concentrated to fix the OOXML
to work in native mode
Hi @sberyozkin ,
I've managed to get it working for xlsx and pptx file types, but now I have 3 serious problems. The first two pop up after the code rebase in my Quarkus fork last week. IMO they are not related to the current task and I definately will need help in order to resolve them (quick and dirty solution is applied for the moment :) ). Here are the problems together with some explanations:
xalan
dependency to the extensions/arc/runtime/pom.xml
. I am certain that this is not in the right place, but if I remove it the native tests fail. The error is described in my previous posts. This is just a hack in order to be concentrated on the current task.--report-unsupported-elements-at-runtime
to the integration-tests/tika/src/main/resources/application.properties
. The error is shown in my previous post. This is just a hack in order to be concentrated on the current task.OOXML
parser leads to OutOfMemoryError
on my PC. Could you please advise me how to proceed with that?The code is published on the same fork
Heed some help here.
Thanks in advance!
@tpenakov Hi, sorry for a delay, and thanks for continuing spending the time on the issue, it is realy appreciated. I'm subscribed but I did not get a single notification...In fact I'm actually not getting the notifications at all, this is strange...
Well, what do you think about going ahead with a new clean branch against the latest master and starting with a PR supporting Docx format only based on the work you showed me last week, just to move forward step by step, as it appears every new format in the OOXML family brings new issues.
What do you think ?
Cheers
@sberyozkin - no problem about the clean start with docx only.
I am almost certain that the problems 1 and 2 from my previous post will be present and there too.
I will let you know when I reach at that point.
@tpenakov Yes, sounds good, lets get docx only working for the moment, I'm sure we will make it work :-). But please wait till #6752 is merged.
@tpenakov Hi, when you get a time please start from a clean master, #6752 has been merged now, so it might also help with avoiding few of the issues you've seen recently. As agreed lets do DOCx first, thanks
Thank you @sberyozkin
Will let you know about the progress.
Hi @sberyozkin ,
I was thinking about the problem when the number of classes for docx, xlsx and pptx for native compilation become too big and the result is OutOfMemoryError.
What if we create a separate apache-tika extension per ooxml format? In that way we will have apache-tika-ooxml-docx, apache-tika-ooxml-xlsx, apache-tika-ooxml-pptx extensions.
What do you think - is there a chance this to solve the OutOfMemoryError?
Hi @tpenakov, OOM won't happen just because the native image is too big. Besides, with the parser configuration optimizations the tika extension will have a much slimmer native image, example, for PDF only, for DOCx only, etc.
Thanks
@tpenakov Hi, I've renamed this issue to have it focused around a specific issue you have reported to do with the Docx format. I will create a follow up issue to check other OOXMl formats in the native mode. thanks
Hi @sberyozkin ,
Yep - that seems reasonable.
For this week I wasn't able to work on this one, but hopefully will try to end it next week.
@tpenakov Hi, no problems, happy you are still OK with looking at this issue :-)
Hi @sberyozkin ,
PR is cerated: https://github.com/quarkusio/quarkus/pull/7198
However there is a few things to points out:
@ConfigProperty
in io.quarkus.it.tika.TikaEmbeddedContentTest
leads to NPE for native build. This one (#2061) claims that is fixed, but I am receiving it. Hi @tpenakov As noted in the PR request, it is appreciated you've spent so much time on this issue :-), I'll try to help now as well. By the way, please also watch #7171, which, if implemented, may help you more. Though as far as this extension is concerned the POI issues will have to be fixed anyway.
I'll keep you up to date once I get to testing your PR, cheers
Thank you @sberyozkin - I am watching the #7171 already. I also suggested Apache POI together with Xml Beans to become a separate extensions.
BTW - the bigger part of Apache POI inclusion is done in this task...
Hi guys, there is some date when this will be corrected? I am oplanning to use Apache Tika with Quarkus in a Microservice environment, and this BUG is preventing the deploy of our stack.
https://github.com/apache/poi/blob/trunk/src/java/org/apache/poi/poifs/nio/CleanerUtil.java#L180 has to be addressed, I had to add -report-unsupported-elements-at-runtime
to bypass the problem in order to upgrade to Tika 1.24.1
- which is ok-ish since POI does not work yet in the native mode. See also https://github.com/oracle/graal/issues/2761.
Update: a cleaner workaround is in place now thanks to @Sanne providing a CleanerUtil
substitution.
@slpereira I'm not having enough time to prioritize on Tika issues, however, slowly but surely some issues are being addressed. I'll pick up this issue during the next round when I'll start looking at Tika issues. Thanks