Elasticsearch: ingest-attachment pipeline is rejecting documents where chars count > default 100k

Created on 8 Sep 2016 · 7Comments · Source: elastic/elasticsearch

Elasticsearch version: 5.0 alpha5

Plugins installed: [aggs-matrix-stats, ingest-common, lang-expression, lang-groovy, lang-mustache, lang-painless, percolator, reindex, transport-netty3, transport-netty4, discovery-ec2, ingest-attachment, lang-javascript, lang-python]

JVM version: JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_102/25.102-b14]

OS version: Linux/4.2.0-18-generic/amd64

Description of the problem including expected versus actual behavior:
I have been testing the ingest-attachment functionality against 5.0alpha5. I previously used mapper-attachment on earlier versions of ES. It seems that for larger files (I tested with PDFs), where there are greater than the 100k character default limit, instead of the ingest-attachment processing up to 100k characters (as it suggests), it rejects the document outright. It only works if I re-apply the pipeline to use indexed_chars = 1000000 (ie. 10x the limit).

The expected behaviour should be to just process up to the default lower limit, but this doesn't appear to be happening. The document only ends up getting processed if it is within the new higher limit for indexed_chars that I have set.

Steps to reproduce:
step1. Create a new ingest pipeline for attachment types, explicitly setting the index_chars to 100k.

curl -H 'Expect:' -XPUT "http://xxx:9200/_ingest/pipeline/files_ingest_attachment" -d '
{
  "description" : "An ingest-attachment pipeline into the files type",
  "processors" : [ {
    "attachment" : {
      "field": "blob",
      "indexed_chars": 100000
    }
  } ]
}'

step2. Push the document in base64 form to ES ... I'm using the _bulk API via elasticsearch-py to send my document ... which includes the base64 encoded field blob ... this works well.

step3. With 100k limit, the documented is rejected by ES with logs as per below.

step4. Apply a new indexed_chars limit of 1,000,000 characters.

curl -H 'Expect:' -XPUT "http://xxx:9200/_ingest/pipeline/files_ingest_attachment" -d '
{
  "description" : "An ingest-attachment pipeline into the files type",
  "processors" : [ {
    "attachment" : {
      "field": "blob",
      "indexed_chars": 1000000
    }
  } ]
}'

step5. Document processes into ES fine

Provide logs (if relevant):

master_1        | [2016-09-08 16:20:16,688][DEBUG][action.ingest            ] [kGGmfo7] failed to execute pipeline [files_ingest_attachment] for document [my-index/files/506c4eb31a91c1711039d6d1]
master_1        | ElasticsearchException[java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [blob]]; nested: TikaException[Unable to extract all PDF content]; nested: IOExceptionWithCause[Unable to write a string: short extract of text from document ]; nested: TaggedSAXException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).]; nested: WriteLimitReachedException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).];]; nested: IllegalArgumentException[ElasticsearchParseException[Error parsing document in field [blob]]; nested: TikaException[Unable to extract all PDF content]; nested: IOExceptionWithCause[Unable to write a string: short extract of text from the document ]; nested: TaggedSAXException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).]; nested: WriteLimitReachedException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).];]; nested: ElasticsearchParseException[Error parsing document in field [blob]]; nested: TikaException[Unable to extract all PDF content]; nested: IOExceptionWithCause[Unable to write a string: short extract of text ]; nested: TaggedSAXException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).]; nested: WriteLimitReachedException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).];
master_1        |   at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156)
master_1        |   at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107)
master_1        |   at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:52)
master_1        |   at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166)
master_1        |   at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41)
master_1        |   at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:88)
master_1        |   at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:510)
master_1        |   at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
master_1        |   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
master_1        |   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
master_1        |   at java.lang.Thread.run(Thread.java:745)
master_1        | Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [blob]]; nested: TikaException[Unable to extract all PDF content]; nested: IOExceptionWithCause[Unable to write a string: short extract of text from document]; nested: TaggedSAXException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).]; nested: WriteLimitReachedException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).];
master_1        |   ... 11 more
master_1        | Caused by: ElasticsearchParseException[Error parsing document in field [blob]]; nested: TikaException[Unable to extract all PDF content]; nested: IOExceptionWithCause[short extract of text from the document ]; nested: TaggedSAXException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).]; nested: WriteLimitReachedException[Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).];
master_1        |   at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:126)
master_1        |   at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100)
master_1        |   ... 9 more
master_1        | Caused by: org.apache.tika.exception.TikaException: Unable to extract all PDF content
master_1        |   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
master_1        |   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
master_1        |   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
master_1        |   at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
master_1        |   at org.apache.tika.Tika.parseToString(Tika.java:568)
master_1        |   at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:94)
master_1        |   at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:91)
master_1        |   at java.security.AccessController.doPrivileged(Native Method)
master_1        |   at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:91)
master_1        |   at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:72)
master_1        |   ... 10 more

:CorFeatureIngest >bug

Source

Analect

All 7 comments

It's Tika 1.13 bug
https://issues.apache.org/jira/browse/TIKA-2098

alexshadow007 on 26 Sep 2016

❤1

@alexshadow007 thanks for catching this as the potential cause.

I can see from here that 1.13 is being used in ES. If someone can guide me how to replace that with Tika 1.14 in a build (I don't have much experience with java and its build systems), then I can certainly try to re-test against my body of documents. I can see from here, that the latest release is still showing 1.13 ... and that 1.14 is technically still unreleased. Is 1.14 expected to be formally released anytime soon, so that if it does fix this problem that it could get rolled into one the 5.0 beta releases?

Analect on 27 Sep 2016

I don't know when Tika 1,14 will be released.
As temporary solution it can be fixed without upgrade to 1.14.
We need set this property PDFParserConfig.setCatchIntermediateIOExceptions to false.

alexshadow007 on 27 Sep 2016

Seems like tika 1.14 will be released soon, so I think it is ok to wait, instead of adding a temporary workaround.

martijnvg on 30 Sep 2016

++ @martijnvg
I started it there (https://github.com/dadoonet/elasticsearch/commit/b98359000b9eb34b51d5cc15c2fcf7566465da82) and I'm waiting for the release :p