Elasticsearch: Add a language-detection processor to Ingest Node

Created on 15 Mar 2018 · 6Comments · Source: elastic/elasticsearch

There are requests and existing solutions for providing language-detection
support within the ingest pipelines.

request: https://github.com/elastic/elasticsearch/issues/23246,
Alex's external plugin: https://github.com/spinscale/elasticsearch-ingest-langdetect

it would be nice to provide this as a separate processor within Elasticsearch as a module or plugin.

:CorFeatureIngest >feature CorFeatures help wanted

Source

talevy

👍1

Most helpful comment

Please don't use Tika's builtin language detection. See https://issues.apache.org/jira/browse/TIKA-1723 for @kkrugler 's work integrating Optimaize and why we prefer it to our own built-in language detection.

Given the other options available, our goal now is to make it easier to integrate other libraries. Right, cybozu isn't maintained. Optimaize, IIRC, is a fork of cybozu and is somewhat more recent, but no activity in 2 years.

Y, you're right, CLD looks great, but JNI... It would be interesting to see a replication of Mike McCandless's evaluation with updated versions: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D

tballison on 18 Sep 2018

❤2

All 6 comments

Note that Tika is also providing a different library for lang auto detection:

Alex's plugin uses com.youcruit.com.cybozu.labs:langdetect:1.1.2-20151117
Tika uses org.apache.tika:tika-langdetect which uses behind the scene com.optimaize.languagedetector:language-detector:jar:0.5

While we are building an official lang-detect plugin, I think we should evaluate the pro/cons of the 2 libs (I have no idea TBH).

dadoonet on 16 Mar 2018

These aren't the only 2 libraries. There's also CLD2 and CLD3 (though existing java bindings aren't really great from what I've seen) and others. I think we should consider low heap utilization and language detection accuracy as the top 2 metrics to look into and detection speed third, since the types of documents that really need language detection tend to have a relatively low index rate compared to the other types of documents we index.

I'm a big fan of this capability lying in an ingest node.

eskibars on 31 May 2018

It seems https://mvnrepository.com/artifact/com.youcruit.com.cybozu.labs/langdetect isn't maintained anymore.
Maybe not the best idea to start depending on that?

CLD seems to be somehwat superior (performance and accuracy) to Tika judging by a quick Google search. It does add a native/JNI dependency though.

=> Tika seems like the safest bet in terms of maintenance to me, but others probably know more here.

original-brownbear on 2 Jul 2018

As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D

tballison on 18 Sep 2018

❤2

We (ML team) are currently investigating this as part of a deployment of supervised models into the ingest pipeline. I'll add more details as we move this forward.

stevedodson on 6 Feb 2019

This is resolved by https://github.com/elastic/elasticsearch/pull/50292 . We (ML team) will also be publishing guidance on using the model in search use-cases (blog post pending).