There are requests and existing solutions for providing language-detection
support within the ingest pipelines.
request: https://github.com/elastic/elasticsearch/issues/23246,
Alex's external plugin: https://github.com/spinscale/elasticsearch-ingest-langdetect
it would be nice to provide this as a separate processor within Elasticsearch as a module or plugin.
Note that Tika is also providing a different library for lang auto detection:
com.youcruit.com.cybozu.labs:langdetect:1.1.2-20151117org.apache.tika:tika-langdetect which uses behind the scene com.optimaize.languagedetector:language-detector:jar:0.5While we are building an official lang-detect plugin, I think we should evaluate the pro/cons of the 2 libs (I have no idea TBH).
These aren't the only 2 libraries. There's also CLD2 and CLD3 (though existing java bindings aren't really great from what I've seen) and others. I think we should consider low heap utilization and language detection accuracy as the top 2 metrics to look into and detection speed third, since the types of documents that really need language detection tend to have a relatively low index rate compared to the other types of documents we index.
I'm a big fan of this capability lying in an ingest node.
It seems https://mvnrepository.com/artifact/com.youcruit.com.cybozu.labs/langdetect isn't maintained anymore.
Maybe not the best idea to start depending on that?
CLD seems to be somehwat superior (performance and accuracy) to Tika judging by a quick Google search. It does add a native/JNI dependency though.
=> Tika seems like the safest bet in terms of maintenance to me, but others probably know more here.
Please don't use Tika's builtin language detection. See https://issues.apache.org/jira/browse/TIKA-1723 for @kkrugler 's work integrating Optimaize and why we prefer it to our own built-in language detection.
Given the other options available, our goal now is to make it easier to integrate other libraries. Right, cybozu isn't maintained. Optimaize, IIRC, is a fork of cybozu and is somewhat more recent, but no activity in 2 years.
Y, you're right, CLD looks great, but JNI... It would be interesting to see a replication of Mike McCandless's evaluation with updated versions: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D
We (ML team) are currently investigating this as part of a deployment of supervised models into the ingest pipeline. I'll add more details as we move this forward.
This is resolved by https://github.com/elastic/elasticsearch/pull/50292 . We (ML team) will also be publishing guidance on using the model in search use-cases (blog post pending).
Most helpful comment
Please don't use Tika's builtin language detection. See https://issues.apache.org/jira/browse/TIKA-1723 for @kkrugler 's work integrating Optimaize and why we prefer it to our own built-in language detection.
Given the other options available, our goal now is to make it easier to integrate other libraries. Right, cybozu isn't maintained. Optimaize, IIRC, is a fork of cybozu and is somewhat more recent, but no activity in 2 years.
Y, you're right, CLD looks great, but JNI... It would be interesting to see a replication of Mike McCandless's evaluation with updated versions: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D