Elasticsearch: Allow URLs for configuring Synonym Filter files

Created on 13 Jun 2017  路  7Comments  路  Source: elastic/elasticsearch

Describe the feature: In Synonym Token Filter, currently a synonyms_path or synonyms are supported. To configure the synonyms_path requires a direct control of the Elasticsearch cluster, which is not always possible when using shared hosts, for example.

This leaves that we configure the token filter inline using the synonyms option instead, which the docs suggest (docs) could increase the cluster size unnecessarily.

In this feature request, I'm requesting a third option, synonyms_url to be defined, which should load the synonyms file from the url, instead of filesystem. This would allow us to configure this even when we can't copy the synonyms file in config directory.

I am willing to contribute this feature, if it is considered relevant. I think the patch would be in analysis/Analysis.java (file).

:SearcMapping

Most helpful comment

It would be great if we could manage the synonym file like we manage an index. Elasticsearch would be responsible for distributing out this file to every node in a similar fashion to how nodes have replicas of shards.

All 7 comments

hey @prashnts thanks for opening the issue. The feature you are describing has a couple of issues that we are are not necessarily willing to buy into:

  • if you allow URLs they might be pulled at different times (when index / shards are allocated or relocate) that might result in different synonym files if the resource changes behind the scenes. BTW the same problem exists with the file bases option but it's already there and we might be able to improve here
  • if you use a URL you are basically opening a connection to an external system which ES is not allowed to do by it's core security permissions. We try really hard to keep it that way. From a security perspective even with whitelists it's a risky thing todo

a lot of users want to update synonyms used for query time synonym expansion. That is a problem that we need to solve better but we don't want to fan out to external systems

Thanks for the detailed explanation @s1monw! I see the rationale, and it makes sense.

If you find time, could you address a couple more questions RE this? Would really appreciate.

  • What would be considered a "big" inlined-synonym? Would, say, 2k lines of inlined synonyms, be "big" enough to have significant impact on cluster size? What about 10k lines?
  • Going through the source of the Analysis.java, I see that both inlined and file-based synonyms are loaded into the memory, so I am assuming inlining does not impact the runtime performance of the nodes. Is that correct?

Thank you!

@s1monw Stumbled upon this issue after I looked for an easy possibility to update synonyms with the new Reload Analyzers API (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-reload-analyzers.html)

To be able to pull the necessary information from an API-Endpoint (hence introducing an synonyms_url) would be awesome.

Otherwise we would have to handle the distribution of the synonyms file to all elastic search cluster nodes, which is not an easy task.

Whats the current state (2 years after opening of this issue) of this regarding the introduction of a new setting? Is Elastic still not alowed to call remote servers?

Hi Team,

Do you have any update on the issue?

I want to update synonym.txt dynamically. and want it to be synced on all the nodes of elasticsearch cluster.
Is there any feature in elasticsearch to support this, or I need to manually handle it in the application.

@sainine In case you, or anyone is interested, I had created this linter for es-synonyms file. It also works with elasticsearch-dsl python library.

This will at least let you be confident that there's no error, before pushing your synonyms file to the index in elasticsearch. Although I haven't been following the development here so unless the specs for it have changed, you can try it out.

https://github.com/agora-team/elasticsearch-synonyms

Example parser data: https://github.com/agora-team/elasticsearch-synonyms/blob/master/data/medical-terms.synonyms

(sorry for a shameless plug...)

Edit: @luflow I still think @s1monw 's argument against making remote calls from database process is rational. I would take managing clusters over risking another possible attack vector in some system. Just my 2 cents.

At my work we had found a middle ground, though -- propagating the synonym files through a python script (there's an example in repo I'd linked above) worked without issues in prod (2-node cluster) for two years, I don't work there anymore so can't say what's the status now.

Take it with a grain of salt, our cluster size was quite modest (<100k docs) and 2-node cluster was simply a fail-safe, we didn't need it for performance. ... and we never needed to inline more than 1k lines of synonyms. So, YMMV.

It would be great if we could manage the synonym file like we manage an index. Elasticsearch would be responsible for distributing out this file to every node in a similar fashion to how nodes have replicas of shards.

Any progress on this would be highly appreciated... aws ES seems to have a feature where you can upload synonym files by uploading packages and triggering an update via aws ES APIs. Having something build into ES would be much nicer of course.

Was this page helpful?
0 / 5 - 0 ratings