Elasticsearch-dsl-py: Allow index settings in Meta.index

Created on 1 Jan 2018  Â·  19Comments  Â·  Source: elastic/elasticsearch-dsl-py

Currently, when defining a new DocType you have to create a separate Index object if you wish to configure any index-level settings. With the deprecation of multiple types in elasticsearch 6.0 it might make more sense to just allow the index settings to configurable on the DocType itself in some fashion.

This would then allow even for situations where given DocType is present in multiple indices (time-based indices for example) meaning its Meta.index would specify a wildcard (products-* for example) and provide settings which would then allow it to generate and save an IndeTemplate and/or create any particular index via the init classmethod.

Any ideas and feedback is more than welcome!

Inspired by @3lnc's comment: https://github.com/elastic/elasticsearch-dsl-py/issues/779#issuecomment-350047587

enhancement discuss

Most helpful comment

I absolutely agree that Index/DocType distinction is no longer useful. What if we changed things around a bit, just brainstorming here so not all might make it to the target design or even make sense:

  • rename DocType to Document
  • remove all index-related options from Meta and ._doc_type

    • including init and index name

  • add class Index: as an (optional) section to Document that can contain definitions for:

    • index name (can contain a wildcard)

    • index settings/analyzers

    • method do determine index name for a document (like looking at a timestamp for example to go from my-idx-* to my-idx-2018.01.01)

    • template_name if this definition if to be defined via a template

    • rollover settings

    • alias setting allowing for some rules to

  • use those definitions to construct an Index object that will be accessible on the Document class (.index ??) which will:

    • store all the settings provided in the Index class

    • provide mechanisms for index management - create/create_template/delete/switch_alias (by default point an alias to latest index in the group)/rollover/shrink/allocation/...

    • have helper methods like migrate which will create a new index, reindex data from another index and swap alias

If someone wants to have multiple Document classes stored in a single index, the recommended way would be to use inheritance and override Meta.matches to dynamically determine which class to use for every hit returned. Question is how to then make sure the mappings get merged - I like the idea of a Document class iterating through all its subclasses and collecting mappings but it might be too confusing and dangerous so maybe we just do it explicitly somehow?

What do you think?

All 19 comments

My random thoughts on this

  • I'd like to see DocType changed to Index or some abstract Model/Doc. DocType is kinda useless abstraction already, and getting index settings, underlying IndicesClient, etc. is too nested w/o good reason (again, I know the motivation prior ES6, but now its stale, IMO)
  • a lot of stuff from DocType._doc_type should be pulled up in some way to have clear public interface for inspecting mapping, iterating/membership check on field, etc.
  • the thing I really want to have is some interface that allows to implement migration, index versioning, shrink/rollover. Mb by wildcard name + some helpers, idk. But I see need for migration system in any project that uses ES for analytics (more than for full-text search), and every time it's implemented from ground-up.

I absolutely agree that Index/DocType distinction is no longer useful. What if we changed things around a bit, just brainstorming here so not all might make it to the target design or even make sense:

  • rename DocType to Document
  • remove all index-related options from Meta and ._doc_type

    • including init and index name

  • add class Index: as an (optional) section to Document that can contain definitions for:

    • index name (can contain a wildcard)

    • index settings/analyzers

    • method do determine index name for a document (like looking at a timestamp for example to go from my-idx-* to my-idx-2018.01.01)

    • template_name if this definition if to be defined via a template

    • rollover settings

    • alias setting allowing for some rules to

  • use those definitions to construct an Index object that will be accessible on the Document class (.index ??) which will:

    • store all the settings provided in the Index class

    • provide mechanisms for index management - create/create_template/delete/switch_alias (by default point an alias to latest index in the group)/rollover/shrink/allocation/...

    • have helper methods like migrate which will create a new index, reindex data from another index and swap alias

If someone wants to have multiple Document classes stored in a single index, the recommended way would be to use inheritance and override Meta.matches to dynamically determine which class to use for every hit returned. Question is how to then make sure the mappings get merged - I like the idea of a Document class iterating through all its subclasses and collecting mappings but it might be too confusing and dangerous so maybe we just do it explicitly somehow?

What do you think?

Hello,

I am currently trying to create a Search using an index pattern :

class LogIndex(DocType):
    pk = Integer()
    date = Date()
    log_message = Text(fields={'raw': Keyword()})

    class Meta:
        index = "logstash-*"

LogIndex.init()

and have this error when the mapping is done by init() call :

LogIndex.init()
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch_dsl/document.py", line 150, in init
monitoring-api_1  |     cls._doc_type.init(index, using)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch_dsl/document.py", line 97, in init
monitoring-api_1  |     self.mapping.save(index or self.index, using=using or self.using)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch_dsl/mapping.py", line 79, in save
monitoring-api_1  |     return index.save()
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch_dsl/index.py", line 250, in save
monitoring-api_1  |     self.put_mapping(doc_type=doc_type, body=mappings[doc_type])
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch_dsl/index.py", line 341, in put_mapping
monitoring-api_1  |     return self.connection.indices.put_mapping(index=self._name, **kwargs)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
monitoring-api_1  |     return func(*args, params=params, **kwargs)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch/client/indices.py", line 282, in put_mapping
monitoring-api_1  |     '_mapping', doc_type), params=params, body=body)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch/transport.py", line 312, in perform_request
monitoring-api_1  |     status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch/connection/http_requests.py", line 90, in perform_request
monitoring-api_1  |     self._raise_error(response.status_code, raw_data)
monitoring-api_1  |   File "/usr/local/lib/python3.4/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
monitoring-api_1  |     raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
monitoring-api_1  | elasticsearch.exceptions.NotFoundError: TransportError(404, 'index_not_found_exception', 'no such index')

I assume it is related to your proposition and that it is currently impossible to do a Search on an index pattern. Is that right ?
Is there another way to do this meanwhile ? If there is, I am interested for your help and in any case, your PR seems a good idea :-)

Thank you

@yogeek You cannot create an index with a wildcard in it. If you want a DocType pointing to a wildcard, that's fine, but then you need to pass in an explicit index when calling init or save:

LogIndex.init(index="logstash-2018.01.07")

@HonzaKral Thank you for your help
I am using django-rest-elasticsearch to be able to do a search (it is based on elasticsearch-dsl, that's why I found this discussion on Meta.index) and the objects are defined as following

###  log_model.py
class LogModel(models.Model):
    date = models.DateTimeField(auto_now=True, blank=False, help_text=_("Publication date"))
    log_message = models.TextField(blank=False, help_text=_("Message content"))

###  search.py
class LogIndex(DocType):
    pk = Integer()
    date = Date()
    log_message = Text(fields={'raw': Keyword()})

    class Meta:
        index = "logstash-2018.01.07"
        using = Elasticsearch(hosts=['elasticsearch:9200/'],
                              connection_class=RequestsHttpConnection)

### log_view.py
class LogView(es_views.ListElasticAPIView):

    es_client = Elasticsearch(hosts=['elasticsearch:9200/'],
                              connection_class=RequestsHttpConnection)
    es_model = LogIndex
    es_pagination_class = es_pagination.ElasticLimitOffsetPagination

    es_filter_backends = (
        es_filters.ElasticFieldsFilter,
        es_filters.ElasticSearchFilter,
        es_filters.ElasticOrderingFilter,
    )

    es_ordering_fields = (
        "date",
    )

    es_filter_fields = (
        es_filters.ESFieldFilter('log_message', 'log_message'),
    )

    es_search_fields = (
        'log_message',
    )

When you say that a DocType can point to a wildcard, LogIndex inherits from DocType so it is in theory possible to pass index = "logstash-*" in the Meta ?
Because it does not seem to work... is it because of the abstraction layer of django-rest-elasticsearch ?

I stumbled upon this looking for a way to have multiple Document classes stored in a single index.
@HonzaKral I saw this in your comment:

If someone wants to have multiple Document classes stored in a single index, the recommended way would be to use inheritance and override Meta.matches to dynamically determine which class to use for every hit returned

I seems to me that wouldn't work, because Meta.matches itself is tied to the index's mapping types (doc_type in elasticsearch-dsl world). For each hit, the Request's get_result method iterates through the index's mapping types (see here and here) and calls the Meta.matches method corresponding to each mapping type. In a single-type index this means only one Meta.matches method gets called, and I won't get instances of different classes returned from Index.search.

Minimal example reproducing this at https://gist.github.com/afallou/1ef93050aec461f2695a7cfde23e3c14 - I'm defining a custom types and putting two documents in the index; however one of the results is returned as a generic Hit because its Meta.matches is not called.

Let me know if I have missed something; otherwise happy to help fixing this.

Thanks @afallou, I added a comment with a workaround to your gist. tl;dr - the Index code assumes doc_types have different names, we need to fix it

@yogeek using wildcards in Meta.index should work for search, only problems might arise when trying to create an index with a wildcard in it. I am not familiar with django-rest-elasticsearch unfortunately to be able to tell you more atm.

btw I would also recommend you use the configuration options provided by elasticsearch-dsl (0) and use globally defined connections instead of creating them in your code. Also recommended connection class is the one based on urllib3, RequestsHttpConnection is slower and should only be used if you require some functionality of requests like auth plugins.

Hope this helps

0 - http://elasticsearch-dsl.readthedocs.io/en/latest/configuration.html

@HonzaKral thanks for your comment - even with your workaround the assumption that doc_types have different names also means that when putting several DocTypes in the same index (and same mapping type) the index mapping won't be properly created; whichever DocType is declared last will override the mapping from the ones declared before (my example wasn't showing that because my models inherited from each other, I since changed it).
Again, the solution probably is to maintain doc types as a list instead of a dict, as you mentioned.
I'm happy to work on a PR if it's something you can't pick up right now. Let me know.

the problem with multiple DocTypes in an Index is that their mappings will have to get merged before creating the index, which can be a bit tricky. Though it should be doable using the merging mechanism we already have in the library.

Unfortunately I cannot work on it right now. If you want to take a stab at it I would be happy to help with feedback and any support I can.

A workaround would be to create a single DocType with the unified mappings manually, register that with the Index and then use different DocType classes to Search. That way you have the mappings under control and will be explicit (which is better than implicit ;) ) about the fact that the distinction between DocType classes in the same index exists purely in python

I implemented some version of the multi-DocType mapping definition in a single index in a library built at our company that adds some Model and Store abstractions on top of elasticsearch_dsl (see the PR). I'll take a stab at porting that over here.

The multiple DocTypes in a single Index object have been addressed in 761f19f3838d7e933e187ba6bffbb02f2c2a29a5

Reading through this thread it sounds like supporting something like time-based indices is still not directly supported. If I have some documents that I want to convert to ES-DSL, which currently use aliases for both reads and writes (ie index_last_30_days for read and index_current for current index), combined with a template/curator to configure the new indices and adjust aliases as needed, what is the recommended way to plug this into DocType?

It sounds like I may need to manage the templates and aliases all external to this library for time being, and then maybe just set Meta.index to index_last_30_days or something and then override save() to use index index_current? Is there a better way? If this is the recommended route, I guess Meta.index would just be the default and if I wanted to search another alias / time frame, I'd need to specify that on search as well.

Ultimately just needing to spread big data out across multiple indices for performance and use aliases over wildcards to allow a bit easier and transparent re-indexing process anytime mappings change / etc.

@rholloway I wonder what do you mean by "supporting something like time-based indices"? You can do many things with time-based indices even now, without these planned improvements, see (0) for details, but if there is something I am missing please do let me know, it is an important pattern (along with indices using the rollover API (1) which is a bit easier thanks to the aliases).

0 - http://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html#indextemplate
1 - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html

Working draft of making DocType play nice with Index objects available in https://github.com/elastic/elasticsearch-dsl-py/pull/840

Any feedback is more than welcome!

It would be nice if index property in DocType's Meta had lazy option.

For example:
In our tests we want to change index name from 'products' to 'test_products' during tests. And we want it to work like in django - regardless of settings, database name will be changed to test_<original-db-name>. No chance to accidentally hit production database.

Current DocType's index is static property and @override_settings decorator takes action too late to switch index name. There are too many places that rely on cls._doc_type.index

class MyType(DocType):
    class Meta:
        # `index` is still valid, but optional 
        # if absent - fallback to `get_index_name`

        @classmethod
        def get_index_name(cls):
             return settings.ES_INDEX_PRODUCTS_NAME

If there are other options to achieve such behavior - please advice.

@exslim why not extract get_index_name to your test framework and simply make smth like

@lru_cache
def get_index_name(index):
    return ('test_' if settings.TEST else '') + index

...

class MyType(DocType):
    class Meta:
        index = get_index_name('my_type')

Code will be roughly the same. As for me – I don't see value in complicating metaparams.

@3lnc - while this may work, I don't like hacks for tests in production code (except I could move get_index_name() function to base settings.py - not the best idea, but thanks for the clue)
Guessing settings.TEST is also not so reliable, I would avoid it if possible.

My goal to make this work without config changes (backward compatibility). Which implies @override_settings is the only way to do this right.

The base testcase class would look something like this:

TEST_ELASTICSEARCH_INDEX_PREFIX = 'test_'

def create_test_elasticsearch_settings(current_settings):
    test_settings = current_settings.copy()
    # prepend 'test_' to every index name
    for key in test_settings.keys():
        idx_name = test_settings[key]['name']
        if idx_name.startswith(TEST_ELASTICSEARCH_INDEX_PREFIX):
            continue
        test_settings[key]['name'] = '{}{}'.format(TEST_ELASTICSEARCH_INDEX_PREFIX, idx_name)
    return test_settings


@override_settings(
    ELASTICSEARCH_INDICES=create_test_elasticsearch_settings(settings.ELASTICSEARCH_INDICES)
)
class BaseTestCase(TestCase):
    pass

Config looks like:

ELASTICSEARCH_INDICES = {
    'products': {
        'name': 'products',
        'mapping': os.path.join(BASE_DIR, 'schemas', 'v6', 'product.json')
    },
    ...
}

I think this is right way to do it.

840 is ready for review and contains some examples showcasing some of the techniques made possible by this change

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rokcarl picture rokcarl  Â·  4Comments

mortada picture mortada  Â·  3Comments

ypkkhatri picture ypkkhatri  Â·  4Comments

abuzakaria picture abuzakaria  Â·  4Comments

beanaroo picture beanaroo  Â·  4Comments