Sylius: [RFC] Replacing SearchBundle with Grid Elastica driver

Created on 31 Mar 2016  路  6Comments  路  Source: Sylius/Sylius

Helloes,

SearchBundle has been great and provided us with search functionality when there was none, but it has some issues:

  • MySQL implementation of faceted search is quite complex and not very efficient
  • Finder is stateful and not clean
  • ProductController indexByTaxon method is quite messy

Goals:

Maintaining SQL implementation of search is a lot of work and most stores use ElasticaBundle and integrate search manually (which is quite simple) or use our not-so-perfect Elastic engine for SearchBundle. I'd like to simplify the whole thing.

Solution:

Product list by category and default product search would be handled by our GridBundle. Because search is really just a grid of products with filtering and sorting. All this can be handled by Grid component now. Default Sylius would ship with Doctrine ORM grid configured, which would provide basic search functionality (no faceted search):

Routing:

sylius_shop_product_index:
     path: /products/{taxonPermalink}
     defaults:
         _controller: sylius.controller.product:indexAction
         _sylius:
             template: SyliusShopBundle:Product:index.html.twig
             grid: sylius_shop_product

Grid (Doctrine/ORM):

sylius_grid:
    grids:
        sylius_shop_product:
            driver:
                name: doctrine/orm
                options:
                    class: %sylius.model.product.class%
                    repository:
                        method: createByChannelAndTaxonPermalinkQueryBuilder
                        arguments:
                            - 'expr:service('sylius.context.channel').getChannel()'
                            - $taxonPermalink
            filters:
                search:
                    type: string
                    options:
                        fields: [translation.name, translation.description]
                 ### Other simple filters totally possible, just without facets and with worse perf. compared to ElasticSearch of course.
            sorting:
                translation.name: asc

Grid (Elastica)

sylius_grid:
    grids:
        sylius_shop_product:
            driver:
                name: elastica
                options:
                    index: sylius
                    type: product
                    terms: # ???
                        channels: 'expr:service('sylius.context.channel').getChannel()'
                        taxons: $taxonPermalink
            filters:
                search:
                    type: ??? string or elastica (depends on how we can implement it)
                 ### Your facets configurations go here
                 tshirt_color:
                     type: sylius_product_attribute
                     options:
                          code: tshirt_color

As you can see, using Grids allows us to reuse the whole system of filters and forms, if we implement Elastica driver, we can use the same system and make it work with both Doctrine ORM and Elastica. Sounds crazy but I already did that for Doctrine DBAL and Doctrine ORM. :dancer:

Pagerfanta supports Elastica and FOSElasticaBundle supports Pagerfanta, so that makes whole thing even easier.

@NeverResponse will be working on Elastica driver for grids. :)

Steps to implement new driver in Sylius grids:

  1. Implement ElasticaDriver, which implements DriverInterface, example for ORM: https://github.com/Sylius/Sylius/blob/master/src/Sylius/Bundle/GridBundle/Doctrine/ORM/Driver.php
  2. Implement DataSource for this driver: https://github.com/Sylius/Sylius/blob/master/src/Sylius/Bundle/GridBundle/Doctrine/ORM/DataSource.php
  3. Implement ExpressionBuilder: https://github.com/Sylius/Sylius/blob/master/src/Sylius/Bundle/GridBundle/Doctrine/ORM/ExpressionBuilder.php
  4. Register in container with tag sylius.grid_driver: https://github.com/Sylius/Sylius/blob/master/src/Sylius/Bundle/GridBundle/Resources/config/drivers.xml
  5. Configure in your Grid :dancer:

Things to note:

  1. Grid bundle should not depend on FOSElasticaBundle, the driver should be only enabled if FOSElasticaBundle is in Kernel.
  2. Initial implementation needs to do only one thing: Search by string with pagination (findPaginated) method should be helpful here, we do not need facets in the PoC

Things to figure out in the future:

  1. How to pass Aggregates information to filters.
  2. How to take over the world with YAML programming.

Helpful links:

RFC

Most helpful comment

We've been working with Elasticsearch in Sylius for a long time and I thought I would share our experiences as there are a few major challenges and it's not quite as easy as you may think.

Our current results for search on the front-end of our site are really good, but they took a lot of tuning and the code architecture still needs work.

Indexing & Structure

The toughest challenge with Elasticsearch is synchronising the data with your primary relational schema DB (mysql).

FOSElasticaBundle provides serialization to help map the structures and a doctrine lifecycle listener to keep it indexed, unfortunately the listener is only tied to the primary entity of concern - in this case Product (example shown here https://github.com/FriendsOfSymfony/FOSElasticaBundle/blob/master/Resources/doc/types.md for Users).

The problem is that there's much more than just the basic entity to describe searchable aspects of a product. In fact, I doubt there's much at all of searchable relevance on the Product entity and it's mainly in other relations:

  • ProductTranslation
  • ProductAttribute
  • ProductAttrubuteValue
  • ProductOption
  • ProductVariant
  • ProductVariantTranslation
  • ProductTaxon

In our model at Reiss we are even more normalized with ProductColour and ProductSize as their own distinct entities which are grouped with ColourGroup and SizeGroup.

The problem is how do you know what to index when you use the automatic listener? If an Attribute name is changed you need to find all Products that use this Attribute and reindex them all. It becomes a complex cascading model of trying to work out when to index to keep data up-to-date.

Also when you have such a complex 'web' of indexing dependencies you can inadvertently find that saving a small value such as a name leads to indexing 5,000 products to Elasticsearch... There could be ways to alleviate bottlenecks, but the point is that there is high complexity.

A brute-force alternative to the listener (and doctrine lifecycle listeners can be overhead anyway) is regular reindexing. We chose this approach.

In our system we have 25,000 products (150,000 variants) but only roughly 2,000 active at any time for customers to search as we are a seasonal business.

We are also offering translations for multiple languages, using channels representing different languages.

By writing a custom indexer avoiding ORM (because it's crazy slow for any kind of batch operation) we can index our full active database for 6 channels (languages) in roughly 10 -> 15 seconds.

The intention is to run this every 2 or 3 minutes to keep product data relatively 'fresh'.

We use separate Elasticsearch indexes for the different channels/languages to keep the content clean and dedicated. Although this may not be necessary, when you are trying to optimize searching for different languages using Elasticsearch you soon find that different analyzers are much more successful for different locales - having this separation can make a huge difference to the quality of results.

We also index our entire product catalogue for admin searching in a simpler way with only our default language. This takes about 10 seconds to index the full 25,000 products.

I don't think our implementation is optimal and needs rework in future, but regardless, Indexing is hard. I would interested to know how others have addressed this and what Sylius is planning here, because it's not trivial.

Querying

It took a lot of work to get good, domain-specific results for Reiss with Elasticsearch.

We tried fuzzy logic but this has no way of prioritising direct match results over those that needed fuzzy matching to get a result.

Eventually we used Edge Ngrams as our primary filter along with a set of fashion-specific synonyms to index content effectively. Here's our configuration so far for analyzers and filters:

                analysis:
                    filter:
                        reiss_synonym_filter:
                            # https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-synonym-tokenfilter.html
                            type: synonym
                            expand: true
                            ignore_case: true
                            synonyms: %reiss.search.synonyms%
                        reiss_snowball_filter:
                            # A filter that stems words using a Snowball-generated stemmer.
                            # https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-snowball-tokenfilter.html
                            type: snowball
                            # CONSIDER TRANSLATIONS!
                            language: English
                        reiss_edge_ngram_filter:
                            type: edgeNGram
                            min_gram: 4
                            max_gram: 12
                    analyzer:
                        # Use for names and short descriptions
                        reiss_keyword_analyzer:
                            # NOTE: Synonym MUST come before other filters
                            filter:
                                - reiss_synonym_filter
                                - reiss_edge_ngram_filter
                            tokenizer: lowercase
                        # Use for sentences and paragraphs
                        reiss_text_analyzer:
                            filter:
                                - reiss_synonym_filter
                                - reiss_snowball_filter
                            # CLASSIC ONLY WORKS WELL FOR ENGLISH
                            tokenizer: classic

With this configuration, for string searches we try 2 separate queries.
The first tries more specific matches using ngrams:

        $qb = new QueryBuilder();

        $query = new Query();
        $query->setQuery(
            $qb->query()->multi_match()
                ->setQuery($searchTerm)
                ->setType('cross_fields')
                ->setAnalyzer('reiss_keyword_analyzer')
                ->setFields([
                    "name^4",
                    "gender^3",
                    "colour^2",
                    "shortDescription^2",
                    "archetype_analyzed^2",
                    "street",
                    "city"
                ])
                ->setTieBreaker(0.3)
        );

If this fails to find a result we then run a 'fallback' query using fuzziness to see if there were any spelling mistakes that might return matches:

       $query = new Query();
        $query->setQuery(
            $qb->query()->multi_match()
                ->setQuery($searchTerm)
                ->setType('best_fields')
                ->setFields([
                    "name^4",
                    "gender^3",
                    "colour^2",
                    "shortDescription^2",
                    "archetype_analyzed^2",
                    "street",
                    "city"
                ])
                ->setFuzziness(1)
        );

We also have a set of Preanalyzers that run business-logic analysis on the search string before passing it to Elasticsearch e.g. if the string input includes the word 'black' then we'll remove this from the search string and apply Black as a colour filter to the query rather than trying to search for it across all of the above fields. We also filter for integers to use a different query if someone searches for a product by code.

With all of this we _almost_ have an excellent quality of results. We still need to do a lot of work around predictive search.

Versions of Elasticsearch

Elasticsearch has had rapid development of the past few years and suffers from poor documentation and a community of drivers and libraries trying to keep up.

There have been some enormous rewrites of features within Elasticsearch in 12 months but you may not be able to use them if you depend on FOSElastica -> Elastica -> Elasticsearch as you need to wait for each to be updated to support changes.

In our case we're also dependent on Platform.sh which only has 0.4 version of Elasticsearch available (as we funded it's development)...

Efficiency

Another important point for larger sites is to avoid depending on 2 storage services. We wrote a custom hydrator (it's horrible and needs to be rewritten!) that takes Elasticsearch results and hydrates them directly into model objects for rendering on the page.

This works great and means our search results and category listing pages are rendered directly from Elasticesarch.

You really want to avoid making a request to Elasticsearch to then go back to a mysql database to fill in the rest of the data - it's horribly inefficient and defeats the purpose.

I see that the docs and features of FOSElastica are more advanced than they were when we started looking at this, so it seems much easier to ensure this now using serializers, but you then come back to efficiency of indexing and making sure your relational Product database is synchronised with Elasticsearch...

There are some really major challenges here and my gut feeling is that you might find it's more than you anticipated to commit to a really good Elasticsearch support for Sylius 1.0.

All 6 comments

Nice one, it will be usefull. I have to work on my facets next week (almost 800 categories).

How can i help with that one ?

We've been working with Elasticsearch in Sylius for a long time and I thought I would share our experiences as there are a few major challenges and it's not quite as easy as you may think.

Our current results for search on the front-end of our site are really good, but they took a lot of tuning and the code architecture still needs work.

Indexing & Structure

The toughest challenge with Elasticsearch is synchronising the data with your primary relational schema DB (mysql).

FOSElasticaBundle provides serialization to help map the structures and a doctrine lifecycle listener to keep it indexed, unfortunately the listener is only tied to the primary entity of concern - in this case Product (example shown here https://github.com/FriendsOfSymfony/FOSElasticaBundle/blob/master/Resources/doc/types.md for Users).

The problem is that there's much more than just the basic entity to describe searchable aspects of a product. In fact, I doubt there's much at all of searchable relevance on the Product entity and it's mainly in other relations:

  • ProductTranslation
  • ProductAttribute
  • ProductAttrubuteValue
  • ProductOption
  • ProductVariant
  • ProductVariantTranslation
  • ProductTaxon

In our model at Reiss we are even more normalized with ProductColour and ProductSize as their own distinct entities which are grouped with ColourGroup and SizeGroup.

The problem is how do you know what to index when you use the automatic listener? If an Attribute name is changed you need to find all Products that use this Attribute and reindex them all. It becomes a complex cascading model of trying to work out when to index to keep data up-to-date.

Also when you have such a complex 'web' of indexing dependencies you can inadvertently find that saving a small value such as a name leads to indexing 5,000 products to Elasticsearch... There could be ways to alleviate bottlenecks, but the point is that there is high complexity.

A brute-force alternative to the listener (and doctrine lifecycle listeners can be overhead anyway) is regular reindexing. We chose this approach.

In our system we have 25,000 products (150,000 variants) but only roughly 2,000 active at any time for customers to search as we are a seasonal business.

We are also offering translations for multiple languages, using channels representing different languages.

By writing a custom indexer avoiding ORM (because it's crazy slow for any kind of batch operation) we can index our full active database for 6 channels (languages) in roughly 10 -> 15 seconds.

The intention is to run this every 2 or 3 minutes to keep product data relatively 'fresh'.

We use separate Elasticsearch indexes for the different channels/languages to keep the content clean and dedicated. Although this may not be necessary, when you are trying to optimize searching for different languages using Elasticsearch you soon find that different analyzers are much more successful for different locales - having this separation can make a huge difference to the quality of results.

We also index our entire product catalogue for admin searching in a simpler way with only our default language. This takes about 10 seconds to index the full 25,000 products.

I don't think our implementation is optimal and needs rework in future, but regardless, Indexing is hard. I would interested to know how others have addressed this and what Sylius is planning here, because it's not trivial.

Querying

It took a lot of work to get good, domain-specific results for Reiss with Elasticsearch.

We tried fuzzy logic but this has no way of prioritising direct match results over those that needed fuzzy matching to get a result.

Eventually we used Edge Ngrams as our primary filter along with a set of fashion-specific synonyms to index content effectively. Here's our configuration so far for analyzers and filters:

                analysis:
                    filter:
                        reiss_synonym_filter:
                            # https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-synonym-tokenfilter.html
                            type: synonym
                            expand: true
                            ignore_case: true
                            synonyms: %reiss.search.synonyms%
                        reiss_snowball_filter:
                            # A filter that stems words using a Snowball-generated stemmer.
                            # https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-snowball-tokenfilter.html
                            type: snowball
                            # CONSIDER TRANSLATIONS!
                            language: English
                        reiss_edge_ngram_filter:
                            type: edgeNGram
                            min_gram: 4
                            max_gram: 12
                    analyzer:
                        # Use for names and short descriptions
                        reiss_keyword_analyzer:
                            # NOTE: Synonym MUST come before other filters
                            filter:
                                - reiss_synonym_filter
                                - reiss_edge_ngram_filter
                            tokenizer: lowercase
                        # Use for sentences and paragraphs
                        reiss_text_analyzer:
                            filter:
                                - reiss_synonym_filter
                                - reiss_snowball_filter
                            # CLASSIC ONLY WORKS WELL FOR ENGLISH
                            tokenizer: classic

With this configuration, for string searches we try 2 separate queries.
The first tries more specific matches using ngrams:

        $qb = new QueryBuilder();

        $query = new Query();
        $query->setQuery(
            $qb->query()->multi_match()
                ->setQuery($searchTerm)
                ->setType('cross_fields')
                ->setAnalyzer('reiss_keyword_analyzer')
                ->setFields([
                    "name^4",
                    "gender^3",
                    "colour^2",
                    "shortDescription^2",
                    "archetype_analyzed^2",
                    "street",
                    "city"
                ])
                ->setTieBreaker(0.3)
        );

If this fails to find a result we then run a 'fallback' query using fuzziness to see if there were any spelling mistakes that might return matches:

       $query = new Query();
        $query->setQuery(
            $qb->query()->multi_match()
                ->setQuery($searchTerm)
                ->setType('best_fields')
                ->setFields([
                    "name^4",
                    "gender^3",
                    "colour^2",
                    "shortDescription^2",
                    "archetype_analyzed^2",
                    "street",
                    "city"
                ])
                ->setFuzziness(1)
        );

We also have a set of Preanalyzers that run business-logic analysis on the search string before passing it to Elasticsearch e.g. if the string input includes the word 'black' then we'll remove this from the search string and apply Black as a colour filter to the query rather than trying to search for it across all of the above fields. We also filter for integers to use a different query if someone searches for a product by code.

With all of this we _almost_ have an excellent quality of results. We still need to do a lot of work around predictive search.

Versions of Elasticsearch

Elasticsearch has had rapid development of the past few years and suffers from poor documentation and a community of drivers and libraries trying to keep up.

There have been some enormous rewrites of features within Elasticsearch in 12 months but you may not be able to use them if you depend on FOSElastica -> Elastica -> Elasticsearch as you need to wait for each to be updated to support changes.

In our case we're also dependent on Platform.sh which only has 0.4 version of Elasticsearch available (as we funded it's development)...

Efficiency

Another important point for larger sites is to avoid depending on 2 storage services. We wrote a custom hydrator (it's horrible and needs to be rewritten!) that takes Elasticsearch results and hydrates them directly into model objects for rendering on the page.

This works great and means our search results and category listing pages are rendered directly from Elasticesarch.

You really want to avoid making a request to Elasticsearch to then go back to a mysql database to fill in the rest of the data - it's horribly inefficient and defeats the purpose.

I see that the docs and features of FOSElastica are more advanced than they were when we started looking at this, so it seems much easier to ensure this now using serializers, but you then come back to efficiency of indexing and making sure your relational Product database is synchronised with Elasticsearch...

There are some really major challenges here and my gut feeling is that you might find it's more than you anticipated to commit to a really good Elasticsearch support for Sylius 1.0.

@peteward Thank you so much for sharing this, it will definitely be helpful. I believe our default implementation will be a bit simpler than REISS, but I agree there is a lot of figure out.

This sounds like a great plan ! Just a little heads-up (basically in line with the warning from @peteward above). It seems like FOSElastica is at some crossroad now, not supporting newer versions of elastica (and thus elasticsearch), which would mean you would probably have to use / write something else or help FOSElastica.
These problems affect the current solution too unfortunately.

Some news on FOS, 1069, they finally add a new member in team and will support newer versions :tada:

It turned out that using Grids for that purpose is good idea for large data sets in admin, etc. but for frontend search, it would require a lot more work to fit it in the genericness of Grids. So we are working on proper implementation here https://github.com/Lakion/SyliusElasticSearchBundle.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Chrysweel picture Chrysweel  路  3Comments

loic425 picture loic425  路  3Comments

ping86 picture ping86  路  3Comments

igormukhingmailcom picture igormukhingmailcom  路  3Comments

bnd170 picture bnd170  路  3Comments