Magento2: [Architecture Proposal] Simplify URL Rewrites and make them better for SEO

Created on 4 Mar 2016 · 19Comments · Source: magento/magento2

_It is a proposal for the future pull request, upon product team and dev team approval_

1. Scalability Problem

At the moment, Magento database gets unmanageable, when you have URL rewrites for products and categories combined. The problem is in the way how URL rewrite table is build up. Becuase URL rewrite generator creates an entry in a database table for each combination of product and category. So it makes databases with only 3000 of products and 200 categories grow into 600,000 record table for URL rewrites. That limits most of the Magento merchants to create SEO valuable URL structure when they URL rewrites lookup takes more time that it could.

Solution

Use combination algorithm at runtime instead of generating all possible combination of URL. I already had PoC with one of my existing customers on 1.x; I have plans to port to Magento 2.0 if core team would be interested in such performance improvement contribution.

How does combination algorithm work? Let consider we have the following URL:
path1/path2/path3/path4

Then we can chain multiple processors for URL match. In default implementation there will be 2:

Full match processor
Combined processor

Full match processor

Simply lookups the full URL rewrite, if it does not return match, then the next one is used

Combined processor

Splits the given URL in two parts: path1/path2/path3 and path4. If there is a URL rewrite for the first part and it has the flag, that it is combinable with another record, the processor then makes a request to find URL rewrite target for path4. By having specific merge algorithms assigned to it (category + product or even more in feature), it will create a virtual rewrite record that will be used to forward the request to a needed target path

How does it affect Canonical Url for a Product?

This algorithm allows to specify canonical category even at runtime, and you don't need to create a separate index for it.

How does it affect Duplicates of URL?

Because path1/path2/path3 and path4 should be unique in the main table, combining them together does not bring any issue, except if there is path1/path2/path3/path4 rewrite exists.

How does it affect Current Installments?

It will use the same tables that exist atm.

2. Internationalization Problem

Because Magento URL generator removes all the characters from non-latin alphabet, it makes impossible to use all the possibilities of allowed URI specification (RFC3986)

Solution

By lifting the logic behind the algorithm of URL generator we can gain much broad SEO possibilities, including ones with different writing schemes (Cyrillic, Chinese, Arabic, etc.). As you can easily use UTF-8 characters in URL path as soon as you escape with rawurlencode when building the rewritten URL, then on the processing convert it back with rawurldecode. SEO experts will love that as it gives better search results rankings in non-latin alphabets.
This approach does make it even possible to generate URL rewrite lookup in much easier faster manner, as you can simply replace all the defined prohibited characters with "-" (dash). So it allows moving URL generation logic to the pure database level.

How does it affect Current Installments?

There are no changes to already translated URLs, it might re-index it a bit differently, but then redirect rewrite history will make auto-redirect.

3. Large Varchar Index Problem

As request path is a string match, when you have a huge database it makes it much slower to lookup the required record.

Solution

There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.

How does it affect Current Installments?

It won't change existing table structure as the new table will be introduced, and existing records will be inserted by a simple SQL query during migration. New records will be handled in the URL indexer itself.

Format is not valid improvement

Source

IvanChepurnyi

👍5

Most helpful comment

@IvanChepurnyi

Hello Ivan,

Following up on our discussion let me inform you that your proposals were approved. I would like to summarize on what we agreed regarding the future PR:

1. Scalability Problem. This proposal makes sense as it will decrease the amount of records in DB. However it may cause the collision. Please make sure the solution covers this case as well and prevent occurrence of collided URLs.
2. Internalization Problem. Definitely will be a good option. Currently we have a transliteration mechanism for URL keys specified using non-ASCII characters. It would be great to have this option proposed by you as the one which user can turn on and use instead of transliteration mechanism.
3. Large Varchar Index Problem. Definitely makes sense. Can be implemented in the way you described it.

Thank you for your initiative aimed to improve and simplify URL Rewrites mechanism.

We look forward to your contribution.

Best regards,
Anton.

antboiko on 19 Apr 2016

❤2 👍1

All 19 comments

This sounds amazing and would tackle the core_url_rewrite problem. Not to mention open up possibilities to be more dynamic in your filtering options and category naming/structure.

peterjaap on 4 Mar 2016

:+1:

sergeifilippov on 4 Mar 2016

that sounds awesome and would help with performance.

keyurshah on 5 Mar 2016

aleron75 on 5 Mar 2016

rianorie on 6 Mar 2016

amacgregor on 6 Mar 2016

:+1:

tgerulaitis on 6 Mar 2016

roman204 on 6 Mar 2016

mcspronko on 9 Mar 2016

:+1:

spras on 9 Mar 2016

Isn't (1) basically what Magento 1 EE already does?

schmengler on 9 Mar 2016

@schmengler Can you please elaborate on this? I don't understand how this can be the same as in EE 1.x? You don't have combinable URL rewrites in any of the core version atm.

IvanChepurnyi on 9 Mar 2016

What about Enterprise_Catalog_Model_Urlrewrite_Matcher_Product ? It splits the request path at the last "/" and looks up the parts separateley. AFAIK there are no combined values stored in the database.

schmengler on 9 Mar 2016

@schmengler Sorry, my fault that I didn't check if EE has something similar in 1.x branch. Seems there are some nice improvements in it. Just checked it, and true there is kind of a similar approach, but not exactly what I am talking about here.

My proposal is to create chained URL rewrite processors with priorities, that are decoupled from each other:

Full request processor that will match the full request path as it is
Combined request processor that will match request path except the last item after slash and if it is a combinable URL rewrite (special flag in record metadata), then try to find the combination of last item
Add any of your custom URL rewrite processors to the chain for your extension or project customization (layered nav, virtual pseudo categories like brands, etc).

The only thing that looks similar to EE 1.x is a part related to combining request paths, but it is still a different approach than the one that is described here.

IvanChepurnyi on 9 Mar 2016

@IvanChepurnyi Hi Ivan, thank you for all proposals regarding URL rewrites.

Both "Internationalization Problem" and "Large Varchar Index Problem" definitely make sense. They both will be a subject for internal discussion and validation.

Regarding "Scalability Problem" proposal, this one was a part of EE 1.13.x. Currently we have few solutions to be considered for solving this.

Ivan, I will keep you informed of the progress on this matter.

Best,
Anton.

antboiko on 9 Mar 2016

@antboiko

Thanks for the official answer. There is something very important to consider when you read about scalability problem. Maybe I was not that clear in the description, but the combination processor is only tiny part of proposed solution. The whole Scalability solution goes far beyond category + product URL lookups. As at the moment (even with EE1.13.x) I cannot create request path match processor that would give me full control over how request_path is split up. I cannot have something like
path1/path2/path3/path4 where path1 is a category and path2, path3, path4 are applied filter URL slugs. At the moment to make it possible I have to wrap the whole Magento\UrlRewrite\Model\UrlFinderInterface implementation, but that should not be the case. That is why I propose to create a layer on top of the storage that will do that and will allow chaining of the processors and keep the same models for storage as it is done now. Doing it the proposed way will make it simple for third-party vendors to add own URL processors without substituting the whole storage logic.
Adding this layer would not break any backward compatibility, as every chain processor will still be using Magento\UrlRewrite\Model\UrlFinderInterface.

IvanChepurnyi on 9 Mar 2016

@IvanChepurnyi

Hello Ivan,

Following up on our discussion let me inform you that your proposals were approved. I would like to summarize on what we agreed regarding the future PR:

Thank you for your initiative aimed to improve and simplify URL Rewrites mechanism.

We look forward to your contribution.

Best regards,
Anton.

antboiko on 19 Apr 2016

❤2 👍1

Thank you for your submission.

We recently made some changes to the way we process GitHub submissions to more quickly identify and respond to core code issues.

Feature Requests and Improvements should now be submitted to the new Magento 2 Feature Requests and Improvements forum (see details here).

We are closing this GitHub ticket and have moved your request to the new forum.

piotrekkaminski on 16 Sep 2016

Commenting here as there does not seem to be any activity in https://community.magento.com/t5/Magento-2-Feature-Requests-and/Architecture-Proposal-Simplify-URL-Rewrites-and-make-them-better/idi-p/48593.

There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.

I don't know if matching was ever a bottleneck, however, I prefer precise solutions much more than approximate ones. With data structure similar to https://en.wikipedia.org/wiki/Trie lookup will occur in no time as well.

The main problem is that we need data structure which allows fast detection of collisions during editing. For example, if some category is moved to another place in hierarchy. I believe a slightly sophisticated version of trie can handle this as well.