_It is a proposal for the future pull request, upon product team and dev team approval_
At the moment, Magento database gets unmanageable, when you have URL rewrites for products and categories combined. The problem is in the way how URL rewrite table is build up. Becuase URL rewrite generator creates an entry in a database table for each combination of product and category. So it makes databases with only 3000 of products and 200 categories grow into 600,000 record table for URL rewrites. That limits most of the Magento merchants to create SEO valuable URL structure when they URL rewrites lookup takes more time that it could.
Use combination algorithm at runtime instead of generating all possible combination of URL. I already had PoC with one of my existing customers on 1.x; I have plans to port to Magento 2.0 if core team would be interested in such performance improvement contribution.
How does combination algorithm work? Let consider we have the following URL:
path1/path2/path3/path4
Then we can chain multiple processors for URL match. In default implementation there will be 2:
Simply lookups the full URL rewrite, if it does not return match, then the next one is used
Splits the given URL in two parts: path1/path2/path3 and path4. If there is a URL rewrite for the first part and it has the flag, that it is combinable with another record, the processor then makes a request to find URL rewrite target for path4. By having specific merge algorithms assigned to it (category + product or even more in feature), it will create a virtual rewrite record that will be used to forward the request to a needed target path
This algorithm allows to specify canonical category even at runtime, and you don't need to create a separate index for it.
Because path1/path2/path3 and path4 should be unique in the main table, combining them together does not bring any issue, except if there is path1/path2/path3/path4 rewrite exists.
It will use the same tables that exist atm.
Because Magento URL generator removes all the characters from non-latin alphabet, it makes impossible to use all the possibilities of allowed URI specification (RFC3986)
By lifting the logic behind the algorithm of URL generator we can gain much broad SEO possibilities, including ones with different writing schemes (Cyrillic, Chinese, Arabic, etc.). As you can easily use UTF-8 characters in URL path as soon as you escape with rawurlencode when building the rewritten URL, then on the processing convert it back with rawurldecode. SEO experts will love that as it gives better search results rankings in non-latin alphabets.
This approach does make it even possible to generate URL rewrite lookup in much easier faster manner, as you can simply replace all the defined prohibited characters with "-" (dash). So it allows moving URL generation logic to the pure database level.
There are no changes to already translated URLs, it might re-index it a bit differently, but then redirect rewrite history will make auto-redirect.
As request path is a string match, when you have a huge database it makes it much slower to lookup the required record.
There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.
It won't change existing table structure as the new table will be introduced, and existing records will be inserted by a simple SQL query during migration. New records will be handled in the URL indexer itself.
This sounds amazing and would tackle the core_url_rewrite problem. Not to mention open up possibilities to be more dynamic in your filtering options and category naming/structure.
+1
:+1:
that sounds awesome and would help with performance.
+1
+1
+1
:+1:
+1
+1
:+1:
Isn't (1) basically what Magento 1 EE already does?
@schmengler Can you please elaborate on this? I don't understand how this can be the same as in EE 1.x? You don't have combinable URL rewrites in any of the core version atm.
What about Enterprise_Catalog_Model_Urlrewrite_Matcher_Product ? It splits the request path at the last "/" and looks up the parts separateley. AFAIK there are no combined values stored in the database.
@schmengler Sorry, my fault that I didn't check if EE has something similar in 1.x branch. Seems there are some nice improvements in it. Just checked it, and true there is kind of a similar approach, but not exactly what I am talking about here.
My proposal is to create chained URL rewrite processors with priorities, that are decoupled from each other:
The only thing that looks similar to EE 1.x is a part related to combining request paths, but it is still a different approach than the one that is described here.
@IvanChepurnyi Hi Ivan, thank you for all proposals regarding URL rewrites.
Both "Internationalization Problem" and "Large Varchar Index Problem" definitely make sense. They both will be a subject for internal discussion and validation.
Regarding "Scalability Problem" proposal, this one was a part of EE 1.13.x. Currently we have few solutions to be considered for solving this.
Ivan, I will keep you informed of the progress on this matter.
Best,
Anton.
@antboiko
Thanks for the official answer. There is something very important to consider when you read about scalability problem. Maybe I was not that clear in the description, but the combination processor is only tiny part of proposed solution. The whole Scalability solution goes far beyond category + product URL lookups. As at the moment (even with EE1.13.x) I cannot create request path match processor that would give me full control over how request_path is split up. I cannot have something like
path1/path2/path3/path4 where path1 is a category and path2, path3, path4 are applied filter URL slugs. At the moment to make it possible I have to wrap the whole Magento\UrlRewrite\Model\UrlFinderInterface implementation, but that should not be the case. That is why I propose to create a layer on top of the storage that will do that and will allow chaining of the processors and keep the same models for storage as it is done now. Doing it the proposed way will make it simple for third-party vendors to add own URL processors without substituting the whole storage logic.
Adding this layer would not break any backward compatibility, as every chain processor will still be using Magento\UrlRewrite\Model\UrlFinderInterface.
@IvanChepurnyi
Hello Ivan,
Following up on our discussion let me inform you that your proposals were approved. I would like to summarize on what we agreed regarding the future PR:
1. Scalability Problem. This proposal makes sense as it will decrease the amount of records in DB. However it may cause the collision. Please make sure the solution covers this case as well and prevent occurrence of collided URLs.
2. Internalization Problem. Definitely will be a good option. Currently we have a transliteration mechanism for URL keys specified using non-ASCII characters. It would be great to have this option proposed by you as the one which user can turn on and use instead of transliteration mechanism.
3. Large Varchar Index Problem. Definitely makes sense. Can be implemented in the way you described it.
Thank you for your initiative aimed to improve and simplify URL Rewrites mechanism.
We look forward to your contribution.
Best regards,
Anton.
Thank you for your submission.
We recently made some changes to the way we process GitHub submissions to more quickly identify and respond to core code issues.
Feature Requests and Improvements should now be submitted to the new Magento 2 Feature Requests and Improvements forum (see details here).
We are closing this GitHub ticket and have moved your request to the new forum.
Commenting here as there does not seem to be any activity in https://community.magento.com/t5/Magento-2-Feature-Requests-and/Architecture-Proposal-Simplify-URL-Rewrites-and-make-them-better/idi-p/48593.
There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.
I don't know if matching was ever a bottleneck, however, I prefer precise solutions much more than approximate ones. With data structure similar to https://en.wikipedia.org/wiki/Trie lookup will occur in no time as well.
The main problem is that we need data structure which allows fast detection of collisions during editing. For example, if some category is moved to another place in hierarchy. I believe a slightly sophisticated version of trie can handle this as well.
Most helpful comment
@IvanChepurnyi
Hello Ivan,
Following up on our discussion let me inform you that your proposals were approved. I would like to summarize on what we agreed regarding the future PR:
1. Scalability Problem. This proposal makes sense as it will decrease the amount of records in DB. However it may cause the collision. Please make sure the solution covers this case as well and prevent occurrence of collided URLs.
2. Internalization Problem. Definitely will be a good option. Currently we have a transliteration mechanism for URL keys specified using non-ASCII characters. It would be great to have this option proposed by you as the one which user can turn on and use instead of transliteration mechanism.
3. Large Varchar Index Problem. Definitely makes sense. Can be implemented in the way you described it.
Thank you for your initiative aimed to improve and simplify URL Rewrites mechanism.
We look forward to your contribution.
Best regards,
Anton.