In our use case, we mainly use position() function to determine whether a string starts with a pattern.
However, position() function is a little inefficient to determine it.
As far as I investigate, regardless of the length of 1st string argument, starts_with function is 20~30% faster than position function in our use case.
link: https://gist.github.com/yuokada/2c2f0c396be4f15a344802bf2a013bbd
And also, compared with the query plan using postion function and one using starts_with function,
There is the difference in filterPredicate section. I assume that we can save the resource of casting to BIGINT and comparison by using starts_with function compared to the position function.
- filterPredicate = ("@strpos|bigint|varchar(8)|varchar(4)@strpos(varchar(x),varchar(y)):bigint"("field", '/v1/') = BIGINT '1')
+ filterPredicate = "@starts_with|boolean|varchar(8)|varchar(4)@starts_with(varchar(x),varchar(y)):boolean"("field", '/v1/')
presto> EXPLAIN select * FROM (VALUES ('/v1/info'), ('/v1/node'), ('/v2/node')) AS t(path) WHERE position('/v1/' in path) = 1;
Query Plan
------------------------------------------------------------------------------------------------------------------------------------------
Output[path]
โ Layout: [field:varchar(8)]
โ Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}
โ path := field
โโ Filter[filterPredicate = ("@strpos|bigint|varchar(8)|varchar(4)@strpos(varchar(x),varchar(y)):bigint"("field", '/v1/') = BIGINT '1')]
โ Layout: [field:varchar(8)]
โ Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}
โโ LocalExchange[ROUND_ROBIN] ()
โ Layout: [field:varchar(8)]
โ Estimates: {rows: 3 (165B), cpu: 165, memory: 0B, network: 0B}
โโ Values
Layout: [field:varchar(8)]
Estimates: {rows: 3 (165B), cpu: 0, memory: 0B, network: 0B}
('/v1/info')
('/v1/node')
('/v2/node')
(1 row)
presto> EXPLAIN select * FROM (VALUES ('/v1/info'), ('/v1/node'), ('/v2/node')) AS t(path) WHERE starts_with(path, '/v1/');
Query Plan
---------------------------------------------------------------------------------------------------------------------------------------
Output[path]
โ Layout: [field:varchar(8)]
โ Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}
โ path := field
โโ Filter[filterPredicate = "@starts_with|boolean|varchar(8)|varchar(4)@starts_with(varchar(x),varchar(y)):boolean"("field", '/v1/')]
โ Layout: [field:varchar(8)]
โ Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}
โโ LocalExchange[ROUND_ROBIN] ()
โ Layout: [field:varchar(8)]
โ Estimates: {rows: 3 (165B), cpu: 165, memory: 0B, network: 0B}
โโ Values
Layout: [field:varchar(8)]
Estimates: {rows: 3 (165B), cpu: 0, memory: 0B, network: 0B}
('/v1/info')
('/v1/node')
('/v2/node')
(1 row)
A dedicated function for this shouldn't be necessary. You can achieve this easily with LIKE:
SELECT *
FROM (
VALUES
('/v1/info'),
('/v1/node'),
('/v2/node')) AS t(path)
WHERE path LIKE '/v1/%'
Is the string you are searching for a constant?
Is the string you are searching for a constant?
As far as I investigate, Almost all of my queries use a constant for 2nd argument.
So I assume I can use LIKE operator instead of starts_with function.
@nurse Just in case, could you explain the detail of our use case? I might not understand our use case and your intent fully.
It could search for a constant of a variable in real. We mainly focused on the performance as position() = 1 and like 'str%' were slower to find out whether an input starts with given string on billions of records.
like 'str%' were slower to find out whether an input starts with given string on billions of records.
That's something that could be optimized inside Presto. If it finds the user is checking for that pattern, it could swap the implementation for a more optimized approach.
The use case is... we develops a marketing tool. Marketers want to create a segment in which there're people who visited our web pages under /food/ for example. (Marketers will do some actions to those people) Since it queries about tens of million records, we want to improve performance.
It doesn't matter the performance improvement is achieved by dedicated function or certain optimizations like position(str) = 1 or str LIKE 'prefix%' if it's actually implemented.
Thanks for describing your use case. In that case, I would suggest that:
starts_with function as proposedxxx% and rewrites them to a call to starts_with. This can be done by updating the DesugarLikeRewriter optimizer.These two improvements can be done separately.
Most helpful comment
Thanks for describing your use case. In that case, I would suggest that:
starts_withfunction as proposedxxx%and rewrites them to a call tostarts_with. This can be done by updating the DesugarLikeRewriter optimizer.These two improvements can be done separately.