Presto: Add the starts_with function to determine whether a string starts with a pattern

Created on 9 Apr 2020  ยท  7Comments  ยท  Source: prestosql/presto

In our use case, we mainly use position() function to determine whether a string starts with a pattern.

However, position() function is a little inefficient to determine it.

As far as I investigate, regardless of the length of 1st string argument, starts_with function is 20~30% faster than position function in our use case.

link: https://gist.github.com/yuokada/2c2f0c396be4f15a344802bf2a013bbd

And also, compared with the query plan using postion function and one using starts_with function,
There is the difference in filterPredicate section. I assume that we can save the resource of casting to BIGINT and comparison by using starts_with function compared to the position function.

- filterPredicate = ("@strpos|bigint|varchar(8)|varchar(4)@strpos(varchar(x),varchar(y)):bigint"("field", '/v1/') = BIGINT '1')
+ filterPredicate = "@starts_with|boolean|varchar(8)|varchar(4)@starts_with(varchar(x),varchar(y)):boolean"("field", '/v1/')

Query plan using position function

presto> EXPLAIN select * FROM (VALUES ('/v1/info'), ('/v1/node'), ('/v2/node')) AS t(path) WHERE position('/v1/' in path) = 1;
                                                                Query Plan                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Output[path]                                                                                                                             
 โ”‚   Layout: [field:varchar(8)]                                                                                                           
 โ”‚   Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}                                                                          
 โ”‚   path := field                                                                                                                        
 โ””โ”€ Filter[filterPredicate = ("@strpos|bigint|varchar(8)|varchar(4)@strpos(varchar(x),varchar(y)):bigint"("field", '/v1/') = BIGINT '1')] 
    โ”‚   Layout: [field:varchar(8)]                                                                                                        
    โ”‚   Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}                                                                       
    โ””โ”€ LocalExchange[ROUND_ROBIN] ()                                                                                                      
       โ”‚   Layout: [field:varchar(8)]                                                                                                     
       โ”‚   Estimates: {rows: 3 (165B), cpu: 165, memory: 0B, network: 0B}                                                                 
       โ””โ”€ Values                                                                                                                          
              Layout: [field:varchar(8)]                                                                                                  
              Estimates: {rows: 3 (165B), cpu: 0, memory: 0B, network: 0B}                                                                
              ('/v1/info')                                                                                                                
              ('/v1/node')                                                                                                                
              ('/v2/node')                                                                                                                

(1 row)

Query plan using starts_with function

presto> EXPLAIN select * FROM (VALUES ('/v1/info'), ('/v1/node'), ('/v2/node')) AS t(path) WHERE starts_with(path, '/v1/');
                                                              Query Plan                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Output[path]                                                                                                                          
 โ”‚   Layout: [field:varchar(8)]                                                                                                        
 โ”‚   Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}                                                                       
 โ”‚   path := field                                                                                                                     
 โ””โ”€ Filter[filterPredicate = "@starts_with|boolean|varchar(8)|varchar(4)@starts_with(varchar(x),varchar(y)):boolean"("field", '/v1/')] 
    โ”‚   Layout: [field:varchar(8)]                                                                                                     
    โ”‚   Estimates: {rows: ? (?), cpu: 330, memory: 0B, network: 0B}                                                                    
    โ””โ”€ LocalExchange[ROUND_ROBIN] ()                                                                                                   
       โ”‚   Layout: [field:varchar(8)]                                                                                                  
       โ”‚   Estimates: {rows: 3 (165B), cpu: 165, memory: 0B, network: 0B}                                                              
       โ””โ”€ Values                                                                                                                       
              Layout: [field:varchar(8)]                                                                                               
              Estimates: {rows: 3 (165B), cpu: 0, memory: 0B, network: 0B}                                                             
              ('/v1/info')                                                                                                             
              ('/v1/node')                                                                                                             
              ('/v2/node')                                                                                                             

(1 row)
enhancement

Most helpful comment

Thanks for describing your use case. In that case, I would suggest that:

  • we add the starts_with function as proposed
  • we add an optimization that detects patterns of the form xxx% and rewrites them to a call to starts_with. This can be done by updating the DesugarLikeRewriter optimizer.

These two improvements can be done separately.

All 7 comments

A dedicated function for this shouldn't be necessary. You can achieve this easily with LIKE:

SELECT * 
FROM (
    VALUES 
        ('/v1/info'), 
        ('/v1/node'), 
        ('/v2/node')) AS t(path) 
WHERE path LIKE '/v1/%'

Is the string you are searching for a constant?

Is the string you are searching for a constant?

As far as I investigate, Almost all of my queries use a constant for 2nd argument.
So I assume I can use LIKE operator instead of starts_with function.

@nurse Just in case, could you explain the detail of our use case? I might not understand our use case and your intent fully.

It could search for a constant of a variable in real. We mainly focused on the performance as position() = 1 and like 'str%' were slower to find out whether an input starts with given string on billions of records.

like 'str%' were slower to find out whether an input starts with given string on billions of records.

That's something that could be optimized inside Presto. If it finds the user is checking for that pattern, it could swap the implementation for a more optimized approach.

The use case is... we develops a marketing tool. Marketers want to create a segment in which there're people who visited our web pages under /food/ for example. (Marketers will do some actions to those people) Since it queries about tens of million records, we want to improve performance.

It doesn't matter the performance improvement is achieved by dedicated function or certain optimizations like position(str) = 1 or str LIKE 'prefix%' if it's actually implemented.

Thanks for describing your use case. In that case, I would suggest that:

  • we add the starts_with function as proposed
  • we add an optimization that detects patterns of the form xxx% and rewrites them to a call to starts_with. This can be done by updating the DesugarLikeRewriter optimizer.

These two improvements can be done separately.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BruceKellan picture BruceKellan  ยท  4Comments

theoretical-olive picture theoretical-olive  ยท  5Comments

findepi picture findepi  ยท  4Comments

lxynov picture lxynov  ยท  4Comments

anismiles picture anismiles  ยท  3Comments