Presto: Add support for dynamic filtering

Created on 22 Feb 2017 · 16Comments · Source: prestodb/presto

fact_table - large fact table, partitioned on date_key
d_date - small dimension table, with significant correlation on it's columns and date_key.

Currently for the following query:

    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE 
    b.year = 2017;

larger probe table is fully scanned. When presto after scaning build table (right) could extract information that for b.year = 2017 there are only small number of matching b.date_key values. This information could be used to narrow down the table scan. In other words, above query could be dynamically "rewritten" to equivalent of:

    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE a.date_key IN (20170101,20170102, etc...)

Source

pnowojski

👍6

All 16 comments

Here's a doc with the design, please mind it's a work in progress ... https://docs.google.com/document/d/1NYhEqYfmOPUqYFVzmOWvIXds5reFlaAi7z-zeJWqSLE

ilfrin on 23 Feb 2017

Hello,
is there any progress?

Marfusios on 8 Jun 2017

Yes, @maciejgrzybek and @arhimondr are well advanced in this effort.

kbajda on 8 Jun 2017

Here's a PR addressing this issue: #8588. This is first version, handling inner joins on split pruning and stripe pruning level. Already after first pass of review.

maciejgrzybek on 29 Aug 2017

good issues

harbby on 12 Sep 2017

👍1

I'm facing a similar issue and was hoping that #8588 would address it, but I built Teradata/presto@c701552 and it looks like Presto is still doing full table scans even with dynamic_partition_pruning enabled.

In my case, I have a huge Hive table of comments partitioned on post_id, and a Postgres table of posts, each of which belongs to a user. I'd like to be able to efficiently get the comment count for a given user's posts:

SELECT COUNT(*)
FROM hive.default.comments c
JOIN postgresql.public.posts p
ON c.post_id = p.id
WHERE p.user_id = '123'

Running this query as-is does a full table scan (which eventually causes the Presto workers to exceed their memory limit), whereas running two separate queries (one query on posts to get the post ids, and a second on comments with a predicate on post_id) is lightning-fast.

Since this query is virtually identical to the original one mentioned in this issue, I figured that #8588 would fix the issue, but it doesn't seem to have made any difference. Enabling or disabling the dynamic_partition_pruning session variable has no effect on the query plan.

Am I missing something? Is there some other configuration setting that needs to be changed in order to use this feature?

Thanks!

afn on 29 Sep 2017

This could be solved by adding index join support to the PostgreSQL connector.

electrum on 29 Sep 2017

Is that to say that this PR is only relevant if the underlying connectors implement index joins?

afn on 29 Sep 2017

@afn Sorry for the confusion. I was responding to your comment specifically, not talking about this PR. Presto supports index joins for connectors that implement it. Your use case looks like a perfect fit for index joins. There is an example of this in IndexedTpchPlugin.

electrum on 29 Sep 2017

I have raised a new PR based on earlier PR:https://github.com/prestodb/presto/pull/9453
New PR has bug fixes as mentioned in my review comments + new operators for dynamic filtering.
Can someone please review it ? I can break this PR into few smaller PRs if that is a concern.

amoghmargoor on 30 Nov 2017

👍1

@electrum I find the dynamic filtering also solve this same case mentioned by @afn . When should we use index join and when should we use dynamic filtering?

hustnn on 11 Nov 2018

Is this feature still being added? I would love an update. I have seen there is an open pull request on the Presto master but there has been no activity for over a year

tastyqbit on 25 Feb 2019

@tastyqbit Presto Foundation actively works on getting this feature in (https://github.com/prestosql/presto/pull/91). Once it is there - we will cherry-pick it.

arhimondr on 25 Feb 2019

❤1

@tooptoop4 https://github.com/prestosql/presto/pull/91 is only the planner part. Are the remaining parts are still in review in Presto SQL?

arhimondr on 22 Jul 2019

@tooptoop4

If the feature is fully implemented please feel free to do the required cherry-picks assembling the whole feature and open a PR. We will be happy to do a quick review of it and have it in.

arhimondr on 22 Jul 2019

@maciejgrzybek @arhimondr Do we know if anyone working on this feature? Thanks!