Presto: Add support for mode(x) or approx_mode(x) in aggregate functions

Created on 12 Apr 2019 · 4Comments · Source: prestodb/presto

Postgres has a (very) handy mode() function which allows to do:

-- this will return the most common surname among all employees
select mode() within group (order by surname) from employees

I'm aware than the same output is achievable by using a subquery, but when dealing with complex scenarios, and especially when generating SQL from templates, having a one-liner like the above is extremely useful.

In Presto, it could take the form of:

select mode(surname) from employees

or even with some degree of tolerance:

select approx_mode(surname) from employees

Is there such a feature in the pipeline? Would it be hard to implement?

enhancement function

Source

JivanRoquet

👍1

Most helpful comment

I see this issue is still open even though it's been two years... I work with AWS Athena that uses Presto as a base. It would be really nice to have either mode(x) or approximate_mode(x) functions available. Right now I am prepared to use a workaround mentioned above... But hopefully next time when I need to calculate mode(x) I can use a real mode(x) function. Thanks!

evgenia82 on 16 Apr 2021

👍2

All 4 comments

There was a long discussion about an approx version of this in #11807. The exact version makes sense to me. Though if we implement it as an aggregation function, it won't be more efficient than writing a query. Here's PostgresSQL reference: https://wiki.postgresql.org/wiki/Aggregate_Mode

rongrong on 12 Apr 2019

👍1

Agree, the exact version makes sense.

The approx version would be useful only if it provides significant performance gains.

The main drawback is writing a vanilla query is not in the difficulty, but in the need to carry over all the context from the outside query when writing this particular line. Instead of simply calling mode(column) (and thus having to care only about the column), we have to bring a lot of external context instead into the subquery.

Again, when designing SQL generating templates and routines, the former is a breeze while the latter can quickly become nightmarish as complexity grows.

JivanRoquet on 12 Apr 2019

Comparison example between Postgres and Presto (current):

CREATE TABLE transactions (
  customer INT NOT NULL,
  subcategory VARCHAR(30) NOT NULL,
  category VARCHAR(30) NOT NULL
);

INSERT INTO transactions 
    (customer, subcategory, category) 
VALUES 
    (1,'cheddar', 'cheese'),
    (2,'blue', 'cheese'),
    (3,'cherry', 'tomato'),
    (4,'shredded', 'cheese'),
    (2,'white', 'coffee'),
    (1,'black', 'coffee'),
    (4,'blue', 'cheese'),
    (2,'blue', 'cheese'),
    (1,'cheddar', 'cheese'),
    (3,'shredded', 'cheese'),
    (3,'black', 'coffee'),
    (3,'cherry', 'tomato'),
    (2,'cheddar', 'cheese'),
    (2,'shredded', 'cheese');

-- Postgres
SELECT
    customer,
    MODE() WITHIN GROUP (ORDER BY subcategory) AS "fav_cheese"
FROM transactions
WHERE category = 'cheese'
GROUP BY customer;

-- Presto
SELECT customer, subcategory AS fav_cheese
FROM (SELECT customer, category, subcategory, COUNT(*) as cnt,
             ROW_NUMBER() OVER (PARTITION BY customer ORDER BY COUNT(*) DESC) as seqnum
      FROM transactions
      WHERE category = 'cheese'
      GROUP BY customer, category, subcategory
     ) t
WHERE seqnum = 1;

The latter is much harder to deal on a maintainability basis, and much harder to read. You know that it's about computing a mode, so you can understand it, but try reading without prior knowledge of what it's about.

Plus, the context is spread all over the place: the reference to seqnum is put at the very end while its initiation is within the subquery, the where category = 'cheese' is deeply buried instead of being surfaced...

Last but not least, if you don't want to use the table transactions directly, but instead a pre-aggregated table in the form of select ... from (select ... from transactions) where .... The complexity grows ten-folds in this case because you have to keep track of everything in the subquery which computes the mode.

JivanRoquet on 12 Apr 2019