Presto: Add support for select distinct on(a, b, c) ...

Created on 12 Apr 2019  Â·  9Comments  Â·  Source: prestodb/presto

The issue in Presto is that on one side, one can't use

select distinct on (a, b)
  c
from d

but one also cannot use:

select
  c
from d
group by a, b

Combining these two limitations together, makes deduplicating rows a relatively cumbersome process, needing resorting either to subqueries with window functions and retrieving the row number, or to array aggregations. Again, a lot of context to be carried over, a complexity which adds up exponentially as more elements get in, and much more error-prone than either of the cleaner solutions above.

Postgres implementation of select distinct on is very straightforward and even allows for custom sorting, e.g:

select distinct on (a, b)
  c
from d
order by
  e desc,
  f asc

All 9 comments

Do you have references to SQL spec or examples on how other query engines handle this? Thanks!

Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b

Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b

SELECT DISTINCT ON in Postgres lets you select entire rows, whereas max_by/min_by returns one result. I couldn't find a way to do this in Presto without creating a CTE.

See:
https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
https://www.postgresql.org/docs/9.5/sql-select.html#SQL-DISTINCT

Found a solution from https://redshift-support.matillion.com/s/article/2822021

ROW_NUMBER() OVER ( PARTITION BY <<unique columns>> ORDER BY <<sort columns>>) as counts

And then select where counts=1 only.

Hope this can help

That's similar to what I ended up leaving it as, but I think it's still rather lengthy and inefficient compared to postgres' select distinct on for something that's used quite frequently.

On Apr 21, 2020, at 7:29 AM, Nicolas Guary notifications@github.com wrote:



Found a solution from https://redshift-support.matillion.com/s/article/2822021https://urldefense.proofpoint.com/v2/url?u=https-3A__redshift-2Dsupport.matillion.com_s_article_2822021&d=DwMCaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sequIz2q--a9nk09NOXnvw&m=FaDjpZ1AYA6BtkwFTmXFQnLlZJT4YQYzhTR5Az-su_Q&s=CWrZqCbcRt3tPL4vbIhr-qGdd8e_52maXXqmIWYT8Fc&e=

ROW_NUMBER() OVER ( PARTITION BY <> ORDER BY <>) as counts

And then select where counts=1 only.

Hope this can help

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/prestodb/presto/issues/12653#issuecomment-617215658, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN25Q3Z2KUOAQ7FONQRTUTLRNWUSPANCNFSM4HFONFLQ.

@NicolasGuary if you read my original post

[need to resort to] subqueries with window functions and retrieving the row number

Plus, I'm not sure why quote a post about Redshit to demonstrate how Presto is working.

@JivanRoquet The solution from here work for me on Athena:

SELECT Name, MAX(Address), MAX(other field)...
FROM MyTable
GROUP BY Name

Will give you one row per Name.

https://stackoverflow.com/a/6792357/9225626

@Kligerr that wasn't probably clear enough in my original message, but the issue with this is that you need the Name field to be included in your column selection as well. For instance, the following wouldn't work in Presto:

  Age,
  Address,
  LastPurchaseDate
FROM MyTable
GROUP BY Name

To achieve that you would need to encapsulate your query into a wrapper like:

  Age,
  Address,
  LastPurchaseDate
FROM (
  SELECT
    Name,
    Age,
    Address,
    LastPurchaseDate
  FROM MyTable
  GROUP BY Name
) b

Which is, again, much more cumbersome and complex than the Postgres way:

SELECT DISTINCT ON (NAME)
  Age,
  Address,
  LastPurchaseDate
FROM MyTable

It's already a problem when you write each query manually, but above all it makes writing automated queries a much more complex process.

@rongrong

Do you have references to SQL spec or examples on how other query engines handle this? Thanks!

one example that I just stumbled upon illustrating the benefit of the distinct on syntax. if you take the following:

select
  vessel_id,
  timestamp,
  longitude,
  latitude
from (
  select
    rank() over (partition by vessel_id, ts_, lon_, lat_ order by timestamp asc) as r,
    *
  from (
    select
      vessel_id,
      timestamp,
      cast(timestamp / 7200 as integer) as ts_,
      cast(longitude * 100 as integer) as lon_,
      cast(latitude * 100 as integer) as lat_,
      longitude,
      latitude
    from position
  ) v
) q
where r = 1

instead of this, distinct on can allow rewriting the above query in a much simpler manner, avoiding one level of subquery

select
  distinct on (vessel_id, ts_, lon_, lat_)
  vessel_id,
  timestamp,
  longitude,
  latitude
from (
    select
      vessel_id,
      timestamp,
      cast(timestamp / 7200 as integer) as ts_,
      cast(longitude * 100 as integer) as lon_,
      cast(latitude * 100 as integer) as lat_,
      longitude,
      latitude
    from position
) v
order by vessel_id, ts_, lon_, lat_, timestamp desc
Was this page helpful?
0 / 5 - 0 ratings

Related issues

zsaltys picture zsaltys  Â·  4Comments

tesseract2048 picture tesseract2048  Â·  3Comments

aminalaee picture aminalaee  Â·  3Comments

xerial picture xerial  Â·  3Comments

shigechuanqi picture shigechuanqi  Â·  3Comments