The issue in Presto is that on one side, one can't use
select distinct on (a, b)
c
from d
but one also cannot use:
select
c
from d
group by a, b
Combining these two limitations together, makes deduplicating rows a relatively cumbersome process, needing resorting either to subqueries with window functions and retrieving the row number, or to array aggregations. Again, a lot of context to be carried over, a complexity which adds up exponentially as more elements get in, and much more error-prone than either of the cleaner solutions above.
Postgres implementation of select distinct on is very straightforward and even allows for custom sorting, e.g:
select distinct on (a, b)
c
from d
order by
e desc,
f asc
Do you have references to SQL spec or examples on how other query engines handle this? Thanks!
Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from
d
group by a, b
Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b
SELECT DISTINCT ON in Postgres lets you select entire rows, whereas max_by/min_by returns one result. I couldn't find a way to do this in Presto without creating a CTE.
See:
https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
https://www.postgresql.org/docs/9.5/sql-select.html#SQL-DISTINCT
Found a solution from https://redshift-support.matillion.com/s/article/2822021
ROW_NUMBER() OVER (
PARTITION BY <<unique columns>>
ORDER BY <<sort columns>>) as counts
And then select where counts=1 only.
Hope this can help
That's similar to what I ended up leaving it as, but I think it's still rather lengthy and inefficient compared to postgres' select distinct on for something that's used quite frequently.
On Apr 21, 2020, at 7:29 AM, Nicolas Guary notifications@github.com wrote:

Found a solution from https://redshift-support.matillion.com/s/article/2822021https://urldefense.proofpoint.com/v2/url?u=https-3A__redshift-2Dsupport.matillion.com_s_article_2822021&d=DwMCaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sequIz2q--a9nk09NOXnvw&m=FaDjpZ1AYA6BtkwFTmXFQnLlZJT4YQYzhTR5Az-su_Q&s=CWrZqCbcRt3tPL4vbIhr-qGdd8e_52maXXqmIWYT8Fc&e=
ROW_NUMBER() OVER ( PARTITION BY <
And then select where counts=1 only.
Hope this can help
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/prestodb/presto/issues/12653#issuecomment-617215658, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN25Q3Z2KUOAQ7FONQRTUTLRNWUSPANCNFSM4HFONFLQ.
@NicolasGuary if you read my original post
[need to resort to] subqueries with window functions and retrieving the row number
Plus, I'm not sure why quote a post about Redshit to demonstrate how Presto is working.
@JivanRoquet The solution from here work for me on Athena:
SELECT Name, MAX(Address), MAX(other field)...
FROM MyTable
GROUP BY Name
Will give you one row per Name.
@Kligerr that wasn't probably clear enough in my original message, but the issue with this is that you need the Name field to be included in your column selection as well. For instance, the following wouldn't work in Presto:
Age,
Address,
LastPurchaseDate
FROM MyTable
GROUP BY Name
To achieve that you would need to encapsulate your query into a wrapper like:
Age,
Address,
LastPurchaseDate
FROM (
SELECT
Name,
Age,
Address,
LastPurchaseDate
FROM MyTable
GROUP BY Name
) b
Which is, again, much more cumbersome and complex than the Postgres way:
SELECT DISTINCT ON (NAME)
Age,
Address,
LastPurchaseDate
FROM MyTable
It's already a problem when you write each query manually, but above all it makes writing automated queries a much more complex process.
@rongrong
Do you have references to SQL spec or examples on how other query engines handle this? Thanks!
one example that I just stumbled upon illustrating the benefit of the distinct on syntax. if you take the following:
select
vessel_id,
timestamp,
longitude,
latitude
from (
select
rank() over (partition by vessel_id, ts_, lon_, lat_ order by timestamp asc) as r,
*
from (
select
vessel_id,
timestamp,
cast(timestamp / 7200 as integer) as ts_,
cast(longitude * 100 as integer) as lon_,
cast(latitude * 100 as integer) as lat_,
longitude,
latitude
from position
) v
) q
where r = 1
instead of this, distinct on can allow rewriting the above query in a much simpler manner, avoiding one level of subquery
select
distinct on (vessel_id, ts_, lon_, lat_)
vessel_id,
timestamp,
longitude,
latitude
from (
select
vessel_id,
timestamp,
cast(timestamp / 7200 as integer) as ts_,
cast(longitude * 100 as integer) as lon_,
cast(latitude * 100 as integer) as lat_,
longitude,
latitude
from position
) v
order by vessel_id, ts_, lon_, lat_, timestamp desc