Graphql-engine: Group by with previous filtering

Created on 4 Dec 2019 · 19Comments · Source: hasura/graphql-engine

Hi,

Thanks for your cool project.
I need to do some grouping by in a query. In the docs it is suggested to use a view (https://docs.hasura.io/1.0/graphql/manual/queries/derived-data.html).

My problem is that I want to filter the rows before they are grouped (using a WHERE).
For example, imagine you have a table of appointments. I want to filter them based on their date (e.g. events after November) and group them by their category.
If I use a view, the grouping happens first and I cannot access the date column afterwards thus, I cannot filter.

For the moment, I solved this writing my own function but then I lose much of power and simplicity that comes with hasura (for example, having to implement the filters myself).

Did I miss something? I cannot stop thinking there must be a more elegant solution I missed.

question

Source

hector

👍9

Most helpful comment

Our app is heavy on the data visualization so grouping and whereing is super important. Javascript filtering is definitely not an option. We made a Postgres function that takes arguments to adjust the GROUP BY and WHERE and a matching Hasura-tracked table that has the same datatype columns as the output of the function. The extra special bit is that the where function argument is a JSON type that is interpreted as the same shap as the Hasura where in Graphql queries. This allows us to use all the same front-end toolings for data filtering without re-doing all the where generation logic.

With this setup then we can also still use all of Hasur's relationship logic. Does Hasura rock or what??

To find all the work log entries that employees did on a job for a particular customer grouped by whatever and easy to graph the results without JS madness we can do this:

query{
  grouped_work_entry_aggregate(
    args: {
      where: {userId: {_in: ["id1", "id2"]}}
        groupByDay: false
        groupByUser: true
        groupByCustomer: false
    }
    order_by: {day: desc}
  ){
    aggregate{
      count
    }
    nodes{
      duration
      avgDuration
      user{displayName}
      day
      customer{
        name
      }
    }
  }
}

SQL function:
You can see here the burrowing into the JSON where argument to find values in an identical structure to Hasura where.

DROP FUNCTION IF EXISTS public.grouped_work_entry;

CREATE FUNCTION public.grouped_work_entry(
    "where" json default null,

    "groupByDay" boolean default false,
    "groupByWeek" boolean default false,
    "groupByMonth" boolean default false,
    "groupByJob" boolean default false,
    "groupByUser" boolean default false,
    "groupByCustomer" boolean default false
)

-- Must return a Hasura-tracked table of matching shape/type
RETURNS SETOF work_entry_generic_group AS $$

SELECT 
    -- Optional Grouping setup
    -- Sets a field to NULL if arg grouping flag was not set
    CASE WHEN "groupByCustomer" THEN "job"."customerId" ELSE null END as "gCustomer",
    CASE WHEN "groupByDay" THEN date_trunc('day', "work_entry"."start") ELSE null END as "gDay",
    CASE WHEN "groupByWeek" THEN date_trunc('week', "work_entry"."start") ELSE null END as "gWeek",
    CASE WHEN "groupByMonth" THEN date_trunc('month', "work_entry"."start") ELSE null END as "gMonth",
    CASE WHEN "groupByUser" THEN "work_entry"."userId" ELSE null END as "gUser",
    CASE WHEN "groupByJob" THEN "work_entry"."jobId" ELSE null END as "gJob",

    -- Counts
    COUNT(DISTINCT "work_entry"."userId"),
    COUNT(DISTINCT "work_entry"."id"), 
    COUNT(DISTINCT "work_entry"."jobId"),
    COUNT(DISTINCT "job"."customerId"),

    -- Work entry duration stats
    SUM("work_entry"."duration"),
    AVG("work_entry"."duration"),
    MIN("work_entry"."duration"),
    MAX("work_entry"."duration"),

    -- More work entry time stats
    MIN("work_entry"."start"),
    MAX("work_entry"."end"),
    MIN("work_entry"."startTime"),
    MAX("work_entry"."endTime")

FROM "work_entry"
INNER JOIN "job" ON "work_entry"."jobId" = "job"."id"

WHERE 
    --  Customer 
        ("where"#>>'{job,customerId,_eq}' IS NULL OR ("job"."customerId" = "where"#>>'{job,customerId,_eq}'))
    AND ("where"#>>'{job,customerId,_in}' IS NULL OR ((("where"#>'{job,customerId,_in}')::jsonb) ? "job"."customerId"))
    -- User
    AND ("where"#>>'{userId,_eq}' IS NULL OR ("work_entry"."userId" = "where"#>>'{userId,_eq}'))
    AND ("where"#>>'{userId,_in}' IS NULL OR ((("where"#>'{userId,_in}')::jsonb) ? "work_entry"."userId"))
    -- Job
    AND ("where"#>>'{jobId,_eq}' IS NULL OR ("work_entry"."jobId" = "where"#>>'{jobId,_eq}'))
    AND ("where"#>>'{jobId,_in}' IS NULL OR ((("where"#>'{jobId,_in}')::jsonb) ? "work_entry"."jobId"))
    -- Times
    AND ("where"#>>'{start,_lte}' IS NULL OR ("work_entry"."start" <= ("where"#>>'{start,_lte}')::timestamp with time zone))
    AND ("where"#>>'{start,_gte}' IS NULL OR ("work_entry"."start" >= ("where"#>>'{start,_gte}')::timestamp with time zone))


GROUP BY
    "gDay",
    "gWeek",
    "gMonth",
    "gJob",
    "gCustomer",
    "gUser"

$$  LANGUAGE sql STABLE;

And the matching Hasura-tracked table:

nolandg on 28 Apr 2020

👍10

All 19 comments

There are many ways to achieve this via pure SQL views. Pls see GROUP BY HAVING or SUB QUERIES in any sql reference :)

tirumaraiselvan on 18 Dec 2019

Hi @tirumaraiselvan, thanks for the answer.
I know how HAVING or SUB QUERIES work but how do you suggest to use them?

I will expose my case in more detail.
Following the example I said before: imagine you have a table of appointments. I want to filter them based on their date (e.g. events after November) and group them by their category.

CREATE VIEW appointments_by_category AS
SELECT category, COUNT(*) AS count
FROM appointments
GROUP BY category;

Calling it from Hasura graphQL endpoint:

query MyQuery {
  appointments_by_category {
    category
    count
  }
}

But if I wanted to filter by date:

query MyQuery {
  appointments_by_category(where: {date: {_gt: "2017-01-01"}}) {
    category
    count
  }
}

This will not work since the field "date" is not available in the results.

I could create the view like this:

CREATE VIEW appointments_by_category AS
SELECT category, COUNT(*) AS count
FROM appointments
WHERE date > '2017-01-01'
GROUP BY category;

But I want the filter to be dynamic. Obviously, I don't want to always filter by the same date.

hector on 19 Dec 2019

Will a view like the below work?

CREATE VIEW appointments_by_category_and_date AS
SELECT category, date, COUNT(*) AS count
FROM appointments
GROUP BY category, date;

tirumaraiselvan on 19 Dec 2019

Nope :(
The results are different. I can have lots of groups if I group by date. Even after filtering, they would need to be grouped again.

hector on 19 Dec 2019

@hector Right. The only solution I can imagine with "no code" is:

CREATE VIEW appointments_by_category_and_date AS
SELECT category, date, COUNT(*) AS count
FROM appointments
GROUP BY category, date;

And then run an auto-generated aggregate query on it:

query {
  appointments_by_category_and_date_aggregate(
  where: { 
    category: {_eq: "ABC"}, 
    date: {_gt: "2017-01-01"}
    }
  ){
    aggregate {
      count
    }
  }
}

tirumaraiselvan on 13 Jan 2020

Mmm I see your point.
The thing is that when querying you want the count of every category group. With your solution there is only the count of a single group.
I guess there is no way to do it without functions.

hector on 14 Jan 2020

any solution?

joni7777 on 12 Apr 2020

@joni7777 if you have the same problem that I do you will need to implement a SQL function that does the grouping for you. Check this docs: https://hasura.io/docs/1.0/graphql/manual/schema/custom-functions.html#custom-sql-functions

I can imagine Hasura might implement a solution for this but I understand it is not trivial. For starters, the type of data returned when grouping changes for every possible combination.

hector on 16 Apr 2020

@hector Yea i saw that option, and i did try that, but when you want to group by data and summarise it, AND filter by timestamp the for specific sets of the data its not possible cuz the group by rows dont have the timestamp.
Do you have any idea?
For now im grouping by in JS instead 😂

Example:

user | views | timestamp
1 | 100 | "2020-03-01 ...."
1 | 200 | "2020-03-02 ...."
1 | 300 | "2020-03-03 ...."
2 | 100 | "2020-03-01 ...."
2 | 250 | "2020-03-02 ...."
3 | 300 | "2020-03-03 ...."

Results wanted for timestamp >= "2020-03-01 ...." AND <= "2020-03-02 ...."
1 | 300
2 | 350

joni7777 on 19 Apr 2020

You need to build a SQL function which accepts the parameters you need. In your case I think it would be the dates, right?

Something like this:

CREATE FUNCTION users_grouped_function(start_datetime datetime, end_datetime datetime)
RETURNS SETOF users_grouped AS $$
    SELECT users.id, sum(users.views)
    FROM users
    WHERE start_datetime <= users.timestamp AND end_datetime >= users.timestamp
    GROUP BY users.id;
$$ LANGUAGE sql STABLE;

I wrote this on the fly so there might be errors.

An important thing to note is that the return type has to be a table. This is one of the constraints of Hasura (https://hasura.io/docs/1.0/graphql/manual/schema/custom-functions.html#supported-sql-functions).
So you will need to create an empty table (I called it users_grouped in the example) with columns that have the types you want to return in your select.

hector on 23 Apr 2020

Yea just saw it, used it now its a good enough solution but i really wish there was group by agg and join with remote schema 🙌🏻

joni7777 on 23 Apr 2020

With this setup then we can also still use all of Hasur's relationship logic. Does Hasura rock or what??

To find all the work log entries that employees did on a job for a particular customer grouped by whatever and easy to graph the results without JS madness we can do this:

query{
  grouped_work_entry_aggregate(
    args: {
      where: {userId: {_in: ["id1", "id2"]}}
        groupByDay: false
        groupByUser: true
        groupByCustomer: false
    }
    order_by: {day: desc}
  ){
    aggregate{
      count
    }
    nodes{
      duration
      avgDuration
      user{displayName}
      day
      customer{
        name
      }
    }
  }
}

SQL function:
You can see here the burrowing into the JSON where argument to find values in an identical structure to Hasura where.

DROP FUNCTION IF EXISTS public.grouped_work_entry;

CREATE FUNCTION public.grouped_work_entry(
    "where" json default null,

    "groupByDay" boolean default false,
    "groupByWeek" boolean default false,
    "groupByMonth" boolean default false,
    "groupByJob" boolean default false,
    "groupByUser" boolean default false,
    "groupByCustomer" boolean default false
)

-- Must return a Hasura-tracked table of matching shape/type
RETURNS SETOF work_entry_generic_group AS $$

SELECT 
    -- Optional Grouping setup
    -- Sets a field to NULL if arg grouping flag was not set
    CASE WHEN "groupByCustomer" THEN "job"."customerId" ELSE null END as "gCustomer",
    CASE WHEN "groupByDay" THEN date_trunc('day', "work_entry"."start") ELSE null END as "gDay",
    CASE WHEN "groupByWeek" THEN date_trunc('week', "work_entry"."start") ELSE null END as "gWeek",
    CASE WHEN "groupByMonth" THEN date_trunc('month', "work_entry"."start") ELSE null END as "gMonth",
    CASE WHEN "groupByUser" THEN "work_entry"."userId" ELSE null END as "gUser",
    CASE WHEN "groupByJob" THEN "work_entry"."jobId" ELSE null END as "gJob",

    -- Counts
    COUNT(DISTINCT "work_entry"."userId"),
    COUNT(DISTINCT "work_entry"."id"), 
    COUNT(DISTINCT "work_entry"."jobId"),
    COUNT(DISTINCT "job"."customerId"),

    -- Work entry duration stats
    SUM("work_entry"."duration"),
    AVG("work_entry"."duration"),
    MIN("work_entry"."duration"),
    MAX("work_entry"."duration"),

    -- More work entry time stats
    MIN("work_entry"."start"),
    MAX("work_entry"."end"),
    MIN("work_entry"."startTime"),
    MAX("work_entry"."endTime")

FROM "work_entry"
INNER JOIN "job" ON "work_entry"."jobId" = "job"."id"

WHERE 
    --  Customer 
        ("where"#>>'{job,customerId,_eq}' IS NULL OR ("job"."customerId" = "where"#>>'{job,customerId,_eq}'))
    AND ("where"#>>'{job,customerId,_in}' IS NULL OR ((("where"#>'{job,customerId,_in}')::jsonb) ? "job"."customerId"))
    -- User
    AND ("where"#>>'{userId,_eq}' IS NULL OR ("work_entry"."userId" = "where"#>>'{userId,_eq}'))
    AND ("where"#>>'{userId,_in}' IS NULL OR ((("where"#>'{userId,_in}')::jsonb) ? "work_entry"."userId"))
    -- Job
    AND ("where"#>>'{jobId,_eq}' IS NULL OR ("work_entry"."jobId" = "where"#>>'{jobId,_eq}'))
    AND ("where"#>>'{jobId,_in}' IS NULL OR ((("where"#>'{jobId,_in}')::jsonb) ? "work_entry"."jobId"))
    -- Times
    AND ("where"#>>'{start,_lte}' IS NULL OR ("work_entry"."start" <= ("where"#>>'{start,_lte}')::timestamp with time zone))
    AND ("where"#>>'{start,_gte}' IS NULL OR ("work_entry"."start" >= ("where"#>>'{start,_gte}')::timestamp with time zone))


GROUP BY
    "gDay",
    "gWeek",
    "gMonth",
    "gJob",
    "gCustomer",
    "gUser"

$$  LANGUAGE sql STABLE;

And the matching Hasura-tracked table:

nolandg on 28 Apr 2020

👍10

@nolandg I like your solution and I made something very similar :)
Still, I believe this ability to have "group bys" in Hasura is very valuable and would be much better than the custom function solution.

hector on 30 Apr 2020

👍6

@nolandg How "generic" do you think that custom function is?

I'm wondering whether it would be possible to write a JS/TS script that uses introspection to auto-generate at least mostly functional copies of these group by functions per-table. Seems like it could be neat and fairly useful.

You think it might map cleanly from table + relationship definition to SQL function templates?

GavinRay97 on 10 Jun 2020

@GavinRay97 The function I showed above is not generic at all but yes, if we had more tables we needed that grouping on, we would definitely make some sort of code generator to produce the SQL.

I think if you defined a limited amount of 1st-level fields/values/aggregations you wanted, it would not be hard to map them cleanly. Making it completely generic and covering all cases would produce a huge function of questionable performance and utility I fear. What might work is a describing the shape of aggregations and conditions you want and then making a parser to produce that function.

You would also need to handle the Hasura side and make a generator to create (and run?) the migrations required to create the Hasura-tracked table that mirrors the function--the function output table type--with all the relations you want on that. Again, unfeasible to make it completely generic and cover all cases especially when it comes to relationships. You'd need to describe the limits you want.

In a perfect world, Hasura would do all this for us :-)

nolandg on 11 Jun 2020

I would also love to have this feature built in. Adding functions to the database is not ideal, but is a workaround we will use for now.

groksrc on 25 Sep 2020

👍1

Just to give another use case which would greatly benefit from server side group by: charts. Charts usually require aggregations.

Taking https://github.com/hasura/graphql2chartjs (https://www.youtube.com/watch?v=153iv1-qFuc) as an example, a bar chart showing how many albums an artist has. A view is used to count the nb of albums per artist.

If go a bit beyond this simple example and we image a more complex page, we would like to filter artists by time/genre/etc.
I understand that arbitrary group by will make hasura query performance highly unpredictable, but for me the discussion should be limited to medium datasets, < 1mil rows. For these Postgres can do arbitrary filtering and grouping in 1-2 seconds. Current solutions:

apply the workaround in https://github.com/hasura/graphql-engine/issues/3478#issuecomment-620313590, convoluted
do the aggregation on the client side, too much data over the wire and load on the client

For extremely large data sets even filtering needs to be done on indexed fields and responsive aggregate queries usually requires pre-aggregation, which actually reduces the problem to medium datasets.

Also subscriptions with grouping might be postponed IMHO, if problematic to implement in the first iteration.

acristu on 28 Oct 2020

👍1

With this setup then we can also still use all of Hasur's relationship logic. Does Hasura rock or what??

To find all the work log entries that employees did on a job for a particular customer grouped by whatever and easy to graph the results without JS madness we can do this:

query{
  grouped_work_entry_aggregate(
    args: {
      where: {userId: {_in: ["id1", "id2"]}}
      groupByDay: false
        groupByUser: true
        groupByCustomer: false
    }
    order_by: {day: desc}
  ){
    aggregate{
      count
    }
    nodes{
      duration
      avgDuration
      user{displayName}
      day
      customer{
        name
      }
    }
  }
}

SQL function:
You can see here the burrowing into the JSON where argument to find values in an identical structure to Hasura where.

DROP FUNCTION IF EXISTS public.grouped_work_entry;

CREATE FUNCTION public.grouped_work_entry(
    "where" json default null,

    "groupByDay" boolean default false,
    "groupByWeek" boolean default false,
    "groupByMonth" boolean default false,
    "groupByJob" boolean default false,
    "groupByUser" boolean default false,
    "groupByCustomer" boolean default false
)

-- Must return a Hasura-tracked table of matching shape/type
RETURNS SETOF work_entry_generic_group AS $$

SELECT 
    -- Optional Grouping setup
    -- Sets a field to NULL if arg grouping flag was not set
    CASE WHEN "groupByCustomer" THEN "job"."customerId" ELSE null END as "gCustomer",
    CASE WHEN "groupByDay" THEN date_trunc('day', "work_entry"."start") ELSE null END as "gDay",
    CASE WHEN "groupByWeek" THEN date_trunc('week', "work_entry"."start") ELSE null END as "gWeek",
    CASE WHEN "groupByMonth" THEN date_trunc('month', "work_entry"."start") ELSE null END as "gMonth",
    CASE WHEN "groupByUser" THEN "work_entry"."userId" ELSE null END as "gUser",
    CASE WHEN "groupByJob" THEN "work_entry"."jobId" ELSE null END as "gJob",

    -- Counts
    COUNT(DISTINCT "work_entry"."userId"),
    COUNT(DISTINCT "work_entry"."id"), 
    COUNT(DISTINCT "work_entry"."jobId"),
    COUNT(DISTINCT "job"."customerId"),

    -- Work entry duration stats
    SUM("work_entry"."duration"),
    AVG("work_entry"."duration"),
    MIN("work_entry"."duration"),
    MAX("work_entry"."duration"),

    -- More work entry time stats
    MIN("work_entry"."start"),
    MAX("work_entry"."end"),
    MIN("work_entry"."startTime"),
    MAX("work_entry"."endTime")

FROM "work_entry"
INNER JOIN "job" ON "work_entry"."jobId" = "job"."id"

WHERE 
    --  Customer 
        ("where"#>>'{job,customerId,_eq}' IS NULL OR ("job"."customerId" = "where"#>>'{job,customerId,_eq}'))
    AND ("where"#>>'{job,customerId,_in}' IS NULL OR ((("where"#>'{job,customerId,_in}')::jsonb) ? "job"."customerId"))
    -- User
    AND ("where"#>>'{userId,_eq}' IS NULL OR ("work_entry"."userId" = "where"#>>'{userId,_eq}'))
    AND ("where"#>>'{userId,_in}' IS NULL OR ((("where"#>'{userId,_in}')::jsonb) ? "work_entry"."userId"))
    -- Job
    AND ("where"#>>'{jobId,_eq}' IS NULL OR ("work_entry"."jobId" = "where"#>>'{jobId,_eq}'))
    AND ("where"#>>'{jobId,_in}' IS NULL OR ((("where"#>'{jobId,_in}')::jsonb) ? "work_entry"."jobId"))
    -- Times
    AND ("where"#>>'{start,_lte}' IS NULL OR ("work_entry"."start" <= ("where"#>>'{start,_lte}')::timestamp with time zone))
    AND ("where"#>>'{start,_gte}' IS NULL OR ("work_entry"."start" >= ("where"#>>'{start,_gte}')::timestamp with time zone))


GROUP BY
    "gDay",
    "gWeek",
    "gMonth",
    "gJob",
    "gCustomer",
    "gUser"

$$  LANGUAGE sql STABLE;

And the matching Hasura-tracked table:

How does this work ? You didn't select the column id in the matched table in the your code example.
It would be helpful if you can post a gist for a complete code then ?

revskill10 on 11 Nov 2020

@revskill10 Not sure I understand what you mean by "select the column id". This thread is talking about grouping and the code I posted is grouping rows by certain fields. What id would you select? You would need to use an aggregation function on the id column of the source table, things like sum or avg which don't many any sense for id columns. The rows returned by databases from group_by and similar functions cannot be identified by the original row ids since they are grouped and aggregated.

If you need to uniquely identify the returned rows then you can generate an id for each row in the above SQL. If non-deterministic ids are ok for your application then you can simply use SQL UUID function. If you need deterministic ids for the grouped rows than you could create a hash of your determining columns right in the SQL above. We eventually required deterministic ids because we started using subscriptions on this grouped table. If subscriptions don't return deterministic ids then it wrecks havoc with your Apollo cache and UI is continually refreshing. This is what we are doing now for generating the id column:

sql CONCAT( min("work_entry"."id"::text), '_', min("work_entry"."userId"), '_', min("work_entry"."jobId"), '_', min("work_entry"."taskId"), '_', min("work_entry"."start"), '_', CASE WHEN "groupByDay" THEN date_trunc('day', "work_entry"."start")::date ELSE null END, '_', CASE WHEN "groupByWeek" THEN date_trunc('week', "work_entry"."start")::date ELSE null END, '_', CASE WHEN "groupByMonth" THEN date_trunc('month', "work_entry"."start")::date ELSE null END ),

It hashes all the fields our application deems as determining identity of a unique grouped row.

nolandg on 12 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings