Ecto: Existence query in `Repo`

Created on 2 Mar 2018 · 12Comments · Source: elixir-ecto/ecto

In #234 there were proposal for adding Repo.exists?/1 function that will check if given query will result with any data.

I think that now it is good idea to reopen that question as Ecto stabilised a little, we have adapters for some NoSQLs and there is Ecto 3.0 around the corner. However mine "SQL" solution is a little bit different from the original

  def exists?(query) do
    import Ecto.Query

    result =
      query
      |> exclude(:select)
      |> limit(1)
      |> select(1)
      |> one()

    not is_nil(result)
  end

Advanced Discussion

Source

hauleth

👍6

Most helpful comment

@josevalim I have tested performance in PosgreSQL using different syntaxes.

Script:

CREATE TEMPORARY TABLE
    tmp AS SELECT
        generate_series(1,1000000) AS id,
        round(random() * 100)::integer AS value;

\echo '========== WITHOUT INDEX ==========\n\n'
\echo 'COUNT'
EXPLAIN ANALYZE SELECT COUNT(*) FROM tmp WHERE value BETWEEN 40 AND 60;
\echo '\nLIMIT 1'
EXPLAIN ANALYZE SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60 LIMIT 1;
\echo '\nEXISTS'
EXPLAIN ANALYZE SELECT EXISTS (SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60);

\echo '\n\n'
\echo '==========  WITH INDEX   ==========\n\n'
CREATE INDEX ON tmp (value);
\echo 'COUNT'
EXPLAIN ANALYZE SELECT COUNT(*) FROM tmp WHERE value BETWEEN 40 AND 60;
\echo '\nLIMIT 1'
EXPLAIN ANALYZE SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60 LIMIT 1;
\echo '\nEXISTS'
EXPLAIN ANALYZE SELECT EXISTS (SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60);

Results:

SELECT 1000000
Time: 908,841 ms
========== WITHOUT INDEX ==========


COUNT
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=19438.25..19438.26 rows=1 width=8) (actual time=383.055..383.055 rows=1 loops=1)
   ->  Seq Scan on tmp  (cost=0.00..19425.75 rows=5000 width=0) (actual time=0.117..339.119 rows=209917 loops=1)
         Filter: ((value >= 40) AND (value <= 60))
         Rows Removed by Filter: 790083
 Planning time: 0.656 ms
 Execution time: 383.197 ms
(6 rows)

Time: 623,812 ms

LIMIT 1
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.89 rows=1 width=4) (actual time=0.050..0.051 rows=1 loops=1)
   ->  Seq Scan on tmp  (cost=0.00..19425.75 rows=5000 width=4) (actual time=0.049..0.049 rows=1 loops=1)
         Filter: ((value >= 40) AND (value <= 60))
 Planning time: 0.090 ms
 Execution time: 0.087 ms
(5 rows)

Time: 0,566 ms

EXISTS
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Result  (cost=3.89..3.90 rows=1 width=1) (actual time=0.018..0.018 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Seq Scan on tmp  (cost=0.00..19425.75 rows=5000 width=0) (actual time=0.016..0.016 rows=1 loops=1)
           Filter: ((value >= 40) AND (value <= 60))
 Planning time: 0.089 ms
 Execution time: 0.059 ms
(6 rows)

Time: 0,400 ms



==========  WITH INDEX   ==========


CREATE INDEX
Time: 1721,912 ms (00:01,722)
COUNT
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3321.93..3321.94 rows=1 width=8) (actual time=905.761..905.761 rows=1 loops=1)
   ->  Index Only Scan using tmp_value_idx on tmp  (cost=0.42..3309.43 rows=5000 width=0) (actual time=0.110..841.317 rows=209917 loops=1)
         Index Cond: ((value >= 40) AND (value <= 60))
         Heap Fetches: 209917
 Planning time: 0.435 ms
 Execution time: 905.842 ms
(6 rows)

Time: 906,816 ms

LIMIT 1
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1.09 rows=1 width=4) (actual time=0.082..0.082 rows=1 loops=1)
   ->  Index Only Scan using tmp_value_idx on tmp  (cost=0.42..3309.43 rows=5000 width=4) (actual time=0.081..0.081 rows=1 loops=1)
         Index Cond: ((value >= 40) AND (value <= 60))
         Heap Fetches: 1
 Planning time: 0.092 ms
 Execution time: 0.121 ms
(6 rows)

Time: 0,491 ms

EXISTS
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.09..1.10 rows=1 width=1) (actual time=0.049..0.049 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Only Scan using tmp_value_idx on tmp  (cost=0.42..3309.43 rows=5000 width=0) (actual time=0.047..0.047 rows=1 loops=1)
           Index Cond: ((value >= 40) AND (value <= 60))
           Heap Fetches: 1
 Planning time: 0.150 ms
 Execution time: 0.076 ms
(7 rows)

Time: 0,663 ms

So it seems that for now LIMIT 1 version is Good Enough™ and in future it could be fixed when we gain support for EXISTS and/or FROM-less queries.

EDIT: @mememori if you are using PostgreSQL then I would encourage you to check this out and change your Repo.aggregate(Foo, :count, :id) to other solution as COUNT() in PSQL can be terribly slow.

hauleth on 7 Apr 2018

❤5

All 12 comments

Can you provide examples of where exists?/1 would be useful? In my mind, most cases I think about using exists? I would still need the data later... or even worse, I would check if something associated exists, leading to N+1 queries.

josevalim on 2 Mar 2018

Here on Xerpa, we have a couple of places where we need to check if something exists (but do not work with the data); right now we are using Repo.aggregate(query, :count, :id) and if id == 0

For example, we wanted to disallow closing/archiving the treatment of the employees work period if there was an older work period open to be treated and closed. For this, we didn't want to retrieve any particular data, just to know if there existed anything matching the query.

We have stumbled upon this several times already. A Repo.exists? or something like that would prove to be of use to us (naturally we could also implement this as a helper over Repo.aggregate on our repo module)

umamaistempo on 2 Mar 2018

@josevalim in my case it is that I need to run slightly different query depending on given user position (geoquery). And I cannot select data and detect by it as there is also pagination. So only solution is to always run check if there is something in given area and then build proper request basing on wether there is something to query at all.

hauleth on 2 Mar 2018

@mememori would the implementation above also work at Xerpa?

josevalim on 6 Mar 2018

Blog post about even more optimised query using EXISTS instead of LIMIT or COUNT, with cons of using COUNT approach.

hauleth on 18 Mar 2018

👍5

@hauleth my concern is how would we make it work. We don't support queries without a from and we don't support exists expressions. :(

josevalim on 7 Apr 2018

@josevalim I have tested performance in PosgreSQL using different syntaxes.

Script:

CREATE TEMPORARY TABLE
    tmp AS SELECT
        generate_series(1,1000000) AS id,
        round(random() * 100)::integer AS value;

\echo '========== WITHOUT INDEX ==========\n\n'
\echo 'COUNT'
EXPLAIN ANALYZE SELECT COUNT(*) FROM tmp WHERE value BETWEEN 40 AND 60;
\echo '\nLIMIT 1'
EXPLAIN ANALYZE SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60 LIMIT 1;
\echo '\nEXISTS'
EXPLAIN ANALYZE SELECT EXISTS (SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60);

\echo '\n\n'
\echo '==========  WITH INDEX   ==========\n\n'
CREATE INDEX ON tmp (value);
\echo 'COUNT'
EXPLAIN ANALYZE SELECT COUNT(*) FROM tmp WHERE value BETWEEN 40 AND 60;
\echo '\nLIMIT 1'
EXPLAIN ANALYZE SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60 LIMIT 1;
\echo '\nEXISTS'
EXPLAIN ANALYZE SELECT EXISTS (SELECT 1 FROM tmp WHERE value BETWEEN 40 AND 60);

Results:

SELECT 1000000
Time: 908,841 ms
========== WITHOUT INDEX ==========


COUNT
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=19438.25..19438.26 rows=1 width=8) (actual time=383.055..383.055 rows=1 loops=1)
   ->  Seq Scan on tmp  (cost=0.00..19425.75 rows=5000 width=0) (actual time=0.117..339.119 rows=209917 loops=1)
         Filter: ((value >= 40) AND (value <= 60))
         Rows Removed by Filter: 790083
 Planning time: 0.656 ms
 Execution time: 383.197 ms
(6 rows)

Time: 623,812 ms

LIMIT 1
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.89 rows=1 width=4) (actual time=0.050..0.051 rows=1 loops=1)
   ->  Seq Scan on tmp  (cost=0.00..19425.75 rows=5000 width=4) (actual time=0.049..0.049 rows=1 loops=1)
         Filter: ((value >= 40) AND (value <= 60))
 Planning time: 0.090 ms
 Execution time: 0.087 ms
(5 rows)

Time: 0,566 ms

EXISTS
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Result  (cost=3.89..3.90 rows=1 width=1) (actual time=0.018..0.018 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Seq Scan on tmp  (cost=0.00..19425.75 rows=5000 width=0) (actual time=0.016..0.016 rows=1 loops=1)
           Filter: ((value >= 40) AND (value <= 60))
 Planning time: 0.089 ms
 Execution time: 0.059 ms
(6 rows)

Time: 0,400 ms



==========  WITH INDEX   ==========


CREATE INDEX
Time: 1721,912 ms (00:01,722)
COUNT
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3321.93..3321.94 rows=1 width=8) (actual time=905.761..905.761 rows=1 loops=1)
   ->  Index Only Scan using tmp_value_idx on tmp  (cost=0.42..3309.43 rows=5000 width=0) (actual time=0.110..841.317 rows=209917 loops=1)
         Index Cond: ((value >= 40) AND (value <= 60))
         Heap Fetches: 209917
 Planning time: 0.435 ms
 Execution time: 905.842 ms
(6 rows)

Time: 906,816 ms

LIMIT 1
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1.09 rows=1 width=4) (actual time=0.082..0.082 rows=1 loops=1)
   ->  Index Only Scan using tmp_value_idx on tmp  (cost=0.42..3309.43 rows=5000 width=4) (actual time=0.081..0.081 rows=1 loops=1)
         Index Cond: ((value >= 40) AND (value <= 60))
         Heap Fetches: 1
 Planning time: 0.092 ms
 Execution time: 0.121 ms
(6 rows)

Time: 0,491 ms

EXISTS
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.09..1.10 rows=1 width=1) (actual time=0.049..0.049 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Only Scan using tmp_value_idx on tmp  (cost=0.42..3309.43 rows=5000 width=0) (actual time=0.047..0.047 rows=1 loops=1)
           Index Cond: ((value >= 40) AND (value <= 60))
           Heap Fetches: 1
 Planning time: 0.150 ms
 Execution time: 0.076 ms
(7 rows)

Time: 0,663 ms

So it seems that for now LIMIT 1 version is Good Enough™ and in future it could be fixed when we gain support for EXISTS and/or FROM-less queries.

hauleth on 7 Apr 2018

❤5

Thank you @hauleth!

To clarify, my concern is that the query one is more generic. For example, we decided to add last to Ecto.Query rather than Ecto.Repo because it composes better with other things, such as subqueries and associations. So I think having some sort of exists in Ecto.Query would be nice but it is unclear how to support it.

josevalim on 7 Apr 2018

I would like to move ahead with this but I am not sure how to best expose the exists interface. Adding a function to the adapter seems like the best choice for now: Ecto.Repo.exists? will add LIMIT 1 and SELECT 1 to the query and return a boolean.

/cc @jeregrine

josevalim on 2 Jun 2018

@josevalim when would this function be published on hex.pm?