Ecto: timeout when query is executed inside GenServer callback

Created on 22 Nov 2016 · 25Comments · Source: elixir-ecto/ecto

Environment

Elixir: 1.3.0
PostgreSQL 9.5
Ecto: 2.1
OS: OSX

Current behavior

If it execute the following Ecto query which is called when invoking a callback in a GenServer.

If this callback is invoked from a channel, Postgrex times out after 15s.
If it is invoked from anywhere else, including an IEX session, with the exact same arguments, it finishes in a few milliseconds without problems.

def find_live_inapp_auto_messages_matching_visitor(query \\ AutoMessage, visitor_id) do
    query =
      query
      |> by_matching_visitors
      |> by_no_duplicate_events

    from [am, a, v] in query,
    where: v.id == ^visitor_id,
    preload: [:from, :assigned_user, :assigned_team],
    order_by: [desc: :delay],
    group_by: am.id
  end

def by_matching_visitors(query \\ AutoMessage) do
    query =
      query
      |> join(:inner, [am], a in assoc(am, :app))
      |> join(:inner, [am, a], v in assoc(a, :visitors))
      |> join(:left, [am, a, v], ame in assoc(am, :auto_message_events))
      |> join(:inner_lateral, [am, a, v],
        matches in fragment("SELECT * from find_matching_visitors(?, ?, ?)", am.app_id, am.formatted_default_filters, am.formatted_custom_filters),
        matches.id == v.id)
  end

  def by_no_duplicate_events(query \\ AutoMessage) do
    from [am, a, v] in query,
    where: fragment("not exists (
        select * from auto_message_events
        where auto_message_id = ? and (visitor_id = ? or visitor_user_id = ? or visitor_email = ?) and event = 'sent'
      )", am.id, v.id, v.user_id, v.email)
  end

Any ideas?

Needs more info

Source

madshargreave

Most helpful comment

@dirkholzapfel oh, I see. This is a test issue.

It happens because we use an ownership pool for tests and that process is checking out the connection and keeping it throughout the life-cycle of your test suite. I believe DBConnection 2.0 will have improved this because we automatically reclaim connections when you change the mode.

Other than that, you can set your ownership_timeout: to :infinity in your repo configuration. It should be fine given that your tests also have a timeout. So if you run into a race the tests won't block forever as they will eventually crash too.

josevalim on 19 Jun 2018

❤3

All 25 comments

No ideas. Can you provide a sample application that reproduces the error?

josevalim on 22 Nov 2016

I've tried, but unfortunately the error only appears in production

madshargreave on 22 Nov 2016

What is the exact error you get with stacktrace?

josevalim on 22 Nov 2016

When you invoke inside IEx, are you also invoking it in production? Can it be that in production that query is taking more than 15 seconds because of the database?

josevalim on 22 Nov 2016

That was more initial thought, but running from IEX in production works fine

10:09:41.824 [error] Postgrex.Protocol (#PID<0.480.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.30750.0> timed out because it checked out the connection for longer than 15000ms
2016-11-22T10:09:41.830218173Z app[web.1]: 10:09:41.830 [error] GenServer Resend.AutoMessages.TriggerMonitor terminating
2016-11-22T10:09:41.830258265Z app[web.1]: ** (DBConnection.ConnectionError) tcp recv: closed
2016-11-22T10:09:41.830267105Z app[web.1]:     (ecto) lib/ecto/adapters/postgres/connection.ex:109: Ecto.Adapters.Postgres.Connection.execute/4
2016-11-22T10:09:41.830273377Z app[web.1]:     (ecto) lib/ecto/adapters/sql.ex:244: Ecto.Adapters.SQL.sql_call/6
2016-11-22T10:09:41.830279062Z app[web.1]:     (ecto) lib/ecto/adapters/sql.ex:420: Ecto.Adapters.SQL.execute_or_reset/7
2016-11-22T10:09:41.830284377Z app[web.1]:     (ecto) lib/ecto/repo/queryable.ex:121: Ecto.Repo.Queryable.execute/5
2016-11-22T10:09:41.830289468Z app[web.1]:     (ecto) lib/ecto/repo/queryable.ex:35: Ecto.Repo.Queryable.all/4
2016-11-22T10:09:41.830294447Z app[web.1]:     (ecto) lib/ecto/repo/queryable.ex:59: Ecto.Repo.Queryable.one/4
2016-11-22T10:09:41.830299455Z app[web.1]:     (resend) lib/resend/auto_messages/monitors/trigger.ex:48: Resend.AutoMessages.TriggerMonitor.handle_cast/2

madshargreave on 22 Nov 2016

@madshargreave that looks like an error to me!

lukerollans on 22 Nov 2016

@madshargreave that error means the query has been running for long. Try increasing the timeout for those queries by passing Repo.all(query, timeout: 30_000) (or even larger) and see if it changes anything.

** (DBConnection.ConnectionError) tcp recv: closed

@fishcakez does the above mean the DB closed the connection on us or does it simply mean we closed it on error?

josevalim on 22 Nov 2016

I tried with 60000s and it still times out.

I've also tried running the exact same query as generated by Ecto manually in the production postgres terminal, and it also returns in a few milliseconds

madshargreave on 22 Nov 2016

Sorry, I am stumped. I have no idea what the root cause may be.

josevalim on 22 Nov 2016

Is the query in the GenServer run inside a transaction? Timeout only
applies to the transaction as a whole. The timeout option is ignored for
queries inside a transaction.

On Tuesday, 22 November 2016, José Valim [email protected] wrote:

Sorry, I am stumped.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elixir-ecto/ecto/issues/1817#issuecomment-262213146,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB6JTbK7n6m8G5AFECOSeEwPLHVnj0Dtks5rAs2XgaJpZM4K5MjZ
.

fishcakez on 22 Nov 2016

Ah the stacktrace says its not.

On Tuesday, 22 November 2016, James Fish [email protected] wrote:

Is the query in the GenServer run inside a transaction? Timeout only
applies to the transaction as a whole. The timeout option is ignored for
queries inside a transaction.

On Tuesday, 22 November 2016, José Valim <[email protected]

Sorry, I am stumped.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elixir-ecto/ecto/issues/1817#issuecomment-262213146,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB6JTbK7n6m8G5AFECOSeEwPLHVnj0Dtks5rAs2XgaJpZM4K5MjZ
.

fishcakez on 22 Nov 2016

@madshargreave if you find out how to reproduce this error, we would appreciate it, otherwise I am afraid there isn't much for us to do. Try, for example, using the same configuration options for prod and dev (except by the database credentials).

josevalim on 22 Nov 2016

No problem, I'll see if I can produce something and report back

madshargreave on 22 Nov 2016

We have same on production, It happens when we deal with big workloads that overloads connection pool, so timeouts start to happen. Thats because you can spawn GenServers that make queries much faster than DB/Repo pool can handle it.

You may want to increase pool_timeout (up to :infinity). And increase pool size. But you should be careful with it, because timeouts can appear later in other places and debugging will be even worse.

Our final solution was to use GenStage. We read data with back-pressure and create same amount of consumer-producer's as repo pool size.

Edited. I looked closer at your code, this is a different issue, sorry. Our is also hard to reproduce, but I doesn't have anything about channels.

AndrewDryga on 22 Nov 2016

This makes sense, but I still don't think it's the exact same issue.

For example, I am not hitting the pool_timeout limit, but rather, it's the query that is timing out, and it also wouldn't explain why I'm able to can invoke the same GenServer callback (that's executing query) from a production IEX session without problems

madshargreave on 22 Nov 2016

You may want to increase pool_timeout (up to :infinity).

In many situation you may benefit from decreasing the pool_timeout if you are getting timeouts because it keeps the queue shorter and it may not change the number of processes that crash significantly. Increasing the timeout is equivalent to increasing the maximum size of the queue, a longer queue takes longer to empty. This means periods of frequent timeouts can last longer and magnifies the affect on the rest of the app. Instead of being slow for a few seconds the app grinds to a halt for a minute. Just like congestion on the motorway.

If timeouts are proving a problem you may benefit from using the DBConnection.Sojourn pool. It has more queue management techniques to help the queue stay fast and minimise timeouts. It will nearly always opt to timeout sooner rather than later.

And increase pool size

This can only work up to a limit, its possible the pool is already too big and you can get better throughput from a smaller pool size. Moving it up or down can help.

A nice article is here: http://ferd.ca/queues-don-t-fix-overload.html

fishcakez on 22 Nov 2016

Since it has been a week, is there any extra information to reproduce this issue?

josevalim on 30 Nov 2016

Still haven't gotten further unfortunately

madshargreave on 3 Dec 2016

Ok, I will close this. Please drop a comment or reopen if you have information for us to reproduce it.

josevalim on 5 Dec 2016

@madshargreave @josevalim Hello! I reproduced this problem in dev environment. The code you can see here: https://github.com/rom8726/ecto_conn_timeout

rom8726 on 8 Dec 2017

@rom8726 you are using the SQL sandbox, which is a pool exclusive for testing. If you remove this configuration here you should be good to go.

josevalim on 8 Dec 2017

👍1

@josevalim i am running into the same problem and the links above are dead. could you clarify which configuration params are important to set? thank you!

dirkholzapfel on 19 Jun 2018

@dirkholzapfel it was a parameter specific to their setup, where they were accidentally setting the :pool in development to something they should not. Unless you have the :pool also set in dev and prod, which you should remove, I don't think their config will help.

josevalim on 19 Jun 2018

@josevalim we have the problem in test-env / sandbox-mode. we have a GenServer which makes a tiny SQL-Query (milliseconds) periodically. when running our test suite we get

[error] Postgrex.Protocol (#PID<0.651.0>) disconnected: ** (DBConnection.ConnectionError) owner #PID<0.656.0> timed out because it owned the connection for longer than 60000ms

it would be totally ok to not start this GenServer in test-env but i would like to understand the problem.

dirkholzapfel on 19 Jun 2018

@dirkholzapfel oh, I see. This is a test issue.

josevalim on 19 Jun 2018

❤3

Was this page helpful?

0 / 5 - 0 ratings