Ecto: Latency spikes after updating to Ecto 3

Created on 21 Jan 2019  路  34Comments  路  Source: elixir-ecto/ecto

Precheck

  • Do not use the issues tracker for help or support requests (try Elixir Forum, Stack Overflow, IRC or mailing lists, etc).
  • For proposing a new feature, please start a discussion on elixir-ecto.
  • For bugs, do a quick search and make sure the bug has not yet been reported.
  • Finally, be nice and have fun!

Environment

  • Elixir version (elixir -v):
Elixir 1.7.4
Erlang/OTP 21 [erts-10.2.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
  • Database and version (PostgreSQL 9.4, MongoDB 3.2, etc.): PostgreSQL 9.5
  • Ecto version (mix deps): 3.0.6
  • Database adapter and version (mix deps): postgrex 0.14.1
  • Operating system: Alpine Linux v3.7
  • DB pool size: 15

Current behavior

After updating one of our applications to Ecto 3 we're observing unexpected latency spikes on random queries.

In the application I created a subscription to Ecto events via Telemetry which emits data for each query to Datadog, here's what we can see all the time (this graph shows max query time)
screen shot 2019-01-21 at 09 32 35

Here's the graph from another application for which we also updated Ecto to 3.0
screen shot 2019-01-21 at 09 40 44
You can see that the max execution time jumped on Tuesday 15th which correlates with the time the changes were deployed.

I also set up both DB and application to log queries that took longer than 100ms. I can see these slow queries in the application logs, but not in the DB logs.

Also note, that the application is mostly read-only and the latency spikes happen even with "SELECT" queries (this excludes lock contention on the DB side). The queries are very simple and usually fast, tables have all necessary indexes.

Expected behavior

Query latency is stable, without spikes.

Most helpful comment

Quick update.

I've taken a look at our graphs and it turns out that we don't have this issue anymore. The latency spikes disappeared after our infrastructure team performed a failover exercise and migrated everything from AWS availability zone us-east-1b to us-east-1a. We don't really know what happened, but now everything works normally.

Here's how the graph looks like, you can clearly see the point when the failover happened.
image

So probably Ecto is not to blame, although it's still strange that Ecto 2 worked fine (or seemed to work fine) in AZ us-east-1b.

All 34 comments

Hi @take-five!

Thank you for the report. We need a bit more information:

  1. for those logged queries, can you also log the queue_time, decode_time and query_time? Those are all in the telemetry event metadata.

  2. Furthermore, do you have logs for the full life cycle? For example, if this is a web app, do you have logs for the whole request/response too? It would be good to know if those spikes are also translating to the whole request/response.

  3. Can you please let us know if you changed anything else during the deployment or was it only the Ecto version?

  4. Did you configure your Ecto pool before? If so, with which parameters? Did you configure it now? Or is everything running "stock"?

  5. Finally, can you check protocols are consolidated in production? You can log the value of Protocol.consolidated?(Enumerable) when your app starts.

We will be glad to look into this.

for those logged queries, can you also log the queue_time, decode_time and query_time? Those are all in the telemetry event metadata.

Yes, all values are logged, you can see them on this screenshot (units are milliseconds)
screen shot 2019-01-21 at 10 39 42

Furthermore, do you have logs for the full life cycle? For example, if this is a web app, do you have logs for the whole request/response too? It would be good to know if those spikes are also translating to the whole request/response.

Here's the example of logs for one of those requests
screen shot 2019-01-21 at 10 42 50

Can you please let us know if you changed anything else during the deployment or was it only the Ecto version?

It was Ecto, Postgrex and DBConnection versions + fixing some deprecation warnings about Ecto.Repo.update_all with returning: true

Did you configure your Ecto pool before? If so, with which parameters? Did you configure it now? Or is everything running "stock"?

We configure it in config/config.exs:

config :app, App.Repo,
  ...
  pool_size: 15

We used this setting before upgrading to Ecto 3 and didn't change it.

Finally, can you check protocols are consolidated in production?

Yes, they are. I SSH-ed into the running app in production and executed the command you suggested. It returns true.

@take-five perfect, thanks for the quick follow up!

Hi @take-five. We discussed this internally and unfortunately nothing came to mind as the root cause of this issue, so the next step would be to isolate it further - which we know can be very hard.

It is also very interesting that it happens in a certain frequency and always around value. Do you have anything periodically being executed in your app?

Other than that, can you reproduce it in development?

Hi @josevalim, thanks for the follow up

It is also very interesting that it happens in a certain frequency and always around value

Actually, I couldn't find any regularity in these spikes, it seems very random as you can see on the first graph in the opening post. The second graph represents wider range and has some regularity, but it's related to how traffic volume is distributed over the daily hours. The second graph is here to show how things changed after upgrading to Ecto 3.

Do you have anything periodically being executed in your app?

One app has periodically executing tasks, other doesn't. Given that performance changed in both apps after upgrading to Ecto 3, I don't think it's related.

Other than that, can you reproduce it in development?

To some extent, but I'm not sure if my method is correct.

  1. I made it so the queries that take longer than 10ms are logged in the application
  2. I turned on logging of all queries in Postgres
  3. I ran a very cheap query, which normally takes fractions of millisecond to complete, 100k times. Among these 100k queries I consistently get 5-15 outliers that take more than 10ms to complete.

This is what I see in application log:

queue_time=0 decode_time=0 query_time=14
queue_time=0 decode_time=0 query_time=10
queue_time=0 decode_time=0 query_time=17
queue_time=0 decode_time=0 query_time=21
queue_time=0 decode_time=0 query_time=13
queue_time=0 decode_time=0 query_time=32

This is what I see in the DB log:

$ grep duration /usr/local/var/log/postgres.log | grep -v "duration: 0"
[unknown] LOG:  duration: 6.105 ms
[unknown] LOG:  duration: 1.356 ms
[unknown] LOG:  duration: 4.686 ms
[unknown] LOG:  duration: 3.169 ms
[unknown] LOG:  duration: 1.116 ms
[unknown] LOG:  duration: 2.075 ms
[unknown] LOG:  duration: 1.416 ms
[unknown] LOG:  duration: 4.896 ms
[unknown] LOG:  duration: 2.460 ms
[unknown] LOG:  duration: 7.358 ms
[unknown] LOG:  duration: 2.143 ms
[unknown] LOG:  duration: 3.013 ms
[unknown] LOG:  duration: 2.189 ms
[unknown] LOG:  duration: 7.270 ms
[unknown] LOG:  duration: 3.284 ms
[unknown] LOG:  duration: 4.395 ms
[unknown] LOG:  duration: 4.016 ms
[unknown] LOG:  duration: 1.784 ms
[unknown] LOG:  duration: 1.773 ms
[unknown] LOG:  duration: 2.780 ms

As you can see, all queries took less than 8 ms to complete in DB, not saying about 32ms from application perspective 馃槂

In development I run the same PostgreSQL version as in production, it runs on the same machine, so there's no network latency involved. Could it be that we have long GC pauses on connection process?

After we reverted one of our apps to Ecto 2 in production, the graph looks stable again:
image

And 95th percentile looks like this (this big hump is a consequence of restart, I guess):
image

So it's not only about latency spikes, but also general performance degrade (even if it's not that big in absolute numbers).

@josevalim, let me know if I can provide any additional information

@take-five if you take the example app in ecto's repository, add some data to it, and run a bunch of simple queries, can you reproduce the issue?

To be clear, by reproduce the issue, I mean, reproduce the results you posted in your comment above in development.

@josevalim I ran the same test on the example app. I had to make some modifications, e.g. add instrumentation code, change default logger level to INFO (by the way, mix.exs in the example app is missing dependency ecto_sql).

I uploaded the instrumentation code here: https://gist.github.com/take-five/f62ba41adfb5c81d12f6c090283427f1

I ran the test twice, here's the result from the second run:

iex(2)> (1..100_000) |> Enum.each(fn _ -> Friends.Repo.get!(Friends.Person, 1) end)
19:44:03.657 [warn]  query_time=10
19:44:13.606 [warn]  query_time=11
19:44:21.507 [warn]  query_time=17
19:44:21.708 [warn]  query_time=21
19:44:22.354 [warn]  query_time=10

This time there are less outliers, but I guess that's because the table has only 3 rows and only 4 columns.

And here's the results from DB log:

$ grep duration /usr/local/var/log/postgres.log | grep -v "duration: 0"
[unknown] LOG:  duration: 2.042 ms
[unknown] LOG:  duration: 4.174 ms

I also ran the same test on Ecto 2.2 and results are slightly worse, but this time there are slow queries in DB log

iex(3)> (1..100_000) |> Enum.each(fn _ -> Friends.Repo.get!(Friends.Person, 1) end)

20:11:16.174 [warn]  query_time=18
20:11:20.515 [warn]  query_time=14
20:11:20.670 [warn]  query_time=47
20:11:30.422 [warn]  query_time=10
20:11:30.433 [warn]  query_time=10
20:11:31.200 [warn]  query_time=14
20:11:41.660 [warn]  query_time=11
$ grep duration /usr/local/var/log/postgres.log | grep -v "duration: 0"
[unknown] LOG:  duration: 13.873 ms
[unknown] LOG:  duration: 39.876 ms
[unknown] LOG:  duration: 2.380 ms
[unknown] LOG:  duration: 14.381 ms

It makes me think that the test wasn't correct

Thanks @take-five. Yes, I would guess those are about garbage collection, especially because they are running in the same process. We can try increasing the amount of data and see if they happen more often, as that would make it clearer it is indeed cause by garbage collection.

In this case, query_time is very poorly named, because it is not the time of the query but rather how long we spent with the database connection.

I asked earlier about the whole request/response time. Does it become worst on average with Ecto 3.0? Or is it roughly the same? If it is the same, then it could mean that for some reason garbage collection is being triggered more often inside Ecto, while before it would happen outside. However if you are having increases or spikes in the request/response, then it is definitely something to worry about.

I asked earlier about the whole request/response time. Does it become worst on average with Ecto 3.0?

Yes, it became worse, we're observing the same spikes on the application response time graphs since we deployed Ecto 3.0.

image

Thanks @take-five. Unfortunately I am not sure what are the next steps. Do you have any suggestions?

One option is to increase the data in the example app and see if it can reproduce it but that is just a guess.

@fishcakez do you have any suggestions?

Out of 3 dependencies that have been upgraded (ecto, db_connection, postgrex), which one, do you think, is most likely the culprit?

@take-five in a way it doesn't matter because they have to be upgraded together as we worked on them together. But if I had to guess ecto or db_connection, the work on postgrex was minimal.

Do you have any suggestions on how to approach this problem? Maybe some ideas about how to catch and profile these outliers on a live system?

@take-five the only next suggestion I have is to profile some of those queries using mix profile.eprof or mix profile.cprof. Are you deploying using Mix or Distillery?

Hi @josevalim ,

I deployed observer_cli to our production system, compared a picture on Ecto 2 and Ecto 3 applications, and found a strange pattern.

Here's what we can see on Ecto 3 application:

|205|<0.1838.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               386947| 0        |erlang:hibernate/3               |
|206|<0.1837.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               373573| 0        |erlang:hibernate/3               |
|207|<0.1836.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               383517| 0        |erlang:hibernate/3               |
|208|<0.1835.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               373575| 0        |erlang:hibernate/3               |

And here's what we can see on Ecto 2 application:

|29 |<0.908.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108717| 0        |gen_server:loop/7                |
|30 |<0.906.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108716| 0        |gen_server:loop/7                |
|31 |<0.905.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108712| 0        |gen_server:loop/7                |
|32 |<0.904.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108720| 0        |gen_server:loop/7                |

I assume, this line puts the connection process into hibernate mode. Could it be the culprit?

Interesting. Could you try a fork of db_connection without that hibernate line and see if it improves things? Maybe @fishcakez has some other ideas too.

Yeah, I'll experiment withdb_connection this week and let you know about the results.

I've let our application run with the patched db_connection for a bit, but there's no difference - we're still experiencing the same spikes.

If you start the console in production and run some of those queries, can
you reproduce those values from the console? If so, one option is to add
:mix and :tools to your release and invoke this function:

https://hexdocs.pm/mix/master/Mix.Tasks.Profile.Eprof.html#profile/2

Jos茅 Valimwww.plataformatec.com.br
http://www.plataformatec.com.br/Founder and Director of R&D

Hibernation is out of band of the normal request path, its only called after a connect and after a ping. It is a good use of hibernate because those processes aren't used unless we see disconnect/idle connection or stop. Idle connections do no occur concurrently unless ping time is very very big (order of seconds which should not happen in healthy state).

Hi @josevalim , sorry for the long response, now I finally could get back to this problem and run some experiments on a live system.

I added :tools app and package ExProf to our system and ran some benchmarks from the console on the production instance. Then I ran the same query against an empty table 2000 times and collected traces for the top 3 slowest of them.

I uploaded eprof reports here: https://gist.github.com/take-five/a8d5897dee3adf3617fc9f40ed4f1e60

Unfortunately, I couldn't catch any long running queries from the console. Running queries with eprof takes quite a lot of time, it slows down execution significantly (two orders of magnitude). Also, out of these 3 traces, 2 bottom ones seem like they were caused by preemption or garbage collection.

Thanks @take-five! Can you please share how and which command was run? From this code snippet, it seems that the eprof report was "corrupted" with evaluation information. Maybe we need to run this slightly different.

Sure, this is what I ran on the server (sorry for spaghetti-ish code)

import ExProf.Macro
operator = OperatorAuth.Repo.get!(OperatorAuth.Operator, "aa8ab5f1-779d-475e-a1a2-750a375c5109")

capture_io = fn fun ->
  original_gl = Process.group_leader()
  {:ok, capture_gl} = StringIO.open("", capture_prompt: true)

  try do
    Process.group_leader(self(), capture_gl)
    fun.()
  after
    #StringIO.close(capture_gl)
    Process.group_leader(self(), original_gl)
  end
end

profile! = fn threshold_ms ->
  {records, {total_time, result}} = capture_io.(fn ->
    profile do
      :timer.tc(fn -> OperatorAuth.OperatorRepository.allowed_login_from?(operator, "127.0.0.1") end)
    end
  end)

  call_time = Enum.reduce(records, 0, &(&1.time + &2)) / 1000

  if call_time > threshold_ms do
    {:slow, total_time / 1000, call_time, records}
  else
    {:fast, call_time, records}
  end
end

crunch! = fn threshold_ms, iterations ->
  started_at = System.monotonic_time()

  {_, _, records} = Enum.reduce_while(1..iterations, {0, 0, [nil, nil, nil]}, fn n, {sum, longest, [records1, records2, records3]} ->
    case profile!.(threshold_ms) do
      {:slow, total_time, call_time, records} ->
        {:halt, {sum, call_time, records}}

      {:fast, call_time, new_records} ->
        records =
          if call_time > longest do
            [new_records, records1, records2]
          else
            [records1, records2, records3]
          end

        if rem(n, 50) == 0 do
          elapsed = System.convert_time_unit(System.monotonic_time() - started_at, :native, :second)

          IO.puts "#{n} samples analyzed, call time avg = #{sum / 50} ms; " <>
                  "call time max = #{max(call_time, longest)} ms; " <>
                  "time elapsed = #{elapsed} s"
          {:cont, {0, max(call_time, longest), records}}
        else
          {:cont, {sum + call_time, max(call_time, longest), records}}
        end
    end
  end)

  records = Enum.reject(records, &is_nil/1)

  IO.puts "TOP #{length(records)} traces"
  IO.puts "-----------------------------"
  Enum.each(records, fn r ->
    ExProf.Analyzer.print(r)
    IO.puts "----"
    IO.puts "call_time: #{Enum.reduce(r, 0, &(&1.time + &2)) / 1000}"
    IO.puts "----"
    IO.puts ""
  end)
end

crunch!.(50, 1000)

I ran this code from the iex console on a live system (operator_auth/bin/operator_auth remote_console)

Did you run this on IEx or on a separate file?

Jos茅 Valimwww.plataformatec.com.br
http://www.plataformatec.com.br/Founder and Director of R&D

Did you run this on IEx or on a separate file?

I ran this in IEx (operator_auth/bin/operator_auth remote_console)

That evaluates the code and that interferes with the profiling. You can try
to put all of it inside a

defmodule Foo do
  def run do
    ...
  end
end

And invoke Foo.run. That should generate more prestine results. :)

If you can give it a try, it will be very appreciated, thanks!

Jos茅 Valimwww.plataformatec.com.br
http://www.plataformatec.com.br/Founder and Director of R&D

Hi @josevalim ,

I compiled everything as a module and ran the benchmark again. Again, I couldn't get any really slow traces, here are the 3 slowest ones: https://gist.github.com/take-five/75455509720a70fbc3c92572ab77692e

Thank you! It is so weird that so much time is spent on ETS. Quick question: how did you install and compile Erlang?

We use this Dockerfile to build the production image:

FROM elixir:1.7.4-alpine AS build

ENV MIX_ENV=prod

RUN apk add --no-cache git make gcc libc-dev

# Install Hex+Rebar
RUN mix local.hex --force && \
    mix local.rebar --force

WORKDIR /app

# Cache elixir deps
COPY mix.exs mix.lock ./

RUN mix deps.get --only prod
COPY config ./config
RUN mix deps.compile

COPY . .
RUN mix release --env=prod

# Runtime container
FROM alpine:3.9

RUN apk --no-cache add bash openssl

ENV MIX_ENV=prod REPLACE_OS_VARS=true

COPY --from=build /app/rel/operator_auth /operator_auth

CMD ["./operator_auth/bin/operator_auth", "foreground"]

We use ERTS included in the release.

Here's ours rel/vm.args.eex file:

## Name of the node
-name <%= release_name %>@127.0.0.1

## Cookie for distributed erlang
-setcookie <%= release.profile.cookie %>

## Enable kernel poll and a few async threads
+K true
+A 100

## Increase number of concurrent ports/sockets
-env ERL_MAX_PORTS 4096

# Enable SMP automatically based on availability
-smp auto

Quick update.

I've taken a look at our graphs and it turns out that we don't have this issue anymore. The latency spikes disappeared after our infrastructure team performed a failover exercise and migrated everything from AWS availability zone us-east-1b to us-east-1a. We don't really know what happened, but now everything works normally.

Here's how the graph looks like, you can clearly see the point when the failover happened.
image

So probably Ecto is not to blame, although it's still strange that Ecto 2 worked fine (or seemed to work fine) in AZ us-east-1b.

Since I don't think there's anything actionable in this issue right now, I'm going to close it. If the observed behaviour is replicated again, we can reopen then.

Was this page helpful?
0 / 5 - 0 ratings