@AmyKort commented on Tue Nov 28 2017
We experienced API outages on 11/15/17, and undertook a series of mitigation steps. https://github.com/18F/openFEC/issues/2756
One outstanding issue is to investigate slow queries, which we'll be undertaking now.
@AmyKort commented on Tue Nov 28 2017
@patphongs Could you tack on the post-mortem doc? Thanks!
@patphongs commented on Tue Nov 28 2017
Post mortem document started by @LindsayYoung is here :lock:
Added logging of queries that take longer than 1s to execute on all the RDS databases using the AWS console. Also created a PR https://github.com/18F/fec-infrastructure/pull/40 to make this logging permanent.
Used aws cli to download the logs from AWS using commands of the form:
aws rds download-db-log-file-portion --db-instance-identifier fec-govcloud-prod-replica-2 --log-file-name "error/postgresql.log.2017-12-04-01" --output text > postgresql.log.2017-12-04-01.log
Analyzed the downloaded logs for a 10 hour period on 2017-12-04 using pgBadger . pgBadger can be installed on OS X using Homebrew.
The command to analyze the logs (as RDS appears to use a custom log) is:
pgbadger --prefix '%t:%r:%u@%d:[%p]:' *.log
This yielded a nice HTML report that is easy to analyze. Attaching a sample: out.html.gz
The next step is to analyze the slow queries.
Tagging @ccostino, @LindsayYoung and @patphongs to see attached pgBadger report ^^^.
Thank you so much, @vrajmohan! Will take a look very soon!
Created a wiki page - https://github.com/18F/openFEC/wiki/Analyzing-slow-queries
Most helpful comment
Created a wiki page - https://github.com/18F/openFEC/wiki/Analyzing-slow-queries