Multiple reports from multiple users and enterprises that orders are failing because of slow server response times.
https://app.datadoghq.com/dash/host/965502139?from_ts=1614708659057&live=true&to_ts=1615313459057
Successful orders. Reasonable response times. Minimal use of swap
Failing orders - users redirected to homepage with error informing them to tell us.
Response times are insane if someone uses a report
bug-s1: a critical feature is broken: checkout, payments, signup, login
We're losing orders. This is s1.
Possible fixes to implement asap please:
1) Turn on Customer Balances performance feature for all users
2) Throw more metal at it. If we need to scale the server for 2 months let's do it.
3) Reduce the timeouts by either 1 or 2 mins please. We can support users to access the stupid reports.
The worst period today seemed to be 17:00 - 18:00. I had a look but there wasn't much use of reports during that period, and certainly none running near to the timeout. The biggest issues were around POST requests to admin order edit, and order submission at checkout. Both should be greatly improved by https://github.com/openfoodfoundation/openfoodnetwork/issues/7036
We can provision https://github.com/openfoodfoundation/ofn-install/pull/719 and restart the Rails server to toggle customer balances too :+1:
I think it wouldn't harm to upscale the VPS further and see where more RAM gets us. It's too easy not to try it.
We had another pretty severe spike between 19:45 and 21:00 last night.
We've got ~10k entries in the nginx logs between 19:15 and 21:15, so I guess it was a busy night. Ordering by highest response time shows all of the top ~30 worst requests are for the checkout.
so it may look like the bottleneck are the write operations. We need to investigate that further; from the top, the slow endpoints, to the bottom, PostgreSQL performance. We have the shared gut feeling that this may be related to PostgreSQL abusing disk, because of swap or something else.
I just run some manual VACUUMs on the most bloated tables, which should make it easier for PostgreSQL to read/write to those tables.
openfoodnetwork=> SELECT schemaname, relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup / (
t8n_live_tup * current_setting('autovacuum_vacuum_scale_factor')::float8 + current_setting('autovacuum_vacuum_threshold')::float
) DESC LIMIT 10;
schemaname | relname | n_live_tup | n_dead_tup | last_autovacuum
------------+------------------------------+------------+------------+-------------------------------
pg_catalog | pg_attribute | 6 | 132 |
pg_catalog | pg_depend | 3 | 53 |
public | taggings | 2435 | 553 |
public | spree_inventory_units | 1142836 | 213777 |
public | distributors_payment_methods | 686 | 169 |
public | spree_orders | 315175 | 55054 | 2021-03-02 19:53:47.909817+00
public | sessions | 412047 | 71822 | 2021-03-01 04:44:23.330844+00
public | subscription_line_items | 544 | 138 |
public | spree_product_properties | 3723 | 683 |
public | enterprise_roles | 4255 | 743 |
(10 rows)
Results
openfoodnetwork=> VACUUM (VERBOSE) spree_inventory_units;
INFO: vacuuming "public.spree_inventory_units"
INFO: scanned index "spree_inventory_units_pkey" to remove 158053 row versions
DETAIL: CPU 0.09s/0.15u sec elapsed 0.58 sec
INFO: scanned index "index_inventory_units_on_order_id" to remove 158053 row versions
DETAIL: CPU 0.20s/0.22u sec elapsed 2.32 sec
INFO: scanned index "index_inventory_units_on_shipment_id" to remove 158053 row versions
DETAIL: CPU 0.07s/0.17u sec elapsed 0.56 sec
INFO: scanned index "index_inventory_units_on_variant_id" to remove 158053 row versions
DETAIL: CPU 0.12s/0.32u sec elapsed 0.85 sec
INFO: "spree_inventory_units": removed 158053 row versions in 10616 pages
DETAIL: CPU 0.04s/0.13u sec elapsed 0.31 sec
INFO: index "spree_inventory_units_pkey" now contains 1142836 row versions in 4918 pages
DETAIL: 157462 index row versions were removed.
1 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_inventory_units_on_order_id" now contains 1142836 row versions in 6878 pages
DETAIL: 158038 index row versions were removed.
10 index pages have been deleted, 9 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_inventory_units_on_shipment_id" now contains 1142836 row versions in 5545 pages
DETAIL: 135798 index row versions were removed.
3 index pages have been deleted, 3 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_inventory_units_on_variant_id" now contains 1142836 row versions in 5881 pages
DETAIL: 123048 index row versions were removed.
450 index pages have been deleted, 450 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "spree_inventory_units": found 67347 removable, 934480 nonremovable row versions in 13881 out of 16029 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 31114 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.56s/1.22u sec elapsed 4.92 sec.
VACUUM
openfoodnetwork=> VACUUM (VERBOSE) spree_orders;
INFO: vacuuming "public.spree_orders"
INFO: scanned index "spree_orders_pkey" to remove 50333 row versions
DETAIL: CPU 0.00s/0.03u sec elapsed 0.04 sec
INFO: scanned index "index_orders_on_number" to remove 50333 row versions
DETAIL: CPU 0.00s/0.11u sec elapsed 0.11 sec
INFO: scanned index "index_spree_orders_on_customer_id" to remove 50333 row versions
DETAIL: CPU 0.03s/0.06u sec elapsed 0.19 sec
INFO: scanned index "spree_orders_completed_at_user_id_created_by_id_created_at_idx" to remove 50333 row versions
DETAIL: CPU 0.00s/0.05u sec elapsed 0.06 sec
INFO: scanned index "index_spree_orders_on_order_cycle_id" to remove 50333 row versions
DETAIL: CPU 0.00s/0.05u sec elapsed 0.08 sec
INFO: scanned index "index_spree_orders_on_distributor_id" to remove 50333 row versions
DETAIL: CPU 0.01s/0.05u sec elapsed 0.10 sec
INFO: scanned index "index_spree_orders_on_user_id" to remove 50333 row versions
DETAIL: CPU 0.01s/0.06u sec elapsed 0.11 sec
INFO: "spree_orders": removed 50333 row versions in 3061 pages
DETAIL: CPU 0.00s/0.02u sec elapsed 0.03 sec
INFO: index "spree_orders_pkey" now contains 315178 row versions in 2646 pages
DETAIL: 24577 index row versions were removed.
3 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_orders_on_number" now contains 315178 row versions in 2419 pages
DETAIL: 50333 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_customer_id" now contains 315178 row versions in 2134 pages
DETAIL: 50235 index row versions were removed.
97 index pages have been deleted, 61 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "spree_orders_completed_at_user_id_created_by_id_created_at_idx" now contains 315178 row versions in 3763 pages
DETAIL: 49536 index row versions were removed.
107 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_order_cycle_id" now contains 315178 row versions in 2248 pages
DETAIL: 44893 index row versions were removed.
86 index pages have been deleted, 48 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_distributor_id" now contains 315178 row versions in 2201 pages
DETAIL: 49438 index row versions were removed.
225 index pages have been deleted, 192 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_user_id" now contains 315178 row versions in 2095 pages
DETAIL: 50251 index row versions were removed.
96 index pages have been deleted, 65 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "spree_orders": found 5575 removable, 260068 nonremovable row versions in 7668 out of 12535 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 101001 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.10s/0.51u sec elapsed 0.84 sec.
INFO: vacuuming "pg_toast.pg_toast_19906"
INFO: index "pg_toast_19906_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_19906": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUMINFO: vacuuming "public.spree_orders"
INFO: scanned index "spree_orders_pkey" to remove 50333 row versions
DETAIL: CPU 0.00s/0.03u sec elapsed 0.04 sec
INFO: scanned index "index_orders_on_number" to remove 50333 row versions
DETAIL: CPU 0.00s/0.11u sec elapsed 0.11 sec
INFO: scanned index "index_spree_orders_on_customer_id" to remove 50333 row versions
DETAIL: CPU 0.03s/0.06u sec elapsed 0.19 sec
INFO: scanned index "spree_orders_completed_at_user_id_created_by_id_created_at_idx" to remove 50333 row versions
DETAIL: CPU 0.00s/0.05u sec elapsed 0.06 sec
INFO: scanned index "index_spree_orders_on_order_cycle_id" to remove 50333 row versions
DETAIL: CPU 0.00s/0.05u sec elapsed 0.08 sec
INFO: scanned index "index_spree_orders_on_distributor_id" to remove 50333 row versions
DETAIL: CPU 0.01s/0.05u sec elapsed 0.10 sec
INFO: scanned index "index_spree_orders_on_user_id" to remove 50333 row versions
DETAIL: CPU 0.01s/0.06u sec elapsed 0.11 sec
INFO: "spree_orders": removed 50333 row versions in 3061 pages
DETAIL: CPU 0.00s/0.02u sec elapsed 0.03 sec
INFO: index "spree_orders_pkey" now contains 315178 row versions in 2646 pages
DETAIL: 24577 index row versions were removed.
3 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_orders_on_number" now contains 315178 row versions in 2419 pages
DETAIL: 50333 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_customer_id" now contains 315178 row versions in 2134 pages
DETAIL: 50235 index row versions were removed.
97 index pages have been deleted, 61 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "spree_orders_completed_at_user_id_created_by_id_created_at_idx" now contains 315178 row versions in 3763 pages
DETAIL: 49536 index row versions were removed.
107 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_order_cycle_id" now contains 315178 row versions in 2248 pages
DETAIL: 44893 index row versions were removed.
86 index pages have been deleted, 48 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_distributor_id" now contains 315178 row versions in 2201 pages
DETAIL: 49438 index row versions were removed.
225 index pages have been deleted, 192 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_spree_orders_on_user_id" now contains 315178 row versions in 2095 pages
DETAIL: 50251 index row versions were removed.
96 index pages have been deleted, 65 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "spree_orders": found 5575 removable, 260068 nonremovable row versions in 7668 out of 12535 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 101001 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.10s/0.51u sec elapsed 0.84 sec.
INFO: vacuuming "pg_toast.pg_toast_19906"
INFO: index "pg_toast_19906_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_19906": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
openfoodnetwork=> VACUUM (VERBOSE) sessions;
INFO: vacuuming "public.sessions"
INFO: scanned index "sessions_pkey" to remove 69633 row versions
DETAIL: CPU 0.06s/0.08u sec elapsed 0.62 sec
INFO: scanned index "index_sessions_on_session_id" to remove 69633 row versions
DETAIL: CPU 0.26s/0.30u sec elapsed 2.51 sec
INFO: scanned index "index_sessions_on_updated_at" to remove 69633 row versions
DETAIL: CPU 0.07s/0.09u sec elapsed 0.75 sec
INFO: "sessions": removed 69633 row versions in 6764 pages
DETAIL: CPU 0.00s/0.02u sec elapsed 0.03 sec
INFO: index "sessions_pkey" now contains 416942 row versions in 12840 pages
DETAIL: 55430 index row versions were removed.
10984 index pages have been deleted, 10878 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_sessions_on_session_id" now contains 416942 row versions in 42263 pages
DETAIL: 63326 index row versions were removed.
441 index pages have been deleted, 412 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "index_sessions_on_updated_at" now contains 416943 row versions in 14976 pages
DETAIL: 69633 index row versions were removed.
13162 index pages have been deleted, 12985 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "sessions": found 11993 removable, 133517 nonremovable row versions in 12159 out of 115198 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 421619 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.41s/0.56u sec elapsed 3.98 sec.
INFO: vacuuming "pg_toast.pg_toast_19769"
INFO: index "pg_toast_19769_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_19769": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUU
Thanks everyone for the excellent team work on this!
One more suggestion before we close this S1 and see how things go over the coming days...
Can we remove old data from:
Removing old carts and associated objects has been done in the past:
https://github.com/openfoodfoundation/openfoodnetwork/blob/master/lib/tasks/data/remove_transient_data.rb#L16-L20
@sauloperez suggests that we:
Ideally we'll set up a task to do this regularly (once every 3 months) however if that is out of scope for this s1 I'll create another issue.
Thank you :pray:
@lin-d-hop Does this delay the last release we want to do before merging Rails 5.0? In that case, there is no rush to draft a release today? ping @andrewpbrett
I don't think this should delay the release :pray:
I know I'm taking a lot of capacity with this... but if a few tweaks can buy us a few months then I think it is time well spent. If you would prefer I'll make the above into a separate issue and we can wait to see if this release makes any impact before trying.
@lin-d-hop so it would look like this, please tell me if I understood correctly:
(hopefully) today UK is upgraded with latests performance improvement
Rails 5 is merge today or tomorrow
The list of performance improvement listed here cannot made it to UK prod before Rails 5 roll-out is complete
Is that correct?
Ahhh I understand you were talking about the Rails 5 release now, rather than our mini release today.
I believe that the Remove Transient Data task does not need a release because it can be run as a once off from the Rails app. As a scheduled task run once every 3 months, I'm not sure how this is implemented, perhaps a dev can help explain.
But in summary - let's not block Rails 5 merge. If things get critical again we can try deploying the Rails5 release to UK before AU and FR as that should include a few performance improvements too :-)
Okay I think we've met the criteria for closing this. The more performant customer balances are switched on, bundle exec rake ofn:data:remove_transient_data is run, and #7036 is merged and will be deployed once testing is complete.
I think we should sort out automating the data cleanup tasks in a separate issue.
Most helpful comment
Ahhh I understand you were talking about the Rails 5 release now, rather than our mini release today.
I believe that the
Remove Transient Datatask does not need a release because it can be run as a once off from the Rails app. As a scheduled task run once every 3 months, I'm not sure how this is implemented, perhaps a dev can help explain.But in summary - let's not block Rails 5 merge. If things get critical again we can try deploying the Rails5 release to UK before AU and FR as that should include a few performance improvements too :-)