![]() Full page write spikes after a checkpoint.heavy lock contention on foreign key locks.a large batch update or insert (typically a migration in our case).vacuum freeze on a large table that hasn't need to be vacuumed recently (typically insert-only tables).vacuuming a table full of pages that need to be vacuumned (typically after a large batch update or delete).I think it would be a valuable addition to our dashboards. That would not be very hard and let us quickly look whenever we want to understand a xlog spike where the extra xlog comes from. Build a simple dashboard in Grafana showing the rates of the counters.The next step wouldn't be to leave the instance running indefinitely (add it to the chef scripts as a pet so it has the same security policies as any other machine) and then: It would have the benefit of being documented and repeatable just by rerunning the same bootstrap script and installing the same wal-e and mtail scripts from the repo. That would be about the same as the above and about the same work too. Write a simple mtail script to keep interesting counters Run wal-e to continuously fetch new logs as they appear - or even set up pg_receivexlog to stream them Set up a GCP instance that uses the standard bootstrap.sh to obtain the production decryption keys via KMS etc. Instead it seems like it would be just as easily to do this in a scalable repeatable way - and one that also doesn't put production data on random laptops or elsewhere: Run wal-e to download segments for the relevant time periodīuild and run xlogdump across the relevant filesĪnalyze them using hand-written grep, sed, awk, uniq -c combinations. ![]() Set up WAL-E with production decryption keys We have existing post-mortem issues for debugging the cause of replication lag incidents.
0 Comments
Leave a Reply. |