Published on July 6th, 2012 | by Michael4
Post-mortem of yesterday’s service disruptions.
Yesterday we released an update to Formstack that included a near re-write of our database connectivity layer to MySQL. This new functionality allows us to split reads and writes to our databases, allowing for increased performance as our database grows.
After the software update, about 25% of our user’s forms failed to load due to stale (incompatible DB objects) entries in our cache. We quickly recognized this and flushed our Memcache cluster. Forms were restored within 10 minutes.
In the afternoon we noticed that one of our master databases (that we write to) was having poor performance. Inserts and updates to forms began to lock critical tables, causing the form builder to be sluggish. Shortly after, the master database began to swap. We rebooted the database at 9:00PM EST, causing about 1 minute of complete downtime.
After the reboot, database performance was restored. However, after an hour, the degraded write performance came back. At 11:00PM EST, we rolled back our software changes. Since the rollback, we have regained our performance and stability.
We are working to reproduce these issues in our development environment and are analyzing MySQL’s slow query logs to determine if our software update introduced a bad SQL query.
Once we have determined the root cause of the poor performance, we will re-launch the new database functionality. We apologize for any inconveniences.