Post-mortem of yesterday’s service disruptions.
Yesterday we released an update to Formstack that included a near re-write of our database connectivity layer to MySQL. This new functionality allows us to split reads and writes to our databases, allowing for increased performance as our database grows.
After the software update, about 25% of our user’s forms failed to load due to stale (incompatible DB objects) entries in our cache. We quickly recognized this and flushed our Memcache cluster. Forms were restored within 10 minutes.
In the afternoon we noticed that one of our master databases (that we write to) was having poor performance. Inserts and updates to forms began to lock critical tables, causing the form builder to be sluggish. Shortly after, the master database began to swap. We rebooted the database at 9:00PM EST, causing about 1 minute of complete downtime.
After the reboot, database performance was restored. However, after an hour, the degraded write performance came back. At 11:00PM EST, we rolled back our software changes. Since the rollback, we have regained our performance and stability.
We are working to reproduce these issues in our development environment and are analyzing MySQL’s slow query logs to determine if our software update introduced a bad SQL query.
Once we have determined the root cause of the poor performance, we will re-launch the new database functionality. We apologize for any inconveniences.

According to this, you were only down for 11 minutes? My experience was otherwise.
For me, the entire day was basically lost. 75% of the time my landing page forms were not displaying. In the future, it would be nice to have some transparent Tweets when you’re going down … It’ll help us make decisions on what to do that day (ie: suspend our ad campaigns etc.)
Hi Steve,
I’m sorry, the 11 minutes was for forms. Landing pages were also affected (and for much longer) than the forms. Restoring access to those depended on what URL was being used and if a custom CNAME was involved so it’s hard to know the exact time pages were restored as it varied greatly.
We agree on the layer of transparency. We did not do a great job of that yesterday.
Thanks,
Michael
Please remember you have international customers and therefore any updates may be during busy times. If you are doing any maintenance you do it via a maintenance window and inform everyone.
The best that can be said about this is that you owned up to it and explained. I’ve had to apologise to my users too. A really big test database (copy of current?) sounds good before the next update.