Postmortem on the Simple In/Out Outage of March 6, 2023
March 7, 2023
On March 6, 2023, at approximately 9:30 am Central Standard Time, Simple In/Out experienced a 5-hour outage. Our website was largely unavailable, and our various apps experienced errors. Microsoft Teams presence integrations were also disconnected.
Yesterday was the second-longest outage in our history, rivaled only by the so-called “Netflix” Christmas Eve outage of 2012.
First, allow me to offer an apology. Unlike other outages, this one was entirely our fault. We know many customers use Simple In/Out for mission-critical tasks. Yesterday we let our customers down. We’re sorry for the disruptions during your Monday.
Below we’ll break down what happened, how we resolved the situation, what you can do, and what we intend to do to ensure this doesn’t happen again.
What happened?
As many may have guessed, this was related to the new Quick Picks update we shipped the night before. The update included a shift in how we keep a user’s status visibly current on both the web board and the Microsoft Teams’ Status tab.
A bug in this code made this operation more expensive on our system than anticipated. Another bug deployed simultaneously caused this expensive code to loop in an unanticipated place. Combined, these two bugs would cause an enormous load on our systems. We design our systems for such things, so Simple In/Out worked great for the first 17 hours.
All this changed when we deployed an unrelated bug fix Monday morning. Any time we deploy changes to Simple In/Out, we do so one server at a time. During deployments, we’re always one server short-handed. A non-issue every day but yesterday due to these bugs stressing our servers. This little one-line-of-code unrelated bug fix became the straw that broke the camel’s back.
What followed was a total system collapse. The web servers were overwhelmed with internal requests and started rejecting requests they could not complete promptly. These error responses led to our load balancers disconnecting servers due to too many errors. Next, our DNS provider realized we were throwing excessive errors at our customers and moved to protect us, which led to what many customers saw: scary DNS error pages.
How did we fix it?
We solved these issues by working with our hosting providers to identify bottlenecks (CPU) and spool up more resources. Adding more servers sounds like an easy solution, but we couldn’t definitively say more resources was the solution and wouldn’t worsen the situation. It took most of our outage time to identify the resources impacted. Once we identified the two software bugs that were the real culprits, everything made sense. Fixing those bugs restored the typical workload to our system.
What can I do?
No action is required unless using Microsoft Teams presence integration. Those users will need to reconnect their integrations here.
What are we doing to stop this from happening again?
We will establish more rigorous requirements for new features that increase our server traffic. We will explore more anomaly detection to catch traffic increases that are extraordinary or outside of our typical ebb and flow. We will also add new requirements when taking a server out of rotation during peak hours.
I hope the above deep dive into our operations provides the context behind this outage. We are firm believers in communication and transparency at Simple In/Out. Our worldwide customers trust us, and we work hard every day to maintain that trust. While we failed our customers yesterday, we’ll do our best to ensure it doesn’t happen again.