2018-11-13 Incident post-mortem report

Jonathan Oliver

November 16, 2018

Tags

Announcement

Programming

Customer experience

At approximately 1:30 PM Mountain Time (3:30 PM Eastern) on November 13, 2018, we observed a significant latency spike from our external monitoring tools that we configured to access our load balancing tier of our cloud-based APIs. These monitoring tools provide full, end-to-end testing and are meant to simulate a complete user experience with our application.

By design our systems can easily process in excess of 25x the usual amount of traffic we receive. We do this because of our customer usage patterns wherein we may observe a 10-fold increase in traffic within a short period of time—usually a few minutes or even a few seconds. While technically wasteful to run this much excess capacity, our goal is to ensure that our APIs remain available and responsive even with massive amounts of requests flowing through them.

As we observed the latency spike moving from around 100ms to over 1 second and longer for many requests, we checked all of our other metrics to ensure it wasn't an errant metric. Sometimes the external monitoring providers will report latency numbers from South Africa or Australia rather than from within North America. Most of our internal metrics showed that our systems were humming along at 10-15% CPU utilization. Our load balancing tier also showed healthy numbers in terms of CPU utilization, memory usage, and total number of bytes flowing through the network.

We checked our own tools to see if we could confirm the latency spike. Regardless of the numbers reported by our systems internally, we could see that calling the system through our load balancers was resulting in many requests returning far slower than is acceptable.

Part of our strategy around an increase in latency is to invoke cloud APIs which provision new hardware and bring that hardware into production service. We followed this process and immediately doubled our capacity—despite the fact that our internal metrics showed only nominal levels of activity within the system.

As these new load balancers came online, the external latency numbers dropped to normal and expected levels. However, even though these numbers appeared normal, we still continued to receive a smaller, yet steady stream of reports from customers stating that they were unable to connect with our servers.

While working with these customers we discovered that there were a significant number of customers that were unable to resolve our DNS records for our various APIs, e.g. us-street.api.smarty.com. Their respective systems would return the equivalent of NXDOMAIN. This was puzzling because we clearly had that domain defined. We finally found that many of these customers had upstream providers that were using either misconfigured or buggy DNS server implementations (of which there are a lot), and were unable to handle more than 20 or so entries within a single A record. In our case, we had close to 40 IPs listed within a given A record.

To combat this issue, at about 4:30 PM Mountain Time we reduced the number of IPs listed in a single A record to a more reasonable level. As we did so, we saw an immediate return of the spike in latency numbers as reported by our external monitoring tools.

All of this meant we were stuck between a rock and a hard place. On the one hand, if we didn't have enough load balancers, customers would be affected by significantly higher than expected latency while if we had too many servers listed some customers would be negatively impacted as a result of using DNS servers that weren't designed to handle large numbers of IPs within a single DNS record.

At about 4:35 PM Mountain Time, we decided that we should split our DNS resolution into several parts such that only a portion of the IPs for a given record were returned at any given time. After making this change, the latency numbers dropped to expected levels and the availability reports stopped and all internal and external metrics showed that things had stabilized.

At around 9am the following morning—without any configuration changes having been applied during the intervening 12+ hours—we started to receive reports of reduced connectivity from some customers. After some diagnostics with these various customers, we determined that each had special regulatory requirements for their respective corporate firewalls. Because we had brought so much hardware online it exposed the mismatch between regulatory compliance vs. cloud scaling. From this we decided to bring the original set of IPs back into rotation. After doing so and closely watching all available metrics, all connectivity reports ceased and all metrics and tools showed normal connectivity.

Now that all systems were behaving normally, we turned our attention to understanding exactly what was happening within our system at the time of the original incident. After investigating CPU utilization, memory, disk, application logs, along with any other resources we had available, we found that a particular piece of software designed to facilitate secure server-to-server communication between our systems was misbehaving. The software in question provides a "mesh network" such that our production systems can talk to each other over the internet in a secure fashion. As best we are able to determine, this errant behavior was not because of misconfiguration, but it was manifest because of a latent bug within the software itself. This bug would results in a "flapping" connection. This means that a TCP connection between our load balancing tier and our application servers would be established and then traffic would begin to flow over that connection only to have the connection terminate unexpectedly. We had anticipated having this connection be unavailable and we had designed our software to use an alternate secure channel as a result. However, because the connection was flapping on and off, the system was designed to prefer the mesh network connection when it was available. This caused many requests to fail while others would succeed.

In the months prior to the incident, we had been looking to remove this particular software from our stack to reduce the need for an additional dependency. Instead of using a peer-to-peer mesh network VPN, we had decided and begun the process of having our various software components communicate directly through standard TLS-based communication channels.

While making a configuration change of this nature isn't trivial, it was already something we had been preparing for. Further, we had concerns about the potential for another incident during the fourth quarter of the year which is often the busiest time for may of our customers and the worst possible time for additional incidents. We tested and deployed these configuration changes on Wednesday and we watched all available metrics closely during all stages of the deployment.

As of this writing, all elements of the offending software have been removed from our infrastructure.

Key Changes

One critical item we learned from this is we need to be much faster about communicating what's happening with our customers. This has been tricky because our desire is to fix the problem as quickly as possible and those individuals that know the most about what's going on are the most engaged on the front lines of solving the problem. Even so, when customers are negatively affected, we need to report things quickly. Therefore, in the case of any future incident, we will post a status update immediately upon observing any errant behavior and we will give incident updates in a streaming fashion with no less than 30-minutes between updates until the issue is fully resolved.

Another takeaway is aligning all of our metrics and tooling around the customer experience. Regardless of internal numbers, if requests are failing, customers are negatively impacted. We are looking at ways to better configure monitoring and alerting to ensure we can get a consistent view of the complete application behavior.

One thing we spent a significant amount of time on is engineering around the principles of availability—especially availability in the face of failure. There is a lot more work to be done in this area. We are fully committed to ensuring a positive and uneventful customer experience even in the face of software and hardware failure.

Notes

As a reminder for customers who have regulatory or other compliance-based requirements around having a stable and known set of IPs, we offer a specific solution as part of your subscription. It is called our Forward Proxy API (https://www.smarty.com/docs/cloud/forward-proxy-api). As part of our guarantee around this API, we will not add new IPs into production without giving at least two weeks (14 days) advance notice as found in this JSON contract: https://proxy.api.smarty.com/ip-ranges.json. Because of the vast range of traffic we receive, it's often necessary for us to provision new hardware quickly in anticipation of increased customer demand (https://www.smarty.com/docs/cloud/requirements#dns). The result of this is that new IPs come and go with our cloud APIs. Those customers using our Forward Proxy API will talk through that system on a known set of IPs and it will dynamically resolve our servers on your behalf and send your request in a fully encrypted fashion to our application servers.

Subscribe to our blog!

Learn more about RSS feeds here.