New 42-day free trial Get it now
Smarty

2018-11-13 Incident post-mortem report

Smarty header pin graphic
Updated October 29, 2025
Tags
Smarty header pin graphic

At approximately 1:30 PM Mountain Time (3:30 PM Eastern) on November 13, 2018, we observed a significant latency spike from our external monitoring tools that we configured to access our load balancing tier of our cloud-based APIs. These monitoring tools provide full, end-to-end testing and are meant to simulate a complete user experience with our application.

By design our systems can easily process in excess of 25x the usual amount of traffic we receive. We do this because of our customer usage patterns wherein we may observe a 10-fold increase in traffic within a short period of time—usually a few minutes or even a few seconds. While technically wasteful to run this much excess capacity, our goal is to ensure that our APIs remain available and responsive even with massive amounts of requests flowing through them.

As we observed the latency spike moving from around 100ms to over 1 second and longer for many requests, we checked all of our other metrics to ensure it wasn't an errant metric. Sometimes the external monitoring providers will report latency numbers from South Africa or Australia rather than from within North America. Most of our internal metrics showed that our systems were humming along at 10-15% CPU utilization. Our load balancing tier also showed healthy numbers in terms of CPU utilization, memory usage, and total number of bytes flowing through the network.

We checked our own tools to see if we could confirm the latency spike. Regardless of the numbers reported by our systems internally, we could see that calling the system through our load balancers was resulting in many requests returning far slower than is acceptable.

Part of our strategy around an increase in latency is to invoke cloud APIs which provision new hardware and bring that hardware into production service. We followed this process and immediately doubled our capacity—despite the fact that our internal metrics showed only nominal levels of activity within the system.

As these new load balancers came online, the external latency numbers dropped to normal and expected levels. However, even though these numbers appeared normal, we still continued to receive a smaller, yet steady stream of reports from customers stating that they were unable to connect with our servers.

While working with these customers we discovered that there were a significant number of customers that were unable to resolve our DNS records for our various APIs, e.g. us-street.api.smarty.com. Their respective systems would return the equivalent of NXDOMAIN. This was puzzling because we clearly had that domain defined. We finally found that many of these customers had upstream providers that were using either misconfigured or buggy DNS server implementations (of which there are a lot), and were unable to handle more than 20 or so entries within a single A record. In our case, we had close to 40 IPs listed within a given A record.

To combat this issue, at about 4:30 PM Mountain Time we reduced the number of IPs listed in a single A record to a more reasonable level. As we did so, we saw an immediate return of the spike in latency numbers as reported by our external monitoring tools.

All of this meant we were stuck between a rock and a hard place. On the one hand, if we didn't have enough load balancers, customers would be affected by significantly higher than expected latency while if we had too many servers listed some customers would be negatively impacted as a result of using DNS servers that weren't designed to handle large numbers of IPs within a single DNS record.

At about 4:35 PM Mountain Time, we decided that we should split our DNS resolution into several parts such that only a portion of the IPs for a given record were returned at any given time. After making this change, the latency numbers dropped to expected levels and the availability reports stopped and all internal and external metrics showed that things had stabilized.

At around 9am the following morning—without any configuration changes having been applied during the intervening 12+ hours—we started to receive reports of reduced connectivity from some customers. After some diagnostics with these various customers, we determined that each had special regulatory requirements for their respective corporate firewalls. Because we had brought so much hardware online it exposed the mismatch between regulatory compliance vs. cloud scaling. From this we decided to bring the original set of IPs back into rotation. After doing so and closely watching all available metrics, all connectivity reports ceased and all metrics and tools showed normal connectivity.

Now that all systems were behaving normally, we turned our attention to understanding exactly what was happening within our system at the time of the original incident. After investigating CPU utilization, memory, disk, application logs, along with any other resources we had available, we found that a particular piece of software designed to facilitate secure server-to-server communication between our systems was misbehaving. The software in question provides a "mesh network" such that our production systems can talk to each other over the internet in a secure fashion. As best we are able to determine, this errant behavior was not because of misconfiguration, but it was manifest because of a latent bug within the software itself. This bug would results in a "flapping" connection. This means that a TCP connection between our load balancing tier and our application servers would be established and then traffic would begin to flow over that connection only to have the connection terminate unexpectedly. We had anticipated having this connection be unavailable and we had designed our software to use an alternate secure channel as a result. However, because the connection was flapping on and off, the system was designed to prefer the mesh network connection when it was available. This caused many requests to fail while others would succeed.

In the months prior to the incident, we had been looking to remove this particular software from our stack to reduce the need for an additional dependency. Instead of using a peer-to-peer mesh network VPN, we had decided and begun the process of having our various software components communicate directly through standard TLS-based communication channels.

While making a configuration change of this nature isn't trivial, it was already something we had been preparing for. Further, we had concerns about the potential for another incident during the fourth quarter of the year which is often the busiest time for may of our customers and the worst possible time for additional incidents. We tested and deployed these configuration changes on Wednesday and we watched all available metrics closely during all stages of the deployment.

As of this writing, all elements of the offending software have been removed from our infrastructure.

Key Changes

One critical item we learned from this is we need to be much faster about communicating what's happening with our customers. This has been tricky because our desire is to fix the problem as quickly as possible and those individuals that know the most about what's going on are the most engaged on the front lines of solving the problem. Even so, when customers are negatively affected, we need to report things quickly. Therefore, in the case of any future incident, we will post a status update immediately upon observing any errant behavior and we will give incident updates in a streaming fashion with no less than 30-minutes between updates until the issue is fully resolved.

Another takeaway is aligning all of our metrics and tooling around the customer experience. Regardless of internal numbers, if requests are failing, customers are negatively impacted. We are looking at ways to better configure monitoring and alerting to ensure we can get a consistent view of the complete application behavior.

One thing we spent a significant amount of time on is engineering around the principles of availability—especially availability in the face of failure. There is a lot more work to be done in this area. We are fully committed to ensuring a positive and uneventful customer experience even in the face of software and hardware failure.

Notes

As a reminder for customers who have regulatory or other compliance-based requirements around having a stable and known set of IPs, we offer a specific solution as part of your subscription. It is called our Forward Proxy API (https://www.smarty.com/docs/cloud/forward-proxy-api). As part of our guarantee around this API, we will not add new IPs into production without giving at least two weeks (14 days) advance notice as found in this JSON contract: https://proxy.api.smarty.com/ip-ranges.json. Because of the vast range of traffic we receive, it's often necessary for us to provision new hardware quickly in anticipation of increased customer demand (https://www.smarty.com/docs/cloud/requirements#dns). The result of this is that new IPs come and go with our cloud APIs. Those customers using our Forward Proxy API will talk through that system on a known set of IPs and it will dynamically resolve our servers on your behalf and send your request in a fully encrypted fashion to our application servers.

Subscribe to our blog!
Learn more about RSS feeds here.
Read our recent posts
Verification update: Add provisional addresses to Smarty’s database
Arrow Icon
With the launch of our latest product feature, US Provisional Address Manual Process, US Address Verification users can now submit new or missing addresses, and we’ll verify them against authoritative databases and add any approved records to our database. Better yet, you won’t need to make any integration changes to see your addresses in your API results, it won’t take longer than 30–60 days, and it won’t cost you anything extra. That’s what we call an all-around win!Let us take the stress of verifying new or missing addresses off your plate so you can focus on what you do best.
Functional options pattern in Go: Flexibility that won’t make future-you sigh loudly
Arrow Icon
SDK authors live in a permanent tug-of-war:Users want a simple constructor they can paste and ship. Maintainers want room to grow without breaking everybody’s build on the next release. That second part matters a lot right now, because a lot of people are still relatively early in their software careers. Approximately one in three developers has coded professionally for four years or less. That matters because unclear or fragile APIs disproportionately hurt newer developers—they don’t have scars yet.
Ambiguous address matches: What they are and why compliance teams should care
Arrow Icon
If you’ve ever run into an address that seems to exist in more than one place, congratulations—you’ve discovered the world of ambiguous address matches. They’re the Schrödinger’s cat of location data: valid, yet potentially two distinct locations. This blog will focus on a few key things: What are ambiguous address matches?Why ambiguous address matches matter for compliance and customer serviceHow to handle matches with address ambiguityWhy you should inform your customers of ambiguous address matchesOur final thoughts on ambiguous address matchesWhat are ambiguous address matches?An ambiguous address match occurs when an entered address resolves to two or more valid locations with slight but meaningful differences.

Ready to get started?