Just over a month ago, we found ourselves facing an alarming and concerning issue. An increasing number of our API requests were abandoned, some vanishing without a trace in our logs. As our traffic continued to surge, this disconcerting trend had a profound impact on our system’s performance. This blog serves as a case study, delving into the Root Cause Analysis (RCA) of this problem (SNAT Port Exhaustion), uncovering the underlying issues, revealing the innovative solution we implemented, and sharing the invaluable lessons we’ve gained for the future. Join us on this journey and find out how we crossed our first hurdle on the path to scalability.
But first let us understand a little bit of background.
We are Kirana Club, an early age startup, an online community for Kirana Store owners. Our server (that was facing the issue stated in above paragraph) is built on Node JS and deployed as a container on Microsoft Azure’s App service. Just before we caught this issue, we released some new features that was received very well by the target audience and hence the traffic began rising at an exponential rate.
Now that we know the background, let us understand the problem.
As the traffic on our servers was increasing, the number of 5xx were also increasing. At a point more than 3% of all the requests we got resulted in 5xx. There were absolutely no logs in the system that could hint us towards the cause of the problem (Partly because the logs were not exhaustive, and partly because of the nature of the problem).
The detailed error ,
Request abondoned with subcode 0, was also not at all helpful.
After spending 2-3 days on the problem, we started discovering a pattern.
Only the APIs that made an external HTTP call were facing this issue . This was something to work with.
Our first step was to read the documentation of all the external services we were using. We went through all of it, incoperated some best practices like timeouts, rate limitings , etc, but unfortunately the problem persisted.
We then went through the docs of our cloud provider (Azure) and the service we were using (App Service) in detail. We were specifically curious to know about the networking of App service, and how it’s VNet infrastructures communicates with the outer network. And this is where we understood the root cause.
Root Cause : SNAT Port Exhaustion
So, first thing that we understood was the App Service does not directly make the outbound network call. It uses something called Source Network Address Translator (SNAT) Load Balancer.
This occurs because an App Service web application finds its home on one or more App Service worker instances, all neatly contained within the site’s scale unit, or stamp. These worker instances don’t possess their own Internet IP addresses. Instead, they depend on the stamp’s trusty load balancer to execute the Source Network Address Translation, affectionately known as SNAT. This SNAT wizardry is what enables them to reach out and connect with external IP addresses. You can read more about it here.
SNAT operates by changing the source IP address and port number of outgoing TCP packages. Here’s how it works in a nutshell:
- The App Service application sends a TCP package to an Internet IP address. The package’s source IP address and port number are initially internal.
- The TCP package is then routed from a worker instance to the SNAT load balancer, where SNAT modifies the source IP and port of the TCP package to its own.
- This modified package is subsequently sent out to the Internet, and SNAT maintains a record of this mapping.
- When the Internet server responds, it uses the IP address and port of the load balancer as the destination.
- Upon receiving the response, the load balancer changes the destination IP and port back to those of the worker instance, facilitating the return of the package to its intended recipient.
This entire process occurs transparently to both the worker instance and the Internet server, with the load balancer handling all the necessary address translation.
Now since the Load Balancer is a shared entity, it soon becomes a
bottleneck. Infact the actual bottleneck are the ports of the SNAT Load Balancer. Each app service, by default, is allocated
128 SNAT ports. And according to the Azure Documentation, this is how the ports are used :
- One SNAT port is consumed per flow to a single destination IP address, port. For multiple TCP flows to the same destination IP address, port, and protocol, each TCP flow consumes a single SNAT port. This ensures that the flows are unique when they originate from the same public IP address and go to the same destination IP address, port, and protocol.
- Multiple flows, each to a different destination IP address, port, and protocol, share a single SNAT port. The destination IP address, port, and protocol make flows unique without the need for additional source ports to distinguish flows in the public IP address space.
The first point was our problem, we were hitting
the same Destination IP addresses again and again and again, thus using too many SNAT ports. And once we crossed the number
128 the requests started to get abandoned, till the time a SNAT port was released (which is released after 4 minutes of being idle).
So, now that we know what the actual problem was, we needed to find a solution. We were making connection requests to the same destination ip:port again and again, utilizing a SNAT port on every attempt. But did we need to make a new connection everytime?
The answer is no. Once the connection was established, we could just reuse it, this will not only save us a port, but also decrease latency (as we don’t need to make a new connection thus no http-handshake needed) .
How to implement this? Well the answer is simple. We needed some connection pooling and reuse the connections.
We were already pooling our connections to Mysql and Redis, next
we identified the most frequently used third party, and started reusing the connections.
Node JS explicitly provides a keep-alive agent that is very useful for keeping the connections open in high traffic environment.
Projections are very important for scaleable systems. You should always know your numbers, how much resource will the request consume, how many requests can my system handle, how much time should my system take to process such request, how much data will my api crunch, etc, and then plan your system accordingly.
Making an external API call is probably the simplest thing that one does while building a backend system, but at scale even the simplest things become complex. But if basics are clear, one can very easily navigate through these issues. We needed one week to diagnose the problem , but the fix was done and deployed in one day. And our systems are now better, resilient and much more robust than they were.
And this is how we solved SNAT Port Exhaustion | Our First Hurdle on the Path to Scalability
See how we built a realtime chat app on Node JS without using Sockets here