Using Lambda in AWS? There’s Something You Should Know About VPC NAT Timeouts
A valuable lesson
If you’re running Kubernetes (K8s) infrastructure in an AWS VPC through a NAT gateway and need to make long-lived API calls to AWS services, you need to configure OS socket options in some way (e.g., init container setting net.ipv4.tcp_keepalive_time, EC2 instance config on your worker plane, etc) to ensure keep-alives (KAs) are sent. Otherwise, the default 5 minute 50 second NAT gateway timeout will kill your outbound client connection.
We owe a big shout-out of appreciation to the AWS support teams (Support, AWS Lambda Engineering, AWS SDK Engineering) for their responsiveness in helping find a solution. Our partnership with AWS has been great and this work is proof. Thanks again AWS!
Use Case and Symptom
Open Raven leverages AWS Lambda to run serverless functions that execute scanning technology. With data governance rules at play, functions must run in customer accounts, which can have an adverse impact on Lambda quota consumption. To avoid this problem, AWS Lambda’s synchronous invocation pattern is used so the AWS service client waits for a result from the function execution. This approach makes it easy to control how many Lambdas are being invoked, on the client-side, at any point in time. Such control is what allows Open Raven to provide confidence that serverless architectures will not be impacted by its scanning.
However, a number of Lambda executions still produced ApiCallTimeoutExceptions. In our code, it was found that the maximum timeout value of 15 minutes (900 seconds) was set for our socket, API, and the API call timeouts. This means that for a single API call, we were timing out at 900s, on the nose.
The Investigation
In response to this telemetry, we decided to implement an internal timeout of 7 minutes to see if we could trade runtime duration for increased completions. However, this change had little impact. In fact, our debug logs show the Lambda complete successfully followed by, 900 seconds later, an ApiCallTimeoutException.
Since the issue was still on the Lambda side, we reviewed docs for any other API contract violations we might be making. One such contract that caught our attention was that the maximum payload size for a response is 6MB. Thinking the payloads being emitted were too large (and seeing some indicators of that on our client side telemetry), we implemented a size check, compression and telemetry. Unfortunately, we noticed that modest payloads (over 100 KB) would fail with an ApiCallTimeoutException. It was at this point we engaged with AWS Support.
AWS Support walked us through a very similar set of diagnostics. Their first investigations focused largely on the function code, which was an interesting direction to take. From our observations, the function code seemed to be immaterial to the problem since memory utilization was within limits and logs suggested the Lambda was running to completion. Their next approach was to dig into potential retries. There, we learned that the client will retry retryable operations by default, but the duration of those retries counts against your total timeout time. Armed with this information, we configured our client to not retry to stay in alignment with our customer posture and prevent unintentional quota consumption. To our dismay, after 900s, the same ApiCallTimeoutException.
AWS Support then wanted to dig into our environment to see if they could recreate the issue with the Lambda Engineering team's involvement. AWS Support was able to create a simple serverless app that slept for 7 minutes and did NOT see the ApiCallTimeoutException. However, when we dug into the experiment, we found that the code was run from an EC2 instance with an internet gateway. Our infrastructure is running in a VPC with a NAT gateway. When AWS Support ran the experiment in a VPC, they saw the same timeout. Success!
Upon finding out about this 5:50 timeout, we immediately rolled out a hotfix to our fleet, reducing the timeouts to 5 minutes from the internal timeout of 7 minutes. This introduced load shifting that we had to account for as our scan technology now had more retries due to the shortened run time. We set the maximum payload size boundary to 5MB (1MB of headroom) and removed compression (opting for more CPU time to scan).
While we shifted load in response to this reduced scanning window, AWS Support continued their investigation. Having an easily repeatable experiment, they immediately dialed in on VPC NAT Gateway timeouts for idle connections. Each AWS VPC NAT Gateway has a hardcoded max idle connection timeout of 5 minutes 50 seconds. Within their EC2 instance in a VPC with a NAT gateway, they configured the SDK to send keep-alives via the following code:
ApacheHttpClient.Builder httpClientBuilder = ApacheHttpClient.builder()
.tcpKeepAlive(true)
.connectionTimeout(Duration.ofSeconds(500))
.socketTimeout(Duration.ofSeconds(500));
ClientOverrideConfiguration overrideConfig = ClientOverrideConfiguration.builder()
.apiCallTimeout(Duration.ofSeconds(500))
.apiCallAttemptTimeout(Duration.ofSeconds(500))
.retryPolicy(RetryPolicy.none())
.build();
LambdaClient awsLambda = LambdaClient.builder()
.region(region)
.httpClientBuilder(httpClientBuilder)
.overrideConfiguration(overrideConfig)
.build();
Unfortunately, even after setting the socket configuration within the SDK, we were still getting ApiCallTimeoutExceptions. In another escalation, the AWS Support engaged the SDK engineering team to assist. It turns out that these values are not honored by the SDK. In order to configure these settings, they must be configured at the OS level.
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_keepalive_intvl=10
And huzzah! After 7 minutes we received the response from Lambda! Problem identified and solved.
We then operationalized this solution into our deployments using an initContainer for each pod that makes long running API calls to AWS Services, guaranteeing that our outbound client connections don’t get terminated at the NAT Gateway.