Building Resilient Webhooks for Serverless Infrastructures
Building Resilient Webhooks for Serverless Infrastructures
Building resilient webhooks for serverless infrastructures is not an option; it's a hard requirement. If your system accepts inbound webhooks from third-party providers (Stripe, GitHub, Twilio) and you rely on synchronous, immediate processing, your architecture is already broken. In a serverless environment, concurrency limits, cold starts, and downstream API failures make synchronous webhook processing a recipe for dropped events and inconsistent state.
In this post, I will show you exactly how to design, build, and deploy an asynchronous webhook ingestion pipeline that guarantees at-least-once delivery, handles massive traffic spikes without dropping payloads, and recovers gracefully from downstream outages.
The Problem: Synchronous Ingestion Fails
Most engineers build webhook endpoints like this: a POST request hits an API Gateway, triggers an AWS Lambda function, which then validates the payload, queries a database, calls an external API, and finally returns a 200 OK.
This works in development. In production, it falls apart fast.
Why?
- Timeouts: API Gateway has a hard 29-second timeout. If your downstream database is slow or the external API you are calling hangs, the Lambda function execution will exceed the timeout. The third-party provider sees a failure and either retries aggressively (causing a thundering herd) or drops the payload entirely.
- Concurrency Limits: Serverless functions scale fast, but they are not immune to concurrency limits. If you get a sudden burst of 10,000 webhooks, you might exhaust your account's concurrent execution limit, causing API Gateway to return 429 Too Many Requests. Some providers will retry 429s, but others will not.
- Partial Failures: If your Lambda function crashes midway through processing-say, after updating the database but before sending the confirmation email-you are left with inconsistent state.
Why It's Hard
Handling webhooks is inherently difficult because you do not control the rate of inbound requests. The sender decides when and how fast to send data. If your system cannot absorb the shock, it breaks.
Serverless compounds this issue. While serverless platforms auto-scale, the resources they connect to (RDS databases, third-party APIs) usually do not. You end up with a highly scalable compute layer hammering a brittle storage layer.
To solve this, you need decoupling. The ingestion of the webhook must be completely separated from the processing of the webhook.
Architecture: The Ingestion Pipeline
The only correct way to handle webhooks in serverless is asynchronous ingestion. The architecture looks like this:
- Ingestion Layer: API Gateway receives the POST request.
- Queueing Layer: API Gateway pushes the payload directly to an Amazon SQS queue. No Lambda function is involved yet.
- Processing Layer: An event-source mapping triggers a Lambda function to process messages from the SQS queue at a controlled concurrency.
- Dead Letter Queue (DLQ): If a message fails processing repeatedly, it is moved to a DLQ for manual inspection.
This pattern is non-negotiable for high-throughput systems. By integrating API Gateway directly with SQS, you eliminate the ingestion Lambda, saving costs and removing a potential point of failure. API Gateway will reliably return a 200 OK within milliseconds, ensuring the webhook provider considers the delivery successful.
Implementation: AWS CDK (v2.100.0)
Let's build this using AWS CDK (TypeScript). We will define the SQS queue, the DLQ, the processing Lambda, and the API Gateway direct integration.
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as sqs from 'aws-cdk-lib/aws-sqs';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as lambdaEventSources from 'aws-cdk-lib/aws-lambda-event-sources';
import * as iam from 'aws-cdk-lib/aws-iam';
export class WebhookIngestionStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// 1. Create the Dead Letter Queue
const webhookDlq = new sqs.Queue(this, 'WebhookDlq', {
retentionPeriod: cdk.Duration.days(14),
});
// 2. Create the Main Ingestion Queue
const webhookQueue = new sqs.Queue(this, 'WebhookQueue', {
visibilityTimeout: cdk.Duration.seconds(30),
deadLetterQueue: {
queue: webhookDlq,
maxReceiveCount: 3,
},
});
// 3. Create the Processing Lambda
const processorLambda = new lambda.Function(this, 'ProcessorLambda', {
runtime: lambda.Runtime.NODEJS_18_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda/processor'),
timeout: cdk.Duration.seconds(10),
});
// 4. Attach SQS to Lambda with controlled concurrency
processorLambda.addEventSource(
new lambdaEventSources.SqsEventSource(webhookQueue, {
batchSize: 10,
maxConcurrency: 5, // Prevent overwhelming downstream services
})
);
// 5. API Gateway to SQS Direct Integration
const api = new apigateway.RestApi(this, 'WebhookApi', {
restApiName: 'Webhook Ingestion Service',
});
const integrationRole = new iam.Role(this, 'ApiGatewaySqsRole', {
assumedBy: new iam.ServicePrincipal('apigateway.amazonaws.com'),
});
webhookQueue.grantSendMessages(integrationRole);
const sqsIntegration = new apigateway.AwsIntegration({
service: 'sqs',
path: `${cdk.Aws.ACCOUNT_ID}/${webhookQueue.queueName}`,
integrationHttpMethod: 'POST',
options: {
credentialsRole: integrationRole,
passthroughBehavior: apigateway.PassthroughBehavior.NEVER,
requestParameters: {
'integration.request.header.Content-Type': `'application/x-www-form-urlencoded'`,
},
requestTemplates: {
'application/json': 'Action=SendMessage&MessageBody=$util.urlEncode($input.body)',
},
integrationResponses: [
{
statusCode: '200',
responseTemplates: {
'application/json': '{"status":"received"}',
},
},
],
},
});
api.root.addResource('webhooks').addMethod('POST', sqsIntegration, {
methodResponses: [{ statusCode: '200' }],
});
}
}
The Processing Logic (Node.js 18.x)
Your Lambda function now pulls messages from SQS. Because we set batchSize: 10, you must handle partial batch failures. If one message out of 10 fails, and you throw an error, SQS will retry all 10 messages. This is inefficient and dangerous.
Instead, return the specific failed message IDs so SQS only retries those.
// lambda/processor/index.js
exports.handler = async (event) => {
const batchItemFailures = [];
for (const record of event.Records) {
try {
const payload = JSON.parse(record.body);
// Implement idempotency check here!
// await checkIdempotency(payload.id);
console.log('Processing webhook:', payload);
// Simulate database write or external API call
// await processWebhook(payload);
} catch (error) {
console.error(`Failed to process record ${record.messageId}`, error);
batchItemFailures.push({ itemIdentifier: record.messageId });
}
}
return { batchItemFailures };
};
Pitfalls
This architecture is vastly superior to synchronous ingestion, but it introduces new challenges.
- Idempotency: SQS guarantees at-least-once delivery, meaning your Lambda will receive the same webhook twice eventually. If you process a Stripe payment charge twice, you are going to have very angry customers. You must store the webhook ID in a fast data store (like DynamoDB) and check if it has already been processed before executing your logic.
- Order of Events: SQS standard queues do not guarantee ordering. If a user updates their profile twice in one second, the second update might be processed before the first. If ordering is critical, use an SQS FIFO queue. However, FIFO queues have lower throughput limits and require careful configuration of MessageGroupId. In most cases, standard queues combined with a "last updated" timestamp check in your database are sufficient.
- Payload Size Constraints: SQS has a maximum payload size of 256KB. Most webhooks are smaller than this, but if you expect massive JSON payloads, you will need to intercept the payload, store it in S3, and pass the S3 object key to SQS.
Outcome
By implementing this ingestion pipeline, you transform your system from a fragile, synchronous endpoint into a durable, highly concurrent processing machine.
API Gateway handles the initial shock, responding to providers in milliseconds. SQS acts as a shock absorber, queueing the payloads. Your Lambda functions process the queue at a sustainable rate, protecting your relational databases from being overwhelmed. And when things inevitably fail, the Dead Letter Queue catches the fallout, ensuring no data is ever lost.
Building resilient webhooks for serverless infrastructures takes more upfront work, but the operational peace of mind is worth every line of code. Stop building synchronous webhooks. Start decoupling your ingestion from your processing.
