Saga Pattern with aws-cdk, Lambda, and Step Functions

steps

The saga pattern is useful when you have transactions that require a bunch of steps to complete successfully, with failure of steps requiring associated rollback processes to run. This post will cover the saga pattern with aws-cdk, leveraging AWS Step Functions and Lambda.

If you need an introduction to the saga pattern in an easy to understand format, I found this GOTO conference session by Caitie McCaffrey very informative.

Another useful resource with regard to the saga pattern and AWS Step Functions is this post over at theburningmonk.com.

Saga Pattern with aws-cdk

I’ll be taking things one step further by automating the setup and deployment of a sample app which uses the saga pattern with aws-cdk.

I’ve started using aws-cdk fairly frequently, but realise it has the issue of vendor lock-in. I found it nice to work with in the case of step functions particularly in the way you construct step chains.

Saga Pattern with Step Functions

So here is the step function state machine you’ll create using the fairly simple saga pattern aws-cdk app I’ve set up.

saga pattern with aws-cdk - a successful transaction run
A successful transaction run

Above you see a successful transaction run, where all records are saved to a DynamoDB table entry.

dynamodb data from sample app using saga pattern with aws-cdk
The sample data written by a succesful transaction run. Each step has a ‘Sample’ map entry with ‘Data’ and a timestamp.

If one of those steps were to fail, you need to manage the rollback process of your transaction from that step backwards.

Illustrating Failure Rollback

As mentioned above, with the saga pattern you’ll want to rollback any steps that have run from the point of failure backward.

The example app has three steps:

  • Process Records
  • Transform Records
  • Commit Records

Each step is a simple lambda function that writes some data to a DynamoDB table with a primary partition key of TransactionId.

In the screenshot below, TransformRecords has a simulated failure, which causes the lambda function to throw an error.

A catch step is linked to each of the process steps to handle rollback for each of them. Above, TransformRecordsRollbackTask is run when TransformRecordsTask fails.

The rollback steps cascade backward to the first ‘business logic’ step ProcessRecordsTask. Any steps that have run up to that point will therefore have their associated rollback tasks run.

Here is what an entry looks like in DynamoDB if it failed:

A failed transaction has no written data, because the data written up to the point of failure was ‘rolled back’.

You’ll notice this one does not have the ‘Sample’ data that you see in the previously shown successful transaction. In reality, for a brief moment it does have that sample data. As each rollback step is run, the associated data for that step is removed from the table entry, resulting in the above entry for TransactionId 18.

Deploying the Sample Saga Pattern App with aws-cdk

Clone the source code for the saga pattern aws-cdk app here.

You’ll need to npm install and typescript compile first. From the root of the project:

npm install && npm run build

Now you can deploy using aws-cdk.

# Check what you'll deploy / modify first with a diff
cdk diff
# Deploy
cdk deploy

With the stack deployed, you’ll now have the following resources:

  • Step Function / State Machine
  • Various Lambda functions for transaction start, finish, the process steps, and each process rollback step.
  • A DynamoDB table for the data
  • IAM role(s) created for the above

Testing the Saga Pattern Sample App

To test, head over to the Step Functions AWS Console and navigate to the newly created SagaStateMachineExample state machine.

Click New Execution, and paste the following for the input:

{
    "Payload": {
      "TransactionDetails": {
        "TransactionId": "1"
      }
    }
}

Click Start Execution.

In a few short moments, you should have a successful execution and you should see your transaction and sample data in DynamoDB.

Moving on, to simulate a random failure, try executing again, but this time with the following payload:

{
    "Payload": {
      "TransactionDetails": {
        "TransactionId": "2",
        "simulateFail": true
      }
    }
}

The lambda functions check the payload input for the simulateFail flag, and if found will do a Math.random() check to give chance of failure in one of the process steps.

Taking it Further

To take this example further, you’ll want to more carefully manage step outputs using Step Function ResultPath configuration. This will ensure that your steps don’t overwrite data in the state machine and that steps further down the line have access to the data that they need.

You’ll probably also want a step at the end of the line for the case of failure (which runs after all rollback steps have completed). This can handle notifications or other tasks that should run if a transaction fails.

Ingest CloudWatch Logs to a Splunk HEC with Lambda and Serverless

cloudwatch logs to splunk HEC via Lambda

I recently came across a scenario requiring CloudWatch log ingestion to a private Splunk HEC (HTTP Event Collector).

The first and preferred method of ingesting CloudWatch Logs into Splunk is by using AWS Firehose. The problem here though is that Firehose only seems to support an endpoint that is open to the public.

This is a problem if you have a Splunk HEC that is only available inside of a VPC and there is no option to proxy public connections back to it.

The next thing I looked at was the Splunk AWS Lambda function template to ingest CloudWatch logs from Log Group events. I had a quick look and it seems pretty out of date, with synchronous functions and libraries in use.

So, I decided to put together a small AWS Lambda Serverless project to improve on what is currently out there.

You can find the code over on Github.

The new version has:

  • async / await, and for promised that wrap the synchronous libraries like zlib.
  • A module that handles identification of Log Group names based on a custom regex pattern. If events come from log groups that don’t match the naming convention, then they get rejected. The idea is that you can write another small function that auto-subscribes Log Groups.
  • Secrets Manager integration for loading the Splunk HEC token from Secrets Manager. (Or fall back to a simple environment variable if you like).
  • Serverless framework wrapper. Pass in your Security Group ID, Subnet IDs and tags, and let serverless CLI deploy the function for you.
  • Lambda VPC support by default. You should deploy this Lambda function in a VPC. You could change that, but my idea here is that most enterprises would be running their own internal Splunk inside of their corporate / VPC network. Change it by removing the VPC section in serverless.yml if you do happen to have a public facing Splunk.

You deploy it using Serverless framework, passing in your VPC details and a few other options for customisation.

serverless deploy --stage test \
  --iamRole arn:aws:iam::123456789012:role/lambda-vpc-execution-role \
  --securityGroupId sg-12345 \
  --privateSubnetA subnet-123 \
  --privateSubnetB subnet-456 \
  --privateSubnetC subnet-789 \
  --splunkHecUrl https://your-splunk-hec:8088/services/collector \
  --secretManagerItemName your/secretmanager/entry/here

Once configured, it’ll pick up any log events coming in from Log Groups you’ve ‘subscribed’ it to (Lambda CloudWatch Logs Triggers).

add your lambda CloudWatch logs triggers and enabled them for automatic ingestion of these to Splunk

These events get enriched with extra metadata defined in the function. The metadata is derived by default from the naming convention used in the CloudWatch Log Groups. Take a close look at the included Regex pattern to ensure you name your Log Groups appropriately. Finally, they’re sent to your Splunk HEC for ingestion.

For an automated Log Group ingestion story, write another small helper function that:

  • Looks for Log Groups that are not yet subscribed as CloudWatch Logs Triggers.
  • Adds them to your CloudWatch to Splunk HEC function as a trigger and enables it.

In the future I might add this ‘automatic trigger adding function’ to the Github repository, so stay tuned!