Skip to main content

Monitoring

This document describes application monitoring methods based on the configuration of this catalog.

Overview

The application monitoring we build utilizes CloudWatch, which is AWS's standard service. Specifically, we deploy standard metrics from API Gateway and AWS Lambda to CloudWatch Dashboard, and configure CloudWatch Alarms for each metric to notify via email or messaging tools (Slack).

The following metrics will be added to CloudWatch Dashboard and monitored by setting thresholds as alarm targets described later.

Amazon API Gateway

Metric NameNotes
CountNumber of API calls
4xxErrorTreated as "warning" level since it's a client error; consider abnormal notification when occurring intensively in a short period
5xxErrorTreated as "error" level since it's a server error; notify as abnormal when exceeding a certain number in a short time

AWS Lambda

Metric NameNotes
ThrottlesNumber of throttling occurrences and Burst Limit detection
DurationDetect when execution time is unusually long or when external system integration encounters unresponsive partner systems
ConcurrentExecutionMonitor concurrent execution count during app operation; perform optimization or mitigation requests when approaching upper limits

For items not listed above, please configure additions to Dashboard and Alarm as needed for monitoring purposes.

[B] Application Log (Lambda) Monitoring

Applications are based on outputting JSON structured logs as follows using the Logger from the AWS Lambda Powertools library.

{
"cold_start": true,
"function_arn": "arn:aws:lambda:us-east-1:123456789012:function:shopping-cart-api-lambda",
"function_memory_size": 128,
"function_request_id": "c6af9ac6-7b61-11e6-9a41-93e812345678",
"function_name": "shopping-cart-api-lambda",
"level": "ERROR",
"message": "This is an ERROR log with some context",
"service": "shopping-cart-api-handler",
"timestamp": "2023-12-12T21:21:08.921Z",
"xray_trace_id": "abcdef123456abcdef123456abcdef123456"
}

These structured logs can be searched from the CloudWatch Logs console by specifying item values in formats like { $.level = "ERROR" }. Using this mechanism, it's possible to utilize the "Subscription Filter" feature that checks structured log items from CloudWatch Logs Log Groups and forwards matching logs to other services (S3, Kinesis, Lambda, etc.), but for monitoring purposes, we use "Metric Filter" which reflects to CloudWatch metrics.

It can be created from the CloudWatch Logs console. Please refer to the code sample written in CloudFormation Template format as a reference for configuration values.

# "Core" is an example of Lambda function name or alias
CoreErrorLogMetricFilter:
Type: AWS::Logs::MetricFilter
DependsOn: CoreLogGroup
Properties:
LogGroupName: !Ref CoreLogGroup
FilterPattern: '{ $.level = "ERROR" }'
MetricTransformations:
- MetricValue: 1
MetricNamespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
MetricName: Errors

CoreWarnLogMetricFilter:
# This sample code is included in the catalog AMI.

When logs with ERROR level are output, they are registered as custom metrics by CloudWatch Logs Metric Filter, and by configuring Alarms, you can create a monitoring mechanism that sends notifications when, for example, the number of error logs exceeds a certain threshold within 10 minutes.

Configure Alarms to Automate Anomaly Detection

CloudWatch Alarm sets thresholds on metrics and remains in an OK state during normal operation. If metric values exceed (or fall below) the threshold within a certain time frame, the state changes to "ALARM," and this state change triggers notifications via email or SNS with messages about the changed state and how data points relate to the threshold. Detection is possible not only for OKALARM but also for the reverse case.

ALARM configures standard metrics as follows:

ApiAll5xxAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-webapi-ApiAll5xx-Alarm
AlarmDescription: Total 5xx count of all APIs
AlarmActions:
- !Ref MonitoringTopic
Namespace: AWS/ApiGateway
Dimensions:
- Name: ApiName
Value: prod-webapi
- Name: Stage
Value: prod
EvaluationPeriods: 5 # 5 minutes
MetricName: 5XXError
Period: 60 # 60 seconds
Statistic: Sum
Threshold: 10 # 10 5xx error response within 5 minutes
ComparisonOperator: GreaterThanThreshold

ApiAll4xxAlarm:
# This sample code is included in the catalog AMI.

GetUsers5xxAlarm:
# This sample code is included in the catalog AMI.

PostUsers5xxAlarm:
# This sample code is included in the catalog AMI.

CoreLambdaThrottlesAlarm:
# This sample code is included in the catalog AMI.

CoreLambdaDurationAlarm:
# This sample code is included in the catalog AMI.

CoreLambdaConcurrentExecutionAlarm:
# This sample code is included in the catalog AMI.

This applies not only to standard metrics but also to custom metrics using structured logs and Metric Filter as described earlier.

CoreLambdaErrorsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-webapi-CoreLambdaErrors-Alarm
AlarmDescription: "Log level `ERROR` count in CoreLambdaFunction"
AlarmActions:
- !Ref MonitoringTopic # SNS Topic
Namespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
EvaluationPeriods: 5 # 5 minutes
MetricName: Errors
Period: 60 # 60 seconds
Statistic: Sum
Threshold: 10 # 10 error logs within 5 minutes
ComparisonOperator: GreaterThanThreshold