Monitoring
This document describes application monitoring methods based on this catalog's configuration.
Overview
The constructed application monitoring utilizes CloudWatch, which is an AWS standard service. Specifically, we deploy API Gateway and AWS Lambda standard metrics to CloudWatch Dashboard, and set up CloudWatch Alarms for each metric to create a mechanism for notifications to email or messaging tools (Slack).
[A] Recommended Standard Metrics for Monitoring
The following metrics will be added to CloudWatch Dashboard, and thresholds will be set as Alarm targets for monitoring as described later.
Amazon API Gateway
Metric Name | Notes |
---|---|
Count | Number of API calls |
4xxError | Treated as "warning" level since it's client error, consider abnormal notification when occurring intensively in short periods |
5xxError | Treated as "error" level since it's server error, notify as abnormal when exceeding a certain number in short time |
AWS Lambda
Metric Name | Notes |
---|---|
Throttles | Number of throttling occurrences, Burst Limit detection |
Duration | When execution time is unusually long, or for detecting unresponsive external systems during external system integration |
ConcurrentExecution | Check concurrent execution count of application, perform optimization or mitigation requests if approaching upper limit |
For items not listed above, please configure additions to Dashboard and Alarms as needed for monitoring response.
[B] Application Log (Lambda) Monitoring
The application basically outputs JSON structured logs like the following using the Logger from AWS Lambda Powertools library.
{
"cold_start": true,
"function_arn": "arn:aws:lambda:us-east-1:123456789012:function:shopping-cart-api-lambda",
"function_memory_size": 128,
"function_request_id": "c6af9ac6-7b61-11e6-9a41-93e812345678",
"function_name": "shopping-cart-api-lambda",
"level": "ERROR",
"message": "This is an ERROR log with some context",
"service": "shopping-cart-api-handler",
"timestamp": "2023-12-12T21:21:08.921Z",
"xray_trace_id": "abcdef123456abcdef123456abcdef123456"
}
Logs structured in this way can be searched from CloudWatch Logs console by specifying item values in formats like { $.level = "ERROR" }
. Using this mechanism, it is possible to use the "Subscription Filter" function that checks structured log items from CloudWatch Logs Log Groups and transfers matched logs to other services (S3, Kinesis, Lambda, etc.), but for monitoring purposes, we use "Metric Filter" which reflects to CloudWatch metrics.
This can be created from CloudWatch Logs console. Please refer to the code sample written in CloudFormation Template format as a reference for configuration values.
# "Core" is an example of Lambda function name or alias
CoreErrorLogMetricFilter:
Type: AWS::Logs::MetricFilter
DependsOn: CoreLogGroup
Properties:
LogGroupName: !Ref CoreLogGroup
FilterPattern: '{ $.level = "ERROR" }'
MetricTransformations:
- MetricValue: 1
MetricNamespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
MetricName: Errors
CoreWarnLogMetricFilter:
# This sample code is included in the catalog AMI.
When logs with Log Level ERROR are output, they are registered to custom metrics by CloudWatch Logs Metric Filter, and if Alarms are set, monitoring mechanisms can be created that send notifications when the number of error logs exceeds a certain threshold within, for example, 10 minutes.
Set Up Alarms to Automate Anomaly Detection
CloudWatch Alarm sets thresholds for metrics and normally remains in OK
state. If metric values exceed (or fall below) the threshold within a certain time frame, the state changes to "ALARM", and this state change triggers notifications to email or SNS with messages about the changed state and how data points relate to the threshold. Detection is possible not only for OK
→ ALARM
but also for the reverse case.
ALARM is configured for standard metrics as follows.
ApiAll5xxAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-webapi-ApiAll5xx-Alarm
AlarmDescription: Total 5xx count of all APIs
AlarmActions:
- !Ref MonitoringTopic
Namespace: AWS/ApiGateway
Dimensions:
- Name: ApiName
Value: prod-webapi
- Name: Stage
Value: prod
EvaluationPeriods: 5 # 5 minutes
MetricName: 5XXError
Period: 60 # 60 seconds
Statistic: Sum
Threshold: 10 # 10 5xx error response within 5 minutes
ComparisonOperator: GreaterThanThreshold
ApiAll4xxAlarm:
# This sample code is included in the catalog AMI.
GetUsers5xxAlarm:
# This sample code is included in the catalog AMI.
PostUsers5xxAlarm:
# This sample code is included in the catalog AMI.
CoreLambdaThrottlesAlarm:
# This sample code is included in the catalog AMI.
CoreLambdaDurationAlarm:
# This sample code is included in the catalog AMI.
CoreLambdaConcurrentExecutionAlarm:
# This sample code is included in the catalog AMI.
This is applicable not only to standard metrics but also to custom metrics using structured logs and Metric Filters mentioned earlier.
CoreLambdaErrorsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-webapi-CoreLambdaErrors-Alarm
AlarmDescription: "Log level `ERROR` count in CoreLambdaFunction"
AlarmActions:
- !Ref MonitoringTopic # SNS Topic
Namespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
EvaluationPeriods: 5 # 5 minutes
MetricName: Errors
Period: 60 # 60 seconds
Statistic: Sum
Threshold: 10 # 10 error logs within 5 minutes
ComparisonOperator: GreaterThanThreshold