Skip to main content

Monitoring

This document describes application monitoring methods based on this catalog's configuration.

Overview

The constructed application monitoring utilizes CloudWatch, which is an AWS standard service. Specifically, we deploy API Gateway and AWS Lambda standard metrics to CloudWatch Dashboard, and set up CloudWatch Alarms for each metric to create a mechanism for notifications to email or messaging tools (Slack).

The following metrics will be added to CloudWatch Dashboard, and thresholds will be set as Alarm targets for monitoring as described later.

Amazon API Gateway

Metric NameNotes
CountNumber of API calls
4xxErrorTreated as "warning" level since it's client error, consider abnormal notification when occurring intensively in short periods
5xxErrorTreated as "error" level since it's server error, notify as abnormal when exceeding a certain number in short time

AWS Lambda

Metric NameNotes
ThrottlesNumber of throttling occurrences, Burst Limit detection
DurationWhen execution time is unusually long, or for detecting unresponsive external systems during external system integration
ConcurrentExecutionCheck concurrent execution count of application, perform optimization or mitigation requests if approaching upper limit

For items not listed above, please configure additions to Dashboard and Alarms as needed for monitoring response.

[B] Application Log (Lambda) Monitoring

The application basically outputs JSON structured logs like the following using the Logger from AWS Lambda Powertools library.

{
"cold_start": true,
"function_arn": "arn:aws:lambda:us-east-1:123456789012:function:shopping-cart-api-lambda",
"function_memory_size": 128,
"function_request_id": "c6af9ac6-7b61-11e6-9a41-93e812345678",
"function_name": "shopping-cart-api-lambda",
"level": "ERROR",
"message": "This is an ERROR log with some context",
"service": "shopping-cart-api-handler",
"timestamp": "2023-12-12T21:21:08.921Z",
"xray_trace_id": "abcdef123456abcdef123456abcdef123456"
}

Logs structured in this way can be searched from CloudWatch Logs console by specifying item values in formats like { $.level = "ERROR" }. Using this mechanism, it is possible to use the "Subscription Filter" function that checks structured log items from CloudWatch Logs Log Groups and transfers matched logs to other services (S3, Kinesis, Lambda, etc.), but for monitoring purposes, we use "Metric Filter" which reflects to CloudWatch metrics.

This can be created from CloudWatch Logs console. Please refer to the code sample written in CloudFormation Template format as a reference for configuration values.

# "Core" is an example of Lambda function name or alias
CoreErrorLogMetricFilter:
Type: AWS::Logs::MetricFilter
DependsOn: CoreLogGroup
Properties:
LogGroupName: !Ref CoreLogGroup
FilterPattern: '{ $.level = "ERROR" }'
MetricTransformations:
- MetricValue: 1
MetricNamespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
MetricName: Errors

CoreWarnLogMetricFilter:
# This sample code is included in the catalog AMI.

When logs with Log Level ERROR are output, they are registered to custom metrics by CloudWatch Logs Metric Filter, and if Alarms are set, monitoring mechanisms can be created that send notifications when the number of error logs exceeds a certain threshold within, for example, 10 minutes.

Set Up Alarms to Automate Anomaly Detection

CloudWatch Alarm sets thresholds for metrics and normally remains in OK state. If metric values exceed (or fall below) the threshold within a certain time frame, the state changes to "ALARM", and this state change triggers notifications to email or SNS with messages about the changed state and how data points relate to the threshold. Detection is possible not only for OKALARM but also for the reverse case.

ALARM is configured for standard metrics as follows.

ApiAll5xxAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-webapi-ApiAll5xx-Alarm
AlarmDescription: Total 5xx count of all APIs
AlarmActions:
- !Ref MonitoringTopic
Namespace: AWS/ApiGateway
Dimensions:
- Name: ApiName
Value: prod-webapi
- Name: Stage
Value: prod
EvaluationPeriods: 5 # 5 minutes
MetricName: 5XXError
Period: 60 # 60 seconds
Statistic: Sum
Threshold: 10 # 10 5xx error response within 5 minutes
ComparisonOperator: GreaterThanThreshold

ApiAll4xxAlarm:
# This sample code is included in the catalog AMI.

GetUsers5xxAlarm:
# This sample code is included in the catalog AMI.

PostUsers5xxAlarm:
# This sample code is included in the catalog AMI.

CoreLambdaThrottlesAlarm:
# This sample code is included in the catalog AMI.

CoreLambdaDurationAlarm:
# This sample code is included in the catalog AMI.

CoreLambdaConcurrentExecutionAlarm:
# This sample code is included in the catalog AMI.

This is applicable not only to standard metrics but also to custom metrics using structured logs and Metric Filters mentioned earlier.

CoreLambdaErrorsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-webapi-CoreLambdaErrors-Alarm
AlarmDescription: "Log level `ERROR` count in CoreLambdaFunction"
AlarmActions:
- !Ref MonitoringTopic # SNS Topic
Namespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
EvaluationPeriods: 5 # 5 minutes
MetricName: Errors
Period: 60 # 60 seconds
Statistic: Sum
Threshold: 10 # 10 error logs within 5 minutes
ComparisonOperator: GreaterThanThreshold