Monitoring

This document describes application monitoring methods based on this catalog's configuration.

Overview

The constructed application monitoring utilizes CloudWatch, which is an AWS standard service. Specifically, we deploy API Gateway and AWS Lambda standard metrics to CloudWatch Dashboard, and set up CloudWatch Alarms for each metric to create a mechanism for notifications to email or messaging tools (Slack).

[A] Recommended Standard Metrics for Monitoring

The following metrics will be added to CloudWatch Dashboard, and thresholds will be set as Alarm targets for monitoring as described later.

Amazon API Gateway

Metric Name	Notes
Count	Number of API calls
4xxError	Treated as "warning" level since it's client error, consider abnormal notification when occurring intensively in short periods
5xxError	Treated as "error" level since it's server error, notify as abnormal when exceeding a certain number in short time

AWS Lambda

Metric Name	Notes
Throttles	Number of throttling occurrences, Burst Limit detection
Duration	When execution time is unusually long, or for detecting unresponsive external systems during external system integration
ConcurrentExecution	Check concurrent execution count of application, perform optimization or mitigation requests if approaching upper limit

For items not listed above, please configure additions to Dashboard and Alarms as needed for monitoring response.

[B] Application Log (Lambda) Monitoring

The application basically outputs JSON structured logs like the following using the Logger from AWS Lambda Powertools library.

{
  "cold_start": true,
  "function_arn": "arn:aws:lambda:us-east-1:123456789012:function:shopping-cart-api-lambda",
  "function_memory_size": 128,
  "function_request_id": "c6af9ac6-7b61-11e6-9a41-93e812345678",
  "function_name": "shopping-cart-api-lambda",
  "level": "ERROR",
  "message": "This is an ERROR log with some context",
  "service": "shopping-cart-api-handler",
  "timestamp": "2023-12-12T21:21:08.921Z",
  "xray_trace_id": "abcdef123456abcdef123456abcdef123456"
 }

Logs structured in this way can be searched from CloudWatch Logs console by specifying item values in formats like { $.level = "ERROR" }. Using this mechanism, it is possible to use the "Subscription Filter" function that checks structured log items from CloudWatch Logs Log Groups and transfers matched logs to other services (S3, Kinesis, Lambda, etc.), but for monitoring purposes, we use "Metric Filter" which reflects to CloudWatch metrics.

This can be created from CloudWatch Logs console. Please refer to the code sample written in CloudFormation Template format as a reference for configuration values.

# "Core" is an example of Lambda function name or alias
CoreErrorLogMetricFilter:
  Type: AWS::Logs::MetricFilter
  DependsOn: CoreLogGroup
  Properties:
    LogGroupName: !Ref CoreLogGroup
    FilterPattern: '{ $.level = "ERROR" }'
    MetricTransformations:
    - MetricValue: 1
      MetricNamespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
      MetricName: Errors

  CoreWarnLogMetricFilter:
    # This sample code is included in the catalog AMI.

When logs with Log Level ERROR are output, they are registered to custom metrics by CloudWatch Logs Metric Filter, and if Alarms are set, monitoring mechanisms can be created that send notifications when the number of error logs exceeds a certain threshold within, for example, 10 minutes.

Set Up Alarms to Automate Anomaly Detection

CloudWatch Alarm sets thresholds for metrics and normally remains in OK state. If metric values exceed (or fall below) the threshold within a certain time frame, the state changes to "ALARM", and this state change triggers notifications to email or SNS with messages about the changed state and how data points relate to the threshold. Detection is possible not only for OK → ALARM but also for the reverse case.

ALARM is configured for standard metrics as follows.

ApiAll5xxAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: prod-webapi-ApiAll5xx-Alarm
    AlarmDescription: Total 5xx count of all APIs
    AlarmActions:
    - !Ref MonitoringTopic
    Namespace: AWS/ApiGateway
    Dimensions:
    - Name: ApiName
      Value: prod-webapi
    - Name: Stage
      Value: prod
    EvaluationPeriods: 5 # 5 minutes
    MetricName: 5XXError
    Period: 60 # 60 seconds
    Statistic: Sum
    Threshold: 10 # 10 5xx error response within 5 minutes
    ComparisonOperator: GreaterThanThreshold

  ApiAll4xxAlarm:
    # This sample code is included in the catalog AMI.

  GetUsers5xxAlarm:
    # This sample code is included in the catalog AMI.

  PostUsers5xxAlarm:
    # This sample code is included in the catalog AMI.

  CoreLambdaThrottlesAlarm:
    # This sample code is included in the catalog AMI.

  CoreLambdaDurationAlarm:
    # This sample code is included in the catalog AMI.

  CoreLambdaConcurrentExecutionAlarm:
    # This sample code is included in the catalog AMI.

This is applicable not only to standard metrics but also to custom metrics using structured logs and Metric Filters mentioned earlier.

CoreLambdaErrorsAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: prod-webapi-CoreLambdaErrors-Alarm
    AlarmDescription: "Log level `ERROR` count in CoreLambdaFunction"
    AlarmActions:
    - !Ref MonitoringTopic # SNS Topic
    Namespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
    EvaluationPeriods: 5 # 5 minutes
    MetricName: Errors
    Period: 60 # 60 seconds
    Statistic: Sum 
    Threshold: 10 # 10 error logs within 5 minutes
    ComparisonOperator: GreaterThanThreshold

Overview​

[A] Recommended Standard Metrics for Monitoring​

Amazon API Gateway​

AWS Lambda​

[B] Application Log (Lambda) Monitoring​

Set Up Alarms to Automate Anomaly Detection​