Monitoring

This document describes application monitoring methods based on the configuration of this catalog.

Overview

The application monitoring we build utilizes CloudWatch, which is AWS's standard service. Specifically, we deploy standard metrics from API Gateway and AWS Lambda to CloudWatch Dashboard, and configure CloudWatch Alarms for each metric to notify via email or messaging tools (Slack).

[A] Recommended Standard Metrics for Monitoring

The following metrics will be added to CloudWatch Dashboard and monitored by setting thresholds as alarm targets described later.

Amazon API Gateway

Metric Name	Notes
Count	Number of API calls
4xxError	Treated as "warning" level since it's a client error; consider abnormal notification when occurring intensively in a short period
5xxError	Treated as "error" level since it's a server error; notify as abnormal when exceeding a certain number in a short time

AWS Lambda

Metric Name	Notes
Throttles	Number of throttling occurrences and Burst Limit detection
Duration	Detect when execution time is unusually long or when external system integration encounters unresponsive partner systems
ConcurrentExecution	Monitor concurrent execution count during app operation; perform optimization or mitigation requests when approaching upper limits

For items not listed above, please configure additions to Dashboard and Alarm as needed for monitoring purposes.

[B] Application Log (Lambda) Monitoring

Applications are based on outputting JSON structured logs as follows using the Logger from the AWS Lambda Powertools library.

{
  "cold_start": true,
  "function_arn": "arn:aws:lambda:us-east-1:123456789012:function:shopping-cart-api-lambda",
  "function_memory_size": 128,
  "function_request_id": "c6af9ac6-7b61-11e6-9a41-93e812345678",
  "function_name": "shopping-cart-api-lambda",
  "level": "ERROR",
  "message": "This is an ERROR log with some context",
  "service": "shopping-cart-api-handler",
  "timestamp": "2023-12-12T21:21:08.921Z",
  "xray_trace_id": "abcdef123456abcdef123456abcdef123456"
 }

These structured logs can be searched from the CloudWatch Logs console by specifying item values in formats like { $.level = "ERROR" }. Using this mechanism, it's possible to utilize the "Subscription Filter" feature that checks structured log items from CloudWatch Logs Log Groups and forwards matching logs to other services (S3, Kinesis, Lambda, etc.), but for monitoring purposes, we use "Metric Filter" which reflects to CloudWatch metrics.

It can be created from the CloudWatch Logs console. Please refer to the code sample written in CloudFormation Template format as a reference for configuration values.

# "Core" is an example of Lambda function name or alias
CoreErrorLogMetricFilter:
  Type: AWS::Logs::MetricFilter
  DependsOn: CoreLogGroup
  Properties:
    LogGroupName: !Ref CoreLogGroup
    FilterPattern: '{ $.level = "ERROR" }'
    MetricTransformations:
    - MetricValue: 1
      MetricNamespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
      MetricName: Errors

  CoreWarnLogMetricFilter:
    # This sample code is included in the catalog AMI.

When logs with ERROR level are output, they are registered as custom metrics by CloudWatch Logs Metric Filter, and by configuring Alarms, you can create a monitoring mechanism that sends notifications when, for example, the number of error logs exceeds a certain threshold within 10 minutes.

Configure Alarms to Automate Anomaly Detection

CloudWatch Alarm sets thresholds on metrics and remains in an OK state during normal operation. If metric values exceed (or fall below) the threshold within a certain time frame, the state changes to "ALARM," and this state change triggers notifications via email or SNS with messages about the changed state and how data points relate to the threshold. Detection is possible not only for OK → ALARM but also for the reverse case.

ALARM configures standard metrics as follows:

ApiAll5xxAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: prod-webapi-ApiAll5xx-Alarm
    AlarmDescription: Total 5xx count of all APIs
    AlarmActions:
    - !Ref MonitoringTopic
    Namespace: AWS/ApiGateway
    Dimensions:
    - Name: ApiName
      Value: prod-webapi
    - Name: Stage
      Value: prod
    EvaluationPeriods: 5 # 5 minutes
    MetricName: 5XXError
    Period: 60 # 60 seconds
    Statistic: Sum
    Threshold: 10 # 10 5xx error response within 5 minutes
    ComparisonOperator: GreaterThanThreshold

  ApiAll4xxAlarm:
    # This sample code is included in the catalog AMI.

  GetUsers5xxAlarm:
    # This sample code is included in the catalog AMI.

  PostUsers5xxAlarm:
    # This sample code is included in the catalog AMI.

  CoreLambdaThrottlesAlarm:
    # This sample code is included in the catalog AMI.

  CoreLambdaDurationAlarm:
    # This sample code is included in the catalog AMI.

  CoreLambdaConcurrentExecutionAlarm:
    # This sample code is included in the catalog AMI.

This applies not only to standard metrics but also to custom metrics using structured logs and Metric Filter as described earlier.

CoreLambdaErrorsAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: prod-webapi-CoreLambdaErrors-Alarm
    AlarmDescription: "Log level `ERROR` count in CoreLambdaFunction"
    AlarmActions:
    - !Ref MonitoringTopic # SNS Topic
    Namespace: !Join [ '', [ 'Logs/', !Ref CoreLambdaFunction ]]
    EvaluationPeriods: 5 # 5 minutes
    MetricName: Errors
    Period: 60 # 60 seconds
    Statistic: Sum 
    Threshold: 10 # 10 error logs within 5 minutes
    ComparisonOperator: GreaterThanThreshold

Overview​

[A] Recommended Standard Metrics for Monitoring​

Amazon API Gateway​

AWS Lambda​

[B] Application Log (Lambda) Monitoring​

Configure Alarms to Automate Anomaly Detection​