Skip to content

AWS Monitoring Troubleshooting Audit

Why Monitoring is Important

  • We know how to deploy applications
    • Safely
    • Automatically
    • Using Infrastructure as Code
    • Leveraging the best AWS components!
  • Our applications are deployed, and our users don't care how we did it...
  • Our users only care that the application is working!
    • Application latency: will it increase over time?
    • Application outages: customer experience should not be degraded
    • Users contacting the IT department or complaining is not a good outcome
    • Troubleshooting and remediation
  • Internal monitoring:
    • Can we prevent issues before they happen?
    • Performance and Cost
    • Trends (scaling patterns)
    • Learning and Improvement

Monitoring in AWS

  • AWS CloudWatch:
    • Metrics: Collect and track key metrics
    • Logs: Collect, monitor, analyze and store log files
    • Events: Send notifications when certain events happen in your AWS
    • Alarms: React in real-time to metrics / events
  • AWS X-Ray:
    • Troubleshooting application performance and errors
    • Distributed tracing of microservices
  • AWS CloudTrail:
    • Internal monitoring of API calls being made
    • Audit changes to AWS Resources by your users

AWS CloudWatch Metrics

  • CloudWatch provides metrics for every services in AWS
  • Metric is a variable to monitor (CPUUtilization, NetworkIn...)
  • Metrics belong to namespaces
  • Dimension is an attribute of a metric (instance id, environment, etc...).
  • Up to 10 dimensions per metric
  • Metrics have timestamps
  • Can create CloudWatch dashboards of metrics

EC2 Detailed monitoring

  • EC2 instance metrics have metrics "every 5 minutes"
  • With detailed monitoring (for a cost), you get data "every 1 minute"
  • Use detailed monitoring if you want to scale faster for your ASG!
  • The AWS Free Tier allows us to have 10 detailed monitoring metrics
  • Note: EC2 Memory usage is by default not pushed (must be pushed from inside the instance as a custom metric)

CloudWatch Custom Metrics

  • Possibility to define and send your own custom metrics to CloudWatch
  • Example: memory (RAM) usage, disk space, number of logged in users ...
  • Use API call PutMetricData
  • Ability to use dimensions (attributes) to segment metrics
    • Instance.id
    • Environment.name
  • Metric resolution (StorageResolution API parameter - two possible value):
    • Standard: 1 minute (60 seconds)
    • High Resolution: 1/5/10/30 second(s) - Higher cost
  • Important: Accepts metric data points two weeks in the past and two hours in the future (make sure to configure your EC2 instance time correctly)

AWS CloudWatch Logs

  • Applications can send logs to CloudWatch using the SDK

  • CloudWatch can collect log from:

    • Elastic Beanstalk: collection of logs from application
    • ECS: collection from containers
    • AWS Lambda: collection from function logs
    • VPC Flow Logs: VPC specific logs
    • API Gateway
    • CloudTrail based on filter
    • CloudWatch log agents: for example on EC2 machines
    • Route53: Log DNS queries
  • CloudWatch Logs can go to:

    • Batch exporter to S3 for archival
    • Stream to ElasticSearch cluster for further analytics
  • CloudWatch Logs can use filter expressions

  • Logs storage architecture:

    • Log groups: arbitrary name, usually representing an application
    • Log stream: instances within application / log files / containers
  • Can define log expiration policies (never expire, 30 days, etc..)

  • Using the AWS CLI we can tail CloudWatch logs

  • To send logs to CloudWatch, make sure IAM permissions are correct!

  • Security: encryption of logs using KMS at the Group Level

CloudWatch Logs for EC2

  • By default, no logs from your EC2 machine will go to CloudWatch
  • You need to run a CloudWatch agent on EC2 to push the log files you want
  • Make sure IAM permissions are correct
  • The CloudWatch log agent can be setup on-premises too

CloudWatch Logs Agent & Unified Agent

  • For virtual servers (EC2 instances, on-premise servers...)
  • CloudWatch Logs Agent
    • Old version of the agent
    • Can only send to CloudWatch Logs
  • CloudWatch Unified Agent
    • Collect additional system-level metrics such as RAM, processes, etc...
    • Collect logs to send to CloudWatch Logs
    • Centralized configuration using SSM Parameter Store

CloudWatch Unified Agent - Metrics

  • Collected directly on your Linux server / EC2 instance
  • CPU (active, guest, idle, system, user, steal)
  • Disk metrics (free, used, total), Disk IO (writes, reads, bytes, iops)
  • RAM (free, inactive, used, total, cached)
  • Netstat (number of TCP and UDP connections, net packets, bytes)
  • Processes (total, dead, bloqued, idle, running, sleep)
  • Swap Space (free, used, used %)
  • Reminder: out-of-the box metrics for EC2 - disk, CPU, network (high level)

CloudWatch Logs Metric Filter

  • CloudWatch Logs can use filter expressions
    • For example, find a specific IP inside of a log
    • Or count occurrences of "ERROR" in your logs
    • Metric filters can be used to trigger alarms
  • Filters do not retroactively filter data. Filters only publish the metric data points for events that happen after the filter was created.
  • CloudWatch Logs Aganet (EC2 Instance) --stream--> CW Logs --> Metric Filters --> CW Alaram --> SNS

CloudWatch Alarms

  • Alarms are used to trigger notifications for any metric
  • Various options (sampling, %, max, min, etc...)
  • Alarm States:
    • OK
    • INSUFFICIENT_DATA
    • ALARM
  • Period:
    • Length of time in seconds to evaluate the metric
    • High resolution custom metrics: 10 sec, 30 sec or multiples of 60 sec

CloudWatch Alarm Targets

  • Stop, Terminate, Reboot, or Recover an EC2 Instance
  • Trigger Auto Scaling Action
  • Send notification to SNS (from which you can do pretty much anything)

EC2 Instance Recovery

  • Status Check:
    • Instance status = check the EC2 VM
    • System status = check the underlying hardware
  • Recovery: Same Private, Public, Elastic IP, metadata, placement group

CloudWatch Alarm: good to know

  • Alarms can be created based on CloudWatch Logs Metrics Filters

  • To test alarms and notifications, set the alarm state to Alarm using CLI:

    bash
    aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"

CloudWatch Events

  • Event Pattern: Intercept events from AWS services (Sources)
    • Example sources: EC2 Instance Start, CodeBuild Failure, S3, Trusted Advisor
    • Can intercept any API call with CloudTrail integration
  • Schedule or Cron (example: create an event every 4 hours)
  • A JSON payload is created from the event and passed to a target...
    • Compute: Lambda, Batch, ECS task
    • Integration: SQS, SNS, Kinesis Data Streams, Kinesis Data Firehose
    • Orchestration: Step Functions, CodePipeline, CodeBuild
    • Maintenance: SSM, EC2 Actions

Amazon EventBridge

  • EventBridge is the next evolution of CloudWatch Events
  • Default event bus: generated by AWS services (CloudWatch Events)
  • Partner event bus: receive events from SaaS service or applications (Zendesk, DataDog, Segment, Auth0...)
  • Custom Event buses: for your own applications
  • Event buses can be accessed by other AWS accounts
  • Rules: how to process the events (similar to CloudWatch Events)

Amazon EventBridge Schema Registry

  • EventBridge can analyze the events in your bus and infer the schema
  • The Schema Registry allows you to generate code for your application, that will know in advance how data is structured in the event bus
  • Schema can be versioned

Amazon EventBridge vs CloudWatch Events

  • Amazon EventBridge builds upon and extends CloudWatch Events.
  • It uses the same service API and endpoint, and the same underlying service infrastructure.
  • EventBridge allows extension to add event buses for your custom applications and your third-party SaaS apps.
  • Event Bridge has the Schema Registry capability
  • EventBridge has a different name to mark the new capabilities
  • Over time, the CloudWatch Events name will be replaced with EventBridge.

AWS X-Ray

  • Debugging in Production, the good old way:
    • Test locally
    • Add log statements everywhere
    • Re-deploy in production
  • Log formats differ across applications using CloudWatch and analytics is hard.
  • Debugging: monolith "easy", distributed services "hard"
  • No common views of your entire architecture!
  • Enter... AWS X-Ray!

AWS X-Ray advantages

  • Troubleshooting performance (bottlenecks)
  • Understand dependencies in a microservice architecture
  • Pinpoint service issues
  • Review request behaviour
  • Find errors and exceptions
  • Are we meeting time SLA?
  • Where I am throttled?
  • Identify users that are impacted

X-Ray compatibility

  • AWS Lambda
  • Elastic Beanstalk
  • ECS
  • ELB
  • API Gateway
  • EC2 Instances or any application server (even on premise)

AWS X-Ray Leverages Tracing

  • Tracing is an end to end way to following a "request"
  • Each component dealing with the request adds its own "trace"
  • Tracing is made of segments (+ sub segments)
  • Annotations can be added to traces to provide extra-information
  • Ability to trace:
    • Every request
    • Sample request (as a % for example or a rate per minute)
  • X-Ray Security:
    • IAM for authorization
    • KMS for encryption at rest

Enable AWS X-Ray

  1. Your code (Java, Python, Go, Node.js, .NET) must import the AWS X-Ray SDK::

    • Very little code modification needed
    • The application SDK will then capture:
      • Calls to AWS services
      • HTTP / HTTPS requests
      • Database Calls (MySQL, PostgreSQL, DynamoDB)
      • Queue calls (SQS)
  2. Install the X-Ray daemon or enable X-Ray AWS Integration

    • X-Ray daemon works as a low level UDP packet interceptor (Linux / Windows / Mac...)
    • AWS Lambda / other AWS services already run the X-Ray daemon for you
    • Each application must have the IAM rights to write data to X-Ray

The X-Ray magic

  • X-Ray service collects data from all the different services
  • Service map is computed from all the segments and traces
  • X-Ray is graphical, so even non technical people can help troubleshoot

AWS X-Ray Troubleshooting

  • If X-Ray is not working on EC2
    • Ensure the EC2 IAM Role has the proper permissions
    • Ensure the EC2 instance is running the X-Ray Daemon
  • To enable on AWS Lambda:
    • Ensure it has an IAM execution role with proper policy (AWSX-RayWriteOnlyAccess)
    • Ensure that X-Ray is imported in the code

X-Ray Instrumentation in your code

  • Instrumentation means the measure of product's performance, diagnose errors, and to write trace information.
  • To instrument your application code, you use the X-Ray SDK
  • Many SDK require only configuration changes
  • You can modify your application code to customize and annotation the data that the SDK sends to X- Ray, using interceptors, filters, handlers, middleware...
javascript
const app = express();

const AWSXRay = require("aws-xray-sdk");
app.use(AWSXRay.express.openSegment("MyApp"));

app.get("/", (req, res) => {
  res.render("index");
});

app.use(AWSXRay.express.closeSegment());

X-Ray Concepts

  • Segments: each application / service will send them
  • Subsegments: if you need more details in your segment
  • Trace: segments collected together to form an end-to-end trace
  • Sampling: decrease the amount of requests sent to X-Ray, reduce cost
  • Annotations: Key Value pairs used to index traces and use with filters
  • Metadata: Key Value pairs, not indexed, not used for searching
  • The X-Ray daemon / agent has a config to send traces cross account:
    • make sure the IAM permissions are correct - the agent will assume the role
    • This allows to have a central account for all your application tracing

X-Ray Sampling Rules

  • With sampling rules, you control the amount of data that you record
  • You can modify sampling rules without changing your code
  • By default, the X-Ray SDK records the first request each second, and five percent of any additional requests.
  • One request per second is the reservoir, which ensures that at least one trace is recorded each second as long the service is serving requests.
  • Five percent is the rate at which additional requests beyond the reservoir size are sampled.

X-Ray Custom Sampling Rules

  • You can create your own rules with the reservoir and rate

  • Example: Higher minimum rate for POSTs

    • Rule name: POST minimum
    • Priority: 100
    • Reservoir: 10
    • Rate: 0.10
    • Service name: *
    • Service type: *
    • Host: *
    • HTTP method: POST
    • URL path: *
    • Resource ARN: *

X-Ray Write APIs (used by the X-Ray daemon)

  • PutTraceSegments: Uploads segment documents to AWS X-Ray
  • PutTelemetryRecords: Used by the AWS X-Ray daemon to upload telemetry.
    • SegmentsReceivedCount, SegmentsRejectedCounts, BackendConnectionErrors...
  • GetSamplingRules: Retrieve all sampling rules (to know what/when to send)
  • GetSamplingTargets & GetSamplingStatisticSummaries: advanced
  • The X-Ray daemon needs to have an IAM policy authorizing the correct API calls to function correctly
json
{
  "Effect": "Allow",
  "Action": [
    "xray:PutTraceSegments",
    "xray:PutTelemetryRecords",
    "xray:GetSamplingRules",
    "xray:GetSamplingTargets",
    "xray:GetSamplingStatisticSummaries"
  ],
  "Resource": ["*"]
}
  • arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess
  • GetServiceGraph: main graph
  • BatchGetTraces: Retrieves a list of traces specified by ID. Each trace is a collection of segment documents that originates from a single request.
  • GetTraceSummaries: Retrieves IDs and annotations for traces available for a specified time frame using an optional filter. To get the full traces, pass the trace IDs to BatchGetTraces.
  • GetTraceGraph: Retrieves a service graph for one or more specific trace IDs.
json
{
  "Effect": "Allow",
  "Action": [
    "xray:GetSamplingRules",
    "xray:GetSamplingTargets",
    "xray:GetSamplingStatisticSummaries",
    "xray:BatchGetTraces",
    "xray:GetServiceGraph",
    "xray:GetTraceGraph",
    "xray:GetTraceSummaries",
    "xray:GetGroups",
    "xray:GetGroup",
    "xray:GetTimeSeriesServiceStatistics"
  ],
  "Resource": ["*"]
}

X-Ray with Elastic Beanstalk

  • AWS Elastic Beanstalk platforms include the X-Ray daemon

  • You can run the daemon by setting an option in the Elastic Beanstalk console or with a configuration file (in .ebextensions/xray-daemon.config)

    yaml
    options_settings:
      aws:elasticbeanstalk:xray:
        XRayEnabled: true
  • Make sure to give your instance profile the correct IAM permissions so that the X-Ray daemon can function correctly

  • Then make sure your application code is instrumented with the X-Ray SDK

  • Note: The X-Ray daemon is not provided for Multicontainer Docker

ECS + X-Ray integration options

ECS Cluster:

  1. X-Ray Container as a Daemon: X-Ray Daemon will run on each EC2 Instance part of the ECS Cluster. So, all the containers running inside a EC2 Instance will be connected to this one X-Ray Daemin

  2. X-Ray Container as a "Side Car": X-Ray Daemon will run on each container and not on EC2 Instance basis

  3. Fargate Cluster: We don't have control over the underlying architecture and it is similar to X-Ray Container as a "Side Car"

Example: ECS + X-Ray Task Definition

json
{
  "portMappings": [
    {
      "hostPort": 0,
      "containerPort": 2000,
      "protocol": "udp"
    }
  ],
  "environment": [
    {
      "name": "AWS_XRAY_DAEMON_ADDRESS",
      "value": "xray-daemon:2000"
    }
  ],
  "links": ["xray-daemon"]
}

AWS CloudTrail

  • Provides governance, compliance and audit for your AWS Account
  • CloudTrail is enabled by default!
  • Get an history of events / API calls made within your AWS Account by:
    • Console
    • SDK
    • CLI
    • AWS Services
  • Can put logs from CloudTrail into CloudWatch Logs or S3
  • A trail can be applied to All Regions (default) or a single Region.
  • If a resource is deleted in AWS, investigate CloudTrail first!

CloudTrail Events

  • Management Events:
    • Operations that are performed on resources in your AWS account
    • Examples:
      • Configuring security (IAM AttachRolePolicy)
      • Configuring rules for routing data (Amazon EC2 CreateSubnet)
      • Setting up logging (AWS CloudTrail CreateTrail)
    • By default, trails are configured to log management events.
    • Can separate Read Events (that don't modify resources) from Write Events (that may modify resources)
  • Data Events:
    • By default, data events are not logged (because high volume operations)
    • Amazon S3 object-level activity (ex: GetObject, DeleteObject, PutObject): can separate Read and Write Events
    • AWS Lambda function execution activity (the Invoke API)
  • CloudTrail Insights Events

CloudTrail Insights

  • Enable CloudTrail Insights to detect unusual activity in your account:
    • inaccurate resource provisioning
    • hitting service limits
    • Bursts of AWS IAM actions
    • Gaps in periodic maintenance activity
  • CloudTrail Insights analyzes normal management events to create a baseline
  • And then continuously analyzes write events to detect unusual patterns
    • Anomalies appear in the CloudTrail console
    • Event is sent to Amazon S3
    • An EventBridge event is generated (for automation needs)

CloudTrail Events Retention

  • Events are stored for 90 days in CloudTrail
  • To keep events beyond this period, log them to S3 and use Athena
  • All Events --> CloudTrail (more than 90 days retention) --log--> S3 Bucket Long-term retention --analyze--> Athena

CloudTrail vs CloudWatch vs X-Ray

  • CloudTrail:
    • Audit API calls made by users / services / AWS console
    • Useful to detect unauthorized calls or root cause of changes
  • CloudWatch:
    • CloudWatch Metrics over time for monitoring
    • CloudWatch Logs for storing application log
    • CloudWatch Alarms to send notifications in case of unexpected metrics
  • X-Ray:
    • Automated Trace Analysis & Central Service Map Visualization
    • Latency, Errors and Fault analysis
    • Request tracking across distributed systems