Monitoring & Logging:

Monitoring & Logging: CloudWatch, Stackdriver, Azure Monitor, Alerts, and Metrics

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 16 minutes
🏷️ Tags: Cloud Monitoring, CloudWatch, Azure Monitor, Stackdriver, Alerts, Metrics, Observability

Introduction: Why Monitoring Matters

You cannot fix what you cannot see. Monitoring is the eyes and ears of your cloud infrastructure. It tells you what is happening, what has happened, and what might happen next.

Without monitoring, you are flying blind. You discover problems when users complain, not when they start. You cannot optimize what you cannot measure. You cannot prove your service is meeting its commitments.

Cloud providers offer comprehensive monitoring services that collect metrics, aggregate logs, and trigger alerts. This guide covers the essential monitoring services across AWS, Azure, and Google Cloud.

The Three Pillars of Observability

Modern cloud monitoring is built on three pillars:

Metrics
Numerical measurements over time. CPU utilization, request count, error rate, latency. Metrics answer "what is happening right now?"

Logs
Detailed records of discrete events. Application logs, system logs, audit logs. Logs answer "what happened, and why?"

Traces
Records of requests as they flow through distributed systems. A single user request might touch a load balancer, two microservices, a database, and a cache. Traces answer "where is the delay?"

This guide focuses on metrics and logs, the foundation of cloud monitoring.

Cloud Monitoring Services Across Providers

Provider	Metrics & Alarms	Logs	Dashboards
AWS	CloudWatch	CloudWatch Logs	CloudWatch Dashboards
Azure	Azure Monitor Metrics	Log Analytics	Azure Dashboards
Google Cloud	Cloud Monitoring	Cloud Logging	Monitoring Dashboards

While the names differ, the capabilities are remarkably similar across providers.

CloudWatch (AWS)

What is CloudWatch?

CloudWatch is AWS's monitoring and observability service. It collects metrics from AWS services, your applications, and on-premises resources. It stores logs, sets alarms, and creates dashboards.

CloudWatch Metrics

Metrics are data points about your resources. Every AWS service automatically sends metrics to CloudWatch.

Common Metrics by Service:

Service	Key Metrics
EC2	CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps
RDS	CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency
Lambda	Invocations, Duration, Errors, Throttles
S3	NumberOfObjects, BucketSizeBytes, 4xxErrors, 5xxErrors
ELB	RequestCount, TargetResponseTime, HTTPCode_ELB_5XX_Count
SQS	ApproximateNumberOfMessagesVisible, ApproximateNumberOfMessagesNotVisible

Metric Characteristics:

Standard Metrics: Free, 15-month retention, 1-minute granularity for detailed monitoring, 5-minute granularity for basic
Custom Metrics: Paid, you can send any data, 1-second granularity possible
Retention: 15 months for standard metrics, customizable for custom metrics

CloudWatch Alarms

An alarm watches a single metric over a specified period and performs actions when the metric crosses a threshold.

Alarm Components:

Metric: What to watch (CPUUtilization, RequestCount, etc.)
Statistic: How to evaluate (Average, Maximum, Minimum, Sum, SampleCount)
Period: How long to evaluate (60 seconds, 300 seconds, etc.)
Threshold: The value that triggers the alarm (CPU > 80%)
Evaluation Periods: How many consecutive periods before triggering (3 out of 3)
Actions: What to do when triggered (SNS notification, Auto Scaling, EC2 action)

Alarm States:

OK: Metric is within the threshold
ALARM: Metric crossed the threshold for the evaluation period
INSUFFICIENT_DATA: Not enough data to evaluate (new instance, no traffic)

CloudWatch Logs

CloudWatch Logs centralize log storage from AWS services and your applications.

Log Sources:

AWS service logs (Lambda, VPC Flow Logs, RDS logs)
EC2 instances with CloudWatch Agent
On-premises servers with CloudWatch Agent
Custom application logging

Log Features:

Log Groups: Container for log streams (one per application or component)
Log Streams: Sequence of log events from a single source (one per instance)
Log Retention: Configurable from 1 day to 10 years, indefinite with export to S3
Metric Filters: Extract metrics from log data (count errors, sum request sizes)
Subscription Filters: Stream logs to Lambda, Kinesis, or Elasticsearch

CloudWatch Dashboards

Dashboards create customizable views of your metrics in one place.

Dashboard Features:

Widgets for metrics, logs, and text
Cross-account views (view multiple accounts in one dashboard)
Automatic refresh (15 seconds to 1 minute)
Sharing with teams or public (read-only)

Azure Monitor

What is Azure Monitor?

Azure Monitor is Microsoft's comprehensive monitoring solution. It collects, analyzes, and acts on telemetry from Azure resources, on-premises environments, and applications.

Azure Monitor Metrics

Metrics are lightweight, near real-time values that require no querying. They are stored in a time-series database optimized for fast retrieval.

Metric Types:

Platform Metrics: Provided by Azure services automatically
Custom Metrics: You can send your own metrics

Common Metrics by Service:

Service	Key Metrics
Virtual Machines	Percentage CPU, Available Memory, Disk Read Bytes, Network In Total
App Service	Response Time, Requests, HTTP 5xx, Memory Working Set
SQL Database	DTU Consumption, CPU Percentage, Sessions Count, Deadlocks
Storage	Transactions, Ingress, Egress, Success E2E Latency

Azure Monitor Alerts

Alerts proactively notify you when conditions are found in your monitoring data.

Alert Types:

Metric Alerts: Trigger when a metric crosses a threshold
Log Alerts: Trigger when a log query returns results
Activity Log Alerts: Trigger when specific events occur

Alert Actions:

Action Groups define what happens when an alert triggers:
- Email and SMS notifications
- Webhook calls to external systems
- ITSM integration (ServiceNow, etc.)
- Automation runbooks
- Azure Functions
- Voice calls

Log Analytics

Log Analytics is the query engine for Azure Monitor logs. It uses the Kusto Query Language (KQL), a powerful query language for log analysis.

Log Sources:

Azure resources (diagnostic settings enabled)
Virtual machines with Azure Monitor Agent
Application Insights
Custom logs

Common Queries:

// Find 5xx errors in the last hour
AppServiceConsoleLogs
| where TimeGenerated > ago(1h)
| where Level == "Error"
| where Message contains "500"
| project TimeGenerated, RoleInstance, Message

// CPU usage across all VMs
Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| order by AvgCPU desc

Application Insights

Application Insights is Azure Monitor's application performance monitoring (APM) service. It monitors live web applications to detect performance anomalies and diagnose issues.

Capabilities:

Request rates, response times, and failure rates
Dependency tracking (database, HTTP calls, external services)
Exception logging with stack traces
Performance counters (CPU, memory, network)
Live Metrics with near real-time performance data
Availability tests (ping tests from global locations)

Google Cloud Operations (formerly Stackdriver)

What is Google Cloud Operations?

Google Cloud Operations is Google's integrated monitoring, logging, and diagnostics suite. It brings together metrics, logs, and traces in a unified interface.

Cloud Monitoring (formerly Stackdriver Monitoring)

Cloud Monitoring collects metrics from Google Cloud services, applications, and on-premises infrastructure.

Metric Types:

Agent metrics: From installed monitoring agent (CPU, disk, network, process)
Platform metrics: From Google Cloud services
Custom metrics: From applications via the API

Workspaces:

A workspace organizes monitoring data
Can monitor multiple projects from a single workspace
Supports AWS accounts and on-premises systems

Service Monitoring

Service Monitoring (formerly Uptime Checks) verifies that your services are available and responsive.

Check Types:

HTTP/HTTPS: Verifies status code and response content
TCP: Verifies port connectivity
UDP: Verifies UDP connectivity

Alerting Policies

Alerting policies define conditions that trigger notifications.

Conditions:

Metric threshold: When a metric exceeds a value
Metric absence: When a metric stops reporting data
Logs-based: When a log query returns results
Uptime check: When a check fails

Notification Channels:

Email
SMS
Slack
PagerDuty
Webhooks
Pub/Sub

Cloud Logging (formerly Stackdriver Logging)

Cloud Logging stores, searches, and analyzes log data from Google Cloud services and your applications.

Log Sources:

Google Cloud services (GKE, Compute Engine, Cloud Run, etc.)
Custom logs from applications
On-premises and other cloud logs

Log Explorer:

Powerful query interface for searching logs
Save and share queries
Create metrics from logs
Export to BigQuery, Pub/Sub, or Cloud Storage

Log-Based Metrics:

Create metrics from log data
Use for alerting and dashboarding
Count occurrences of specific log messages

Alerting Best Practices

What to Alert On

Alert on symptoms, not causes.

Bad: CPU > 80% (this is a cause)
Good: Error rate > 1% for 5 minutes (this is a symptom users experience)

Alert on actionable conditions.
If an alert fires, someone should be able to do something about it. Avoid alert fatigue from non-actionable notifications.

Use multiple thresholds.

Warning: Something is degrading, needs attention soon
Critical: Immediate action required
Info: Informational, no action needed

Alert Design Principles

One alert, one purpose.
Each alert should indicate a single problem with a clear remediation path.

Include remediation steps in the notification.
Tell the on-call engineer what to do. A link to a runbook is ideal.

Set appropriate thresholds.
Too sensitive → alert fatigue. Too insensitive → missed problems.

Use multiple evaluation periods.
Require 3 of 5 periods to prevent flapping.

Alert on absence of data.
If a critical instance stops sending metrics, that is a problem.

Common Alert Examples

High Error Rate

Metric: HTTP 5xx count
Statistic: Sum
Period: 5 minutes
Threshold: > 10 errors
Evaluation: 2 of 2 periods
Action: Page on-call engineer

High Latency

Metric: Response time (p99)
Statistic: Percentile 99
Period: 5 minutes
Threshold: > 2000 ms
Evaluation: 3 of 5 periods
Action: Page on-call engineer

Low Disk Space

Metric: Free disk space %
Statistic: Minimum
Period: 5 minutes
Threshold: < 10%
Evaluation: 2 of 2 periods
Action: Send warning email

Instance Down

Metric: Status Check Failed (Instance)
Statistic: Maximum
Period: 1 minute
Threshold: 1
Evaluation: 1 of 1 period
Action: Page on-call engineer, auto-recovery

Real-World Monitoring Scenarios

Scenario 1: Web Application Monitoring

A web application runs on EC2 instances behind a load balancer.

Metrics to Monitor:

Load balancer request count
Load balancer 5xx errors
Target response time
EC2 CPU utilization
EC2 status checks

Alerts:

5xx errors > 1% for 5 minutes → page on-call
Response time > 1 second for 5 minutes → page on-call
CPU > 80% for 10 minutes → scale up, notify
Instance status check failed → auto-recovery, notify

Dashboards:

Overview dashboard: request rate, error rate, response time
Instance dashboard: CPU, memory, disk per instance
Business dashboard: orders per minute, conversion rate

Scenario 2: Database Monitoring

A PostgreSQL database running on RDS.

Metrics to Monitor:

CPU utilization
Database connections
Free storage space
Read/Write latency
Replica lag (if replicas configured)

Alerts:

Connections > 80% of max → page on-call
Free storage < 10GB → page on-call
Replica lag > 60 seconds → page on-call
CPU > 90% for 15 minutes → page on-call
Connection failures → page on-call

Scenario 3: Multi-Environment Monitoring

Development, staging, and production environments require different alerting strategies.

Production:

Aggressive thresholds
Page on-call for critical issues
24/7 monitoring

Staging:

Same metrics as production
Email alerts only
No paging (unless staging is customer-facing)

Development:

Basic metrics
No alerts (teams monitor during work hours)

Dashboard Design Principles

Show what matters.
Do not put every metric on the dashboard. Include only metrics that indicate service health.

Use appropriate granularity.

Real-time dashboards for incident response: 1-minute granularity
Operational dashboards: 5-minute granularity
Business dashboards: hourly or daily aggregates

Provide context.
Include annotations for deployments, configuration changes, and known incidents.

Design for your audience.

Engineering dashboards: detailed metrics, logs, traces
Management dashboards: high-level SLIs, business metrics
Executive dashboards: uptime, revenue impact

Monitoring Costs

Monitoring services have costs that can scale with usage.

Cost Drivers:

Number of metrics collected (custom metrics, high-resolution metrics)
Number of API calls (putting metrics, querying logs)
Log ingestion volume (GB per day)
Log retention duration
Number of dashboards and alerts

Cost Optimization:

Use standard metrics when possible (they are free or low cost)
Set appropriate log retention (30-90 days is typical)
Archive logs to cheaper storage (S3, Blob, GCS) for long-term retention
Avoid high-resolution metrics unless needed
Consolidate dashboards to reduce API calls

Summary

Provider	Metrics	Logs	Alerts	Dashboards
AWS	CloudWatch Metrics	CloudWatch Logs	CloudWatch Alarms	CloudWatch Dashboards
Azure	Azure Monitor Metrics	Log Analytics	Azure Monitor Alerts	Azure Dashboards
Google Cloud	Cloud Monitoring	Cloud Logging	Alerting Policies	Monitoring Dashboards

The principles are the same across providers:

Collect metrics automatically from services
Add custom metrics for application-specific data
Centralize logs from all sources
Alert on symptoms users experience
Create dashboards for different audiences

Practice Questions

A web application is experiencing intermittent 5xx errors. How would you set up monitoring to detect this and alert the on-call engineer?
A production database is running low on disk space. What metric would you monitor, and what threshold would you set?
You need to create a dashboard for executives showing application uptime over the last 30 days. How would you design this?
Your team is getting too many alerts and starting to ignore them. How would you improve the alerting strategy?
A developer wants to log application-specific events and create alerts based on them. How would you implement this across providers?

Learn More

Practice cloud monitoring with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

SKY Tech – Explore Technology!