Monitoring & Logging: CloudWatch, Stackdriver, Azure Monitor, Alerts, and Metrics
📅 Published: Feb 2026
⏱️ Estimated Reading Time: 16 minutes
🏷️ Tags: Cloud Monitoring, CloudWatch, Azure Monitor, Stackdriver, Alerts, Metrics, Observability
Introduction: Why Monitoring Matters
You cannot fix what you cannot see. Monitoring is the eyes and ears of your cloud infrastructure. It tells you what is happening, what has happened, and what might happen next.
Without monitoring, you are flying blind. You discover problems when users complain, not when they start. You cannot optimize what you cannot measure. You cannot prove your service is meeting its commitments.
Cloud providers offer comprehensive monitoring services that collect metrics, aggregate logs, and trigger alerts. This guide covers the essential monitoring services across AWS, Azure, and Google Cloud.
The Three Pillars of Observability
Modern cloud monitoring is built on three pillars:
Metrics
Numerical measurements over time. CPU utilization, request count, error rate, latency. Metrics answer "what is happening right now?"
Logs
Detailed records of discrete events. Application logs, system logs, audit logs. Logs answer "what happened, and why?"
Traces
Records of requests as they flow through distributed systems. A single user request might touch a load balancer, two microservices, a database, and a cache. Traces answer "where is the delay?"
This guide focuses on metrics and logs, the foundation of cloud monitoring.
Cloud Monitoring Services Across Providers
| Provider | Metrics & Alarms | Logs | Dashboards |
|---|---|---|---|
| AWS | CloudWatch | CloudWatch Logs | CloudWatch Dashboards |
| Azure | Azure Monitor Metrics | Log Analytics | Azure Dashboards |
| Google Cloud | Cloud Monitoring | Cloud Logging | Monitoring Dashboards |
While the names differ, the capabilities are remarkably similar across providers.
CloudWatch (AWS)
What is CloudWatch?
CloudWatch is AWS's monitoring and observability service. It collects metrics from AWS services, your applications, and on-premises resources. It stores logs, sets alarms, and creates dashboards.
CloudWatch Metrics
Metrics are data points about your resources. Every AWS service automatically sends metrics to CloudWatch.
Common Metrics by Service:
| Service | Key Metrics |
|---|---|
| EC2 | CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps |
| RDS | CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency |
| Lambda | Invocations, Duration, Errors, Throttles |
| S3 | NumberOfObjects, BucketSizeBytes, 4xxErrors, 5xxErrors |
| ELB | RequestCount, TargetResponseTime, HTTPCode_ELB_5XX_Count |
| SQS | ApproximateNumberOfMessagesVisible, ApproximateNumberOfMessagesNotVisible |
Metric Characteristics:
Standard Metrics: Free, 15-month retention, 1-minute granularity for detailed monitoring, 5-minute granularity for basic
Custom Metrics: Paid, you can send any data, 1-second granularity possible
Retention: 15 months for standard metrics, customizable for custom metrics
CloudWatch Alarms
An alarm watches a single metric over a specified period and performs actions when the metric crosses a threshold.
Alarm Components:
Metric: What to watch (CPUUtilization, RequestCount, etc.)
Statistic: How to evaluate (Average, Maximum, Minimum, Sum, SampleCount)
Period: How long to evaluate (60 seconds, 300 seconds, etc.)
Threshold: The value that triggers the alarm (CPU > 80%)
Evaluation Periods: How many consecutive periods before triggering (3 out of 3)
Actions: What to do when triggered (SNS notification, Auto Scaling, EC2 action)
Alarm States:
OK: Metric is within the threshold
ALARM: Metric crossed the threshold for the evaluation period
INSUFFICIENT_DATA: Not enough data to evaluate (new instance, no traffic)
CloudWatch Logs
CloudWatch Logs centralize log storage from AWS services and your applications.
Log Sources:
AWS service logs (Lambda, VPC Flow Logs, RDS logs)
EC2 instances with CloudWatch Agent
On-premises servers with CloudWatch Agent
Custom application logging
Log Features:
Log Groups: Container for log streams (one per application or component)
Log Streams: Sequence of log events from a single source (one per instance)
Log Retention: Configurable from 1 day to 10 years, indefinite with export to S3
Metric Filters: Extract metrics from log data (count errors, sum request sizes)
Subscription Filters: Stream logs to Lambda, Kinesis, or Elasticsearch
CloudWatch Dashboards
Dashboards create customizable views of your metrics in one place.
Dashboard Features:
Widgets for metrics, logs, and text
Cross-account views (view multiple accounts in one dashboard)
Automatic refresh (15 seconds to 1 minute)
Sharing with teams or public (read-only)
Azure Monitor
What is Azure Monitor?
Azure Monitor is Microsoft's comprehensive monitoring solution. It collects, analyzes, and acts on telemetry from Azure resources, on-premises environments, and applications.
Azure Monitor Metrics
Metrics are lightweight, near real-time values that require no querying. They are stored in a time-series database optimized for fast retrieval.
Metric Types:
Platform Metrics: Provided by Azure services automatically
Custom Metrics: You can send your own metrics
Common Metrics by Service:
| Service | Key Metrics |
|---|---|
| Virtual Machines | Percentage CPU, Available Memory, Disk Read Bytes, Network In Total |
| App Service | Response Time, Requests, HTTP 5xx, Memory Working Set |
| SQL Database | DTU Consumption, CPU Percentage, Sessions Count, Deadlocks |
| Storage | Transactions, Ingress, Egress, Success E2E Latency |
Azure Monitor Alerts
Alerts proactively notify you when conditions are found in your monitoring data.
Alert Types:
Metric Alerts: Trigger when a metric crosses a threshold
Log Alerts: Trigger when a log query returns results
Activity Log Alerts: Trigger when specific events occur
Alert Actions:
Action Groups define what happens when an alert triggers:
Email and SMS notifications
Webhook calls to external systems
ITSM integration (ServiceNow, etc.)
Automation runbooks
Azure Functions
Voice calls
Log Analytics
Log Analytics is the query engine for Azure Monitor logs. It uses the Kusto Query Language (KQL), a powerful query language for log analysis.
Log Sources:
Azure resources (diagnostic settings enabled)
Virtual machines with Azure Monitor Agent
Application Insights
Custom logs
Common Queries:
// Find 5xx errors in the last hour AppServiceConsoleLogs | where TimeGenerated > ago(1h) | where Level == "Error" | where Message contains "500" | project TimeGenerated, RoleInstance, Message // CPU usage across all VMs Perf | where TimeGenerated > ago(1h) | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize AvgCPU = avg(CounterValue) by Computer | order by AvgCPU desc
Application Insights
Application Insights is Azure Monitor's application performance monitoring (APM) service. It monitors live web applications to detect performance anomalies and diagnose issues.
Capabilities:
Request rates, response times, and failure rates
Dependency tracking (database, HTTP calls, external services)
Exception logging with stack traces
Performance counters (CPU, memory, network)
Live Metrics with near real-time performance data
Availability tests (ping tests from global locations)
Google Cloud Operations (formerly Stackdriver)
What is Google Cloud Operations?
Google Cloud Operations is Google's integrated monitoring, logging, and diagnostics suite. It brings together metrics, logs, and traces in a unified interface.
Cloud Monitoring (formerly Stackdriver Monitoring)
Cloud Monitoring collects metrics from Google Cloud services, applications, and on-premises infrastructure.
Metric Types:
Agent metrics: From installed monitoring agent (CPU, disk, network, process)
Platform metrics: From Google Cloud services
Custom metrics: From applications via the API
Workspaces:
A workspace organizes monitoring data
Can monitor multiple projects from a single workspace
Supports AWS accounts and on-premises systems
Service Monitoring
Service Monitoring (formerly Uptime Checks) verifies that your services are available and responsive.
Check Types:
HTTP/HTTPS: Verifies status code and response content
TCP: Verifies port connectivity
UDP: Verifies UDP connectivity
Alerting Policies
Alerting policies define conditions that trigger notifications.
Conditions:
Metric threshold: When a metric exceeds a value
Metric absence: When a metric stops reporting data
Logs-based: When a log query returns results
Uptime check: When a check fails
Notification Channels:
Email
SMS
Slack
PagerDuty
Webhooks
Pub/Sub
Cloud Logging (formerly Stackdriver Logging)
Cloud Logging stores, searches, and analyzes log data from Google Cloud services and your applications.
Log Sources:
Google Cloud services (GKE, Compute Engine, Cloud Run, etc.)
Custom logs from applications
On-premises and other cloud logs
Log Explorer:
Powerful query interface for searching logs
Save and share queries
Create metrics from logs
Export to BigQuery, Pub/Sub, or Cloud Storage
Log-Based Metrics:
Create metrics from log data
Use for alerting and dashboarding
Count occurrences of specific log messages
Alerting Best Practices
What to Alert On
Alert on symptoms, not causes.
Bad: CPU > 80% (this is a cause)
Good: Error rate > 1% for 5 minutes (this is a symptom users experience)
Alert on actionable conditions.
If an alert fires, someone should be able to do something about it. Avoid alert fatigue from non-actionable notifications.
Use multiple thresholds.
Warning: Something is degrading, needs attention soon
Critical: Immediate action required
Info: Informational, no action needed
Alert Design Principles
One alert, one purpose.
Each alert should indicate a single problem with a clear remediation path.
Include remediation steps in the notification.
Tell the on-call engineer what to do. A link to a runbook is ideal.
Set appropriate thresholds.
Too sensitive → alert fatigue. Too insensitive → missed problems.
Use multiple evaluation periods.
Require 3 of 5 periods to prevent flapping.
Alert on absence of data.
If a critical instance stops sending metrics, that is a problem.
Common Alert Examples
High Error Rate
Metric: HTTP 5xx count
Statistic: Sum
Period: 5 minutes
Threshold: > 10 errors
Evaluation: 2 of 2 periods
Action: Page on-call engineer
High Latency
Metric: Response time (p99)
Statistic: Percentile 99
Period: 5 minutes
Threshold: > 2000 ms
Evaluation: 3 of 5 periods
Action: Page on-call engineer
Low Disk Space
Metric: Free disk space %
Statistic: Minimum
Period: 5 minutes
Threshold: < 10%
Evaluation: 2 of 2 periods
Action: Send warning email
Instance Down
Metric: Status Check Failed (Instance)
Statistic: Maximum
Period: 1 minute
Threshold: 1
Evaluation: 1 of 1 period
Action: Page on-call engineer, auto-recovery
Real-World Monitoring Scenarios
Scenario 1: Web Application Monitoring
A web application runs on EC2 instances behind a load balancer.
Metrics to Monitor:
Load balancer request count
Load balancer 5xx errors
Target response time
EC2 CPU utilization
EC2 status checks
Alerts:
5xx errors > 1% for 5 minutes → page on-call
Response time > 1 second for 5 minutes → page on-call
CPU > 80% for 10 minutes → scale up, notify
Instance status check failed → auto-recovery, notify
Dashboards:
Overview dashboard: request rate, error rate, response time
Instance dashboard: CPU, memory, disk per instance
Business dashboard: orders per minute, conversion rate
Scenario 2: Database Monitoring
A PostgreSQL database running on RDS.
Metrics to Monitor:
CPU utilization
Database connections
Free storage space
Read/Write latency
Replica lag (if replicas configured)
Alerts:
Connections > 80% of max → page on-call
Free storage < 10GB → page on-call
Replica lag > 60 seconds → page on-call
CPU > 90% for 15 minutes → page on-call
Connection failures → page on-call
Scenario 3: Multi-Environment Monitoring
Development, staging, and production environments require different alerting strategies.
Production:
Aggressive thresholds
Page on-call for critical issues
24/7 monitoring
Staging:
Same metrics as production
Email alerts only
No paging (unless staging is customer-facing)
Development:
Basic metrics
No alerts (teams monitor during work hours)
Dashboard Design Principles
Show what matters.
Do not put every metric on the dashboard. Include only metrics that indicate service health.
Use appropriate granularity.
Real-time dashboards for incident response: 1-minute granularity
Operational dashboards: 5-minute granularity
Business dashboards: hourly or daily aggregates
Provide context.
Include annotations for deployments, configuration changes, and known incidents.
Design for your audience.
Engineering dashboards: detailed metrics, logs, traces
Management dashboards: high-level SLIs, business metrics
Executive dashboards: uptime, revenue impact
Monitoring Costs
Monitoring services have costs that can scale with usage.
Cost Drivers:
Number of metrics collected (custom metrics, high-resolution metrics)
Number of API calls (putting metrics, querying logs)
Log ingestion volume (GB per day)
Log retention duration
Number of dashboards and alerts
Cost Optimization:
Use standard metrics when possible (they are free or low cost)
Set appropriate log retention (30-90 days is typical)
Archive logs to cheaper storage (S3, Blob, GCS) for long-term retention
Avoid high-resolution metrics unless needed
Consolidate dashboards to reduce API calls
Summary
| Provider | Metrics | Logs | Alerts | Dashboards |
|---|---|---|---|---|
| AWS | CloudWatch Metrics | CloudWatch Logs | CloudWatch Alarms | CloudWatch Dashboards |
| Azure | Azure Monitor Metrics | Log Analytics | Azure Monitor Alerts | Azure Dashboards |
| Google Cloud | Cloud Monitoring | Cloud Logging | Alerting Policies | Monitoring Dashboards |
The principles are the same across providers:
Collect metrics automatically from services
Add custom metrics for application-specific data
Centralize logs from all sources
Alert on symptoms users experience
Create dashboards for different audiences
Practice Questions
A web application is experiencing intermittent 5xx errors. How would you set up monitoring to detect this and alert the on-call engineer?
A production database is running low on disk space. What metric would you monitor, and what threshold would you set?
You need to create a dashboard for executives showing application uptime over the last 30 days. How would you design this?
Your team is getting too many alerts and starting to ignore them. How would you improve the alerting strategy?
A developer wants to log application-specific events and create alerts based on them. How would you implement this across providers?
Learn More
Practice cloud monitoring with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/
Comments
Post a Comment