Skip to main content

Monitoring & Logging:

 Monitoring & Logging: CloudWatch, Stackdriver, Azure Monitor, Alerts, and Metrics

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 16 minutes
🏷️ Tags: Cloud Monitoring, CloudWatch, Azure Monitor, Stackdriver, Alerts, Metrics, Observability


Introduction: Why Monitoring Matters

You cannot fix what you cannot see. Monitoring is the eyes and ears of your cloud infrastructure. It tells you what is happening, what has happened, and what might happen next.

Without monitoring, you are flying blind. You discover problems when users complain, not when they start. You cannot optimize what you cannot measure. You cannot prove your service is meeting its commitments.

Cloud providers offer comprehensive monitoring services that collect metrics, aggregate logs, and trigger alerts. This guide covers the essential monitoring services across AWS, Azure, and Google Cloud.


The Three Pillars of Observability

Modern cloud monitoring is built on three pillars:

Metrics
Numerical measurements over time. CPU utilization, request count, error rate, latency. Metrics answer "what is happening right now?"

Logs
Detailed records of discrete events. Application logs, system logs, audit logs. Logs answer "what happened, and why?"

Traces
Records of requests as they flow through distributed systems. A single user request might touch a load balancer, two microservices, a database, and a cache. Traces answer "where is the delay?"

This guide focuses on metrics and logs, the foundation of cloud monitoring.


Cloud Monitoring Services Across Providers

ProviderMetrics & AlarmsLogsDashboards
AWSCloudWatchCloudWatch LogsCloudWatch Dashboards
AzureAzure Monitor MetricsLog AnalyticsAzure Dashboards
Google CloudCloud MonitoringCloud LoggingMonitoring Dashboards

While the names differ, the capabilities are remarkably similar across providers.


CloudWatch (AWS)

What is CloudWatch?

CloudWatch is AWS's monitoring and observability service. It collects metrics from AWS services, your applications, and on-premises resources. It stores logs, sets alarms, and creates dashboards.

CloudWatch Metrics

Metrics are data points about your resources. Every AWS service automatically sends metrics to CloudWatch.

Common Metrics by Service:

ServiceKey Metrics
EC2CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps
RDSCPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency
LambdaInvocations, Duration, Errors, Throttles
S3NumberOfObjects, BucketSizeBytes, 4xxErrors, 5xxErrors
ELBRequestCount, TargetResponseTime, HTTPCode_ELB_5XX_Count
SQSApproximateNumberOfMessagesVisible, ApproximateNumberOfMessagesNotVisible

Metric Characteristics:

  • Standard Metrics: Free, 15-month retention, 1-minute granularity for detailed monitoring, 5-minute granularity for basic

  • Custom Metrics: Paid, you can send any data, 1-second granularity possible

  • Retention: 15 months for standard metrics, customizable for custom metrics


CloudWatch Alarms

An alarm watches a single metric over a specified period and performs actions when the metric crosses a threshold.

Alarm Components:

  • Metric: What to watch (CPUUtilization, RequestCount, etc.)

  • Statistic: How to evaluate (Average, Maximum, Minimum, Sum, SampleCount)

  • Period: How long to evaluate (60 seconds, 300 seconds, etc.)

  • Threshold: The value that triggers the alarm (CPU > 80%)

  • Evaluation Periods: How many consecutive periods before triggering (3 out of 3)

  • Actions: What to do when triggered (SNS notification, Auto Scaling, EC2 action)

Alarm States:

  • OK: Metric is within the threshold

  • ALARM: Metric crossed the threshold for the evaluation period

  • INSUFFICIENT_DATA: Not enough data to evaluate (new instance, no traffic)


CloudWatch Logs

CloudWatch Logs centralize log storage from AWS services and your applications.

Log Sources:

  • AWS service logs (Lambda, VPC Flow Logs, RDS logs)

  • EC2 instances with CloudWatch Agent

  • On-premises servers with CloudWatch Agent

  • Custom application logging

Log Features:

  • Log Groups: Container for log streams (one per application or component)

  • Log Streams: Sequence of log events from a single source (one per instance)

  • Log Retention: Configurable from 1 day to 10 years, indefinite with export to S3

  • Metric Filters: Extract metrics from log data (count errors, sum request sizes)

  • Subscription Filters: Stream logs to Lambda, Kinesis, or Elasticsearch


CloudWatch Dashboards

Dashboards create customizable views of your metrics in one place.

Dashboard Features:

  • Widgets for metrics, logs, and text

  • Cross-account views (view multiple accounts in one dashboard)

  • Automatic refresh (15 seconds to 1 minute)

  • Sharing with teams or public (read-only)


Azure Monitor

What is Azure Monitor?

Azure Monitor is Microsoft's comprehensive monitoring solution. It collects, analyzes, and acts on telemetry from Azure resources, on-premises environments, and applications.

Azure Monitor Metrics

Metrics are lightweight, near real-time values that require no querying. They are stored in a time-series database optimized for fast retrieval.

Metric Types:

  • Platform Metrics: Provided by Azure services automatically

  • Custom Metrics: You can send your own metrics

Common Metrics by Service:

ServiceKey Metrics
Virtual MachinesPercentage CPU, Available Memory, Disk Read Bytes, Network In Total
App ServiceResponse Time, Requests, HTTP 5xx, Memory Working Set
SQL DatabaseDTU Consumption, CPU Percentage, Sessions Count, Deadlocks
StorageTransactions, Ingress, Egress, Success E2E Latency

Azure Monitor Alerts

Alerts proactively notify you when conditions are found in your monitoring data.

Alert Types:

  • Metric Alerts: Trigger when a metric crosses a threshold

  • Log Alerts: Trigger when a log query returns results

  • Activity Log Alerts: Trigger when specific events occur

Alert Actions:

  • Action Groups define what happens when an alert triggers:

    • Email and SMS notifications

    • Webhook calls to external systems

    • ITSM integration (ServiceNow, etc.)

    • Automation runbooks

    • Azure Functions

    • Voice calls


Log Analytics

Log Analytics is the query engine for Azure Monitor logs. It uses the Kusto Query Language (KQL), a powerful query language for log analysis.

Log Sources:

  • Azure resources (diagnostic settings enabled)

  • Virtual machines with Azure Monitor Agent

  • Application Insights

  • Custom logs

Common Queries:

kusto
// Find 5xx errors in the last hour
AppServiceConsoleLogs
| where TimeGenerated > ago(1h)
| where Level == "Error"
| where Message contains "500"
| project TimeGenerated, RoleInstance, Message

// CPU usage across all VMs
Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| order by AvgCPU desc

Application Insights

Application Insights is Azure Monitor's application performance monitoring (APM) service. It monitors live web applications to detect performance anomalies and diagnose issues.

Capabilities:

  • Request rates, response times, and failure rates

  • Dependency tracking (database, HTTP calls, external services)

  • Exception logging with stack traces

  • Performance counters (CPU, memory, network)

  • Live Metrics with near real-time performance data

  • Availability tests (ping tests from global locations)


Google Cloud Operations (formerly Stackdriver)

What is Google Cloud Operations?

Google Cloud Operations is Google's integrated monitoring, logging, and diagnostics suite. It brings together metrics, logs, and traces in a unified interface.

Cloud Monitoring (formerly Stackdriver Monitoring)

Cloud Monitoring collects metrics from Google Cloud services, applications, and on-premises infrastructure.

Metric Types:

  • Agent metrics: From installed monitoring agent (CPU, disk, network, process)

  • Platform metrics: From Google Cloud services

  • Custom metrics: From applications via the API

Workspaces:

  • A workspace organizes monitoring data

  • Can monitor multiple projects from a single workspace

  • Supports AWS accounts and on-premises systems


Service Monitoring

Service Monitoring (formerly Uptime Checks) verifies that your services are available and responsive.

Check Types:

  • HTTP/HTTPS: Verifies status code and response content

  • TCP: Verifies port connectivity

  • UDP: Verifies UDP connectivity

Alerting Policies

Alerting policies define conditions that trigger notifications.

Conditions:

  • Metric threshold: When a metric exceeds a value

  • Metric absence: When a metric stops reporting data

  • Logs-based: When a log query returns results

  • Uptime check: When a check fails

Notification Channels:

  • Email

  • SMS

  • Slack

  • PagerDuty

  • Webhooks

  • Pub/Sub


Cloud Logging (formerly Stackdriver Logging)

Cloud Logging stores, searches, and analyzes log data from Google Cloud services and your applications.

Log Sources:

  • Google Cloud services (GKE, Compute Engine, Cloud Run, etc.)

  • Custom logs from applications

  • On-premises and other cloud logs

Log Explorer:

  • Powerful query interface for searching logs

  • Save and share queries

  • Create metrics from logs

  • Export to BigQuery, Pub/Sub, or Cloud Storage

Log-Based Metrics:

  • Create metrics from log data

  • Use for alerting and dashboarding

  • Count occurrences of specific log messages


Alerting Best Practices

What to Alert On

Alert on symptoms, not causes.

  • Bad: CPU > 80% (this is a cause)

  • Good: Error rate > 1% for 5 minutes (this is a symptom users experience)

Alert on actionable conditions.
If an alert fires, someone should be able to do something about it. Avoid alert fatigue from non-actionable notifications.

Use multiple thresholds.

  • Warning: Something is degrading, needs attention soon

  • Critical: Immediate action required

  • Info: Informational, no action needed


Alert Design Principles

One alert, one purpose.
Each alert should indicate a single problem with a clear remediation path.

Include remediation steps in the notification.
Tell the on-call engineer what to do. A link to a runbook is ideal.

Set appropriate thresholds.
Too sensitive → alert fatigue. Too insensitive → missed problems.

Use multiple evaluation periods.
Require 3 of 5 periods to prevent flapping.

Alert on absence of data.
If a critical instance stops sending metrics, that is a problem.


Common Alert Examples

High Error Rate

  • Metric: HTTP 5xx count

  • Statistic: Sum

  • Period: 5 minutes

  • Threshold: > 10 errors

  • Evaluation: 2 of 2 periods

  • Action: Page on-call engineer

High Latency

  • Metric: Response time (p99)

  • Statistic: Percentile 99

  • Period: 5 minutes

  • Threshold: > 2000 ms

  • Evaluation: 3 of 5 periods

  • Action: Page on-call engineer

Low Disk Space

  • Metric: Free disk space %

  • Statistic: Minimum

  • Period: 5 minutes

  • Threshold: < 10%

  • Evaluation: 2 of 2 periods

  • Action: Send warning email

Instance Down

  • Metric: Status Check Failed (Instance)

  • Statistic: Maximum

  • Period: 1 minute

  • Threshold: 1

  • Evaluation: 1 of 1 period

  • Action: Page on-call engineer, auto-recovery


Real-World Monitoring Scenarios

Scenario 1: Web Application Monitoring

A web application runs on EC2 instances behind a load balancer.

Metrics to Monitor:

  • Load balancer request count

  • Load balancer 5xx errors

  • Target response time

  • EC2 CPU utilization

  • EC2 status checks

Alerts:

  • 5xx errors > 1% for 5 minutes → page on-call

  • Response time > 1 second for 5 minutes → page on-call

  • CPU > 80% for 10 minutes → scale up, notify

  • Instance status check failed → auto-recovery, notify

Dashboards:

  • Overview dashboard: request rate, error rate, response time

  • Instance dashboard: CPU, memory, disk per instance

  • Business dashboard: orders per minute, conversion rate


Scenario 2: Database Monitoring

A PostgreSQL database running on RDS.

Metrics to Monitor:

  • CPU utilization

  • Database connections

  • Free storage space

  • Read/Write latency

  • Replica lag (if replicas configured)

Alerts:

  • Connections > 80% of max → page on-call

  • Free storage < 10GB → page on-call

  • Replica lag > 60 seconds → page on-call

  • CPU > 90% for 15 minutes → page on-call

  • Connection failures → page on-call


Scenario 3: Multi-Environment Monitoring

Development, staging, and production environments require different alerting strategies.

Production:

  • Aggressive thresholds

  • Page on-call for critical issues

  • 24/7 monitoring

Staging:

  • Same metrics as production

  • Email alerts only

  • No paging (unless staging is customer-facing)

Development:

  • Basic metrics

  • No alerts (teams monitor during work hours)


Dashboard Design Principles

Show what matters.
Do not put every metric on the dashboard. Include only metrics that indicate service health.

Use appropriate granularity.

  • Real-time dashboards for incident response: 1-minute granularity

  • Operational dashboards: 5-minute granularity

  • Business dashboards: hourly or daily aggregates

Provide context.
Include annotations for deployments, configuration changes, and known incidents.

Design for your audience.

  • Engineering dashboards: detailed metrics, logs, traces

  • Management dashboards: high-level SLIs, business metrics

  • Executive dashboards: uptime, revenue impact


Monitoring Costs

Monitoring services have costs that can scale with usage.

Cost Drivers:

  • Number of metrics collected (custom metrics, high-resolution metrics)

  • Number of API calls (putting metrics, querying logs)

  • Log ingestion volume (GB per day)

  • Log retention duration

  • Number of dashboards and alerts

Cost Optimization:

  • Use standard metrics when possible (they are free or low cost)

  • Set appropriate log retention (30-90 days is typical)

  • Archive logs to cheaper storage (S3, Blob, GCS) for long-term retention

  • Avoid high-resolution metrics unless needed

  • Consolidate dashboards to reduce API calls


Summary

ProviderMetricsLogsAlertsDashboards
AWSCloudWatch MetricsCloudWatch LogsCloudWatch AlarmsCloudWatch Dashboards
AzureAzure Monitor MetricsLog AnalyticsAzure Monitor AlertsAzure Dashboards
Google CloudCloud MonitoringCloud LoggingAlerting PoliciesMonitoring Dashboards

The principles are the same across providers:

  • Collect metrics automatically from services

  • Add custom metrics for application-specific data

  • Centralize logs from all sources

  • Alert on symptoms users experience

  • Create dashboards for different audiences


Practice Questions

  1. A web application is experiencing intermittent 5xx errors. How would you set up monitoring to detect this and alert the on-call engineer?

  2. A production database is running low on disk space. What metric would you monitor, and what threshold would you set?

  3. You need to create a dashboard for executives showing application uptime over the last 30 days. How would you design this?

  4. Your team is getting too many alerts and starting to ignore them. How would you improve the alerting strategy?

  5. A developer wants to log application-specific events and create alerts based on them. How would you implement this across providers?


Learn More

Practice cloud monitoring with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

Comments

Popular posts from this blog

Introduction to Terraform – The Future of Infrastructure as Code

  Introduction to Terraform – The Future of Infrastructure as Code In today’s fast-paced DevOps world, managing infrastructure manually is outdated . This is where Terraform comes in—a powerful Infrastructure as Code (IaC) tool that allows you to define, provision, and manage cloud infrastructure efficiently . Whether you're working with AWS, Azure, Google Cloud, or on-premises servers , Terraform provides a declarative, automation-first approach to infrastructure deployment. Shape Your Future with AI & Infinite Knowledge...!! Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! In today’s digital-first world, agility and automation are no longer optional—they’re essential. Companies across the globe are rapidly shifting their operations to the cloud to keep up with the pace of innovatio...

📊 Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd

  Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd Monitoring and logging are essential for maintaining a healthy and well-performing Kubernetes cluster. In this guide, we’ll cover why monitoring is important, key monitoring tools like Prometheus and Grafana, and logging tools like Fluentd to help you gain visibility into your cluster’s performance and logs. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction In today’s fast-paced cloud-native environment, Kubernetes has emerged as the de-facto container orchestration platform. But deploying and managing applications in Kubernetes is just half the ba...

🔒 Kubernetes Security – RBAC, Network Policies, and Secrets Management

  Kubernetes Security – RBAC, Network Policies, and Secrets Management Security is a critical aspect of managing Kubernetes clusters. In this guide, we'll cover essential security mechanisms like Role-Based Access Control (RBAC) , Network Policies , and Secrets Management to help you secure your Kubernetes environment effectively. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction: Why Kubernetes Security Is Non-Negotiable As Kubernetes becomes the backbone of modern cloud-native infrastructure, security is no longer optional—it’s mission-critical . With multiple moving parts like containers, pods, services, nodes, and more, Kuberne...