Saturday, January 10, 2026

System Monitoring & Performance Guide - DevOps

System Monitoring & Performance Guide - DevOps

System Monitoring & Performance Guide

Published: December 2025 | Topic: System Monitoring & Performance for DevOps

Monitoring is the eyes and ears of your infrastructure. Effective monitoring helps you detect issues before they become outages, optimize resource utilization, and ensure system reliability. This guide covers essential monitoring tools and techniques for Linux systems.

1. System Resource Monitoring

Key Metrics to Monitor

CPU Usage
≤ 80%
Warning threshold
Memory Usage
≤ 85%
Warning threshold
Disk Usage
≤ 90%
Warning threshold
Load Average
≤ CPU Cores
Optimal range

CPU Monitoring

top / htop

$ top
# Interactive process viewer
$ htop # Enhanced version (install if needed)

# Batch mode - non-interactive
$ top -bn1 | head -20

# Key metrics in top:
# load average: 1min, 5min, 15min
# %Cpu(s): us(user), sy(system), ni(nice), id(idle)
# Memory: total, used, free, buffers

vmstat - Virtual Memory Stats

$ vmstat 2 5
# Display every 2 seconds, 5 times

# Output columns:
# r: Running processes
# b: Blocked processes
# swpd: Swap used (KB)
# free: Free memory (KB)
# si: Swap in from disk (KB/s)
# so: Swap out to disk (KB/s)
# us: User CPU time (%)
# sy: System CPU time (%)
# id: Idle CPU time (%)

mpstat - CPU Statistics

$ mpstat -P ALL 2 3
# Monitor all CPUs, every 2s, 3 times

# Key metrics:
# %usr: User CPU percentage
# %nice: Nice user CPU percentage
# %sys: System CPU percentage
# %iowait: I/O wait percentage
# %irq: Interrupt handling
# %soft: Software interrupts
# %idle: Idle percentage

# Check CPU frequency scaling:
$ cat /proc/cpuinfo | grep MHz

Memory Monitoring

free - Memory Usage

$ free -h # Human readable
$ free -m # Megabytes

# Watch memory continuously:
$ watch -n 2 free -m

# Understanding output:
# total: Total installed memory
# used: Used memory (includes buffers/cache)
# free: Unused memory
# shared: Memory used by tmpfs
# buffers: Memory used by kernel buffers
# cache: Page cache and slabs
# available: Estimation of memory available

/proc/meminfo - Detailed Info

$ cat /proc/meminfo

# Key fields:
# MemTotal: Total usable RAM
# MemFree: Completely unused RAM
# MemAvailable: Estimated available memory
# Buffers: Temporary storage for raw disk blocks
# Cached: In-memory cache for files read from disk
# SwapCached: Memory that was swapped out
# Active: Memory recently used
# Inactive: Memory not recently used
# SwapTotal: Total swap space
# SwapFree: Unused swap space

Disk I/O Monitoring

iostat - Disk Statistics

$ iostat -xz 2 5
# Extended stats, every 2s, 5 times

# Key metrics:
# %util: Percentage of CPU time for I/O requests
# r/s: Reads per second
# w/s: Writes per second
# rkB/s: KB read per second
# wkB/s: KB written per second
# await: Average wait time for I/O (ms)
# svctm: Average service time for I/O (ms)

# Disk space usage:
$ df -h # Human readable
$ df -i # Inode usage

iotop - I/O by Process

$ iotop
# Interactive view of disk I/O by process
$ iotop -o # Only show processes doing I/O
$ iotop -b -n 3 # Batch mode, 3 iterations

# Key columns:
# TID: Thread ID
# PRIO: I/O priority
# USER: Process owner
# DISK READ: Read rate
# DISK WRITE: Write rate
# SWAPIN: Swap in percentage
# IO: I/O percentage

Network Monitoring

iftop - Bandwidth Usage

$ iftop -i eth0 # Monitor specific interface
$ iftop -n # Don't resolve hostnames
$ iftop -N # Don't resolve port numbers

# Batch mode with pcap:
$ iftop -t -s 10 -L 100

# Key information:
# Shows bandwidth usage per connection
# Real-time traffic display
# Sorted by bandwidth usage
# Source and destination IPs

nload & ifstat

$ nload eth0
# Visual network monitor

$ ifstat -i eth0 -t 2 10
# Interface statistics

# Basic network stats:
$ netstat -i # Interface table
$ ss -s # Socket statistics
$ cat /proc/net/dev # Network device info

# Check connections:
$ ss -tunap # All TCP/UDP connections

2. Performance Monitoring Tools

sar - System Activity Reporter

# Install sysstat package first
$ sudo apt install sysstat # Debian/Ubuntu
$ sudo yum install sysstat # RHEL/CentOS

# Enable data collection (edit /etc/default/sysstat)
ENABLED="true"

# Common sar commands:
$ sar -u 2 5 # CPU every 2s, 5 times
$ sar -r 1 3 # Memory every 1s, 3 times
$ sar -b 2 4 # I/O and transfer rate
$ sar -n DEV 2 3 # Network devices
$ sar -q 2 5 # Queue length and load

# View historical data:
$ sar -u -f /var/log/sa/sa10 # CPU for specific day
$ sar -A # All statistics

# Generate reports:
$ sar -u -s 10:00:00 -e 13:00:00 # Specific time range

dstat - Versatile Resource Statistics

$ dstat -cdngy 1 10
# CPU, disk, network, paging, system stats
# Update every 1s, 10 times

# Custom combinations:
$ dstat -c -d -n -m -y -t 2 5
# cpu, disk, net, memory, system, timestamp

# Advanced usage:
$ dstat --top-cpu # Top CPU consuming processes
$ dstat --top-mem # Top memory consumers
$ dstat --top-io # Top I/O processes
$ dstat --output /tmp/dstat.csv # Export to CSV

# Color-coded output:
$ dstat -cdngy --color

Performance Analysis Script

Complete Performance Monitoring Script

#!/bin/bash
# performance_monitor.sh - Comprehensive system monitoring

LOG_FILE="/var/log/performance_$(date +%Y%m%d).log"
ALERT_THRESHOLD_CPU=80
ALERT_THRESHOLD_MEM=85
ALERT_THRESHOLD_DISK=90

collect_metrics() {
    local timestamp="$(date '+%Y-%m-%d %H:%M:%S')"
    
    # CPU Metrics
    local cpu_usage="$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')"
    local load_avg="$(uptime | awk -F'load average:' '{print $2}' | tr -d ',')"
    
    # Memory Metrics
    local mem_info="$(free -m)"
    local mem_total="$(echo "$mem_info" | grep Mem | awk '{print $2}')"
    local mem_used="$(echo "$mem_info" | grep Mem | awk '{print $3}')"
    local mem_percent="$(echo "scale=2; $mem_used * 100 / $mem_total" | bc)"
    
    # Disk Metrics
    local disk_usage="$(df / | tail -1 | awk '{print $5}' | sed 's/%//')"
    
    # Output results
    cat >> "$LOG_FILE" << EOF
=== $timestamp ===
CPU Usage: ${cpu_usage}%
Load Average: $load_avg
Memory: ${mem_used}MB/${mem_total}MB (${mem_percent}%)
Disk Usage: ${disk_usage}%
EOF
    
    # Check thresholds
    if (( $(echo "$cpu_usage > $ALERT_THRESHOLD_CPU" | bc -l) )); then
        send_alert "High CPU" "CPU usage: ${cpu_usage}%"
    fi
    
    if (( $(echo "$mem_percent > $ALERT_THRESHOLD_MEM" | bc -l) )); then
        send_alert "High Memory" "Memory usage: ${mem_percent}%"
    fi
}

monitor_processes() {
    echo "=== Top Processes ===" >> "$LOG_FILE"
    ps aux --sort=-%cpu | head -6 >> "$LOG_FILE"
    echo "" >> "$LOG_FILE"
}

main() {
    echo "Starting performance monitoring..."
    
    while true; do
        collect_metrics
        monitor_processes
        sleep 60 # Collect every minute
    done
}

main "$@"

3. Log Monitoring & Analysis

System Log Locations

# System logs location:
/var/log/syslog # General system messages (Debian/Ubuntu)
/var/log/messages # General system messages (RHEL/CentOS)
/var/log/auth.log # Authentication logs
/var/log/kern.log # Kernel messages
/var/log/dmesg # Kernel ring buffer
/var/log/secure # Security-related messages (RHEL)
/var/log/boot.log # System boot messages
/var/log/cron # Cron job logs
/var/log/maillog # Mail server logs
/var/log/httpd/ # Apache logs
/var/log/nginx/ # Nginx logs
/var/log/mysql/ # MySQL logs

# Check log rotation configuration:
$ ls -la /etc/logrotate.d/

journalctl - Systemd Journal

Basic journalctl Usage

# View all logs:
$ journalctl

# Follow new entries:
$ journalctl -f

# Show logs since boot:
$ journalctl -b

# Show logs for specific unit:
$ journalctl -u nginx.service
$ journalctl -u docker.service

# Show kernel messages:
$ journalctl -k

Advanced journalctl Queries

# Filter by priority:
$ journalctl -p err # Errors only
$ journalctl -p warning # Warnings and above

# Filter by time:
$ journalctl --since "2025-12-01 09:00:00"
$ journalctl --since "1 hour ago"
$ journalctl --since yesterday

# Show specific fields:
$ journalctl -o json # JSON output
$ journalctl -o json-pretty
$ journalctl -o verbose # All available fields

# Disk usage:
$ journalctl --disk-usage

Log Analysis Commands

grep - Pattern Searching

# Basic searching:
$ grep "error" /var/log/syslog
$ grep -i "failed" /var/log/auth.log

# Multiple patterns:
$ grep -E "error|fail|critical" /var/log/syslog

# Show context:
$ grep -B2 -A2 "out of memory" /var/log/kern.log
# Show 2 lines before and after match

# Count occurrences:
$ grep -c "Connection refused" /var/log/syslog

# Search in multiple files:
$ grep -r "authentication failure" /var/log/

awk & sed - Advanced Processing

# Extract specific fields:
$ awk '{print $1, $5}' /var/log/nginx/access.log

# Count unique IPs in access log:
$ awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr

# Find top URLs accessed:
$ awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20

# Replace text with sed:
$ sed 's/old/new/g' file.log

# Extract time range:
$ awk '/Dec 01 09:/, /Dec 01 10:/' /var/log/syslog

Log Monitoring Script

Automated Log Analysis Script

#!/bin/bash
# log_monitor.sh - Real-time log monitoring

LOG_FILES=(
    "/var/log/syslog"
    "/var/log/auth.log"
    "/var/log/nginx/error.log"
)

PATTERNS=(
    "error"
    "failed"
    "critical"
    "authentication failure"
    "Connection refused"
)

monitor_logs() {
    local logfile="$1"
    
    tail -F "$logfile" | while read -r line; do
        for pattern in "${PATTERNS[@]}"; do
            if [[ "$line" =~ $pattern ]]; then
                local timestamp="$(date '+%Y-%m-%d %H:%M:%S')"
                echo "[$timestamp] [ALERT] $logfile: $line"
                send_notification "Log Alert" "Pattern '$pattern' found in $logfile"
            fi
        done
    done
}

analyze_logs() {
    local logfile="$1"
    local days="${2:-1}"
    
    echo "=== Analysis for $logfile (last $days days) ==="
    
    # Count errors
    local error_count="$(grep -c -i "error" "$logfile" 2>/dev/null || echo 0)"
    echo "Total errors: $error_count"
    
    # Most common error messages
    echo "Top error messages:"
    grep -i "error" "$logfile" 2>/dev/null | \
        awk -F: '{print $NF}' | sort | uniq -c | sort -nr | head -5
}

main() {
    echo "Starting log monitoring..."
    
    # Start monitoring in background
    for logfile in "${LOG_FILES[@]}"; do
        if [[ -f "$logfile" ]]; then
            monitor_logs "$logfile" &
        fi
    done
    
    # Daily analysis
    while true; do
        for logfile in "${LOG_FILES[@]}"; do
            analyze_logs "$logfile" 1
        done
        sleep 86400 # Run daily
    done
}

main "$@"

4. Troubleshooting Boot & Performance Issues

Boot Process Troubleshooting

Boot Logs & Systemd

# View boot process:
$ journalctl -b # Current boot
$ journalctl -b -1 # Previous boot
$ journalctl --list-boots # List all boots

# Kernel messages during boot:
$ dmesg | head -50
$ dmesg | grep -i "error\|fail"

# Systemd boot analysis:
$ systemd-analyze time # Boot time breakdown
$ systemd-analyze blame # Services by boot time
$ systemd-analyze critical-chain # Critical path
$ systemd-analyze plot > boot.svg # Create boot chart

Common Boot Issues

# Filesystem check:
$ fsck -y /dev/sda1

# Check disk for bad blocks:
$ badblocks -v /dev/sda

# Rescue mode access:
# 1. Reboot and edit kernel parameters
# 2. Add "single" or "init=/bin/bash"
# 3. Mount filesystems read-write:
$ mount -o remount,rw /

# Fix GRUB bootloader:
$ grub-install /dev/sda
$ update-grub # Debian/Ubuntu

Performance Issue Diagnosis

Slow System Symptoms

# Check load average:
$ uptime
# If load > CPU cores, system overloaded

# Check memory pressure:
$ vmstat 1 5
# High si/so indicates swapping

# Check I/O wait:
$ iostat -x 1 3
# High %util or await indicates disk bottleneck

# Check process states:
$ ps aux | awk '$8 ~ /D/ {print}'
# D state = uninterruptible sleep (usually I/O)

Quick Diagnostics

# One-liner system check:
$ echo "Load: $(uptime | awk -F'load average:' '{print $2}')"; \
    echo "Memory: $(free -h | grep Mem | awk '{print $3"/"$2}')"; \
    echo "Disk: $(df -h / | tail -1 | awk '{print $5}')"; \
    echo "Top CPU: $(ps aux --sort=-%cpu | head -2 | tail -1 | awk '{print $11}')"

# Check for zombie processes:
$ ps aux | awk '$8 ~ /Z/ {print}'
# Kill parent of zombies:
$ kill -HUP $(ps -A -ostat,ppid | awk '/[zZ]/ {print $2}')

# Clear page cache (careful!):
$ sync; echo 1 > /proc/sys/vm/drop_caches

Troubleshooting Checklist

System Performance Troubleshooting Flow

  1. Identify Symptoms: High load, slow response, errors
  2. Check Load Average: uptime, check if > CPU cores
  3. Monitor CPU: top/htop, check us/sy/idle percentages
  4. Check Memory: free -m, watch for swapping (si/so)
  5. Monitor Disk I/O: iostat, check await and %util
  6. Check Network: iftop/iftop, look for bottlenecks
  7. Identify Top Processes: ps aux --sort=-%cpu or -%mem
  8. Check Logs: journalctl, /var/log/ for errors
  9. Verify Services: systemctl status for failed services
  10. Review Recent Changes: Package updates, config changes

Emergency Recovery Commands

# When system becomes unresponsive:
# Magic SysRq keys (if enabled):
# Alt+SysRq+r - Switch keyboard to raw mode
# Alt+SysRq+e - Send SIGTERM to all processes
# Alt+SysRq+i - Send SIGKILL to all processes
# Alt+SysRq+s - Sync filesystems
# Alt+SysRq+u - Remount filesystems read-only
# Alt+SysRq+b - Reboot system

# Force kill process eating CPU:
$ kill -9 $(ps aux --sort=-%cpu | head -2 | tail -1 | awk '{print $2}')

# Emergency disk space cleanup:
$ du -sh /* 2>/dev/null | sort -hr | head -10
$ find /var/log -name "*.log" -size +100M -exec truncate -s 0 {} \;
$ journalctl --vacuum-size=100M # Reduce journal size

Monitoring Command Quick Reference

CPU Monitoring

$ top -bn1 | grep "Cpu(s)"
$ mpstat -P ALL 1 3
$ sar -u 2 5
$ uptime # Load average

Memory Monitoring

$ free -m
$ vmstat 1 5
$ sar -r 1 3
$ cat /proc/meminfo

Disk Monitoring

$ df -h
$ iostat -xz 1 3
$ iotop
$ du -sh /* 2>/dev/null | sort -hr

Network Monitoring

$ iftop -i eth0
$ nload eth0
$ ss -tunap
$ netstat -i

Log Monitoring

$ journalctl -f
$ tail -f /var/log/syslog
$ grep -i "error" /var/log/syslog
$ dmesg | tail -20

Process Monitoring

$ ps aux --sort=-%cpu
$ ps aux --sort=-%mem
$ htop
$ pstree -p

Best Practices Summary

  • Monitor proactively: Don't wait for alerts - set up comprehensive monitoring
  • Establish baselines: Know normal behavior to identify anomalies
  • Use the right tool: Different tools for different monitoring needs
  • Set meaningful thresholds: Avoid alert fatigue with proper thresholds
  • Centralize logs: Use centralized logging for distributed systems
  • Automate responses: Script common recovery actions
  • Document procedures: Create runbooks for common issues
  • Regular reviews: Periodically review monitoring and alerting setups
  • Test recovery: Regularly test backup and recovery procedures
  • Stay updated: Keep monitoring tools and scripts current

Effective system monitoring and performance management requires understanding both the tools available and the metrics that matter most for your specific environment. Regular monitoring, combined with proper alerting and automated responses, can prevent most issues from becoming outages.

System Monitoring & Performance Guide - DevOps

System Monitoring & Performance Guide - DevOps System Monitoring & Performance Guide ...