System Monitoring & Performance Guide

System Monitoring & Performance Guide - DevOps

System Monitoring & Performance: The DevOps Monitoring Mastery Guide

Learn to monitor, analyze, and optimize Linux system performance like a seasoned system administrator.

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 18 minutes
🏷️ Tags: System Monitoring, Performance, Troubleshooting, Linux Administration, DevOps

📊 Monitoring CPU, Memory, Disk, Network Usage

Understanding System Resources

Think of your Linux server like a busy restaurant kitchen:

CPU = The chefs (processors cooking tasks)
Memory (RAM) = Counter space (working area for chefs)
Disk (Storage) = Refrigerator and pantry (long-term storage)
Network = Delivery trucks (moving data in and out)

When any of these gets overloaded, your whole "restaurant" slows down. Monitoring helps you see bottlenecks before customers (users) complain!

Real-time vs Historical Monitoring

Real-time monitoring = Watching live camera feeds in the kitchen
Historical monitoring = Reviewing yesterday's kitchen logs

You need both:

Real-time to fix problems happening NOW
Historical to spot trends and prevent future issues

🔧 Essential Monitoring Tools

1. top & htop: The System Dashboard

# Basic system overview
top

# Inside top:
# Press '1' = Show all CPU cores
# Press 'M' = Sort by memory usage
# Press 'P' = Sort by CPU usage
# Press 'k' = Kill a process (enter PID)
# Press 'q' = Quit

# htop (nicer, color-coded)
sudo apt install htop    # Install if needed
htop

# htop features:
# F2 = Setup (configure display)
# F3 = Search for process
# F4 = Filter processes
# F5 = Tree view
# F6 = Sort by column
# F9 = Kill process

Understanding key metrics in top:

top - 10:30:00 up 30 days,  3:15,  1 user,  load average: 0.50, 0.75, 1.00
Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.2 us,  1.5 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  0.0 si
MiB Mem :   7845.2 total,   1024.5 free,   2048.3 used,   4772.4 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   5120.2 avail Mem

Load average = Average number of processes waiting for CPU (1.00 per CPU core = 100% busy)
%Cpu(s) = User (us), System (sy), Idle (id), Wait I/O (wa) percentages
Memory = Total, used, free, buff/cache (cached disk data)
Swap = Virtual memory on disk (high usage = memory pressure)

2. vmstat: Virtual Memory Statistics

vmstat shows processes, memory, paging, block IO, traps, and CPU activity.

# Basic vmstat
vmstat

# Refresh every 2 seconds, show 5 updates
vmstat 2 5

# Show with timestamps
vmstat -t 2 3

# Understanding vmstat output:
# Procs: r=run queue, b=blocked
# Memory: swpd=swap used, free, buff, cache
# Swap: si=swap in, so=swap out
# IO: bi=blocks in, bo=blocks out
# System: in=interrupts, cs=context switches
# CPU: us, sy, id, wa, st

3. iostat: Input/Output Statistics

iostat shows CPU and disk I/O statistics.

# Install sysstat package first
sudo apt install sysstat

# Show CPU and disk statistics
iostat

# Show only disk statistics
iostat -d

# Show extended statistics
iostat -x

# Show with human readable units
iostat -h

# Continuous monitoring (every 2 seconds)
iostat 2

# Understanding key metrics:
# %util = Disk utilization percentage (100% = saturated)
# await = Average wait time for I/O requests
# r/s, w/s = Reads/writes per second
# rkB/s, wkB/s = KB read/written per second

4. free: Memory Usage

# Basic memory info
free

# Human readable (MB, GB)
free -h

# Show in megabytes
free -m

# Show total
free -t

# Show wide output (all details)
free -wh

# Understanding output:
# total = Total installed memory
# used = Memory used by processes
# free = Completely unused memory
# shared = Memory used by tmpfs
# buff/cache = Memory used for disk caching
# available = Memory available for new applications

Key insight: Linux uses free memory for disk caching. High "used" memory with low "available" is a problem. High "used" with high "available" is normal (caching).

5. sar: System Activity Reporter

sar collects and reports system activity information. It's like having a flight data recorder for your server.

# Install sysstat
sudo apt install sysstat

# Enable data collection (usually already enabled)
sudo systemctl enable sysstat
sudo systemctl start sysstat

# View today's CPU statistics
sar

# View memory usage
sar -r

# View swap usage
sar -S

# View disk I/O
sar -d

# View network statistics
sar -n DEV

# View specific time (e.g., 10:00 AM to 11:00 AM)
sar -s 10:00:00 -e 11:00:00

# Save report to file
sar -A > system-report.txt

6. dstat: Versatile Resource Statistics

dstat combines vmstat, iostat, ifstat, and more in one tool.

# Install dstat
sudo apt install dstat

# Basic dstat (refreshes every second)
dstat

# Show CPU, disk, network, paging, system
dstat -cdngy

# Show with colors
dstat -ta --color

# Show specific plugins
dstat --cpu --mem --net --disk

# Save to CSV for analysis
dstat --output /tmp/dstat.csv 5

# Key columns:
# cpu: usr, sys, idl, wai, hiq, siq
# dsk: read, writ
# net: recv, send
# paging: in, out
# system: int, csw

7. iotop: Monitor Disk I/O by Process

While iostat shows disk-level stats, iotop shows which processes are doing the I/O.

# Install iotop
sudo apt install iotop

# Run with sudo (needs root for process info)
sudo iotop

# Only show processes doing I/O
sudo iotop -o

# Show accumulated I/O
sudo iotop -a

# Batch mode (for scripting)
sudo iotop -b -n 5

# Understanding output:
# TID = Thread ID
# PRIO = I/O priority
# USER = Process owner
# DISK READ/DISK WRITE = Throughput
# SWAPIN = Swap usage
# IO = I/O percentage

8. iftop: Network Bandwidth Monitor

iftop shows real-time network bandwidth usage.

# Install iftop
sudo apt install iftop

# Monitor all interfaces
sudo iftop

# Monitor specific interface
sudo iftop -i eth0

# Show port numbers
sudo iftop -P

# Show in batch mode
sudo iftop -t -s 10

# Don't resolve hostnames (faster)
sudo iftop -n

# Understanding display:
# TX = Transmit (sending)
# RX = Receive (receiving)
# Cumulative = Total in session
# Peak = Highest rate
# Rates = Current rates

📝 Log Monitoring with journalctl & /var/log/

System Logs: Your Server's Diary

System logs are like a detailed diary of everything happening on your server. When something goes wrong, the logs usually know why.

Key Log Directories

# System logs (Ubuntu/Debian)
/var/log/syslog        # General system messages
/var/log/auth.log      # Authentication logs
/var/log/kern.log      # Kernel messages

# System logs (Red Hat/CentOS)
/var/log/messages      # General system messages
/var/log/secure        # Authentication logs

# Application logs
/var/log/nginx/        # Nginx web server
/var/log/apache2/      # Apache web server
/var/log/mysql/        # MySQL database
/var/log/docker/       # Docker containers

# Service logs (systemd)
sudo journalctl        # All journal logs

journalctl: The Modern Log Viewer

journalctl is the standard tool for viewing logs on systemd-based systems.

# View all logs (newest first)
sudo journalctl

# View in reverse (oldest first)
sudo journalctl -r

# Follow new logs (like tail -f)
sudo journalctl -f

# View logs for specific service
sudo journalctl -u nginx
sudo journalctl -u docker.service

# View logs since boot
sudo journalctl -b

# View logs from specific time
sudo journalctl --since "2024-02-10 09:00:00"
sudo journalctl --since "1 hour ago"
sudo journalctl --since yesterday

# Filter by priority/severity
sudo journalctl -p err          # Only errors
sudo journalctl -p warning      # Only warnings
sudo journalctl -p info         # Only info messages

# Show kernel messages
sudo journalctl -k

# Show with full details
sudo journalctl -o verbose

# Export logs to file
sudo journalctl --since "today" > /tmp/today-logs.txt

# Combine filters
sudo journalctl -u nginx --since "2 hours ago" -p err

Traditional Log Files

# View system log
sudo tail -f /var/log/syslog

# View authentication log
sudo tail -f /var/log/auth.log

# Search for errors in logs
sudo grep -i "error\|fail" /var/log/syslog

# Count occurrences of a pattern
sudo grep -c "Connection refused" /var/log/auth.log

# Monitor multiple logs
sudo tail -f /var/log/syslog /var/log/auth.log /var/log/nginx/error.log

# Find large log files
sudo find /var/log -type f -size +100M

# Rotate logs manually
sudo logrotate -f /etc/logrotate.conf

Log Monitoring Script

#!/bin/bash
# log-monitor.sh

LOG_DIRS="/var/log /var/log/nginx /var/log/mysql"
ALERT_EMAIL="admin@example.com"
ERROR_PATTERNS="error\|fail\|critical\|panic\|segmentation fault"

# Check for errors in last 10 minutes
check_errors() {
    local logfile=$1
    local error_count=$(sudo grep -c -i "$ERROR_PATTERNS" "$logfile" 2>/dev/null)
    
    if [ "$error_count" -gt 0 ]; then
        echo "Found $error_count errors in $logfile"
        
        # Get sample errors
        sudo grep -i "$ERROR_PATTERNS" "$logfile" 2>/dev/null | tail -5 | while read error; do
            echo "  - $error"
        done
    fi
}

# Monitor disk usage of logs
check_log_disk_usage() {
    local total_size=$(sudo du -sh /var/log | cut -f1)
    echo "Total log disk usage: $total_size"
    
    # Check individual large log files
    sudo find /var/log -type f -size +100M 2>/dev/null | while read logfile; do
        size=$(sudo du -h "$logfile" | cut -f1)
        echo "Large log: $logfile ($size)"
    done
}

# Main monitoring
echo "=== Log Monitoring Report ==="
echo "Time: $(date)"
echo

# Check each log directory
for dir in $LOG_DIRS; do
    if [ -d "$dir" ]; then
        echo "Checking $dir:"
        
        # Find all .log files
        find "$dir" -name "*.log" -type f 2>/dev/null | while read logfile; do
            check_errors "$logfile"
        done
        
        echo
    fi
done

# Check disk usage
check_log_disk_usage

echo "=== Report Complete ==="

🔍 Troubleshooting Boot & Performance Issues

The Systematic Troubleshooting Approach

When troubleshooting, follow this methodical approach:

Identify symptoms (What's wrong?)
Check basic health (Is anything obvious broken?)
Review logs (What do the logs say?)
Monitor resources (What's overloaded?)
Test fixes (Does changing X help?)
Document (What was the solution?)

Boot Issues Troubleshooting

Common Boot Problems:

# 1. System won't boot at all
# Check: Power, cables, hardware

# 2. Boots to emergency/rescue mode
# Check: Filesystem errors, missing files

# 3. Boots but services don't start
# Check: Service configuration, dependencies

Boot Troubleshooting Commands:

# View boot messages (if you can get to console)
dmesg | less
dmesg | grep -i "error\|fail"

# View systemd boot process
journalctl -b
journalctl -xb  # With extra details

# Check filesystem
fsck /dev/sda1  # Check and repair filesystem

# Check disk for bad sectors
badblocks -sv /dev/sda

# Check boot loader
sudo update-grub  # Reconfigure GRUB

# Boot to single-user/recovery mode
# At GRUB menu: edit kernel line, add "single" or "init=/bin/bash"

Performance Issues Troubleshooting

Step 1: Check Basic System Health

# Quick health check
uptime                    # Load average
free -h                  # Memory usage
df -h                    # Disk space
ss -tulpn                # Open ports
systemctl --failed       # Failed services

Step 2: Identify Resource Bottlenecks

# CPU Bottleneck Signs:
# - Load average > CPU cores
# - High %wa (wait I/O) in top
# - Long running processes

# Memory Bottleneck Signs:
# - High swap usage (si/so in vmstat)
# - Out of memory errors in dmesg
# - Processes killed by OOM killer

# Disk Bottleneck Signs:
# - High %util in iostat (>80%)
# - High await time in iostat
# - Slow file operations

# Network Bottleneck Signs:
# - High packet loss (ping)
# - Slow transfer speeds
# - Connection timeouts

Step 3: Performance Analysis Commands

# Find CPU-intensive processes
ps aux --sort=-%cpu | head -10

# Find memory-intensive processes
ps aux --sort=-%mem | head -10

# Check I/O wait
vmstat 2 5  # Look at wa column

# Check disk latency
iostat -x 1  # Look at await, svctm

# Check network connectivity
ping -c 5 google.com
traceroute google.com
mtr google.com  # Combines ping + traceroute

# Check for too many open files
lsof | wc -l
ulimit -n  # Current limit

# Check for zombie processes
ps aux | grep 'Z'

Common Performance Problems & Solutions

Problem 1: High CPU Usage

# Identify culprit
top  # Look for process with high %CPU

# If it's a specific process, investigate
strace -p <PID>  # See system calls
perf top         # CPU profiling (install linux-tools)

# If it's many small processes
# Check for fork bombs or misbehaving scripts
pstree           # View process tree

# Solutions:
# 1. Kill/restart problematic process
# 2. Adjust process priority (nice/renice)
# 3. Add more CPU capacity
# 4. Optimize application code

Problem 2: High Memory Usage

# Check memory usage
free -h
cat /proc/meminfo

# Check what's using memory
ps aux --sort=-%mem | head -10
sudo slabtop     # Kernel memory usage

# Check for memory leaks
# Monitor a process over time:
watch -n 1 'ps -p <PID> -o pid,ppid,cmd,%mem,%cpu,size'

# Clear caches (temporary fix)
echo 3 | sudo tee /proc/sys/vm/drop_caches

# Solutions:
# 1. Add more RAM
# 2. Adjust swappiness (vm.swappiness)
# 3. Fix memory leaks in applications
# 4. Add swap space

Problem 3: Slow Disk I/O

# Check disk performance
iostat -x 2
iotop -o

# Check for disk errors
dmesg | grep -i "disk\|sda\|error"
smartctl -a /dev/sda  # SMART data

# Check filesystem fragmentation (ext4)
sudo e4defrag -c /  # Check fragmentation level

# Check for too many small files
find / -type f -size -1k 2>/dev/null | wc -l

# Solutions:
# 1. Upgrade to SSD
# 2. Use RAID for better performance
# 3. Optimize database/config settings
# 4. Move frequently accessed data to faster disk

Problem 4: Network Issues

# Basic connectivity check
ping -c 5 8.8.8.8
ping -c 5 google.com

# Check DNS
nslookup google.com
dig google.com

# Check bandwidth usage
iftop
nethogs  # Bandwidth by process

# Check connection limits
ss -s  # Socket statistics
cat /proc/sys/net/ipv4/tcp_max_syn_backlog

# Check for packet loss
mtr google.com

# Solutions:
# 1. Check physical connections
# 2. Adjust network card settings
# 3. Optimize application network usage
# 4. Upgrade network infrastructure

🎯 Real-World Monitoring Scenarios

Scenario 1: Production Server Monitoring Dashboard

#!/bin/bash
# server-dashboard.sh

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Header
header() {
    echo -e "${BLUE}========================================${NC}"
    echo -e "${BLUE}    $1${NC}"
    echo -e "${BLUE}========================================${NC}"
}

# Status indicator
status() {
    if [ $1 -eq 0 ]; then
        echo -e "${GREEN}✓ OK${NC}"
    else
        echo -e "${RED}✗ ISSUE${NC}"
    fi
}

# System Information
header "System Information"
echo "Hostname: $(hostname)"
echo "Uptime: $(uptime -p)"
echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')"
echo "Processes: $(ps aux | wc -l)"

# CPU Information
header "CPU Information"
CPU_CORES=$(nproc)
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',')
echo "Cores: $CPU_CORES"
echo -n "Load: $LOAD - "
if (( $(echo "$LOAD > $CPU_CORES" | bc -l) )); then
    echo -e "${RED}High Load${NC}"
else
    echo -e "${GREEN}Normal${NC}"
fi

# Memory Information
header "Memory Information"
MEM_TOTAL=$(free -h | awk '/^Mem:/ {print $2}')
MEM_USED=$(free -h | awk '/^Mem:/ {print $3}')
MEM_PERCENT=$(free | awk '/^Mem:/ {printf("%.0f"), $3/$2*100}')
echo "Total: $MEM_TOTAL"
echo "Used: $MEM_USED ($MEM_PERCENT%)"
if [ $MEM_PERCENT -gt 90 ]; then
    echo -e "${RED}CRITICAL: High memory usage${NC}"
elif [ $MEM_PERCENT -gt 70 ]; then
    echo -e "${YELLOW}WARNING: High memory usage${NC}"
else
    echo -e "${GREEN}Normal${NC}"
fi

# Disk Information
header "Disk Information"
df -h / | awk 'NR==2 {print "Mount: "$6, "Size: "$2, "Used: "$3, "Avail: "$4, "Use%: "$5}'
DISK_PERCENT=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
if [ $DISK_PERCENT -gt 90 ]; then
    echo -e "${RED}CRITICAL: Disk almost full${NC}"
elif [ $DISK_PERCENT -gt 80 ]; then
    echo -e "${YELLOW}WARNING: Disk usage high${NC}"
fi

# Service Status
header "Service Status"
services=("ssh" "nginx" "mysql" "docker" "redis")
for service in "${services[@]}"; do
    echo -n "$service: "
    if systemctl is-active --quiet "$service" 2>/dev/null; then
        echo -e "${GREEN}Running${NC}"
    else
        echo -e "${RED}Stopped${NC}"
    fi
done

# Network Status
header "Network Status"
IP_ADDR=$(hostname -I | awk '{print $1}')
echo "IP Address: $IP_ADDR"
echo -n "Internet Connectivity: "
if ping -c 1 -W 2 8.8.8.8 >/dev/null 2>&1; then
    echo -e "${GREEN}Connected${NC}"
else
    echo -e "${RED}Disconnected${NC}"
fi

# Recent Errors
header "Recent Errors (Last 10)"
sudo journalctl --since "10 minutes ago" -p err 2>/dev/null | tail -10 || echo "No recent errors"

echo -e "\n${BLUE}Report generated: $(date)${NC}"

Scenario 2: Automated Performance Alert System

#!/bin/bash
# performance-alert.sh

# Configuration
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/performance-alerts.log"

# Thresholds
CPU_THRESHOLD=80
MEM_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=$(nproc)

# Log function
log_alert() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Check CPU
check_cpu() {
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    
    if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
        log_alert "CPU ALERT: Usage at ${CPU_USAGE}%"
        
        # Get top CPU processes
        TOP_PROCESSES=$(ps aux --sort=-%cpu | head -6)
        log_alert "Top processes:\n$TOP_PROCESSES"
        
        return 1
    fi
    return 0
}

# Check Memory
check_memory() {
    MEM_USAGE=$(free | awk '/Mem/{printf("%.0f"), $3/$2*100}')
    SWAP_USAGE=$(free | awk '/Swap/{printf("%.0f"), $3/$2*100}')
    
    if [ $MEM_USAGE -gt $MEM_THRESHOLD ]; then
        log_alert "MEMORY ALERT: Usage at ${MEM_USAGE}%"
        
        # Get top memory processes
        TOP_PROCESSES=$(ps aux --sort=-%mem | head -6)
        log_alert "Top memory users:\n$TOP_PROCESSES"
        
        return 1
    fi
    
    if [ $SWAP_USAGE -gt 50 ]; then
        log_alert "SWAP ALERT: Swap usage at ${SWAP_USAGE}%"
        return 1
    fi
    return 0
}

# Check Disk
check_disk() {
    DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
    
    if [ $DISK_USAGE -gt $DISK_THRESHOLD ]; then
        log_alert "DISK ALERT: Usage at ${DISK_USAGE}%"
        
        # Find large files/directories
        LARGE_FILES=$(sudo find / -type f -size +100M 2>/dev/null | head -10)
        log_alert "Large files found:\n$LARGE_FILES"
        
        return 1
    fi
    return 0
}

# Check Load
check_load() {
    LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',')
    
    if (( $(echo "$LOAD > $LOAD_THRESHOLD" | bc -l) )); then
        log_alert "LOAD ALERT: Load average at $LOAD"
        return 1
    fi
    return 0
}

# Main monitoring
log_alert "=== Starting performance check ==="

ALERT_COUNT=0

check_cpu || ((ALERT_COUNT++))
check_memory || ((ALERT_COUNT++))
check_disk || ((ALERT_COUNT++))
check_load || ((ALERT_COUNT++))

if [ $ALERT_COUNT -gt 0 ]; then
    # Send email alert
    SUBJECT="Performance Alert: $ALERT_COUNT issues detected"
    BODY=$(tail -20 "$LOG_FILE")
    
    echo "$BODY" | mail -s "$SUBJECT" "$ALERT_EMAIL"
    
    log_alert "Alert email sent to $ALERT_EMAIL"
fi

log_alert "=== Performance check complete ==="

Scenario 3: Boot Performance Analysis

#!/bin/bash
# boot-analyzer.sh

# Analyze boot performance

echo "=== Boot Performance Analysis ==="
echo

# 1. Check boot time
echo "1. Boot Time Analysis:"
systemd-analyze
echo

# 2. Check what's slowing boot
echo "2. Slowest Boot Units:"
systemd-analyze blame | head -10
echo

# 3. Check boot chain
echo "3. Boot Chain Graph:"
systemd-analyze critical-chain
echo

# 4. Check kernel messages
echo "4. Kernel Boot Messages (errors/warnings):"
dmesg | grep -E "error|fail|warning" | head -10
echo

# 5. Check failed services
echo "5. Failed Services:"
systemctl --failed
echo

# 6. Check startup time of key services
echo "6. Key Services Startup Time:"
for service in ssh nginx mysql docker; do
    TIME=$(systemd-analyze critical-chain $service.service 2>/dev/null | grep "reached after" | awk '{print $4}')
    if [ -n "$TIME" ]; then
        echo "  $service: $TIME"
    fi
done
echo

# 7. Check for unnecessary services
echo "7. Enabled Services (consider disabling if not needed):"
systemctl list-unit-files --type=service --state=enabled | grep -v "@" | head -20
echo

echo "=== Analysis Complete ==="
echo "Recommendations:"
echo "1. Disable unnecessary services"
echo "2. Check slow services from 'systemd-analyze blame'"
echo "3. Consider using 'systemd-analyze plot > boot.svg' for detailed graph"

Scenario 4: Continuous Monitoring with sar

#!/bin/bash
# sar-monitor.sh

# Generate daily performance report using sar

REPORT_DIR="/var/log/sar-reports"
REPORT_FILE="$REPORT_DIR/daily-report-$(date +%Y%m%d).txt"

# Create report directory
mkdir -p "$REPORT_DIR"

echo "=== Daily Performance Report ===" > "$REPORT_FILE"
echo "Generated: $(date)" >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# CPU Usage Summary
echo "CPU Usage (24-hour average):" >> "$REPORT_FILE"
sar | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Memory Usage Summary
echo "Memory Usage:" >> "$REPORT_FILE"
sar -r | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Swap Usage Summary
echo "Swap Usage:" >> "$REPORT_FILE"
sar -S | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Disk I/O Summary
echo "Disk I/O (/dev/sda):" >> "$REPORT_FILE"
sar -d | grep "sda" | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Network Summary
echo "Network (eth0):" >> "$REPORT_FILE"
sar -n DEV | grep "eth0" | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Load Average
echo "Load Average:" >> "$REPORT_FILE"
sar -q | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Peak Usage Times
echo "Peak CPU Usage Times:" >> "$REPORT_FILE"
sar | grep -v "^Average" | grep -v "^Linux" | sort -k4 -rn | head -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

echo "Report saved to: $REPORT_FILE"

📋 Quick Reference Cheat Sheet

Tool	Command	Purpose
CPU Monitoring	`top`, `htop`, `mpstat`	Real-time CPU usage
Memory Monitoring	`free -h`, `vmstat`, `sar -r`	Memory usage
Disk Monitoring	`iostat -x`, `iotop`, `df -h`	Disk I/O and space
Network Monitoring	`iftop`, `nethogs`, `sar -n DEV`	Network bandwidth
Process Monitoring	`ps aux`, `pstree`, `lsof`	Process information
Load Average	`uptime`, `sar -q`	System load
Boot Analysis	`systemd-analyze`, `dmesg`	Boot performance
Log Monitoring	`journalctl`, `tail -f /var/log/syslog`	System logs
Continuous Monitoring	`sar`, `dstat`	Historical data
I/O Wait	`vmstat 1`, `iostat 1`	Disk bottleneck detection
Swap Monitoring	`swapon --show`, `free -h`	Swap usage
Service Monitoring	`systemctl status`, `systemctl --failed`	Service health

🚀 Practice Exercises

Exercise 1: Create a Resource Monitor

#!/bin/bash
# resource-monitor.sh

watch -n 2 '
clear
echo "=== System Resource Monitor ==="
echo "Time: $(date)"
echo
echo "CPU Load:"
uptime
echo
echo "Memory Usage:"
free -h | grep -E "^Mem:"
echo
echo "Disk Usage:"
df -h / | tail -1
echo
echo "Top 5 Processes by CPU:"
ps aux --sort=-%cpu | head -6 | awk "{print \$1, \$2, \$3, \$11}"
echo
echo "Top 5 Processes by Memory:"
ps aux --sort=-%mem | head -6 | awk "{print \$1, \$2, \$4, \$11}"
'

Exercise 2: Analyze Boot Performance

# 1. Check current boot time
systemd-analyze

# 2. See what's taking time
systemd-analyze blame | head -10

# 3. Generate boot chart
systemd-analyze plot > boot.svg
# Transfer boot.svg to view graphically

# 4. Check for failed services
systemctl --failed

# 5. Disable a non-essential service (example: bluetooth)
sudo systemctl disable bluetooth.service
sudo systemctl mask bluetooth.service

# 6. Re-check boot time after reboot

Exercise 3: Monitor Disk I/O in Real-time

# Terminal 1: Monitor overall disk I/O
watch -n 1 'iostat -x 1 2 | tail -10'

# Terminal 2: Monitor per-process I/O
sudo iotop -o

# Terminal 3: Generate disk load
dd if=/dev/zero of=/tmp/testfile bs=1M count=1000

# Observe what happens in Terminals 1 and 2
# Then clean up: rm /tmp/testfile

Exercise 4: Network Performance Test

# 1. Check current bandwidth
sudo iftop -i eth0

# 2. Test network speed
# On server: iperf3 -s
# On client: iperf3 -c server-ip

# 3. Check for packet loss
ping -c 100 google.com | grep "packet loss"

# 4. Trace route with statistics
mtr google.com

# 5. Check active connections
ss -tun | wc -l
ss -tun | grep ESTAB | wc -l

🔗 Master System Monitoring with Hands-on Labs

System monitoring is what separates junior from senior DevOps engineers. The ability to quickly identify and resolve performance issues is invaluable.

👉 Practice system monitoring, performance analysis, and troubleshooting in our interactive labs at:
https://devops.trainwithsky.com/

Our platform provides:

Real server environments to monitor
Performance bottleneck scenarios
Troubleshooting challenges
Production-like monitoring setups
Guided exercises with expert feedback

Frequently Asked Questions

Q: What's a "good" load average?
A: Load average should be less than number of CPU cores. Example: 4-core CPU, load < 4 is good.

Q: How often should I check system metrics?
A: Real-time during issues, every 5-15 minutes for routine checks, and review historical daily/weekly.

Q: What's more important: CPU or memory?
A: It depends! Database servers need memory, compute servers need CPU. Monitor both.

Q: Should I be worried about high buff/cache memory?
A: No! Linux uses free memory for disk caching. This is good - it speeds up disk access.

Q: How do I find what caused a server crash yesterday?
A: Check /var/log/messages, journalctl --since yesterday, and dmesg output.

Q: What's the difference between %wa and %id in CPU stats?
A: %wa = waiting for I/O (disk bottleneck), %id = idle (CPU has nothing to do).

Q: How can I monitor a script's resource usage?
A: Use /usr/bin/time -v script.sh or run script and monitor with top -p $(pgrep scriptname).

Having monitoring or performance issues? Share your metrics and symptoms in the comments below! 💬

SKY Tech – Explore Technology!