Skip to main content

System Monitoring & Performance Guide - DevOps

 System Monitoring & Performance: The DevOps Monitoring Mastery Guide

Learn to monitor, analyze, and optimize Linux system performance like a seasoned system administrator.

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 18 minutes
🏷️ Tags: System Monitoring, Performance, Troubleshooting, Linux Administration, DevOps


📊 Monitoring CPU, Memory, Disk, Network Usage

Understanding System Resources

Think of your Linux server like a busy restaurant kitchen:

  • CPU = The chefs (processors cooking tasks)

  • Memory (RAM) = Counter space (working area for chefs)

  • Disk (Storage) = Refrigerator and pantry (long-term storage)

  • Network = Delivery trucks (moving data in and out)

When any of these gets overloaded, your whole "restaurant" slows down. Monitoring helps you see bottlenecks before customers (users) complain!

Real-time vs Historical Monitoring

Real-time monitoring = Watching live camera feeds in the kitchen
Historical monitoring = Reviewing yesterday's kitchen logs

You need both:

  • Real-time to fix problems happening NOW

  • Historical to spot trends and prevent future issues


🔧 Essential Monitoring Tools

1. top & htop: The System Dashboard

bash
# Basic system overview
top

# Inside top:
# Press '1' = Show all CPU cores
# Press 'M' = Sort by memory usage
# Press 'P' = Sort by CPU usage
# Press 'k' = Kill a process (enter PID)
# Press 'q' = Quit

# htop (nicer, color-coded)
sudo apt install htop    # Install if needed
htop

# htop features:
# F2 = Setup (configure display)
# F3 = Search for process
# F4 = Filter processes
# F5 = Tree view
# F6 = Sort by column
# F9 = Kill process

Understanding key metrics in top:

text
top - 10:30:00 up 30 days,  3:15,  1 user,  load average: 0.50, 0.75, 1.00
Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.2 us,  1.5 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  0.0 si
MiB Mem :   7845.2 total,   1024.5 free,   2048.3 used,   4772.4 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   5120.2 avail Mem
  • Load average = Average number of processes waiting for CPU (1.00 per CPU core = 100% busy)

  • %Cpu(s) = User (us), System (sy), Idle (id), Wait I/O (wa) percentages

  • Memory = Total, used, free, buff/cache (cached disk data)

  • Swap = Virtual memory on disk (high usage = memory pressure)

2. vmstat: Virtual Memory Statistics

vmstat shows processes, memory, paging, block IO, traps, and CPU activity.

bash
# Basic vmstat
vmstat

# Refresh every 2 seconds, show 5 updates
vmstat 2 5

# Show with timestamps
vmstat -t 2 3

# Understanding vmstat output:
# Procs: r=run queue, b=blocked
# Memory: swpd=swap used, free, buff, cache
# Swap: si=swap in, so=swap out
# IO: bi=blocks in, bo=blocks out
# System: in=interrupts, cs=context switches
# CPU: us, sy, id, wa, st

3. iostat: Input/Output Statistics

iostat shows CPU and disk I/O statistics.

bash
# Install sysstat package first
sudo apt install sysstat

# Show CPU and disk statistics
iostat

# Show only disk statistics
iostat -d

# Show extended statistics
iostat -x

# Show with human readable units
iostat -h

# Continuous monitoring (every 2 seconds)
iostat 2

# Understanding key metrics:
# %util = Disk utilization percentage (100% = saturated)
# await = Average wait time for I/O requests
# r/s, w/s = Reads/writes per second
# rkB/s, wkB/s = KB read/written per second

4. free: Memory Usage

bash
# Basic memory info
free

# Human readable (MB, GB)
free -h

# Show in megabytes
free -m

# Show total
free -t

# Show wide output (all details)
free -wh

# Understanding output:
# total = Total installed memory
# used = Memory used by processes
# free = Completely unused memory
# shared = Memory used by tmpfs
# buff/cache = Memory used for disk caching
# available = Memory available for new applications

Key insight: Linux uses free memory for disk caching. High "used" memory with low "available" is a problem. High "used" with high "available" is normal (caching).

5. sar: System Activity Reporter

sar collects and reports system activity information. It's like having a flight data recorder for your server.

bash
# Install sysstat
sudo apt install sysstat

# Enable data collection (usually already enabled)
sudo systemctl enable sysstat
sudo systemctl start sysstat

# View today's CPU statistics
sar

# View memory usage
sar -r

# View swap usage
sar -S

# View disk I/O
sar -d

# View network statistics
sar -n DEV

# View specific time (e.g., 10:00 AM to 11:00 AM)
sar -s 10:00:00 -e 11:00:00

# Save report to file
sar -A > system-report.txt

6. dstat: Versatile Resource Statistics

dstat combines vmstat, iostat, ifstat, and more in one tool.

bash
# Install dstat
sudo apt install dstat

# Basic dstat (refreshes every second)
dstat

# Show CPU, disk, network, paging, system
dstat -cdngy

# Show with colors
dstat -ta --color

# Show specific plugins
dstat --cpu --mem --net --disk

# Save to CSV for analysis
dstat --output /tmp/dstat.csv 5

# Key columns:
# cpu: usr, sys, idl, wai, hiq, siq
# dsk: read, writ
# net: recv, send
# paging: in, out
# system: int, csw

7. iotop: Monitor Disk I/O by Process

While iostat shows disk-level stats, iotop shows which processes are doing the I/O.

bash
# Install iotop
sudo apt install iotop

# Run with sudo (needs root for process info)
sudo iotop

# Only show processes doing I/O
sudo iotop -o

# Show accumulated I/O
sudo iotop -a

# Batch mode (for scripting)
sudo iotop -b -n 5

# Understanding output:
# TID = Thread ID
# PRIO = I/O priority
# USER = Process owner
# DISK READ/DISK WRITE = Throughput
# SWAPIN = Swap usage
# IO = I/O percentage

8. iftop: Network Bandwidth Monitor

iftop shows real-time network bandwidth usage.

bash
# Install iftop
sudo apt install iftop

# Monitor all interfaces
sudo iftop

# Monitor specific interface
sudo iftop -i eth0

# Show port numbers
sudo iftop -P

# Show in batch mode
sudo iftop -t -s 10

# Don't resolve hostnames (faster)
sudo iftop -n

# Understanding display:
# TX = Transmit (sending)
# RX = Receive (receiving)
# Cumulative = Total in session
# Peak = Highest rate
# Rates = Current rates

📝 Log Monitoring with journalctl & /var/log/

System Logs: Your Server's Diary

System logs are like a detailed diary of everything happening on your server. When something goes wrong, the logs usually know why.

Key Log Directories

bash
# System logs (Ubuntu/Debian)
/var/log/syslog        # General system messages
/var/log/auth.log      # Authentication logs
/var/log/kern.log      # Kernel messages

# System logs (Red Hat/CentOS)
/var/log/messages      # General system messages
/var/log/secure        # Authentication logs

# Application logs
/var/log/nginx/        # Nginx web server
/var/log/apache2/      # Apache web server
/var/log/mysql/        # MySQL database
/var/log/docker/       # Docker containers

# Service logs (systemd)
sudo journalctl        # All journal logs

journalctl: The Modern Log Viewer

journalctl is the standard tool for viewing logs on systemd-based systems.

bash
# View all logs (newest first)
sudo journalctl

# View in reverse (oldest first)
sudo journalctl -r

# Follow new logs (like tail -f)
sudo journalctl -f

# View logs for specific service
sudo journalctl -u nginx
sudo journalctl -u docker.service

# View logs since boot
sudo journalctl -b

# View logs from specific time
sudo journalctl --since "2024-02-10 09:00:00"
sudo journalctl --since "1 hour ago"
sudo journalctl --since yesterday

# Filter by priority/severity
sudo journalctl -p err          # Only errors
sudo journalctl -p warning      # Only warnings
sudo journalctl -p info         # Only info messages

# Show kernel messages
sudo journalctl -k

# Show with full details
sudo journalctl -o verbose

# Export logs to file
sudo journalctl --since "today" > /tmp/today-logs.txt

# Combine filters
sudo journalctl -u nginx --since "2 hours ago" -p err

Traditional Log Files

bash
# View system log
sudo tail -f /var/log/syslog

# View authentication log
sudo tail -f /var/log/auth.log

# Search for errors in logs
sudo grep -i "error\|fail" /var/log/syslog

# Count occurrences of a pattern
sudo grep -c "Connection refused" /var/log/auth.log

# Monitor multiple logs
sudo tail -f /var/log/syslog /var/log/auth.log /var/log/nginx/error.log

# Find large log files
sudo find /var/log -type f -size +100M

# Rotate logs manually
sudo logrotate -f /etc/logrotate.conf

Log Monitoring Script

bash
#!/bin/bash
# log-monitor.sh

LOG_DIRS="/var/log /var/log/nginx /var/log/mysql"
ALERT_EMAIL="admin@example.com"
ERROR_PATTERNS="error\|fail\|critical\|panic\|segmentation fault"

# Check for errors in last 10 minutes
check_errors() {
    local logfile=$1
    local error_count=$(sudo grep -c -i "$ERROR_PATTERNS" "$logfile" 2>/dev/null)
    
    if [ "$error_count" -gt 0 ]; then
        echo "Found $error_count errors in $logfile"
        
        # Get sample errors
        sudo grep -i "$ERROR_PATTERNS" "$logfile" 2>/dev/null | tail -5 | while read error; do
            echo "  - $error"
        done
    fi
}

# Monitor disk usage of logs
check_log_disk_usage() {
    local total_size=$(sudo du -sh /var/log | cut -f1)
    echo "Total log disk usage: $total_size"
    
    # Check individual large log files
    sudo find /var/log -type f -size +100M 2>/dev/null | while read logfile; do
        size=$(sudo du -h "$logfile" | cut -f1)
        echo "Large log: $logfile ($size)"
    done
}

# Main monitoring
echo "=== Log Monitoring Report ==="
echo "Time: $(date)"
echo

# Check each log directory
for dir in $LOG_DIRS; do
    if [ -d "$dir" ]; then
        echo "Checking $dir:"
        
        # Find all .log files
        find "$dir" -name "*.log" -type f 2>/dev/null | while read logfile; do
            check_errors "$logfile"
        done
        
        echo
    fi
done

# Check disk usage
check_log_disk_usage

echo "=== Report Complete ==="

🔍 Troubleshooting Boot & Performance Issues

The Systematic Troubleshooting Approach

When troubleshooting, follow this methodical approach:

  1. Identify symptoms (What's wrong?)

  2. Check basic health (Is anything obvious broken?)

  3. Review logs (What do the logs say?)

  4. Monitor resources (What's overloaded?)

  5. Test fixes (Does changing X help?)

  6. Document (What was the solution?)

Boot Issues Troubleshooting

Common Boot Problems:

bash
# 1. System won't boot at all
# Check: Power, cables, hardware

# 2. Boots to emergency/rescue mode
# Check: Filesystem errors, missing files

# 3. Boots but services don't start
# Check: Service configuration, dependencies

Boot Troubleshooting Commands:

bash
# View boot messages (if you can get to console)
dmesg | less
dmesg | grep -i "error\|fail"

# View systemd boot process
journalctl -b
journalctl -xb  # With extra details

# Check filesystem
fsck /dev/sda1  # Check and repair filesystem

# Check disk for bad sectors
badblocks -sv /dev/sda

# Check boot loader
sudo update-grub  # Reconfigure GRUB

# Boot to single-user/recovery mode
# At GRUB menu: edit kernel line, add "single" or "init=/bin/bash"

Performance Issues Troubleshooting

Step 1: Check Basic System Health

bash
# Quick health check
uptime                    # Load average
free -h                  # Memory usage
df -h                    # Disk space
ss -tulpn                # Open ports
systemctl --failed       # Failed services

Step 2: Identify Resource Bottlenecks

bash
# CPU Bottleneck Signs:
# - Load average > CPU cores
# - High %wa (wait I/O) in top
# - Long running processes

# Memory Bottleneck Signs:
# - High swap usage (si/so in vmstat)
# - Out of memory errors in dmesg
# - Processes killed by OOM killer

# Disk Bottleneck Signs:
# - High %util in iostat (>80%)
# - High await time in iostat
# - Slow file operations

# Network Bottleneck Signs:
# - High packet loss (ping)
# - Slow transfer speeds
# - Connection timeouts

Step 3: Performance Analysis Commands

bash
# Find CPU-intensive processes
ps aux --sort=-%cpu | head -10

# Find memory-intensive processes
ps aux --sort=-%mem | head -10

# Check I/O wait
vmstat 2 5  # Look at wa column

# Check disk latency
iostat -x 1  # Look at await, svctm

# Check network connectivity
ping -c 5 google.com
traceroute google.com
mtr google.com  # Combines ping + traceroute

# Check for too many open files
lsof | wc -l
ulimit -n  # Current limit

# Check for zombie processes
ps aux | grep 'Z'

Common Performance Problems & Solutions

Problem 1: High CPU Usage

bash
# Identify culprit
top  # Look for process with high %CPU

# If it's a specific process, investigate
strace -p <PID>  # See system calls
perf top         # CPU profiling (install linux-tools)

# If it's many small processes
# Check for fork bombs or misbehaving scripts
pstree           # View process tree

# Solutions:
# 1. Kill/restart problematic process
# 2. Adjust process priority (nice/renice)
# 3. Add more CPU capacity
# 4. Optimize application code

Problem 2: High Memory Usage

bash
# Check memory usage
free -h
cat /proc/meminfo

# Check what's using memory
ps aux --sort=-%mem | head -10
sudo slabtop     # Kernel memory usage

# Check for memory leaks
# Monitor a process over time:
watch -n 1 'ps -p <PID> -o pid,ppid,cmd,%mem,%cpu,size'

# Clear caches (temporary fix)
echo 3 | sudo tee /proc/sys/vm/drop_caches

# Solutions:
# 1. Add more RAM
# 2. Adjust swappiness (vm.swappiness)
# 3. Fix memory leaks in applications
# 4. Add swap space

Problem 3: Slow Disk I/O

bash
# Check disk performance
iostat -x 2
iotop -o

# Check for disk errors
dmesg | grep -i "disk\|sda\|error"
smartctl -a /dev/sda  # SMART data

# Check filesystem fragmentation (ext4)
sudo e4defrag -c /  # Check fragmentation level

# Check for too many small files
find / -type f -size -1k 2>/dev/null | wc -l

# Solutions:
# 1. Upgrade to SSD
# 2. Use RAID for better performance
# 3. Optimize database/config settings
# 4. Move frequently accessed data to faster disk

Problem 4: Network Issues

bash
# Basic connectivity check
ping -c 5 8.8.8.8
ping -c 5 google.com

# Check DNS
nslookup google.com
dig google.com

# Check bandwidth usage
iftop
nethogs  # Bandwidth by process

# Check connection limits
ss -s  # Socket statistics
cat /proc/sys/net/ipv4/tcp_max_syn_backlog

# Check for packet loss
mtr google.com

# Solutions:
# 1. Check physical connections
# 2. Adjust network card settings
# 3. Optimize application network usage
# 4. Upgrade network infrastructure

🎯 Real-World Monitoring Scenarios

Scenario 1: Production Server Monitoring Dashboard

bash
#!/bin/bash
# server-dashboard.sh

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Header
header() {
    echo -e "${BLUE}========================================${NC}"
    echo -e "${BLUE}    $1${NC}"
    echo -e "${BLUE}========================================${NC}"
}

# Status indicator
status() {
    if [ $1 -eq 0 ]; then
        echo -e "${GREEN}✓ OK${NC}"
    else
        echo -e "${RED}✗ ISSUE${NC}"
    fi
}

# System Information
header "System Information"
echo "Hostname: $(hostname)"
echo "Uptime: $(uptime -p)"
echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')"
echo "Processes: $(ps aux | wc -l)"

# CPU Information
header "CPU Information"
CPU_CORES=$(nproc)
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',')
echo "Cores: $CPU_CORES"
echo -n "Load: $LOAD - "
if (( $(echo "$LOAD > $CPU_CORES" | bc -l) )); then
    echo -e "${RED}High Load${NC}"
else
    echo -e "${GREEN}Normal${NC}"
fi

# Memory Information
header "Memory Information"
MEM_TOTAL=$(free -h | awk '/^Mem:/ {print $2}')
MEM_USED=$(free -h | awk '/^Mem:/ {print $3}')
MEM_PERCENT=$(free | awk '/^Mem:/ {printf("%.0f"), $3/$2*100}')
echo "Total: $MEM_TOTAL"
echo "Used: $MEM_USED ($MEM_PERCENT%)"
if [ $MEM_PERCENT -gt 90 ]; then
    echo -e "${RED}CRITICAL: High memory usage${NC}"
elif [ $MEM_PERCENT -gt 70 ]; then
    echo -e "${YELLOW}WARNING: High memory usage${NC}"
else
    echo -e "${GREEN}Normal${NC}"
fi

# Disk Information
header "Disk Information"
df -h / | awk 'NR==2 {print "Mount: "$6, "Size: "$2, "Used: "$3, "Avail: "$4, "Use%: "$5}'
DISK_PERCENT=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
if [ $DISK_PERCENT -gt 90 ]; then
    echo -e "${RED}CRITICAL: Disk almost full${NC}"
elif [ $DISK_PERCENT -gt 80 ]; then
    echo -e "${YELLOW}WARNING: Disk usage high${NC}"
fi

# Service Status
header "Service Status"
services=("ssh" "nginx" "mysql" "docker" "redis")
for service in "${services[@]}"; do
    echo -n "$service: "
    if systemctl is-active --quiet "$service" 2>/dev/null; then
        echo -e "${GREEN}Running${NC}"
    else
        echo -e "${RED}Stopped${NC}"
    fi
done

# Network Status
header "Network Status"
IP_ADDR=$(hostname -I | awk '{print $1}')
echo "IP Address: $IP_ADDR"
echo -n "Internet Connectivity: "
if ping -c 1 -W 2 8.8.8.8 >/dev/null 2>&1; then
    echo -e "${GREEN}Connected${NC}"
else
    echo -e "${RED}Disconnected${NC}"
fi

# Recent Errors
header "Recent Errors (Last 10)"
sudo journalctl --since "10 minutes ago" -p err 2>/dev/null | tail -10 || echo "No recent errors"

echo -e "\n${BLUE}Report generated: $(date)${NC}"

Scenario 2: Automated Performance Alert System

bash
#!/bin/bash
# performance-alert.sh

# Configuration
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/performance-alerts.log"

# Thresholds
CPU_THRESHOLD=80
MEM_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=$(nproc)

# Log function
log_alert() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Check CPU
check_cpu() {
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    
    if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
        log_alert "CPU ALERT: Usage at ${CPU_USAGE}%"
        
        # Get top CPU processes
        TOP_PROCESSES=$(ps aux --sort=-%cpu | head -6)
        log_alert "Top processes:\n$TOP_PROCESSES"
        
        return 1
    fi
    return 0
}

# Check Memory
check_memory() {
    MEM_USAGE=$(free | awk '/Mem/{printf("%.0f"), $3/$2*100}')
    SWAP_USAGE=$(free | awk '/Swap/{printf("%.0f"), $3/$2*100}')
    
    if [ $MEM_USAGE -gt $MEM_THRESHOLD ]; then
        log_alert "MEMORY ALERT: Usage at ${MEM_USAGE}%"
        
        # Get top memory processes
        TOP_PROCESSES=$(ps aux --sort=-%mem | head -6)
        log_alert "Top memory users:\n$TOP_PROCESSES"
        
        return 1
    fi
    
    if [ $SWAP_USAGE -gt 50 ]; then
        log_alert "SWAP ALERT: Swap usage at ${SWAP_USAGE}%"
        return 1
    fi
    return 0
}

# Check Disk
check_disk() {
    DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
    
    if [ $DISK_USAGE -gt $DISK_THRESHOLD ]; then
        log_alert "DISK ALERT: Usage at ${DISK_USAGE}%"
        
        # Find large files/directories
        LARGE_FILES=$(sudo find / -type f -size +100M 2>/dev/null | head -10)
        log_alert "Large files found:\n$LARGE_FILES"
        
        return 1
    fi
    return 0
}

# Check Load
check_load() {
    LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',')
    
    if (( $(echo "$LOAD > $LOAD_THRESHOLD" | bc -l) )); then
        log_alert "LOAD ALERT: Load average at $LOAD"
        return 1
    fi
    return 0
}

# Main monitoring
log_alert "=== Starting performance check ==="

ALERT_COUNT=0

check_cpu || ((ALERT_COUNT++))
check_memory || ((ALERT_COUNT++))
check_disk || ((ALERT_COUNT++))
check_load || ((ALERT_COUNT++))

if [ $ALERT_COUNT -gt 0 ]; then
    # Send email alert
    SUBJECT="Performance Alert: $ALERT_COUNT issues detected"
    BODY=$(tail -20 "$LOG_FILE")
    
    echo "$BODY" | mail -s "$SUBJECT" "$ALERT_EMAIL"
    
    log_alert "Alert email sent to $ALERT_EMAIL"
fi

log_alert "=== Performance check complete ==="

Scenario 3: Boot Performance Analysis

bash
#!/bin/bash
# boot-analyzer.sh

# Analyze boot performance

echo "=== Boot Performance Analysis ==="
echo

# 1. Check boot time
echo "1. Boot Time Analysis:"
systemd-analyze
echo

# 2. Check what's slowing boot
echo "2. Slowest Boot Units:"
systemd-analyze blame | head -10
echo

# 3. Check boot chain
echo "3. Boot Chain Graph:"
systemd-analyze critical-chain
echo

# 4. Check kernel messages
echo "4. Kernel Boot Messages (errors/warnings):"
dmesg | grep -E "error|fail|warning" | head -10
echo

# 5. Check failed services
echo "5. Failed Services:"
systemctl --failed
echo

# 6. Check startup time of key services
echo "6. Key Services Startup Time:"
for service in ssh nginx mysql docker; do
    TIME=$(systemd-analyze critical-chain $service.service 2>/dev/null | grep "reached after" | awk '{print $4}')
    if [ -n "$TIME" ]; then
        echo "  $service: $TIME"
    fi
done
echo

# 7. Check for unnecessary services
echo "7. Enabled Services (consider disabling if not needed):"
systemctl list-unit-files --type=service --state=enabled | grep -v "@" | head -20
echo

echo "=== Analysis Complete ==="
echo "Recommendations:"
echo "1. Disable unnecessary services"
echo "2. Check slow services from 'systemd-analyze blame'"
echo "3. Consider using 'systemd-analyze plot > boot.svg' for detailed graph"

Scenario 4: Continuous Monitoring with sar

bash
#!/bin/bash
# sar-monitor.sh

# Generate daily performance report using sar

REPORT_DIR="/var/log/sar-reports"
REPORT_FILE="$REPORT_DIR/daily-report-$(date +%Y%m%d).txt"

# Create report directory
mkdir -p "$REPORT_DIR"

echo "=== Daily Performance Report ===" > "$REPORT_FILE"
echo "Generated: $(date)" >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# CPU Usage Summary
echo "CPU Usage (24-hour average):" >> "$REPORT_FILE"
sar | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Memory Usage Summary
echo "Memory Usage:" >> "$REPORT_FILE"
sar -r | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Swap Usage Summary
echo "Swap Usage:" >> "$REPORT_FILE"
sar -S | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Disk I/O Summary
echo "Disk I/O (/dev/sda):" >> "$REPORT_FILE"
sar -d | grep "sda" | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Network Summary
echo "Network (eth0):" >> "$REPORT_FILE"
sar -n DEV | grep "eth0" | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Load Average
echo "Load Average:" >> "$REPORT_FILE"
sar -q | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

# Peak Usage Times
echo "Peak CPU Usage Times:" >> "$REPORT_FILE"
sar | grep -v "^Average" | grep -v "^Linux" | sort -k4 -rn | head -5 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"

echo "Report saved to: $REPORT_FILE"

📋 Quick Reference Cheat Sheet

ToolCommandPurpose
CPU MonitoringtophtopmpstatReal-time CPU usage
Memory Monitoringfree -hvmstatsar -rMemory usage
Disk Monitoringiostat -xiotopdf -hDisk I/O and space
Network Monitoringiftopnethogssar -n DEVNetwork bandwidth
Process Monitoringps auxpstreelsofProcess information
Load Averageuptimesar -qSystem load
Boot Analysissystemd-analyzedmesgBoot performance
Log Monitoringjournalctltail -f /var/log/syslogSystem logs
Continuous MonitoringsardstatHistorical data
I/O Waitvmstat 1iostat 1Disk bottleneck detection
Swap Monitoringswapon --showfree -hSwap usage
Service Monitoringsystemctl statussystemctl --failedService health

🚀 Practice Exercises

Exercise 1: Create a Resource Monitor

bash
#!/bin/bash
# resource-monitor.sh

watch -n 2 '
clear
echo "=== System Resource Monitor ==="
echo "Time: $(date)"
echo
echo "CPU Load:"
uptime
echo
echo "Memory Usage:"
free -h | grep -E "^Mem:"
echo
echo "Disk Usage:"
df -h / | tail -1
echo
echo "Top 5 Processes by CPU:"
ps aux --sort=-%cpu | head -6 | awk "{print \$1, \$2, \$3, \$11}"
echo
echo "Top 5 Processes by Memory:"
ps aux --sort=-%mem | head -6 | awk "{print \$1, \$2, \$4, \$11}"
'

Exercise 2: Analyze Boot Performance

bash
# 1. Check current boot time
systemd-analyze

# 2. See what's taking time
systemd-analyze blame | head -10

# 3. Generate boot chart
systemd-analyze plot > boot.svg
# Transfer boot.svg to view graphically

# 4. Check for failed services
systemctl --failed

# 5. Disable a non-essential service (example: bluetooth)
sudo systemctl disable bluetooth.service
sudo systemctl mask bluetooth.service

# 6. Re-check boot time after reboot

Exercise 3: Monitor Disk I/O in Real-time

bash
# Terminal 1: Monitor overall disk I/O
watch -n 1 'iostat -x 1 2 | tail -10'

# Terminal 2: Monitor per-process I/O
sudo iotop -o

# Terminal 3: Generate disk load
dd if=/dev/zero of=/tmp/testfile bs=1M count=1000

# Observe what happens in Terminals 1 and 2
# Then clean up: rm /tmp/testfile

Exercise 4: Network Performance Test

bash
# 1. Check current bandwidth
sudo iftop -i eth0

# 2. Test network speed
# On server: iperf3 -s
# On client: iperf3 -c server-ip

# 3. Check for packet loss
ping -c 100 google.com | grep "packet loss"

# 4. Trace route with statistics
mtr google.com

# 5. Check active connections
ss -tun | wc -l
ss -tun | grep ESTAB | wc -l

🔗 Master System Monitoring with Hands-on Labs

System monitoring is what separates junior from senior DevOps engineers. The ability to quickly identify and resolve performance issues is invaluable.

👉 Practice system monitoring, performance analysis, and troubleshooting in our interactive labs at:
https://devops.trainwithsky.com/

Our platform provides:

  • Real server environments to monitor

  • Performance bottleneck scenarios

  • Troubleshooting challenges

  • Production-like monitoring setups

  • Guided exercises with expert feedback


Frequently Asked Questions

Q: What's a "good" load average?
A: Load average should be less than number of CPU cores. Example: 4-core CPU, load < 4 is good.

Q: How often should I check system metrics?
A: Real-time during issues, every 5-15 minutes for routine checks, and review historical daily/weekly.

Q: What's more important: CPU or memory?
A: It depends! Database servers need memory, compute servers need CPU. Monitor both.

Q: Should I be worried about high buff/cache memory?
A: No! Linux uses free memory for disk caching. This is good - it speeds up disk access.

Q: How do I find what caused a server crash yesterday?
A: Check /var/log/messagesjournalctl --since yesterday, and dmesg output.

Q: What's the difference between %wa and %id in CPU stats?
A: %wa = waiting for I/O (disk bottleneck), %id = idle (CPU has nothing to do).

Q: How can I monitor a script's resource usage?
A: Use /usr/bin/time -v script.sh or run script and monitor with top -p $(pgrep scriptname).


Having monitoring or performance issues? Share your metrics and symptoms in the comments below! 💬

Comments

Popular posts from this blog

Introduction to Terraform – The Future of Infrastructure as Code

  Introduction to Terraform – The Future of Infrastructure as Code In today’s fast-paced DevOps world, managing infrastructure manually is outdated . This is where Terraform comes in—a powerful Infrastructure as Code (IaC) tool that allows you to define, provision, and manage cloud infrastructure efficiently . Whether you're working with AWS, Azure, Google Cloud, or on-premises servers , Terraform provides a declarative, automation-first approach to infrastructure deployment. Shape Your Future with AI & Infinite Knowledge...!! Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! In today’s digital-first world, agility and automation are no longer optional—they’re essential. Companies across the globe are rapidly shifting their operations to the cloud to keep up with the pace of innovatio...

📊 Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd

  Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd Monitoring and logging are essential for maintaining a healthy and well-performing Kubernetes cluster. In this guide, we’ll cover why monitoring is important, key monitoring tools like Prometheus and Grafana, and logging tools like Fluentd to help you gain visibility into your cluster’s performance and logs. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction In today’s fast-paced cloud-native environment, Kubernetes has emerged as the de-facto container orchestration platform. But deploying and managing applications in Kubernetes is just half the ba...

🔒 Kubernetes Security – RBAC, Network Policies, and Secrets Management

  Kubernetes Security – RBAC, Network Policies, and Secrets Management Security is a critical aspect of managing Kubernetes clusters. In this guide, we'll cover essential security mechanisms like Role-Based Access Control (RBAC) , Network Policies , and Secrets Management to help you secure your Kubernetes environment effectively. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction: Why Kubernetes Security Is Non-Negotiable As Kubernetes becomes the backbone of modern cloud-native infrastructure, security is no longer optional—it’s mission-critical . With multiple moving parts like containers, pods, services, nodes, and more, Kuberne...