System Monitoring & Performance Guide
Published: December 2025 | Topic: System Monitoring & Performance for DevOps
Monitoring is the eyes and ears of your infrastructure. Effective monitoring helps you detect issues before they become outages, optimize resource utilization, and ensure system reliability. This guide covers essential monitoring tools and techniques for Linux systems.
1. System Resource Monitoring
Key Metrics to Monitor
CPU Monitoring
top / htop
# Interactive process viewer
$ htop # Enhanced version (install if needed)
# Batch mode - non-interactive
$ top -bn1 | head -20
# Key metrics in top:
# load average: 1min, 5min, 15min
# %Cpu(s): us(user), sy(system), ni(nice), id(idle)
# Memory: total, used, free, buffers
vmstat - Virtual Memory Stats
# Display every 2 seconds, 5 times
# Output columns:
# r: Running processes
# b: Blocked processes
# swpd: Swap used (KB)
# free: Free memory (KB)
# si: Swap in from disk (KB/s)
# so: Swap out to disk (KB/s)
# us: User CPU time (%)
# sy: System CPU time (%)
# id: Idle CPU time (%)
mpstat - CPU Statistics
# Monitor all CPUs, every 2s, 3 times
# Key metrics:
# %usr: User CPU percentage
# %nice: Nice user CPU percentage
# %sys: System CPU percentage
# %iowait: I/O wait percentage
# %irq: Interrupt handling
# %soft: Software interrupts
# %idle: Idle percentage
# Check CPU frequency scaling:
$ cat /proc/cpuinfo | grep MHz
Memory Monitoring
free - Memory Usage
$ free -m # Megabytes
# Watch memory continuously:
$ watch -n 2 free -m
# Understanding output:
# total: Total installed memory
# used: Used memory (includes buffers/cache)
# free: Unused memory
# shared: Memory used by tmpfs
# buffers: Memory used by kernel buffers
# cache: Page cache and slabs
# available: Estimation of memory available
/proc/meminfo - Detailed Info
# Key fields:
# MemTotal: Total usable RAM
# MemFree: Completely unused RAM
# MemAvailable: Estimated available memory
# Buffers: Temporary storage for raw disk blocks
# Cached: In-memory cache for files read from disk
# SwapCached: Memory that was swapped out
# Active: Memory recently used
# Inactive: Memory not recently used
# SwapTotal: Total swap space
# SwapFree: Unused swap space
Disk I/O Monitoring
iostat - Disk Statistics
# Extended stats, every 2s, 5 times
# Key metrics:
# %util: Percentage of CPU time for I/O requests
# r/s: Reads per second
# w/s: Writes per second
# rkB/s: KB read per second
# wkB/s: KB written per second
# await: Average wait time for I/O (ms)
# svctm: Average service time for I/O (ms)
# Disk space usage:
$ df -h # Human readable
$ df -i # Inode usage
iotop - I/O by Process
# Interactive view of disk I/O by process
$ iotop -o # Only show processes doing I/O
$ iotop -b -n 3 # Batch mode, 3 iterations
# Key columns:
# TID: Thread ID
# PRIO: I/O priority
# USER: Process owner
# DISK READ: Read rate
# DISK WRITE: Write rate
# SWAPIN: Swap in percentage
# IO: I/O percentage
Network Monitoring
iftop - Bandwidth Usage
$ iftop -n # Don't resolve hostnames
$ iftop -N # Don't resolve port numbers
# Batch mode with pcap:
$ iftop -t -s 10 -L 100
# Key information:
# Shows bandwidth usage per connection
# Real-time traffic display
# Sorted by bandwidth usage
# Source and destination IPs
nload & ifstat
# Visual network monitor
$ ifstat -i eth0 -t 2 10
# Interface statistics
# Basic network stats:
$ netstat -i # Interface table
$ ss -s # Socket statistics
$ cat /proc/net/dev # Network device info
# Check connections:
$ ss -tunap # All TCP/UDP connections
2. Performance Monitoring Tools
sar - System Activity Reporter
$ sudo apt install sysstat # Debian/Ubuntu
$ sudo yum install sysstat # RHEL/CentOS
# Enable data collection (edit /etc/default/sysstat)
ENABLED="true"
# Common sar commands:
$ sar -u 2 5 # CPU every 2s, 5 times
$ sar -r 1 3 # Memory every 1s, 3 times
$ sar -b 2 4 # I/O and transfer rate
$ sar -n DEV 2 3 # Network devices
$ sar -q 2 5 # Queue length and load
# View historical data:
$ sar -u -f /var/log/sa/sa10 # CPU for specific day
$ sar -A # All statistics
# Generate reports:
$ sar -u -s 10:00:00 -e 13:00:00 # Specific time range
dstat - Versatile Resource Statistics
# CPU, disk, network, paging, system stats
# Update every 1s, 10 times
# Custom combinations:
$ dstat -c -d -n -m -y -t 2 5
# cpu, disk, net, memory, system, timestamp
# Advanced usage:
$ dstat --top-cpu # Top CPU consuming processes
$ dstat --top-mem # Top memory consumers
$ dstat --top-io # Top I/O processes
$ dstat --output /tmp/dstat.csv # Export to CSV
# Color-coded output:
$ dstat -cdngy --color
Performance Analysis Script
Complete Performance Monitoring Script
# performance_monitor.sh - Comprehensive system monitoring
LOG_FILE="/var/log/performance_$(date +%Y%m%d).log"
ALERT_THRESHOLD_CPU=80
ALERT_THRESHOLD_MEM=85
ALERT_THRESHOLD_DISK=90
collect_metrics() {
local timestamp="$(date '+%Y-%m-%d %H:%M:%S')"
# CPU Metrics
local cpu_usage="$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')"
local load_avg="$(uptime | awk -F'load average:' '{print $2}' | tr -d ',')"
# Memory Metrics
local mem_info="$(free -m)"
local mem_total="$(echo "$mem_info" | grep Mem | awk '{print $2}')"
local mem_used="$(echo "$mem_info" | grep Mem | awk '{print $3}')"
local mem_percent="$(echo "scale=2; $mem_used * 100 / $mem_total" | bc)"
# Disk Metrics
local disk_usage="$(df / | tail -1 | awk '{print $5}' | sed 's/%//')"
# Output results
cat >> "$LOG_FILE" << EOF
=== $timestamp ===
CPU Usage: ${cpu_usage}%
Load Average: $load_avg
Memory: ${mem_used}MB/${mem_total}MB (${mem_percent}%)
Disk Usage: ${disk_usage}%
EOF
# Check thresholds
if (( $(echo "$cpu_usage > $ALERT_THRESHOLD_CPU" | bc -l) )); then
send_alert "High CPU" "CPU usage: ${cpu_usage}%"
fi
if (( $(echo "$mem_percent > $ALERT_THRESHOLD_MEM" | bc -l) )); then
send_alert "High Memory" "Memory usage: ${mem_percent}%"
fi
}
monitor_processes() {
echo "=== Top Processes ===" >> "$LOG_FILE"
ps aux --sort=-%cpu | head -6 >> "$LOG_FILE"
echo "" >> "$LOG_FILE"
}
main() {
echo "Starting performance monitoring..."
while true; do
collect_metrics
monitor_processes
sleep 60 # Collect every minute
done
}
main "$@"
3. Log Monitoring & Analysis
System Log Locations
/var/log/syslog # General system messages (Debian/Ubuntu)
/var/log/messages # General system messages (RHEL/CentOS)
/var/log/auth.log # Authentication logs
/var/log/kern.log # Kernel messages
/var/log/dmesg # Kernel ring buffer
/var/log/secure # Security-related messages (RHEL)
/var/log/boot.log # System boot messages
/var/log/cron # Cron job logs
/var/log/maillog # Mail server logs
/var/log/httpd/ # Apache logs
/var/log/nginx/ # Nginx logs
/var/log/mysql/ # MySQL logs
# Check log rotation configuration:
$ ls -la /etc/logrotate.d/
journalctl - Systemd Journal
Basic journalctl Usage
$ journalctl
# Follow new entries:
$ journalctl -f
# Show logs since boot:
$ journalctl -b
# Show logs for specific unit:
$ journalctl -u nginx.service
$ journalctl -u docker.service
# Show kernel messages:
$ journalctl -k
Advanced journalctl Queries
$ journalctl -p err # Errors only
$ journalctl -p warning # Warnings and above
# Filter by time:
$ journalctl --since "2025-12-01 09:00:00"
$ journalctl --since "1 hour ago"
$ journalctl --since yesterday
# Show specific fields:
$ journalctl -o json # JSON output
$ journalctl -o json-pretty
$ journalctl -o verbose # All available fields
# Disk usage:
$ journalctl --disk-usage
Log Analysis Commands
grep - Pattern Searching
$ grep "error" /var/log/syslog
$ grep -i "failed" /var/log/auth.log
# Multiple patterns:
$ grep -E "error|fail|critical" /var/log/syslog
# Show context:
$ grep -B2 -A2 "out of memory" /var/log/kern.log
# Show 2 lines before and after match
# Count occurrences:
$ grep -c "Connection refused" /var/log/syslog
# Search in multiple files:
$ grep -r "authentication failure" /var/log/
awk & sed - Advanced Processing
$ awk '{print $1, $5}' /var/log/nginx/access.log
# Count unique IPs in access log:
$ awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr
# Find top URLs accessed:
$ awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
# Replace text with sed:
$ sed 's/old/new/g' file.log
# Extract time range:
$ awk '/Dec 01 09:/, /Dec 01 10:/' /var/log/syslog
Log Monitoring Script
Automated Log Analysis Script
# log_monitor.sh - Real-time log monitoring
LOG_FILES=(
"/var/log/syslog"
"/var/log/auth.log"
"/var/log/nginx/error.log"
)
PATTERNS=(
"error"
"failed"
"critical"
"authentication failure"
"Connection refused"
)
monitor_logs() {
local logfile="$1"
tail -F "$logfile" | while read -r line; do
for pattern in "${PATTERNS[@]}"; do
if [[ "$line" =~ $pattern ]]; then
local timestamp="$(date '+%Y-%m-%d %H:%M:%S')"
echo "[$timestamp] [ALERT] $logfile: $line"
send_notification "Log Alert" "Pattern '$pattern' found in $logfile"
fi
done
done
}
analyze_logs() {
local logfile="$1"
local days="${2:-1}"
echo "=== Analysis for $logfile (last $days days) ==="
# Count errors
local error_count="$(grep -c -i "error" "$logfile" 2>/dev/null || echo 0)"
echo "Total errors: $error_count"
# Most common error messages
echo "Top error messages:"
grep -i "error" "$logfile" 2>/dev/null | \
awk -F: '{print $NF}' | sort | uniq -c | sort -nr | head -5
}
main() {
echo "Starting log monitoring..."
# Start monitoring in background
for logfile in "${LOG_FILES[@]}"; do
if [[ -f "$logfile" ]]; then
monitor_logs "$logfile" &
fi
done
# Daily analysis
while true; do
for logfile in "${LOG_FILES[@]}"; do
analyze_logs "$logfile" 1
done
sleep 86400 # Run daily
done
}
main "$@"
4. Troubleshooting Boot & Performance Issues
Boot Process Troubleshooting
Boot Logs & Systemd
$ journalctl -b # Current boot
$ journalctl -b -1 # Previous boot
$ journalctl --list-boots # List all boots
# Kernel messages during boot:
$ dmesg | head -50
$ dmesg | grep -i "error\|fail"
# Systemd boot analysis:
$ systemd-analyze time # Boot time breakdown
$ systemd-analyze blame # Services by boot time
$ systemd-analyze critical-chain # Critical path
$ systemd-analyze plot > boot.svg # Create boot chart
Common Boot Issues
$ fsck -y /dev/sda1
# Check disk for bad blocks:
$ badblocks -v /dev/sda
# Rescue mode access:
# 1. Reboot and edit kernel parameters
# 2. Add "single" or "init=/bin/bash"
# 3. Mount filesystems read-write:
$ mount -o remount,rw /
# Fix GRUB bootloader:
$ grub-install /dev/sda
$ update-grub # Debian/Ubuntu
Performance Issue Diagnosis
Slow System Symptoms
$ uptime
# If load > CPU cores, system overloaded
# Check memory pressure:
$ vmstat 1 5
# High si/so indicates swapping
# Check I/O wait:
$ iostat -x 1 3
# High %util or await indicates disk bottleneck
# Check process states:
$ ps aux | awk '$8 ~ /D/ {print}'
# D state = uninterruptible sleep (usually I/O)
Quick Diagnostics
$ echo "Load: $(uptime | awk -F'load average:' '{print $2}')"; \
echo "Memory: $(free -h | grep Mem | awk '{print $3"/"$2}')"; \
echo "Disk: $(df -h / | tail -1 | awk '{print $5}')"; \
echo "Top CPU: $(ps aux --sort=-%cpu | head -2 | tail -1 | awk '{print $11}')"
# Check for zombie processes:
$ ps aux | awk '$8 ~ /Z/ {print}'
# Kill parent of zombies:
$ kill -HUP $(ps -A -ostat,ppid | awk '/[zZ]/ {print $2}')
# Clear page cache (careful!):
$ sync; echo 1 > /proc/sys/vm/drop_caches
Troubleshooting Checklist
System Performance Troubleshooting Flow
- Identify Symptoms: High load, slow response, errors
- Check Load Average: uptime, check if > CPU cores
- Monitor CPU: top/htop, check us/sy/idle percentages
- Check Memory: free -m, watch for swapping (si/so)
- Monitor Disk I/O: iostat, check await and %util
- Check Network: iftop/iftop, look for bottlenecks
- Identify Top Processes: ps aux --sort=-%cpu or -%mem
- Check Logs: journalctl, /var/log/ for errors
- Verify Services: systemctl status for failed services
- Review Recent Changes: Package updates, config changes
Emergency Recovery Commands
# Magic SysRq keys (if enabled):
# Alt+SysRq+r - Switch keyboard to raw mode
# Alt+SysRq+e - Send SIGTERM to all processes
# Alt+SysRq+i - Send SIGKILL to all processes
# Alt+SysRq+s - Sync filesystems
# Alt+SysRq+u - Remount filesystems read-only
# Alt+SysRq+b - Reboot system
# Force kill process eating CPU:
$ kill -9 $(ps aux --sort=-%cpu | head -2 | tail -1 | awk '{print $2}')
# Emergency disk space cleanup:
$ du -sh /* 2>/dev/null | sort -hr | head -10
$ find /var/log -name "*.log" -size +100M -exec truncate -s 0 {} \;
$ journalctl --vacuum-size=100M # Reduce journal size
Monitoring Command Quick Reference
CPU Monitoring
$ mpstat -P ALL 1 3
$ sar -u 2 5
$ uptime # Load average
Memory Monitoring
$ vmstat 1 5
$ sar -r 1 3
$ cat /proc/meminfo
Disk Monitoring
$ iostat -xz 1 3
$ iotop
$ du -sh /* 2>/dev/null | sort -hr
Network Monitoring
$ nload eth0
$ ss -tunap
$ netstat -i
Log Monitoring
$ tail -f /var/log/syslog
$ grep -i "error" /var/log/syslog
$ dmesg | tail -20
Process Monitoring
$ ps aux --sort=-%mem
$ htop
$ pstree -p
Best Practices Summary
- Monitor proactively: Don't wait for alerts - set up comprehensive monitoring
- Establish baselines: Know normal behavior to identify anomalies
- Use the right tool: Different tools for different monitoring needs
- Set meaningful thresholds: Avoid alert fatigue with proper thresholds
- Centralize logs: Use centralized logging for distributed systems
- Automate responses: Script common recovery actions
- Document procedures: Create runbooks for common issues
- Regular reviews: Periodically review monitoring and alerting setups
- Test recovery: Regularly test backup and recovery procedures
- Stay updated: Keep monitoring tools and scripts current
Effective system monitoring and performance management requires understanding both the tools available and the metrics that matter most for your specific environment. Regular monitoring, combined with proper alerting and automated responses, can prevent most issues from becoming outages.
No comments:
Post a Comment