System Monitoring & Performance: The DevOps Monitoring Mastery Guide
Learn to monitor, analyze, and optimize Linux system performance like a seasoned system administrator.
📅 Published: Feb 2026
⏱️ Estimated Reading Time: 18 minutes
🏷️ Tags: System Monitoring, Performance, Troubleshooting, Linux Administration, DevOps
📊 Monitoring CPU, Memory, Disk, Network Usage
Understanding System Resources
Think of your Linux server like a busy restaurant kitchen:
CPU = The chefs (processors cooking tasks)
Memory (RAM) = Counter space (working area for chefs)
Disk (Storage) = Refrigerator and pantry (long-term storage)
Network = Delivery trucks (moving data in and out)
When any of these gets overloaded, your whole "restaurant" slows down. Monitoring helps you see bottlenecks before customers (users) complain!
Real-time vs Historical Monitoring
Real-time monitoring = Watching live camera feeds in the kitchen
Historical monitoring = Reviewing yesterday's kitchen logs
You need both:
Real-time to fix problems happening NOW
Historical to spot trends and prevent future issues
🔧 Essential Monitoring Tools
1. top & htop: The System Dashboard
# Basic system overview top # Inside top: # Press '1' = Show all CPU cores # Press 'M' = Sort by memory usage # Press 'P' = Sort by CPU usage # Press 'k' = Kill a process (enter PID) # Press 'q' = Quit # htop (nicer, color-coded) sudo apt install htop # Install if needed htop # htop features: # F2 = Setup (configure display) # F3 = Search for process # F4 = Filter processes # F5 = Tree view # F6 = Sort by column # F9 = Kill process
Understanding key metrics in top:
top - 10:30:00 up 30 days, 3:15, 1 user, load average: 0.50, 0.75, 1.00 Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie %Cpu(s): 5.2 us, 1.5 sy, 0.0 ni, 93.3 id, 0.0 wa, 0.0 hi, 0.0 si MiB Mem : 7845.2 total, 1024.5 free, 2048.3 used, 4772.4 buff/cache MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 5120.2 avail Mem
Load average = Average number of processes waiting for CPU (1.00 per CPU core = 100% busy)
%Cpu(s) = User (us), System (sy), Idle (id), Wait I/O (wa) percentages
Memory = Total, used, free, buff/cache (cached disk data)
Swap = Virtual memory on disk (high usage = memory pressure)
2. vmstat: Virtual Memory Statistics
vmstat shows processes, memory, paging, block IO, traps, and CPU activity.
# Basic vmstat vmstat # Refresh every 2 seconds, show 5 updates vmstat 2 5 # Show with timestamps vmstat -t 2 3 # Understanding vmstat output: # Procs: r=run queue, b=blocked # Memory: swpd=swap used, free, buff, cache # Swap: si=swap in, so=swap out # IO: bi=blocks in, bo=blocks out # System: in=interrupts, cs=context switches # CPU: us, sy, id, wa, st
3. iostat: Input/Output Statistics
iostat shows CPU and disk I/O statistics.
# Install sysstat package first sudo apt install sysstat # Show CPU and disk statistics iostat # Show only disk statistics iostat -d # Show extended statistics iostat -x # Show with human readable units iostat -h # Continuous monitoring (every 2 seconds) iostat 2 # Understanding key metrics: # %util = Disk utilization percentage (100% = saturated) # await = Average wait time for I/O requests # r/s, w/s = Reads/writes per second # rkB/s, wkB/s = KB read/written per second
4. free: Memory Usage
# Basic memory info free # Human readable (MB, GB) free -h # Show in megabytes free -m # Show total free -t # Show wide output (all details) free -wh # Understanding output: # total = Total installed memory # used = Memory used by processes # free = Completely unused memory # shared = Memory used by tmpfs # buff/cache = Memory used for disk caching # available = Memory available for new applications
Key insight: Linux uses free memory for disk caching. High "used" memory with low "available" is a problem. High "used" with high "available" is normal (caching).
5. sar: System Activity Reporter
sar collects and reports system activity information. It's like having a flight data recorder for your server.
# Install sysstat sudo apt install sysstat # Enable data collection (usually already enabled) sudo systemctl enable sysstat sudo systemctl start sysstat # View today's CPU statistics sar # View memory usage sar -r # View swap usage sar -S # View disk I/O sar -d # View network statistics sar -n DEV # View specific time (e.g., 10:00 AM to 11:00 AM) sar -s 10:00:00 -e 11:00:00 # Save report to file sar -A > system-report.txt
6. dstat: Versatile Resource Statistics
dstat combines vmstat, iostat, ifstat, and more in one tool.
# Install dstat sudo apt install dstat # Basic dstat (refreshes every second) dstat # Show CPU, disk, network, paging, system dstat -cdngy # Show with colors dstat -ta --color # Show specific plugins dstat --cpu --mem --net --disk # Save to CSV for analysis dstat --output /tmp/dstat.csv 5 # Key columns: # cpu: usr, sys, idl, wai, hiq, siq # dsk: read, writ # net: recv, send # paging: in, out # system: int, csw
7. iotop: Monitor Disk I/O by Process
While iostat shows disk-level stats, iotop shows which processes are doing the I/O.
# Install iotop sudo apt install iotop # Run with sudo (needs root for process info) sudo iotop # Only show processes doing I/O sudo iotop -o # Show accumulated I/O sudo iotop -a # Batch mode (for scripting) sudo iotop -b -n 5 # Understanding output: # TID = Thread ID # PRIO = I/O priority # USER = Process owner # DISK READ/DISK WRITE = Throughput # SWAPIN = Swap usage # IO = I/O percentage
8. iftop: Network Bandwidth Monitor
iftop shows real-time network bandwidth usage.
# Install iftop sudo apt install iftop # Monitor all interfaces sudo iftop # Monitor specific interface sudo iftop -i eth0 # Show port numbers sudo iftop -P # Show in batch mode sudo iftop -t -s 10 # Don't resolve hostnames (faster) sudo iftop -n # Understanding display: # TX = Transmit (sending) # RX = Receive (receiving) # Cumulative = Total in session # Peak = Highest rate # Rates = Current rates
📝 Log Monitoring with journalctl & /var/log/
System Logs: Your Server's Diary
System logs are like a detailed diary of everything happening on your server. When something goes wrong, the logs usually know why.
Key Log Directories
# System logs (Ubuntu/Debian) /var/log/syslog # General system messages /var/log/auth.log # Authentication logs /var/log/kern.log # Kernel messages # System logs (Red Hat/CentOS) /var/log/messages # General system messages /var/log/secure # Authentication logs # Application logs /var/log/nginx/ # Nginx web server /var/log/apache2/ # Apache web server /var/log/mysql/ # MySQL database /var/log/docker/ # Docker containers # Service logs (systemd) sudo journalctl # All journal logs
journalctl: The Modern Log Viewer
journalctl is the standard tool for viewing logs on systemd-based systems.
# View all logs (newest first) sudo journalctl # View in reverse (oldest first) sudo journalctl -r # Follow new logs (like tail -f) sudo journalctl -f # View logs for specific service sudo journalctl -u nginx sudo journalctl -u docker.service # View logs since boot sudo journalctl -b # View logs from specific time sudo journalctl --since "2024-02-10 09:00:00" sudo journalctl --since "1 hour ago" sudo journalctl --since yesterday # Filter by priority/severity sudo journalctl -p err # Only errors sudo journalctl -p warning # Only warnings sudo journalctl -p info # Only info messages # Show kernel messages sudo journalctl -k # Show with full details sudo journalctl -o verbose # Export logs to file sudo journalctl --since "today" > /tmp/today-logs.txt # Combine filters sudo journalctl -u nginx --since "2 hours ago" -p err
Traditional Log Files
# View system log sudo tail -f /var/log/syslog # View authentication log sudo tail -f /var/log/auth.log # Search for errors in logs sudo grep -i "error\|fail" /var/log/syslog # Count occurrences of a pattern sudo grep -c "Connection refused" /var/log/auth.log # Monitor multiple logs sudo tail -f /var/log/syslog /var/log/auth.log /var/log/nginx/error.log # Find large log files sudo find /var/log -type f -size +100M # Rotate logs manually sudo logrotate -f /etc/logrotate.conf
Log Monitoring Script
#!/bin/bash # log-monitor.sh LOG_DIRS="/var/log /var/log/nginx /var/log/mysql" ALERT_EMAIL="admin@example.com" ERROR_PATTERNS="error\|fail\|critical\|panic\|segmentation fault" # Check for errors in last 10 minutes check_errors() { local logfile=$1 local error_count=$(sudo grep -c -i "$ERROR_PATTERNS" "$logfile" 2>/dev/null) if [ "$error_count" -gt 0 ]; then echo "Found $error_count errors in $logfile" # Get sample errors sudo grep -i "$ERROR_PATTERNS" "$logfile" 2>/dev/null | tail -5 | while read error; do echo " - $error" done fi } # Monitor disk usage of logs check_log_disk_usage() { local total_size=$(sudo du -sh /var/log | cut -f1) echo "Total log disk usage: $total_size" # Check individual large log files sudo find /var/log -type f -size +100M 2>/dev/null | while read logfile; do size=$(sudo du -h "$logfile" | cut -f1) echo "Large log: $logfile ($size)" done } # Main monitoring echo "=== Log Monitoring Report ===" echo "Time: $(date)" echo # Check each log directory for dir in $LOG_DIRS; do if [ -d "$dir" ]; then echo "Checking $dir:" # Find all .log files find "$dir" -name "*.log" -type f 2>/dev/null | while read logfile; do check_errors "$logfile" done echo fi done # Check disk usage check_log_disk_usage echo "=== Report Complete ==="
🔍 Troubleshooting Boot & Performance Issues
The Systematic Troubleshooting Approach
When troubleshooting, follow this methodical approach:
Identify symptoms (What's wrong?)
Check basic health (Is anything obvious broken?)
Review logs (What do the logs say?)
Monitor resources (What's overloaded?)
Test fixes (Does changing X help?)
Document (What was the solution?)
Boot Issues Troubleshooting
Common Boot Problems:
# 1. System won't boot at all # Check: Power, cables, hardware # 2. Boots to emergency/rescue mode # Check: Filesystem errors, missing files # 3. Boots but services don't start # Check: Service configuration, dependencies
Boot Troubleshooting Commands:
# View boot messages (if you can get to console) dmesg | less dmesg | grep -i "error\|fail" # View systemd boot process journalctl -b journalctl -xb # With extra details # Check filesystem fsck /dev/sda1 # Check and repair filesystem # Check disk for bad sectors badblocks -sv /dev/sda # Check boot loader sudo update-grub # Reconfigure GRUB # Boot to single-user/recovery mode # At GRUB menu: edit kernel line, add "single" or "init=/bin/bash"
Performance Issues Troubleshooting
Step 1: Check Basic System Health
# Quick health check uptime # Load average free -h # Memory usage df -h # Disk space ss -tulpn # Open ports systemctl --failed # Failed services
Step 2: Identify Resource Bottlenecks
# CPU Bottleneck Signs: # - Load average > CPU cores # - High %wa (wait I/O) in top # - Long running processes # Memory Bottleneck Signs: # - High swap usage (si/so in vmstat) # - Out of memory errors in dmesg # - Processes killed by OOM killer # Disk Bottleneck Signs: # - High %util in iostat (>80%) # - High await time in iostat # - Slow file operations # Network Bottleneck Signs: # - High packet loss (ping) # - Slow transfer speeds # - Connection timeouts
Step 3: Performance Analysis Commands
# Find CPU-intensive processes ps aux --sort=-%cpu | head -10 # Find memory-intensive processes ps aux --sort=-%mem | head -10 # Check I/O wait vmstat 2 5 # Look at wa column # Check disk latency iostat -x 1 # Look at await, svctm # Check network connectivity ping -c 5 google.com traceroute google.com mtr google.com # Combines ping + traceroute # Check for too many open files lsof | wc -l ulimit -n # Current limit # Check for zombie processes ps aux | grep 'Z'
Common Performance Problems & Solutions
Problem 1: High CPU Usage
# Identify culprit top # Look for process with high %CPU # If it's a specific process, investigate strace -p <PID> # See system calls perf top # CPU profiling (install linux-tools) # If it's many small processes # Check for fork bombs or misbehaving scripts pstree # View process tree # Solutions: # 1. Kill/restart problematic process # 2. Adjust process priority (nice/renice) # 3. Add more CPU capacity # 4. Optimize application code
Problem 2: High Memory Usage
# Check memory usage free -h cat /proc/meminfo # Check what's using memory ps aux --sort=-%mem | head -10 sudo slabtop # Kernel memory usage # Check for memory leaks # Monitor a process over time: watch -n 1 'ps -p <PID> -o pid,ppid,cmd,%mem,%cpu,size' # Clear caches (temporary fix) echo 3 | sudo tee /proc/sys/vm/drop_caches # Solutions: # 1. Add more RAM # 2. Adjust swappiness (vm.swappiness) # 3. Fix memory leaks in applications # 4. Add swap space
Problem 3: Slow Disk I/O
# Check disk performance iostat -x 2 iotop -o # Check for disk errors dmesg | grep -i "disk\|sda\|error" smartctl -a /dev/sda # SMART data # Check filesystem fragmentation (ext4) sudo e4defrag -c / # Check fragmentation level # Check for too many small files find / -type f -size -1k 2>/dev/null | wc -l # Solutions: # 1. Upgrade to SSD # 2. Use RAID for better performance # 3. Optimize database/config settings # 4. Move frequently accessed data to faster disk
Problem 4: Network Issues
# Basic connectivity check ping -c 5 8.8.8.8 ping -c 5 google.com # Check DNS nslookup google.com dig google.com # Check bandwidth usage iftop nethogs # Bandwidth by process # Check connection limits ss -s # Socket statistics cat /proc/sys/net/ipv4/tcp_max_syn_backlog # Check for packet loss mtr google.com # Solutions: # 1. Check physical connections # 2. Adjust network card settings # 3. Optimize application network usage # 4. Upgrade network infrastructure
🎯 Real-World Monitoring Scenarios
Scenario 1: Production Server Monitoring Dashboard
#!/bin/bash # server-dashboard.sh # Colors RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' NC='\033[0m' # No Color # Header header() { echo -e "${BLUE}========================================${NC}" echo -e "${BLUE} $1${NC}" echo -e "${BLUE}========================================${NC}" } # Status indicator status() { if [ $1 -eq 0 ]; then echo -e "${GREEN}✓ OK${NC}" else echo -e "${RED}✗ ISSUE${NC}" fi } # System Information header "System Information" echo "Hostname: $(hostname)" echo "Uptime: $(uptime -p)" echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')" echo "Processes: $(ps aux | wc -l)" # CPU Information header "CPU Information" CPU_CORES=$(nproc) LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',') echo "Cores: $CPU_CORES" echo -n "Load: $LOAD - " if (( $(echo "$LOAD > $CPU_CORES" | bc -l) )); then echo -e "${RED}High Load${NC}" else echo -e "${GREEN}Normal${NC}" fi # Memory Information header "Memory Information" MEM_TOTAL=$(free -h | awk '/^Mem:/ {print $2}') MEM_USED=$(free -h | awk '/^Mem:/ {print $3}') MEM_PERCENT=$(free | awk '/^Mem:/ {printf("%.0f"), $3/$2*100}') echo "Total: $MEM_TOTAL" echo "Used: $MEM_USED ($MEM_PERCENT%)" if [ $MEM_PERCENT -gt 90 ]; then echo -e "${RED}CRITICAL: High memory usage${NC}" elif [ $MEM_PERCENT -gt 70 ]; then echo -e "${YELLOW}WARNING: High memory usage${NC}" else echo -e "${GREEN}Normal${NC}" fi # Disk Information header "Disk Information" df -h / | awk 'NR==2 {print "Mount: "$6, "Size: "$2, "Used: "$3, "Avail: "$4, "Use%: "$5}' DISK_PERCENT=$(df / | awk 'NR==2{print $5}' | sed 's/%//') if [ $DISK_PERCENT -gt 90 ]; then echo -e "${RED}CRITICAL: Disk almost full${NC}" elif [ $DISK_PERCENT -gt 80 ]; then echo -e "${YELLOW}WARNING: Disk usage high${NC}" fi # Service Status header "Service Status" services=("ssh" "nginx" "mysql" "docker" "redis") for service in "${services[@]}"; do echo -n "$service: " if systemctl is-active --quiet "$service" 2>/dev/null; then echo -e "${GREEN}Running${NC}" else echo -e "${RED}Stopped${NC}" fi done # Network Status header "Network Status" IP_ADDR=$(hostname -I | awk '{print $1}') echo "IP Address: $IP_ADDR" echo -n "Internet Connectivity: " if ping -c 1 -W 2 8.8.8.8 >/dev/null 2>&1; then echo -e "${GREEN}Connected${NC}" else echo -e "${RED}Disconnected${NC}" fi # Recent Errors header "Recent Errors (Last 10)" sudo journalctl --since "10 minutes ago" -p err 2>/dev/null | tail -10 || echo "No recent errors" echo -e "\n${BLUE}Report generated: $(date)${NC}"
Scenario 2: Automated Performance Alert System
#!/bin/bash # performance-alert.sh # Configuration ALERT_EMAIL="admin@example.com" LOG_FILE="/var/log/performance-alerts.log" # Thresholds CPU_THRESHOLD=80 MEM_THRESHOLD=85 DISK_THRESHOLD=90 LOAD_THRESHOLD=$(nproc) # Log function log_alert() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE" } # Check CPU check_cpu() { CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then log_alert "CPU ALERT: Usage at ${CPU_USAGE}%" # Get top CPU processes TOP_PROCESSES=$(ps aux --sort=-%cpu | head -6) log_alert "Top processes:\n$TOP_PROCESSES" return 1 fi return 0 } # Check Memory check_memory() { MEM_USAGE=$(free | awk '/Mem/{printf("%.0f"), $3/$2*100}') SWAP_USAGE=$(free | awk '/Swap/{printf("%.0f"), $3/$2*100}') if [ $MEM_USAGE -gt $MEM_THRESHOLD ]; then log_alert "MEMORY ALERT: Usage at ${MEM_USAGE}%" # Get top memory processes TOP_PROCESSES=$(ps aux --sort=-%mem | head -6) log_alert "Top memory users:\n$TOP_PROCESSES" return 1 fi if [ $SWAP_USAGE -gt 50 ]; then log_alert "SWAP ALERT: Swap usage at ${SWAP_USAGE}%" return 1 fi return 0 } # Check Disk check_disk() { DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//') if [ $DISK_USAGE -gt $DISK_THRESHOLD ]; then log_alert "DISK ALERT: Usage at ${DISK_USAGE}%" # Find large files/directories LARGE_FILES=$(sudo find / -type f -size +100M 2>/dev/null | head -10) log_alert "Large files found:\n$LARGE_FILES" return 1 fi return 0 } # Check Load check_load() { LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | tr -d ',') if (( $(echo "$LOAD > $LOAD_THRESHOLD" | bc -l) )); then log_alert "LOAD ALERT: Load average at $LOAD" return 1 fi return 0 } # Main monitoring log_alert "=== Starting performance check ===" ALERT_COUNT=0 check_cpu || ((ALERT_COUNT++)) check_memory || ((ALERT_COUNT++)) check_disk || ((ALERT_COUNT++)) check_load || ((ALERT_COUNT++)) if [ $ALERT_COUNT -gt 0 ]; then # Send email alert SUBJECT="Performance Alert: $ALERT_COUNT issues detected" BODY=$(tail -20 "$LOG_FILE") echo "$BODY" | mail -s "$SUBJECT" "$ALERT_EMAIL" log_alert "Alert email sent to $ALERT_EMAIL" fi log_alert "=== Performance check complete ==="
Scenario 3: Boot Performance Analysis
#!/bin/bash # boot-analyzer.sh # Analyze boot performance echo "=== Boot Performance Analysis ===" echo # 1. Check boot time echo "1. Boot Time Analysis:" systemd-analyze echo # 2. Check what's slowing boot echo "2. Slowest Boot Units:" systemd-analyze blame | head -10 echo # 3. Check boot chain echo "3. Boot Chain Graph:" systemd-analyze critical-chain echo # 4. Check kernel messages echo "4. Kernel Boot Messages (errors/warnings):" dmesg | grep -E "error|fail|warning" | head -10 echo # 5. Check failed services echo "5. Failed Services:" systemctl --failed echo # 6. Check startup time of key services echo "6. Key Services Startup Time:" for service in ssh nginx mysql docker; do TIME=$(systemd-analyze critical-chain $service.service 2>/dev/null | grep "reached after" | awk '{print $4}') if [ -n "$TIME" ]; then echo " $service: $TIME" fi done echo # 7. Check for unnecessary services echo "7. Enabled Services (consider disabling if not needed):" systemctl list-unit-files --type=service --state=enabled | grep -v "@" | head -20 echo echo "=== Analysis Complete ===" echo "Recommendations:" echo "1. Disable unnecessary services" echo "2. Check slow services from 'systemd-analyze blame'" echo "3. Consider using 'systemd-analyze plot > boot.svg' for detailed graph"
Scenario 4: Continuous Monitoring with sar
#!/bin/bash # sar-monitor.sh # Generate daily performance report using sar REPORT_DIR="/var/log/sar-reports" REPORT_FILE="$REPORT_DIR/daily-report-$(date +%Y%m%d).txt" # Create report directory mkdir -p "$REPORT_DIR" echo "=== Daily Performance Report ===" > "$REPORT_FILE" echo "Generated: $(date)" >> "$REPORT_FILE" echo >> "$REPORT_FILE" # CPU Usage Summary echo "CPU Usage (24-hour average):" >> "$REPORT_FILE" sar | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # Memory Usage Summary echo "Memory Usage:" >> "$REPORT_FILE" sar -r | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # Swap Usage Summary echo "Swap Usage:" >> "$REPORT_FILE" sar -S | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # Disk I/O Summary echo "Disk I/O (/dev/sda):" >> "$REPORT_FILE" sar -d | grep "sda" | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # Network Summary echo "Network (eth0):" >> "$REPORT_FILE" sar -n DEV | grep "eth0" | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # Load Average echo "Load Average:" >> "$REPORT_FILE" sar -q | grep -E "^Average:|^[0-9]" | tail -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # Peak Usage Times echo "Peak CPU Usage Times:" >> "$REPORT_FILE" sar | grep -v "^Average" | grep -v "^Linux" | sort -k4 -rn | head -5 >> "$REPORT_FILE" echo >> "$REPORT_FILE" echo "Report saved to: $REPORT_FILE"
📋 Quick Reference Cheat Sheet
| Tool | Command | Purpose |
|---|---|---|
| CPU Monitoring | top, htop, mpstat | Real-time CPU usage |
| Memory Monitoring | free -h, vmstat, sar -r | Memory usage |
| Disk Monitoring | iostat -x, iotop, df -h | Disk I/O and space |
| Network Monitoring | iftop, nethogs, sar -n DEV | Network bandwidth |
| Process Monitoring | ps aux, pstree, lsof | Process information |
| Load Average | uptime, sar -q | System load |
| Boot Analysis | systemd-analyze, dmesg | Boot performance |
| Log Monitoring | journalctl, tail -f /var/log/syslog | System logs |
| Continuous Monitoring | sar, dstat | Historical data |
| I/O Wait | vmstat 1, iostat 1 | Disk bottleneck detection |
| Swap Monitoring | swapon --show, free -h | Swap usage |
| Service Monitoring | systemctl status, systemctl --failed | Service health |
🚀 Practice Exercises
Exercise 1: Create a Resource Monitor
#!/bin/bash # resource-monitor.sh watch -n 2 ' clear echo "=== System Resource Monitor ===" echo "Time: $(date)" echo echo "CPU Load:" uptime echo echo "Memory Usage:" free -h | grep -E "^Mem:" echo echo "Disk Usage:" df -h / | tail -1 echo echo "Top 5 Processes by CPU:" ps aux --sort=-%cpu | head -6 | awk "{print \$1, \$2, \$3, \$11}" echo echo "Top 5 Processes by Memory:" ps aux --sort=-%mem | head -6 | awk "{print \$1, \$2, \$4, \$11}" '
Exercise 2: Analyze Boot Performance
# 1. Check current boot time systemd-analyze # 2. See what's taking time systemd-analyze blame | head -10 # 3. Generate boot chart systemd-analyze plot > boot.svg # Transfer boot.svg to view graphically # 4. Check for failed services systemctl --failed # 5. Disable a non-essential service (example: bluetooth) sudo systemctl disable bluetooth.service sudo systemctl mask bluetooth.service # 6. Re-check boot time after reboot
Exercise 3: Monitor Disk I/O in Real-time
# Terminal 1: Monitor overall disk I/O watch -n 1 'iostat -x 1 2 | tail -10' # Terminal 2: Monitor per-process I/O sudo iotop -o # Terminal 3: Generate disk load dd if=/dev/zero of=/tmp/testfile bs=1M count=1000 # Observe what happens in Terminals 1 and 2 # Then clean up: rm /tmp/testfile
Exercise 4: Network Performance Test
# 1. Check current bandwidth sudo iftop -i eth0 # 2. Test network speed # On server: iperf3 -s # On client: iperf3 -c server-ip # 3. Check for packet loss ping -c 100 google.com | grep "packet loss" # 4. Trace route with statistics mtr google.com # 5. Check active connections ss -tun | wc -l ss -tun | grep ESTAB | wc -l
🔗 Master System Monitoring with Hands-on Labs
System monitoring is what separates junior from senior DevOps engineers. The ability to quickly identify and resolve performance issues is invaluable.
👉 Practice system monitoring, performance analysis, and troubleshooting in our interactive labs at:
https://devops.trainwithsky.com/
Our platform provides:
Real server environments to monitor
Performance bottleneck scenarios
Troubleshooting challenges
Production-like monitoring setups
Guided exercises with expert feedback
Frequently Asked Questions
Q: What's a "good" load average?
A: Load average should be less than number of CPU cores. Example: 4-core CPU, load < 4 is good.
Q: How often should I check system metrics?
A: Real-time during issues, every 5-15 minutes for routine checks, and review historical daily/weekly.
Q: What's more important: CPU or memory?
A: It depends! Database servers need memory, compute servers need CPU. Monitor both.
Q: Should I be worried about high buff/cache memory?
A: No! Linux uses free memory for disk caching. This is good - it speeds up disk access.
Q: How do I find what caused a server crash yesterday?
A: Check /var/log/messages, journalctl --since yesterday, and dmesg output.
Q: What's the difference between %wa and %id in CPU stats?
A: %wa = waiting for I/O (disk bottleneck), %id = idle (CPU has nothing to do).
Q: How can I monitor a script's resource usage?
A: Use /usr/bin/time -v script.sh or run script and monitor with top -p $(pgrep scriptname).
Having monitoring or performance issues? Share your metrics and symptoms in the comments below! 💬
Comments
Post a Comment