Troubleshooting & Best Practices: The Complete DevOps Field Guide
Master the art of troubleshooting and learn industry-proven best practices for resilient, maintainable systems.
📅 Published: Feb 2026
⏱️ Estimated Reading Time: 22 minutes
🏷️ Tags: Troubleshooting, Root Cause Analysis, System Recovery, DevOps Best Practices, Production Support
🚨 Common Boot Issues: When Your Server Won't Start
Understanding the Boot Process
Think of server boot as a domino chain reaction. Each component must successfully pass the baton to the next. When a server won't boot, one domino in this chain has failed:
BIOS/UEFI → Bootloader (GRUB) → Kernel → Initramfs → Root Filesystem → init/systemd → Services
Critical Boot Failure Scenarios
Scenario 1: GRUB Rescue Mode
Symptoms: Black screen with grub rescue> prompt, server won't boot past bootloader.
Why it happens: GRUB (Grand Unified Bootloader) can't find its configuration or the kernel image. This is like a GPS that lost its map.
# At the grub rescue> prompt, first find your root partition grub rescue> ls # Output: (hd0) (hd0,msdos1) (hd0,msdos2) # Figure out which partition has /boot grub rescue> ls (hd0,msdos1)/ # If you see /boot/grub, that's your boot partition # Set root prefix grub rescue> set root=(hd0,msdos1) grub rescue> set prefix=(hd0,msdos1)/boot/grub # Load normal mode grub rescue> insmod normal grub rescue> normal # Once booted, reinstall GRUB sudo update-grub sudo grub-install /dev/sda
Prevention: Always test kernel updates in staging. Keep a known-good kernel entry in GRUB.
Scenario 2: Filesystem Corruption
Symptoms: Drops to (initramfs) prompt, "Kernel panic - not syncing", "mount: can't read superblock".
Why it happens: Filesystem metadata corruption from improper shutdown, hardware issues, or disk errors.
# At initramfs prompt, check filesystem initramfs> e2fsck -y /dev/sda1 # -y = automatically answer yes to all prompts # If e2fsck fails, try force check initramfs> e2fsck -fy /dev/sda1 # -f = force check even if filesystem seems clean # -y = assume yes # Mount manually to verify initramfs> mount /dev/sda1 /mnt initramfs> ls /mnt # Should see root filesystem contents # Exit to continue boot initramfs> exit
Prevention: Enable periodic filesystem checks, use UPS for power protection, monitor disk SMART data.
Scenario 3: FSTAB Configuration Error
Symptoms: Drops to emergency mode, "Failed to mount /etc/fstab", "Timed out waiting for device".
Why it happens: A mount entry in /etc/fstab references a device that doesn't exist or has incorrect options.
# When you see emergency mode prompt: # 1. Enter root password # 2. Check fstab for errors cat /etc/fstab # Look for: # - Incorrect UUIDs # - Missing devices # - Invalid mount options # 3. Temporarily comment out problematic line # Use nano or vi to edit # Add # at beginning of problematic mount # 4. Test fstab mount -a # Should mount all filesystems without error # 5. Reboot reboot # To find correct UUID for a device: blkid /dev/sda1
Prevention: Use UUIDs instead of device names (/dev/sda1). Always test fstab changes with mount -a before reboot.
Scenario 4: Failed Systemd Service
Symptoms: Boot completes but critical service (network, SSH) doesn't start. Server boots but is unreachable.
# From console or out-of-band management: # Check failed services systemctl --failed # View specific service failure systemctl status networking journalctl -u networking -xe # Common fixes: # 1. Fix configuration error nano /etc/netplan/01-netcfg.yaml netplan apply # 2. Re-enable service systemctl enable networking systemctl start networking # 3. Reset failed state systemctl reset-failed networking
Scenario 5: Full /boot Partition
Symptoms: Boot fails after kernel update, error about no space left on device during boot.
Why it happens: /boot partition is full of old kernels (common with automatic updates).
# From recovery mode or live CD: # 1. Check space df -h /boot # 2. List installed kernels dpkg -l | grep linux-image # Ubuntu/Debian rpm -qa | grep kernel # Red Hat/CentOS # 3. Remove old kernels (keep current and one previous) # Ubuntu: sudo apt autoremove --purge # Manual removal: sudo rm -rf /boot/*-3.13.* # Replace with old version # 4. Update GRUB sudo update-grub
Prevention: Set up automatic kernel cleanup, increase /boot partition size, or remove /boot as separate partition.
Boot Recovery Toolkit
#!/bin/bash # boot-recovery-toolkit.sh # Run from recovery environment or live CD echo "=== Boot Recovery Toolkit ===" echo # 1. Mount root filesystem echo "1. Mounting root filesystem..." mkdir -p /mnt/root mount /dev/sda1 /mnt/root mount --bind /dev /mnt/root/dev mount --bind /proc /mnt/root/proc mount --bind /sys /mnt/root/sys # 2. Chroot into system chroot /mnt/root /bin/bash # 3. Fix GRUB echo "2. Reinstalling GRUB..." grub-install /dev/sda update-grub # 4. Check filesystems echo "3. Checking filesystems..." fsck -y /dev/sda1 fsck -y /dev/sda2 # 5. Fix network if needed echo "4. Restarting network..." systemctl restart networking # 6. Update initramfs echo "5. Updating initramfs..." update-initramfs -u -k all echo "Boot recovery completed!"
🔍 Log Analysis Techniques: Finding Needles in Haystacks
The Log Analysis Mindset
Think of logs as black boxes of your system. They recorded everything before the crash. Your job as a detective is to reconstruct what happened.
Log Analysis Workflow
1. What's the symptom? → 2. When did it start? → 3. What changed? → 4. What's the pattern? → 5. Root cause → 6. Fix
Essential Log Locations
# System logs /var/log/syslog # Ubuntu/Debian (everything) /var/log/messages # Red Hat/CentOS (everything) /var/log/auth.log # Ubuntu/Debian (authentication) /var/log/secure # Red Hat/CentOS (authentication) /var/log/kern.log # Kernel messages /var/log/dmesg # Boot messages /var/log/boot.log # Boot process log # Application logs /var/log/nginx/ # Nginx web server /var/log/apache2/ # Apache web server /var/log/mysql/ | MySQL database /var/log/postgresql/ # PostgreSQL database /var/log/redis/ # Redis cache /var/log/docker/ # Docker containers # User logs ~/.bash_history # Command history ~/.xsession-errors # GUI application errors
Log Analysis Command Patterns
# 1. Find errors in the last hour journalctl --since "1 hour ago" -p err # 2. Find specific error pattern grep -r "OutOfMemoryError\|OOM" /var/log/ # 3. Find error frequency grep "Failed password" /var/log/auth.log | cut -d' ' -f1-3 | sort | uniq -c | sort -rn # 4. Correlate timestamps # First, find error time ERROR_TIME=$(grep "connection refused" /var/log/nginx/error.log | tail -1 | cut -d' ' -f1-2) # Then, check other logs around that time journalctl --since "$ERROR_TIME" --until "5 minutes later" # 5. Watch for specific errors in real-time tail -f /var/log/syslog | grep -i "error\|fail\|critical" # 6. Extract structured data from logs grep "GET" /var/log/nginx/access.log | awk '{print $1, $7, $9}' | sort | uniq -c # 7. Check for patterns over time for hour in {0..23}; do COUNT=$(grep "error" /var/log/app.log | grep " 0$hour:" | wc -l) echo "Hour $hour: $COUNT errors" done
Real-World Log Analysis Scenarios
Scenario 1: Performance Degradation
Symptoms: Application slow, users complaining. No obvious errors.
# 1. Check system logs for resource issues journalctl --since "1 hour ago" | grep -i "overcommit\|oom\|swap" # 2. Look for slow queries in database logs grep "slow query" /var/log/mysql/mysql-slow.log | tail -20 # 3. Check web server response times tail -1000 /var/log/nginx/access.log | awk '{print $NF}' | sort -rn | head -10 # 4. Look for connection buildup grep "connect" /var/log/nginx/error.log | tail -20 # 5. Check for garbage collection in Java apps grep "Full GC" /var/log/app.log | tail -10
Scenario 2: Security Breach Investigation
Symptoms: Unusual activity, unknown processes, unexpected outbound connections.
#!/bin/bash # security-audit.sh echo "=== Security Incident Investigation ===" echo "Time: $(date)" echo # 1. Check authentication failures echo "1. Failed SSH attempts (last 24h):" grep "Failed password" /var/log/auth.log | grep "$(date +%b\ %d)" | awk '{print $11}' | sort | uniq -c | sort -rn # 2. Check successful logins from unusual times echo -e "\n2. Successful logins after hours:" grep "Accepted" /var/log/auth.log | grep -E "(0[0-5]|23):" | tail -20 # 3. Check sudo commands echo -e "\n3. Suspicious sudo commands:" grep "sudo" /var/log/auth.log | grep -i "useradd\|usermod\|passwd\|chmod 777" | tail -20 # 4. Check for privilege escalation echo -e "\n4. SUID/SGID files modified recently:" find / -type f -perm /4000 -o -perm /2000 -newermt "24 hours ago" 2>/dev/null # 5. Check crontab for unauthorized entries echo -e "\n5. Unexpected cron jobs:" for user in $(cut -f1 -d: /etc/passwd); do crontab -u $user -l 2>/dev/null | grep -v "^#" && echo "User: $user" done
Scenario 3: Database Connection Issues
Symptoms: Application can't connect to database, intermittent connection resets.
# 1. Check database logs for connection errors sudo journalctl -u mysql --since "1 hour ago" | grep -i "connect\|accept\|refused" # 2. Check for connection limit exceeded grep "max_connections" /var/log/mysql/error.log # 3. Look for connection timeouts grep "Aborted_connects" /var/log/mysql/error.log # 4. Check system limits for database process MYSQL_PID=$(pidof mysqld) cat /proc/$MYSQL_PID/limits | grep "open files" # 5. Monitor current connections mysql -e "show processlist;" | wc -l mysql -e "show status like '%connect%';" # 6. Check network connectivity from app server telnet db-server 3306 nc -zv db-server 3306
Centralized Logging Preparation
# 1. Configure rsyslog to send logs to central server cat >> /etc/rsyslog.conf << 'EOF' *.* @logserver.example.com:514 # UDP *.* @@logserver.example.com:514 # TCP (more reliable) EOF systemctl restart rsyslog # 2. Configure journald for persistent storage mkdir -p /var/log/journal systemctl restart systemd-journald # 3. Log rotation policies cat > /etc/logrotate.d/custom << 'EOF' /var/log/myapp/*.log { daily rotate 30 compress delaycompress missingok notifempty create 640 appuser appgroup sharedscripts postrotate systemctl reload myapp endscript } EOF
💾 Backup & Restore Strategies: Your Safety Net
The Backup Pyramid
Think of backup strategies as layers of protection:
Level 3: Off-site Backups (Disaster Recovery) Level 2: On-site Backups (Hardware Failure) Level 1: Snapshots (Accidental Deletion) Level 0: RAID (Disk Failure)
3-2-1 Backup Rule Explained
3 copies of your data
2 different media types
1 copy off-site
This is not just a rule—it's survival insurance for your data.
Backup Strategy by Data Type
#!/bin/bash # backup-strategy.sh # Different strategies for different data types # 1. DATABASES - Use native tools, hot backups backup_database() { DB_NAME=$1 BACKUP_PATH="/backup/db/$(date +%Y%m%d)" mkdir -p "$BACKUP_PATH" # MySQL/MariaDB mysqldump --single-transaction \ --routines \ --triggers \ --events \ "$DB_NAME" | gzip > "$BACKUP_PATH/$DB_NAME.sql.gz" # PostgreSQL pg_dump -Fc "$DB_NAME" > "$BACKUP_PATH/$DB_NAME.dump" # MongoDB mongodump --db "$DB_NAME" --out "$BACKUP_PATH" } # 2. APPLICATION CONFIGURATION - Version control backup_config() { # Store configs in git cd /etc git init git add . git commit -m "Config backup $(date)" git bundle create "/backup/config/config-$(date +%Y%m%d).bundle" --all } # 3. USER DATA - Incremental backups backup_user_data() { rsync -avz --delete \ --link-dest=/backup/users/yesterday \ /home/ \ /backup/users/$(date +%Y%m%d)/ # Create symlink to latest ln -snf /backup/users/$(date +%Y%m%d) /backup/users/latest } # 4. SYSTEM FILES - Image-based backup backup_system() { # Create disk image dd if=/dev/sda1 of=/backup/system/root-$(date +%Y%m%d).img bs=4M # Or use dump dump -0uf /backup/system/root-$(date +%Y%m%d).dump /dev/sda1 }
Automated Backup Framework
#!/bin/bash # backup-framework.sh # Enterprise-grade backup automation # Configuration BACKUP_ROOT="/backup" BACKUP_HOST="backup01.example.com" BACKUP_USER="backup" RETENTION_DAYS=30 LOG_FILE="/var/log/backup.log" ALERT_EMAIL="admin@example.com" # Logging log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE" } # Error handler error_exit() { log "ERROR: $1" echo "Backup failed: $1" | mail -s "BACKUP FAILED" "$ALERT_EMAIL" exit 1 } # Pre-backup checks pre_checks() { log "Running pre-backup checks..." # Check disk space SPACE=$(df "$BACKUP_ROOT" | awk 'NR==2{print $4}') if [ "$SPACE" -lt 10485760 ]; then # 10GB in KB error_exit "Insufficient backup space" fi # Check connectivity to backup host ping -c 1 "$BACKUP_HOST" >/dev/null || error_exit "Backup host unreachable" # Check backup directory permissions touch "$BACKUP_ROOT/test.$$" || error_exit "Cannot write to backup directory" rm "$BACKUP_ROOT/test.$$" } # Database backup backup_databases() { log "Starting database backup..." # MySQL all databases if command -v mysql >/dev/null; then mysqldump --all-databases \ --single-transaction \ --routines \ --triggers \ --events 2>/dev/null | gzip > \ "$BACKUP_ROOT/mysql-full-$(date +%Y%m%d-%H%M%S).sql.gz" fi # PostgreSQL if command -v psql >/dev/null; then sudo -u postgres pg_dumpall | gzip > \ "$BACKUP_ROOT/postgresql-full-$(date +%Y%m%d-%H%M%S).sql.gz" fi } # File system backup backup_files() { log "Starting file system backup..." # Important directories rsync -avz --delete \ --exclude='.cache' \ --exclude='tmp' \ --exclude='*.log' \ /etc/ \ /home/ \ /var/www/ \ /opt/ \ "$BACKUP_HOST:$BACKUP_ROOT/$(hostname)/files/$(date +%Y%m%d)/" \ || error_exit "File backup failed" } # Application-specific backup backup_applications() { log "Starting application backup..." # Backup Docker volumes if command -v docker >/dev/null; then docker run --rm \ -v /var/lib/docker/volumes:/source \ -v "$BACKUP_ROOT/docker":/target \ alpine tar czf /target/volumes-$(date +%Y%m%d).tar.gz -C /source . fi # Backup Kubernetes resources if command -v kubectl >/dev/null; then kubectl get all --all-namespaces -o yaml > \ "$BACKUP_ROOT/kubernetes-$(date +%Y%m%d).yaml" fi } # Retention policy apply_retention() { log "Applying retention policy..." # Delete files older than RETENTION_DAYS find "$BACKUP_ROOT" -type f -name "*.gz" -mtime +$RETENTION_DAYS -delete find "$BACKUP_ROOT" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete find "$BACKUP_ROOT" -type d -empty -delete } # Verification verify_backups() { log "Verifying backups..." VERIFY_LOG="$BACKUP_ROOT/verify-$(date +%Y%m%d).log" # Test database backups for backup in "$BACKUP_ROOT"/*.sql.gz; do if [ -f "$backup" ]; then gunzip -t "$backup" && echo "✓ $backup" >> "$VERIFY_LOG" || \ echo "✗ $backup (corrupt)" >> "$VERIFY_LOG" fi done } # Main execution main() { log "=== Starting Backup Process ===" pre_checks backup_databases backup_files backup_applications apply_retention verify_backups log "=== Backup Completed Successfully ===" } # Run if executed directly if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then main fi
Disaster Recovery Drill
#!/bin/bash # disaster-recovery-drill.sh # Practice restoring from backups SCENARIO=$1 case $SCENARIO in "database") echo "Simulating: Database corruption" echo "Step 1: Stop application" echo "Step 2: Stop database service" echo "Step 3: Restore from latest backup" echo "Step 4: Verify data integrity" echo "Step 5: Start services" echo "RTO: 15 minutes" echo "RPO: 5 minutes (point-in-time recovery)" ;; "server") echo "Simulating: Complete server loss" echo "Step 1: Provision new instance" echo "Step 2: Apply base configuration" echo "Step 3: Restore data from backup" echo "Step 4: Verify application functionality" echo "Step 5: Switch DNS/Load Balancer" echo "RTO: 1 hour" echo "RPO: 24 hours" ;; "file") echo "Simulating: Accidental file deletion" echo "Step 1: Locate file in backup" echo "Step 2: Restore single file" echo "Step 3: Verify permissions" echo "Step 4: Notify user" echo "RTO: 5 minutes" echo "RPO: Real-time" ;; esac # Actual restore test restore_test() { echo "Performing quarterly restore test..." # Create isolated test environment docker run -d --name restore-test ubuntu:latest # Restore backup to test container gunzip -c /backup/latest.sql.gz | docker exec -i restore-test mysql # Run integrity checks docker exec restore-test mysql -e "CHECK TABLE *.*" # Clean up docker rm -f restore-test }
⚠️ Kernel Panics & Recovery: When the Kernel Dies
Understanding Kernel Panics
A kernel panic is Linux's blue screen of death. It's the kernel's last resort when it encounters a fatal error it cannot recover from.
Common Kernel Panic Causes
Hardware failure (bad RAM, overheating, disk failure)
Corrupted kernel image (bad update, disk corruption)
Incompatible kernel module (faulty driver)
Filesystem corruption (superblock damage)
Out of memory (OOM killer triggered too late)
Capturing Crash Information
# 1. Check for crash dumps ls -la /var/crash/ # 2. Configure kdump (kernel crash dumping) sudo apt install linux-crashdump # Ubuntu/Debian sudo yum install kexec-tools # Red Hat/CentOS # 3. Configure crash kernel parameters sudo nano /etc/default/grub # Add to GRUB_CMDLINE_LINUX: # crashkernel=256M sudo update-grub # 4. Test crash (WARNING: system will reboot!) echo c > /proc/sysrq-trigger # System will crash, save dump, and reboot # 5. Analyze crash dump (requires kernel debug symbols) sudo crash /usr/lib/debug/boot/vmlinux-$(uname -r) /var/crash/202402101230/dump
Kernel Panic Recovery Procedure
#!/bin/bash # kernel-panic-recovery.sh echo "=== Kernel Panic Recovery ===" echo # 1. Try to capture panic message echo "1. Capturing panic message..." dmesg -T | tail -50 > /root/panic-$(date +%Y%m%d-%H%M%S).log # 2. Force reboot (if system is frozen) echo "2. Attempting reboot..." echo b > /proc/sysrq-trigger # Or press: Alt + SysRq + R, E, I, S, U, B (Raising Elephants Is So Utterly Boring) # 3. After reboot, select older kernel at GRUB menu # Hold Shift during boot, select "Advanced options for Ubuntu" # Choose previous kernel version # 4. Once booted, check logs echo "3. Analyzing panic cause..." journalctl -k --since "1 hour ago" | grep -i "panic\|oops\|error" # 5. Check hardware echo "4. Running hardware diagnostics..." # Memory test sudo memtester 1G 1 # Disk health sudo smartctl -a /dev/sda | grep -i "reallocated\|pending\|uncorrectable" # CPU errors sudo mcelog --ignorenodev # 6. Fix the issue echo "5. Applying fixes..." case $PANIC_CAUSE in "kernel_module") # Blacklist problematic module echo "blacklist badmodule" >> /etc/modprobe.d/blacklist.conf update-initramfs -u ;; "filesystem") # Check and repair filesystem fsck -fy /dev/sda1 ;; "memory") # Reserve bad memory area echo "memmap=2M\$8M" >> /etc/default/grub update-grub ;; esac
Preventing Kernel Panics
# 1. Use kernel live patching for critical security updates # Canonical Livepatch (Ubuntu) sudo snap install canonical-livepatch sudo canonical-livepatch enable YOUR_TOKEN # 2. Monitor kernel logs proactively #!/bin/bash # kernel-monitor.sh journalctl -k -f | while read line; do if echo "$line" | grep -qi "error\|fail\|warning\|oops"; then echo "Kernel issue detected: $line" # Send alert echo "Kernel warning on $(hostname): $line" | \ mail -s "Kernel Alert" admin@example.com fi done # 3. Set panic timeout (auto-reboot after panic) echo "10" > /proc/sys/kernel/panic # Make permanent echo "kernel.panic=10" >> /etc/sysctl.conf # 4. Enable kernel crash dumps sudo systemctl enable kdump-tools sudo systemctl start kdump-tools # 5. Test kernel updates in staging # Always have a rollback plan!
✅ Best Practices for DevOps Environments
1. Infrastructure as Code (IaC)
Don't: "SSH into production and fix things manually"
Do: "All infrastructure defined in code, reviewed, versioned, and tested"
# infrastructure.yml (Ansible) --- - name: Configure web servers hosts: web vars: nginx_port: 8080 max_connections: 1000 tasks: - name: Install nginx apt: name: nginx state: present - name: Configure nginx template: src: nginx.conf.j2 dest: /etc/nginx/nginx.conf notify: restart nginx handlers: - name: restart nginx service: name: nginx state: restarted
2. Immutable Infrastructure
Concept: Once deployed, servers are never modified. Updates = replace.
#!/bin/bash # immutable-deployment.sh # Old way (mutable): ssh prod-server "apt update && apt upgrade nginx" # New way (immutable): # 1. Build new AMI/container image with Packer packer build -var 'version=2.0.1' web-server.json # 2. Deploy new instance aws ec2 run-instances --image-id ami-12345 --instance-type t3.micro # 3. Shift traffic aws elbv2 register-targets --target-group-arn arn:aws:elasticloadbalancing:... \ --targets Id=i-1234567890abcdef0 # 4. Verify health curl -f https://new-instance/health # 5. Deregister old instance aws elbv2 deregister-targets --target-group-arn arn:aws:elasticloadbalancing:... \ --targets Id=i-0987654321fedcba0 # 6. Terminate old instance aws ec2 terminate-instances --instance-ids i-0987654321fedcba0
3. Observability: Beyond Monitoring
# prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'node' static_configs: - targets: ['localhost:9100'] # Node exporter - job_name: 'app' metrics_path: /metrics static_configs: - targets: ['app1:8080', 'app2:8080'] # Grafana dashboard configuration # Three pillars of observability: # 1. METRICS - numerical measurements (Prometheus) # 2. LOGS - event records (ELK/Loki) # 3. TRACES - request flow (Jaeger)
4. Security by Default
#!/bin/bash # security-baseline.sh # Applied to every new server # 1. Minimal installation apt-get install --no-install-recommends ... # 2. Remove unnecessary packages apt-get remove telnet rsh-server yp-tools tftpd # 3. SSH hardening cat >> /etc/ssh/sshd_config << 'EOF' Protocol 2 PermitRootLogin no PasswordAuthentication no PubkeyAuthentication yes MaxAuthTries 3 ClientAliveInterval 300 ClientAliveCountMax 2 AllowUsers deploy EOF # 4. Kernel hardening cat >> /etc/sysctl.conf << 'EOF' # IP spoofing protection net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.default.rp_filter = 1 # Ignore ICMP redirects net.ipv4.conf.all.accept_redirects = 0 net.ipv6.conf.all.accept_redirects = 0 # Disable source packet routing net.ipv4.conf.all.accept_source_route = 0 net.ipv6.conf.all.accept_source_route = 0 # Log Martians net.ipv4.conf.all.log_martians = 1 EOF # 5. File permissions chmod 750 /home/* chmod 600 /etc/shadow chmod 644 /etc/passwd chmod 644 /etc/group # 6. Automatic security updates apt-get install unattended-upgrades dpkg-reconfigure -plow unattended-upgrades
5. Deployment Best Practices
# .gitlab-ci.yml stages: - test - build - staging - production variables: APP_VERSION: $CI_COMMIT_SHORT_SHA test: stage: test script: - npm install - npm test - npm run lint build: stage: build script: - docker build -t myapp:$APP_VERSION . - docker tag myapp:$APP_VERSION myregistry.com/myapp:$APP_VERSION - docker push myregistry.com/myapp:$APP_VERSION staging: stage: staging script: - kubectl set image deployment/myapp myapp=myregistry.com/myapp:$APP_VERSION - kubectl rollout status deployment/myapp - curl -f https://staging.example.com/health production: stage: production script: - kubectl set image deployment/myapp myapp=myregistry.com/myapp:$APP_VERSION - kubectl rollout status deployment/myapp - curl -f https://example.com/health when: manual environment: name: production
6. Monitoring & Alerting Best Practices
# alertmanager.yml route: group_by: ['alertname', 'cluster'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'pagerduty' receivers: - name: 'pagerduty' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_KEY' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance'] # Alert design principles: # 1. Alert on symptoms, not causes # BAD: "High CPU" # GOOD: "5xx errors > 1% for 5 minutes" # 2. Every alert must have a runbook # 3. Test alerts regularly # 4. Avoid alert fatigue
7. Post-Mortem Culture
# Post-Mortem Template ## Incident Summary - Date: YYYY-MM-DD - Duration: 2 hours 15 minutes - Impact: 5% of users unable to checkout ## Timeline - 14:23 - PagerDuty alert: Checkout API 5xx errors - 14:25 - On-call engineer acknowledges - 14:30 - Identified database connection pool exhaustion - 14:45 - Scaled up application servers - 15:10 - Errors continue, root cause not resolved - 15:30 - Identified connection leak in payment service - 15:45 - Rolled back payment service deployment - 16:38 - All systems recovered ## Root Cause Connection leak in payment service version 2.1.0 when handling timeout errors in third-party gateway. ## Action Items - [ ] Add connection pool metrics dashboard - [ ] Implement circuit breaker for payment gateway - [ ] Add integration test for timeout scenarios - [ ] Review rollback procedure documentation ## Blameless Culture Statement No individual is at fault. We improve our systems together.
8. Capacity Planning
#!/bin/bash # capacity-planning.sh echo "=== Capacity Planning Report ===" echo "Period: $(date -d '30 days ago' +%Y-%m-%d) to $(date +%Y-%m-%d)" echo # 1. Trend analysis echo "1. Growth Trends:" echo " - CPU: $(sar | awk '/Average:/ {print $3}')% average" echo " - Memory: $(sar -r | awk '/Average:/ {print $4}')% average" echo " - Disk: $(df -h / | awk 'NR==2{print $5}')" # 2. Peak analysis echo "2. Peak Usage (last 30 days):" echo " - Peak CPU: $(sar | sort -k4 -rn | head -1 | awk '{print $3}')%" echo " - Peak Memory: $(sar -r | sort -k5 -rn | head -1 | awk '{print $4}')%" echo " - Peak Network: $(sar -n DEV | grep eth0 | sort -k6 -rn | head -1 | awk '{print $5}') MB/s" # 3. Projection (next 90 days) echo "3. 90-Day Projection:" CURRENT_DISK=$(df / | awk 'NR==2{print $3}') DAILY_GROWTH=$(find / -type f -newermt "-7 days" -printf "%s\n" | awk '{sum+=$1} END{print sum/7/1024/1024}') PROJECTED_90=$(echo "$CURRENT_DISK + ($DAILY_GROWTH * 90 * 1024)" | bc) echo " - Disk usage in 90 days: $(echo "$PROJECTED_90/1024/1024" | bc) GB"
9. Documentation as Code
# README.md # Infrastructure Documentation ## Architecture - 3 web servers behind load balancer - 2 PostgreSQL in active-passive replication - Redis cache cluster - Object storage for assets ## Deployment ```bash # Deploy web service make deploy-web # Rollback make rollback-web
Common Operations
Scaling
kubectl scale deployment web --replicas=5
Backup Verification
./scripts/verify-backup.sh
Troubleshooting
| Symptom | Likely Cause | Resolution |
|---|---|---|
| High latency | Connection pool exhaustion | kubectl scale deployment api --replicas=10 |
| Disk full | Logs | ./scripts/cleanup-logs.sh |
Runbooks
Incident: Database Connection Failure
Check database status:
systemctl status postgresqlCheck replication lag:
./scripts/check-replication.shFailover to standby:
./scripts/failover-db.sh
Contact
On-call: #devops-oncall
Escalation: devops-lead@company.com
### **10. Chaos Engineering**
```python
# chaos-monkey.py
# Deliberately inject failures to test resilience
import random
import subprocess
import time
TARGETS = ['web-01', 'web-02', 'db-01', 'cache-01']
def inject_failure(target):
experiments = [
f"docker stop {target}",
f"docker kill -s SIGTERM {target}",
f"docker pause {target}",
f"docker network disconnect bridge {target}"
]
experiment = random.choice(experiments)
print(f"Injecting: {experiment}")
subprocess.run(experiment, shell=True)
# Wait and check auto-recovery
time.sleep(30)
status = subprocess.run(f"docker inspect -f '{{.State.Status}}' {target}",
shell=True, capture_output=True, text=True)
print(f"Status: {status.stdout.strip()}")
if __name__ == "__main__":
print("Starting Chaos Experiment...")
while True:
target = random.choice(TARGETS)
inject_failure(target)
time.sleep(random.randint(60, 300))📋 Quick Reference Troubleshooting Matrix
| Symptom | First Command | Deep Dive | Most Common Fix |
|---|---|---|---|
| Won't boot | journalctl -xb | fsck /dev/sda1 | Repair filesystem |
| High CPU | top -o %CPU | ps aux --sort=-%cpu | Restart process |
| Memory leak | free -h | ps aux --sort=-%mem | Update app, add swap |
| Disk full | df -h | du -sh /* | sort -rh | Clean logs, resize |
| Network down | ip a | ping gateway, traceroute | Restart network |
| DNS issues | cat /etc/resolv.conf | dig google.com | Add 8.8.8.8 |
| Service down | systemctl status | journalctl -u service | systemctl restart |
| Kernel panic | dmesg | tail | journalctl -k -b -1 | Boot older kernel |
| Database slow | mysqladmin status | SHOW PROCESSLIST | Add index, optimize |
| Container fails | docker logs | docker inspect | Check resources |
🚀 Practice Exercises
Exercise 1: Simulate and Fix Boot Issue
# 1. Break GRUB (in VM) sudo dd if=/dev/zero of=/dev/sda bs=446 count=1 # 2. Boot from live CD/USB # 3. Mount root sudo mount /dev/sda1 /mnt sudo mount --bind /dev /mnt/dev sudo mount --bind /proc /mnt/proc sudo mount --bind /sys /mnt/sys # 4. Chroot and fix sudo chroot /mnt /bin/bash grub-install /dev/sda update-grub exit # 5. Reboot sudo reboot
Exercise 2: Create Recovery USB
#!/bin/bash # create-recovery-usb.sh USB_DEVICE="/dev/sdb" echo "Creating recovery USB on $USB_DEVICE" # Download Ubuntu minimal wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso # Write to USB sudo dd if=mini.iso of=$USB_DEVICE bs=4M status=progress # Add custom scripts sudo mkdir -p /mnt/recovery sudo mount ${USB_DEVICE}1 /mnt/recovery sudo cp recovery-scripts/* /mnt/recovery/ sudo umount /mnt/recovery echo "Recovery USB created!"
Exercise 3: Chaos Monkey Test
#!/bin/bash # chaos-test.sh echo "Starting Chaos Test - Web Service" # Ensure service is running kubectl scale deployment web --replicas=3 kubectl rollout status deployment/web # Chaos test 1: Kill pods randomly for i in {1..10}; do POD=$(kubectl get pods -l app=web -o name | shuf -n1) echo "Killing $POD" kubectl delete $POD sleep 5 kubectl get pods -l app=web done # Chaos test 2: Network latency POD=$(kubectl get pods -l app=web -o name | head -1) kubectl exec $POD -- tc qdisc add dev eth0 root netem delay 100ms echo "Added 100ms latency" sleep 30 kubectl exec $POD -- tc qdisc del dev eth0 root netem # Check recovery curl -f http://web-service/health || echo "Service unhealthy!"
🔗 Master Troubleshooting with Hands-on Labs
The best way to learn troubleshooting is through real failures in safe environments. Practice these scenarios until they become second nature.
👉 Practice troubleshooting and incident response in our interactive labs at:
https://devops.trainwithsky.com/
Our platform provides:
Real broken systems to fix
Production incident simulations
Time-pressured troubleshooting challenges
Guided post-mortems
Community incident reviews
Frequently Asked Questions
Q: How do I stay calm during production incidents?
A: Follow your runbook, don't rush, communicate clearly, and remember: no one died from a server being down.
Q: What should be in an emergency runbook?
A: Clear steps, contact list, service dependencies, rollback procedures, and where to find logs.
Q: How often should I test backups?
A: Quarterly full restore tests, weekly integrity checks, daily verification of backup completion.
Q: What's the most common cause of kernel panics?
A: Hardware issues (especially memory) and buggy kernel modules from third-party drivers.
Q: How do I prove root cause to management?
A: Timeline with evidence, correlated logs, and a reproduction of the issue in a test environment.
Q: When should I reboot a production server?
A: Only when you know exactly why, and you're prepared for it not coming back up.
Q: What's the best monitoring tool?
A: The one you actually use and maintain. Simple tools with proper dashboards beat complex ones nobody understands.
Have you faced a unique troubleshooting challenge? Share your war story in the comments below! 💬
- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Comments
Post a Comment