Linux Troubleshooting & DevOps Best Practices

Troubleshooting & Best Practices: The Complete DevOps Field Guide

Master the art of troubleshooting and learn industry-proven best practices for resilient, maintainable systems.

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 22 minutes
🏷️ Tags: Troubleshooting, Root Cause Analysis, System Recovery, DevOps Best Practices, Production Support

🚨 Common Boot Issues: When Your Server Won't Start

Understanding the Boot Process

Think of server boot as a domino chain reaction. Each component must successfully pass the baton to the next. When a server won't boot, one domino in this chain has failed:

BIOS/UEFI → Bootloader (GRUB) → Kernel → Initramfs → Root Filesystem → init/systemd → Services

Critical Boot Failure Scenarios

Scenario 1: GRUB Rescue Mode

Symptoms: Black screen with grub rescue> prompt, server won't boot past bootloader.

Why it happens: GRUB (Grand Unified Bootloader) can't find its configuration or the kernel image. This is like a GPS that lost its map.

# At the grub rescue> prompt, first find your root partition
grub rescue> ls
# Output: (hd0) (hd0,msdos1) (hd0,msdos2)

# Figure out which partition has /boot
grub rescue> ls (hd0,msdos1)/
# If you see /boot/grub, that's your boot partition

# Set root prefix
grub rescue> set root=(hd0,msdos1)
grub rescue> set prefix=(hd0,msdos1)/boot/grub

# Load normal mode
grub rescue> insmod normal
grub rescue> normal

# Once booted, reinstall GRUB
sudo update-grub
sudo grub-install /dev/sda

Prevention: Always test kernel updates in staging. Keep a known-good kernel entry in GRUB.

Scenario 2: Filesystem Corruption

Symptoms: Drops to (initramfs) prompt, "Kernel panic - not syncing", "mount: can't read superblock".

Why it happens: Filesystem metadata corruption from improper shutdown, hardware issues, or disk errors.

# At initramfs prompt, check filesystem
initramfs> e2fsck -y /dev/sda1
# -y = automatically answer yes to all prompts

# If e2fsck fails, try force check
initramfs> e2fsck -fy /dev/sda1
# -f = force check even if filesystem seems clean
# -y = assume yes

# Mount manually to verify
initramfs> mount /dev/sda1 /mnt
initramfs> ls /mnt
# Should see root filesystem contents

# Exit to continue boot
initramfs> exit

Prevention: Enable periodic filesystem checks, use UPS for power protection, monitor disk SMART data.

Scenario 3: FSTAB Configuration Error

Symptoms: Drops to emergency mode, "Failed to mount /etc/fstab", "Timed out waiting for device".

Why it happens: A mount entry in /etc/fstab references a device that doesn't exist or has incorrect options.

# When you see emergency mode prompt:
# 1. Enter root password

# 2. Check fstab for errors
cat /etc/fstab
# Look for:
# - Incorrect UUIDs
# - Missing devices
# - Invalid mount options

# 3. Temporarily comment out problematic line
# Use nano or vi to edit
# Add # at beginning of problematic mount

# 4. Test fstab
mount -a
# Should mount all filesystems without error

# 5. Reboot
reboot

# To find correct UUID for a device:
blkid /dev/sda1

Prevention: Use UUIDs instead of device names (/dev/sda1). Always test fstab changes with mount -a before reboot.

Scenario 4: Failed Systemd Service

Symptoms: Boot completes but critical service (network, SSH) doesn't start. Server boots but is unreachable.

# From console or out-of-band management:
# Check failed services
systemctl --failed

# View specific service failure
systemctl status networking
journalctl -u networking -xe

# Common fixes:
# 1. Fix configuration error
nano /etc/netplan/01-netcfg.yaml
netplan apply

# 2. Re-enable service
systemctl enable networking
systemctl start networking

# 3. Reset failed state
systemctl reset-failed networking

Scenario 5: Full /boot Partition

Symptoms: Boot fails after kernel update, error about no space left on device during boot.

Why it happens: /boot partition is full of old kernels (common with automatic updates).

# From recovery mode or live CD:
# 1. Check space
df -h /boot

# 2. List installed kernels
dpkg -l | grep linux-image  # Ubuntu/Debian
rpm -qa | grep kernel       # Red Hat/CentOS

# 3. Remove old kernels (keep current and one previous)
# Ubuntu:
sudo apt autoremove --purge

# Manual removal:
sudo rm -rf /boot/*-3.13.*  # Replace with old version

# 4. Update GRUB
sudo update-grub

Prevention: Set up automatic kernel cleanup, increase /boot partition size, or remove /boot as separate partition.

Boot Recovery Toolkit

#!/bin/bash
# boot-recovery-toolkit.sh
# Run from recovery environment or live CD

echo "=== Boot Recovery Toolkit ==="
echo

# 1. Mount root filesystem
echo "1. Mounting root filesystem..."
mkdir -p /mnt/root
mount /dev/sda1 /mnt/root
mount --bind /dev /mnt/root/dev
mount --bind /proc /mnt/root/proc
mount --bind /sys /mnt/root/sys

# 2. Chroot into system
chroot /mnt/root /bin/bash

# 3. Fix GRUB
echo "2. Reinstalling GRUB..."
grub-install /dev/sda
update-grub

# 4. Check filesystems
echo "3. Checking filesystems..."
fsck -y /dev/sda1
fsck -y /dev/sda2

# 5. Fix network if needed
echo "4. Restarting network..."
systemctl restart networking

# 6. Update initramfs
echo "5. Updating initramfs..."
update-initramfs -u -k all

echo "Boot recovery completed!"

🔍 Log Analysis Techniques: Finding Needles in Haystacks

The Log Analysis Mindset

Think of logs as black boxes of your system. They recorded everything before the crash. Your job as a detective is to reconstruct what happened.

Log Analysis Workflow

1. What's the symptom? → 2. When did it start? → 3. What changed? → 4. What's the pattern? → 5. Root cause → 6. Fix

Essential Log Locations

# System logs
/var/log/syslog          # Ubuntu/Debian (everything)
/var/log/messages        # Red Hat/CentOS (everything)
/var/log/auth.log        # Ubuntu/Debian (authentication)
/var/log/secure         # Red Hat/CentOS (authentication)
/var/log/kern.log       # Kernel messages
/var/log/dmesg          # Boot messages
/var/log/boot.log       # Boot process log

# Application logs
/var/log/nginx/         # Nginx web server
/var/log/apache2/       # Apache web server
/var/log/mysql/         | MySQL database
/var/log/postgresql/    # PostgreSQL database
/var/log/redis/         # Redis cache
/var/log/docker/        # Docker containers

# User logs
~/.bash_history         # Command history
~/.xsession-errors      # GUI application errors

Log Analysis Command Patterns

# 1. Find errors in the last hour
journalctl --since "1 hour ago" -p err

# 2. Find specific error pattern
grep -r "OutOfMemoryError\|OOM" /var/log/

# 3. Find error frequency
grep "Failed password" /var/log/auth.log | cut -d' ' -f1-3 | sort | uniq -c | sort -rn

# 4. Correlate timestamps
# First, find error time
ERROR_TIME=$(grep "connection refused" /var/log/nginx/error.log | tail -1 | cut -d' ' -f1-2)
# Then, check other logs around that time
journalctl --since "$ERROR_TIME" --until "5 minutes later"

# 5. Watch for specific errors in real-time
tail -f /var/log/syslog | grep -i "error\|fail\|critical"

# 6. Extract structured data from logs
grep "GET" /var/log/nginx/access.log | awk '{print $1, $7, $9}' | sort | uniq -c

# 7. Check for patterns over time
for hour in {0..23}; do
    COUNT=$(grep "error" /var/log/app.log | grep " 0$hour:" | wc -l)
    echo "Hour $hour: $COUNT errors"
done

Real-World Log Analysis Scenarios

Scenario 1: Performance Degradation

Symptoms: Application slow, users complaining. No obvious errors.

# 1. Check system logs for resource issues
journalctl --since "1 hour ago" | grep -i "overcommit\|oom\|swap"

# 2. Look for slow queries in database logs
grep "slow query" /var/log/mysql/mysql-slow.log | tail -20

# 3. Check web server response times
tail -1000 /var/log/nginx/access.log | awk '{print $NF}' | sort -rn | head -10

# 4. Look for connection buildup
grep "connect" /var/log/nginx/error.log | tail -20

# 5. Check for garbage collection in Java apps
grep "Full GC" /var/log/app.log | tail -10

Scenario 2: Security Breach Investigation

Symptoms: Unusual activity, unknown processes, unexpected outbound connections.

#!/bin/bash
# security-audit.sh

echo "=== Security Incident Investigation ==="
echo "Time: $(date)"
echo

# 1. Check authentication failures
echo "1. Failed SSH attempts (last 24h):"
grep "Failed password" /var/log/auth.log | grep "$(date +%b\ %d)" | awk '{print $11}' | sort | uniq -c | sort -rn

# 2. Check successful logins from unusual times
echo -e "\n2. Successful logins after hours:"
grep "Accepted" /var/log/auth.log | grep -E "(0[0-5]|23):" | tail -20

# 3. Check sudo commands
echo -e "\n3. Suspicious sudo commands:"
grep "sudo" /var/log/auth.log | grep -i "useradd\|usermod\|passwd\|chmod 777" | tail -20

# 4. Check for privilege escalation
echo -e "\n4. SUID/SGID files modified recently:"
find / -type f -perm /4000 -o -perm /2000 -newermt "24 hours ago" 2>/dev/null

# 5. Check crontab for unauthorized entries
echo -e "\n5. Unexpected cron jobs:"
for user in $(cut -f1 -d: /etc/passwd); do
    crontab -u $user -l 2>/dev/null | grep -v "^#" && echo "User: $user"
done

Scenario 3: Database Connection Issues

Symptoms: Application can't connect to database, intermittent connection resets.

# 1. Check database logs for connection errors
sudo journalctl -u mysql --since "1 hour ago" | grep -i "connect\|accept\|refused"

# 2. Check for connection limit exceeded
grep "max_connections" /var/log/mysql/error.log

# 3. Look for connection timeouts
grep "Aborted_connects" /var/log/mysql/error.log

# 4. Check system limits for database process
MYSQL_PID=$(pidof mysqld)
cat /proc/$MYSQL_PID/limits | grep "open files"

# 5. Monitor current connections
mysql -e "show processlist;" | wc -l
mysql -e "show status like '%connect%';"

# 6. Check network connectivity from app server
telnet db-server 3306
nc -zv db-server 3306

Centralized Logging Preparation

# 1. Configure rsyslog to send logs to central server
cat >> /etc/rsyslog.conf << 'EOF'
*.* @logserver.example.com:514   # UDP
*.* @@logserver.example.com:514  # TCP (more reliable)
EOF

systemctl restart rsyslog

# 2. Configure journald for persistent storage
mkdir -p /var/log/journal
systemctl restart systemd-journald

# 3. Log rotation policies
cat > /etc/logrotate.d/custom << 'EOF'
/var/log/myapp/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 640 appuser appgroup
    sharedscripts
    postrotate
        systemctl reload myapp
    endscript
}
EOF

💾 Backup & Restore Strategies: Your Safety Net

The Backup Pyramid

Think of backup strategies as layers of protection:

Level 3: Off-site Backups (Disaster Recovery)
Level 2: On-site Backups (Hardware Failure)
Level 1: Snapshots (Accidental Deletion)
Level 0: RAID (Disk Failure)

3-2-1 Backup Rule Explained

3 copies of your data
2 different media types
1 copy off-site

This is not just a rule—it's survival insurance for your data.

Backup Strategy by Data Type

#!/bin/bash
# backup-strategy.sh
# Different strategies for different data types

# 1. DATABASES - Use native tools, hot backups
backup_database() {
    DB_NAME=$1
    BACKUP_PATH="/backup/db/$(date +%Y%m%d)"
    mkdir -p "$BACKUP_PATH"
    
    # MySQL/MariaDB
    mysqldump --single-transaction \
        --routines \
        --triggers \
        --events \
        "$DB_NAME" | gzip > "$BACKUP_PATH/$DB_NAME.sql.gz"
    
    # PostgreSQL
    pg_dump -Fc "$DB_NAME" > "$BACKUP_PATH/$DB_NAME.dump"
    
    # MongoDB
    mongodump --db "$DB_NAME" --out "$BACKUP_PATH"
}

# 2. APPLICATION CONFIGURATION - Version control
backup_config() {
    # Store configs in git
    cd /etc
    git init
    git add .
    git commit -m "Config backup $(date)"
    git bundle create "/backup/config/config-$(date +%Y%m%d).bundle" --all
}

# 3. USER DATA - Incremental backups
backup_user_data() {
    rsync -avz --delete \
        --link-dest=/backup/users/yesterday \
        /home/ \
        /backup/users/$(date +%Y%m%d)/
    
    # Create symlink to latest
    ln -snf /backup/users/$(date +%Y%m%d) /backup/users/latest
}

# 4. SYSTEM FILES - Image-based backup
backup_system() {
    # Create disk image
    dd if=/dev/sda1 of=/backup/system/root-$(date +%Y%m%d).img bs=4M
    
    # Or use dump
    dump -0uf /backup/system/root-$(date +%Y%m%d).dump /dev/sda1
}

Automated Backup Framework

#!/bin/bash
# backup-framework.sh
# Enterprise-grade backup automation

# Configuration
BACKUP_ROOT="/backup"
BACKUP_HOST="backup01.example.com"
BACKUP_USER="backup"
RETENTION_DAYS=30
LOG_FILE="/var/log/backup.log"
ALERT_EMAIL="admin@example.com"

# Logging
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Error handler
error_exit() {
    log "ERROR: $1"
    echo "Backup failed: $1" | mail -s "BACKUP FAILED" "$ALERT_EMAIL"
    exit 1
}

# Pre-backup checks
pre_checks() {
    log "Running pre-backup checks..."
    
    # Check disk space
    SPACE=$(df "$BACKUP_ROOT" | awk 'NR==2{print $4}')
    if [ "$SPACE" -lt 10485760 ]; then  # 10GB in KB
        error_exit "Insufficient backup space"
    fi
    
    # Check connectivity to backup host
    ping -c 1 "$BACKUP_HOST" >/dev/null || error_exit "Backup host unreachable"
    
    # Check backup directory permissions
    touch "$BACKUP_ROOT/test.$$" || error_exit "Cannot write to backup directory"
    rm "$BACKUP_ROOT/test.$$"
}

# Database backup
backup_databases() {
    log "Starting database backup..."
    
    # MySQL all databases
    if command -v mysql >/dev/null; then
        mysqldump --all-databases \
            --single-transaction \
            --routines \
            --triggers \
            --events 2>/dev/null | gzip > \
            "$BACKUP_ROOT/mysql-full-$(date +%Y%m%d-%H%M%S).sql.gz"
    fi
    
    # PostgreSQL
    if command -v psql >/dev/null; then
        sudo -u postgres pg_dumpall | gzip > \
            "$BACKUP_ROOT/postgresql-full-$(date +%Y%m%d-%H%M%S).sql.gz"
    fi
}

# File system backup
backup_files() {
    log "Starting file system backup..."
    
    # Important directories
    rsync -avz --delete \
        --exclude='.cache' \
        --exclude='tmp' \
        --exclude='*.log' \
        /etc/ \
        /home/ \
        /var/www/ \
        /opt/ \
        "$BACKUP_HOST:$BACKUP_ROOT/$(hostname)/files/$(date +%Y%m%d)/" \
        || error_exit "File backup failed"
}

# Application-specific backup
backup_applications() {
    log "Starting application backup..."
    
    # Backup Docker volumes
    if command -v docker >/dev/null; then
        docker run --rm \
            -v /var/lib/docker/volumes:/source \
            -v "$BACKUP_ROOT/docker":/target \
            alpine tar czf /target/volumes-$(date +%Y%m%d).tar.gz -C /source .
    fi
    
    # Backup Kubernetes resources
    if command -v kubectl >/dev/null; then
        kubectl get all --all-namespaces -o yaml > \
            "$BACKUP_ROOT/kubernetes-$(date +%Y%m%d).yaml"
    fi
}

# Retention policy
apply_retention() {
    log "Applying retention policy..."
    
    # Delete files older than RETENTION_DAYS
    find "$BACKUP_ROOT" -type f -name "*.gz" -mtime +$RETENTION_DAYS -delete
    find "$BACKUP_ROOT" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
    find "$BACKUP_ROOT" -type d -empty -delete
}

# Verification
verify_backups() {
    log "Verifying backups..."
    VERIFY_LOG="$BACKUP_ROOT/verify-$(date +%Y%m%d).log"
    
    # Test database backups
    for backup in "$BACKUP_ROOT"/*.sql.gz; do
        if [ -f "$backup" ]; then
            gunzip -t "$backup" && echo "✓ $backup" >> "$VERIFY_LOG" || \
                echo "✗ $backup (corrupt)" >> "$VERIFY_LOG"
        fi
    done
}

# Main execution
main() {
    log "=== Starting Backup Process ==="
    
    pre_checks
    backup_databases
    backup_files
    backup_applications
    apply_retention
    verify_backups
    
    log "=== Backup Completed Successfully ==="
}

# Run if executed directly
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
    main
fi

Disaster Recovery Drill

#!/bin/bash
# disaster-recovery-drill.sh
# Practice restoring from backups

SCENARIO=$1

case $SCENARIO in
    "database")
        echo "Simulating: Database corruption"
        echo "Step 1: Stop application"
        echo "Step 2: Stop database service"
        echo "Step 3: Restore from latest backup"
        echo "Step 4: Verify data integrity"
        echo "Step 5: Start services"
        echo "RTO: 15 minutes"
        echo "RPO: 5 minutes (point-in-time recovery)"
        ;;
    "server")
        echo "Simulating: Complete server loss"
        echo "Step 1: Provision new instance"
        echo "Step 2: Apply base configuration"
        echo "Step 3: Restore data from backup"
        echo "Step 4: Verify application functionality"
        echo "Step 5: Switch DNS/Load Balancer"
        echo "RTO: 1 hour"
        echo "RPO: 24 hours"
        ;;
    "file")
        echo "Simulating: Accidental file deletion"
        echo "Step 1: Locate file in backup"
        echo "Step 2: Restore single file"
        echo "Step 3: Verify permissions"
        echo "Step 4: Notify user"
        echo "RTO: 5 minutes"
        echo "RPO: Real-time"
        ;;
esac

# Actual restore test
restore_test() {
    echo "Performing quarterly restore test..."
    
    # Create isolated test environment
    docker run -d --name restore-test ubuntu:latest
    
    # Restore backup to test container
    gunzip -c /backup/latest.sql.gz | docker exec -i restore-test mysql
    
    # Run integrity checks
    docker exec restore-test mysql -e "CHECK TABLE *.*"
    
    # Clean up
    docker rm -f restore-test
}

⚠️ Kernel Panics & Recovery: When the Kernel Dies

Understanding Kernel Panics

A kernel panic is Linux's blue screen of death. It's the kernel's last resort when it encounters a fatal error it cannot recover from.

Common Kernel Panic Causes

Hardware failure (bad RAM, overheating, disk failure)
Corrupted kernel image (bad update, disk corruption)
Incompatible kernel module (faulty driver)
Filesystem corruption (superblock damage)
Out of memory (OOM killer triggered too late)

Capturing Crash Information

# 1. Check for crash dumps
ls -la /var/crash/

# 2. Configure kdump (kernel crash dumping)
sudo apt install linux-crashdump  # Ubuntu/Debian
sudo yum install kexec-tools      # Red Hat/CentOS

# 3. Configure crash kernel parameters
sudo nano /etc/default/grub
# Add to GRUB_CMDLINE_LINUX:
# crashkernel=256M

sudo update-grub

# 4. Test crash (WARNING: system will reboot!)
echo c > /proc/sysrq-trigger
# System will crash, save dump, and reboot

# 5. Analyze crash dump (requires kernel debug symbols)
sudo crash /usr/lib/debug/boot/vmlinux-$(uname -r) /var/crash/202402101230/dump

Kernel Panic Recovery Procedure

#!/bin/bash
# kernel-panic-recovery.sh

echo "=== Kernel Panic Recovery ==="
echo

# 1. Try to capture panic message
echo "1. Capturing panic message..."
dmesg -T | tail -50 > /root/panic-$(date +%Y%m%d-%H%M%S).log

# 2. Force reboot (if system is frozen)
echo "2. Attempting reboot..."
echo b > /proc/sysrq-trigger
# Or press: Alt + SysRq + R, E, I, S, U, B (Raising Elephants Is So Utterly Boring)

# 3. After reboot, select older kernel at GRUB menu
# Hold Shift during boot, select "Advanced options for Ubuntu"
# Choose previous kernel version

# 4. Once booted, check logs
echo "3. Analyzing panic cause..."
journalctl -k --since "1 hour ago" | grep -i "panic\|oops\|error"

# 5. Check hardware
echo "4. Running hardware diagnostics..."
# Memory test
sudo memtester 1G 1
# Disk health
sudo smartctl -a /dev/sda | grep -i "reallocated\|pending\|uncorrectable"
# CPU errors
sudo mcelog --ignorenodev

# 6. Fix the issue
echo "5. Applying fixes..."
case $PANIC_CAUSE in
    "kernel_module")
        # Blacklist problematic module
        echo "blacklist badmodule" >> /etc/modprobe.d/blacklist.conf
        update-initramfs -u
        ;;
    "filesystem")
        # Check and repair filesystem
        fsck -fy /dev/sda1
        ;;
    "memory")
        # Reserve bad memory area
        echo "memmap=2M\$8M" >> /etc/default/grub
        update-grub
        ;;
esac

Preventing Kernel Panics

# 1. Use kernel live patching for critical security updates
# Canonical Livepatch (Ubuntu)
sudo snap install canonical-livepatch
sudo canonical-livepatch enable YOUR_TOKEN

# 2. Monitor kernel logs proactively
#!/bin/bash
# kernel-monitor.sh
journalctl -k -f | while read line; do
    if echo "$line" | grep -qi "error\|fail\|warning\|oops"; then
        echo "Kernel issue detected: $line"
        # Send alert
        echo "Kernel warning on $(hostname): $line" | \
            mail -s "Kernel Alert" admin@example.com
    fi
done

# 3. Set panic timeout (auto-reboot after panic)
echo "10" > /proc/sys/kernel/panic
# Make permanent
echo "kernel.panic=10" >> /etc/sysctl.conf

# 4. Enable kernel crash dumps
sudo systemctl enable kdump-tools
sudo systemctl start kdump-tools

# 5. Test kernel updates in staging
# Always have a rollback plan!

✅ Best Practices for DevOps Environments

1. Infrastructure as Code (IaC)

Don't: "SSH into production and fix things manually"
Do: "All infrastructure defined in code, reviewed, versioned, and tested"

# infrastructure.yml (Ansible)
---
- name: Configure web servers
  hosts: web
  vars:
    nginx_port: 8080
    max_connections: 1000
  
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present
    
    - name: Configure nginx
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx
  
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

2. Immutable Infrastructure

Concept: Once deployed, servers are never modified. Updates = replace.

#!/bin/bash
# immutable-deployment.sh

# Old way (mutable):
ssh prod-server "apt update && apt upgrade nginx"

# New way (immutable):
# 1. Build new AMI/container image with Packer
packer build -var 'version=2.0.1' web-server.json

# 2. Deploy new instance
aws ec2 run-instances --image-id ami-12345 --instance-type t3.micro

# 3. Shift traffic
aws elbv2 register-targets --target-group-arn arn:aws:elasticloadbalancing:... \
    --targets Id=i-1234567890abcdef0

# 4. Verify health
curl -f https://new-instance/health

# 5. Deregister old instance
aws elbv2 deregister-targets --target-group-arn arn:aws:elasticloadbalancing:... \
    --targets Id=i-0987654321fedcba0

# 6. Terminate old instance
aws ec2 terminate-instances --instance-ids i-0987654321fedcba0

3. Observability: Beyond Monitoring

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']  # Node exporter

  - job_name: 'app'
    metrics_path: /metrics
    static_configs:
      - targets: ['app1:8080', 'app2:8080']

# Grafana dashboard configuration
# Three pillars of observability:
# 1. METRICS - numerical measurements (Prometheus)
# 2. LOGS - event records (ELK/Loki)
# 3. TRACES - request flow (Jaeger)

4. Security by Default

#!/bin/bash
# security-baseline.sh
# Applied to every new server

# 1. Minimal installation
apt-get install --no-install-recommends ...

# 2. Remove unnecessary packages
apt-get remove telnet rsh-server yp-tools tftpd

# 3. SSH hardening
cat >> /etc/ssh/sshd_config << 'EOF'
Protocol 2
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2
AllowUsers deploy
EOF

# 4. Kernel hardening
cat >> /etc/sysctl.conf << 'EOF'
# IP spoofing protection
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1

# Ignore ICMP redirects
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0

# Disable source packet routing
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0

# Log Martians
net.ipv4.conf.all.log_martians = 1
EOF

# 5. File permissions
chmod 750 /home/*
chmod 600 /etc/shadow
chmod 644 /etc/passwd
chmod 644 /etc/group

# 6. Automatic security updates
apt-get install unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades

5. Deployment Best Practices

# .gitlab-ci.yml
stages:
  - test
  - build
  - staging
  - production

variables:
  APP_VERSION: $CI_COMMIT_SHORT_SHA

test:
  stage: test
  script:
    - npm install
    - npm test
    - npm run lint

build:
  stage: build
  script:
    - docker build -t myapp:$APP_VERSION .
    - docker tag myapp:$APP_VERSION myregistry.com/myapp:$APP_VERSION
    - docker push myregistry.com/myapp:$APP_VERSION

staging:
  stage: staging
  script:
    - kubectl set image deployment/myapp myapp=myregistry.com/myapp:$APP_VERSION
    - kubectl rollout status deployment/myapp
    - curl -f https://staging.example.com/health

production:
  stage: production
  script:
    - kubectl set image deployment/myapp myapp=myregistry.com/myapp:$APP_VERSION
    - kubectl rollout status deployment/myapp
    - curl -f https://example.com/health
  when: manual
  environment:
    name: production

6. Monitoring & Alerting Best Practices

# alertmanager.yml
route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'pagerduty'

receivers:
- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'YOUR_PAGERDUTY_KEY'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

# Alert design principles:
# 1. Alert on symptoms, not causes
# BAD: "High CPU"
# GOOD: "5xx errors > 1% for 5 minutes"

# 2. Every alert must have a runbook
# 3. Test alerts regularly
# 4. Avoid alert fatigue

7. Post-Mortem Culture

# Post-Mortem Template

## Incident Summary
- Date: YYYY-MM-DD
- Duration: 2 hours 15 minutes
- Impact: 5% of users unable to checkout

## Timeline
- 14:23 - PagerDuty alert: Checkout API 5xx errors
- 14:25 - On-call engineer acknowledges
- 14:30 - Identified database connection pool exhaustion
- 14:45 - Scaled up application servers
- 15:10 - Errors continue, root cause not resolved
- 15:30 - Identified connection leak in payment service
- 15:45 - Rolled back payment service deployment
- 16:38 - All systems recovered

## Root Cause
Connection leak in payment service version 2.1.0
when handling timeout errors in third-party gateway.

## Action Items
- [ ] Add connection pool metrics dashboard
- [ ] Implement circuit breaker for payment gateway
- [ ] Add integration test for timeout scenarios
- [ ] Review rollback procedure documentation

## Blameless Culture Statement
No individual is at fault. We improve our systems together.

8. Capacity Planning

#!/bin/bash
# capacity-planning.sh

echo "=== Capacity Planning Report ==="
echo "Period: $(date -d '30 days ago' +%Y-%m-%d) to $(date +%Y-%m-%d)"
echo

# 1. Trend analysis
echo "1. Growth Trends:"
echo "  - CPU: $(sar | awk '/Average:/ {print $3}')% average"
echo "  - Memory: $(sar -r | awk '/Average:/ {print $4}')% average"
echo "  - Disk: $(df -h / | awk 'NR==2{print $5}')"

# 2. Peak analysis
echo "2. Peak Usage (last 30 days):"
echo "  - Peak CPU: $(sar | sort -k4 -rn | head -1 | awk '{print $3}')%"
echo "  - Peak Memory: $(sar -r | sort -k5 -rn | head -1 | awk '{print $4}')%"
echo "  - Peak Network: $(sar -n DEV | grep eth0 | sort -k6 -rn | head -1 | awk '{print $5}') MB/s"

# 3. Projection (next 90 days)
echo "3. 90-Day Projection:"
CURRENT_DISK=$(df / | awk 'NR==2{print $3}')
DAILY_GROWTH=$(find / -type f -newermt "-7 days" -printf "%s\n" | awk '{sum+=$1} END{print sum/7/1024/1024}')
PROJECTED_90=$(echo "$CURRENT_DISK + ($DAILY_GROWTH * 90 * 1024)" | bc)
echo "  - Disk usage in 90 days: $(echo "$PROJECTED_90/1024/1024" | bc) GB"

9. Documentation as Code

# README.md
# Infrastructure Documentation

## Architecture
- 3 web servers behind load balancer
- 2 PostgreSQL in active-passive replication
- Redis cache cluster
- Object storage for assets

## Deployment
```bash
# Deploy web service
make deploy-web

# Rollback
make rollback-web

Common Operations

Scaling

kubectl scale deployment web --replicas=5

Backup Verification

./scripts/verify-backup.sh

Troubleshooting

Symptom	Likely Cause	Resolution
High latency	Connection pool exhaustion	`kubectl scale deployment api --replicas=10`
Disk full	Logs	`./scripts/cleanup-logs.sh`

Runbooks

Incident: Database Connection Failure

Check database status: systemctl status postgresql
Check replication lag: ./scripts/check-replication.sh
Failover to standby: ./scripts/failover-db.sh

Contact

On-call: #devops-oncall
Escalation: devops-lead@company.com

### **10. Chaos Engineering**

```python
# chaos-monkey.py
# Deliberately inject failures to test resilience
import random
import subprocess
import time

TARGETS = ['web-01', 'web-02', 'db-01', 'cache-01']

def inject_failure(target):
    experiments = [
        f"docker stop {target}",
        f"docker kill -s SIGTERM {target}",
        f"docker pause {target}",
        f"docker network disconnect bridge {target}"
    ]
    
    experiment = random.choice(experiments)
    print(f"Injecting: {experiment}")
    subprocess.run(experiment, shell=True)
    
    # Wait and check auto-recovery
    time.sleep(30)
    status = subprocess.run(f"docker inspect -f '{{.State.Status}}' {target}", 
                           shell=True, capture_output=True, text=True)
    print(f"Status: {status.stdout.strip()}")

if __name__ == "__main__":
    print("Starting Chaos Experiment...")
    while True:
        target = random.choice(TARGETS)
        inject_failure(target)
        time.sleep(random.randint(60, 300))

📋 Quick Reference Troubleshooting Matrix

Symptom	First Command	Deep Dive	Most Common Fix
Won't boot	`journalctl -xb`	`fsck /dev/sda1`	Repair filesystem
High CPU	`top -o %CPU`	`ps aux --sort=-%cpu`	Restart process
Memory leak	`free -h`	`ps aux --sort=-%mem`	Update app, add swap
Disk full	`df -h`	`du -sh /* \| sort -rh`	Clean logs, resize
Network down	`ip a`	`ping gateway`, `traceroute`	Restart network
DNS issues	`cat /etc/resolv.conf`	`dig google.com`	Add 8.8.8.8
Service down	`systemctl status`	`journalctl -u service`	`systemctl restart`
Kernel panic	`dmesg \| tail`	`journalctl -k -b -1`	Boot older kernel
Database slow	`mysqladmin status`	`SHOW PROCESSLIST`	Add index, optimize
Container fails	`docker logs`	`docker inspect`	Check resources

🚀 Practice Exercises

Exercise 1: Simulate and Fix Boot Issue

# 1. Break GRUB (in VM)
sudo dd if=/dev/zero of=/dev/sda bs=446 count=1

# 2. Boot from live CD/USB

# 3. Mount root
sudo mount /dev/sda1 /mnt
sudo mount --bind /dev /mnt/dev
sudo mount --bind /proc /mnt/proc
sudo mount --bind /sys /mnt/sys

# 4. Chroot and fix
sudo chroot /mnt /bin/bash
grub-install /dev/sda
update-grub
exit

# 5. Reboot
sudo reboot

Exercise 2: Create Recovery USB

#!/bin/bash
# create-recovery-usb.sh

USB_DEVICE="/dev/sdb"
echo "Creating recovery USB on $USB_DEVICE"

# Download Ubuntu minimal
wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso

# Write to USB
sudo dd if=mini.iso of=$USB_DEVICE bs=4M status=progress

# Add custom scripts
sudo mkdir -p /mnt/recovery
sudo mount ${USB_DEVICE}1 /mnt/recovery
sudo cp recovery-scripts/* /mnt/recovery/
sudo umount /mnt/recovery

echo "Recovery USB created!"

Exercise 3: Chaos Monkey Test

#!/bin/bash
# chaos-test.sh

echo "Starting Chaos Test - Web Service"

# Ensure service is running
kubectl scale deployment web --replicas=3
kubectl rollout status deployment/web

# Chaos test 1: Kill pods randomly
for i in {1..10}; do
    POD=$(kubectl get pods -l app=web -o name | shuf -n1)
    echo "Killing $POD"
    kubectl delete $POD
    
    sleep 5
    kubectl get pods -l app=web
done

# Chaos test 2: Network latency
POD=$(kubectl get pods -l app=web -o name | head -1)
kubectl exec $POD -- tc qdisc add dev eth0 root netem delay 100ms
echo "Added 100ms latency"
sleep 30
kubectl exec $POD -- tc qdisc del dev eth0 root netem

# Check recovery
curl -f http://web-service/health || echo "Service unhealthy!"

🔗 Master Troubleshooting with Hands-on Labs

The best way to learn troubleshooting is through real failures in safe environments. Practice these scenarios until they become second nature.

👉 Practice troubleshooting and incident response in our interactive labs at:
https://devops.trainwithsky.com/

Our platform provides:

Real broken systems to fix
Production incident simulations
Time-pressured troubleshooting challenges
Guided post-mortems
Community incident reviews

Frequently Asked Questions

Q: How do I stay calm during production incidents?
A: Follow your runbook, don't rush, communicate clearly, and remember: no one died from a server being down.

Q: What should be in an emergency runbook?
A: Clear steps, contact list, service dependencies, rollback procedures, and where to find logs.

Q: How often should I test backups?
A: Quarterly full restore tests, weekly integrity checks, daily verification of backup completion.

Q: What's the most common cause of kernel panics?
A: Hardware issues (especially memory) and buggy kernel modules from third-party drivers.

Q: How do I prove root cause to management?
A: Timeline with evidence, correlated logs, and a reproduction of the issue in a test environment.

Q: When should I reboot a production server?
A: Only when you know exactly why, and you're prepared for it not coming back up.

Q: What's the best monitoring tool?
A: The one you actually use and maintain. Simple tools with proper dashboards beat complex ones nobody understands.

Have you faced a unique troubleshooting challenge? Share your war story in the comments below! 💬

SKY Tech – Explore Technology!