Saturday, January 17, 2026

Linux Troubleshooting & DevOps Best Practices

Linux Troubleshooting & DevOps Best Practices

Linux Troubleshooting & DevOps Best Practices

Mastering System Diagnostics, Recovery, and Production Operations

Common Boot Issues Log Analysis Backup Strategies Kernel Panics DevOps Best Practices

⚠️ Emergency Notice

This guide contains critical troubleshooting information. In production emergencies, always follow your organization's incident response procedures and maintain clear communication with your team.

1. Common Boot Issues & Recovery

Linux Boot Process Flow

1
BIOS/UEFI - Hardware initialization
Failure: Check hardware, boot order
2
Bootloader (GRUB) - Kernel loading
Failure: GRUB rescue, kernel params
3
Kernel Initialization - Hardware detection
Failure: Driver issues, kernel panic
4
Init System (systemd) - Service startup
Failure: Service dependencies, targets

GRUB Boot Issues

1

GRUB Not Loading

CRITICAL
# Boot from live USB and repair GRUB: $ sudo mount /dev/sdX1 /mnt # Replace with your root partition $ sudo mount --bind /dev /mnt/dev $ sudo mount --bind /proc /mnt/proc $ sudo mount --bind /sys /mnt/sys $ sudo chroot /mnt # Reinstall GRUB: $ grub-install /dev/sdX # Replace with your disk $ update-grub $ exit $ sudo umount -R /mnt $ sudo reboot
2

GRUB Rescue Mode

HIGH
# In GRUB rescue prompt: grub rescue> set prefix=(hd0,gpt1)/boot/grub grub rescue> set root=(hd0,gpt1) grub rescue> insmod normal grub rescue> normal # If above works, then repair from Linux: $ sudo grub-install --recheck /dev/sda $ sudo update-grub # Alternative: Use boot-repair from live USB: $ sudo add-apt-repository ppa:yannubuntu/boot-repair $ sudo apt update $ sudo apt install boot-repair $ boot-repair
3

Kernel Panic at Boot

CRITICAL
# In GRUB, edit kernel parameters: # Press 'e' on boot entry, add: linux /boot/vmlinuz-xxx root=/dev/sdXY ro single # Or for systemd: linux /boot/vmlinuz-xxx root=/dev/sdXY systemd.unit=rescue.target # Boot to single user/rescue mode: $ journalctl -xb # Check boot logs $ systemctl --failed # Check failed services $ dmesg | grep -i error # Check kernel errors # Fix common issues: $ fsck /dev/sdXY # Check filesystem $ mount -o remount,rw / # Remount root as read-write

Filesystem & Mount Issues

Filesystem Corruption

# Check filesystem integrity: $ sudo fsck /dev/sdX1 # Interactive check $ sudo fsck -y /dev/sdX1 # Auto-repair $ sudo fsck -f /dev/sdX1 # Force check even if clean # For specific filesystems: $ sudo xfs_repair /dev/sdX1 # XFS $ sudo e2fsck -f /dev/sdX1 # ext2/3/4 $ sudo btrfs check --repair /dev/sdX1 # Btrfs (use with caution!) # Check disk for bad sectors: $ sudo badblocks -v /dev/sdX > bad-sectors.txt $ sudo smartctl -a /dev/sdX # SMART data $ sudo smartctl -t short /dev/sdX # Short self-test $ sudo smartctl -t long /dev/sdX # Long self-test

Mount Issues & fstab Problems

# Check /etc/fstab for errors: $ sudo findmnt --verify # Verify fstab entries $ sudo systemctl daemon-reload $ sudo mount -a # Test mounting all fstab entries # Emergency fstab repair: $ sudo blkid # Get UUIDs $ sudo nano /etc/fstab # Edit fstab # Example fstab entry: # UUID=xxxx-xxxx / ext4 defaults,noatime,errors=remount-ro 0 1 # UUID=xxxx-xxxx /home ext4 defaults,noatime 0 2 # UUID=xxxx-xxxx none swap sw 0 0 # Mount options for recovery: $ sudo mount -o ro /dev/sdX1 /mnt # Read-only mount $ sudo mount -o remount,rw / # Remount root as read-write $ sudo mount -t proc proc /proc $ sudo mount -t sysfs sys /sys $ sudo mount -t devtmpfs dev /dev

⚠️ Critical Boot Parameters for Recovery

# Add these to kernel command line in GRUB (press 'e'): # Single user mode (SysV init): single OR 1 # Rescue mode (systemd): systemd.unit=rescue.target # Emergency mode (minimal): systemd.unit=emergency.target # Skip specific services: systemd.mask=service_name.service # Debug early boot: debug OR earlyprintk=vga,keep # Disable graphical mode: nomodeset # For graphics driver issues # Ignore filesystem checks: fsck.mode=skip # Read-only root mount: ro # Example combined parameters: linux /boot/vmlinuz-xxx root=/dev/sdXY ro systemd.unit=rescue.target nomodeset

2. Log Analysis Techniques

Log Analysis Philosophy: "Logs don't lie. Learn to ask them the right questions." Start with the most recent logs and work backwards. Correlate timestamps across different log sources.

System Log Locations

Systemd Journal

$ journalctl -xe # Detailed log with pager $ journalctl -f # Follow logs (like tail -f) $ journalctl -u nginx.service # Service-specific logs $ journalctl --since "2024-01-01 00:00:00" $ journalctl --until "1 hour ago" $ journalctl -p err # Show only errors $ journalctl -k # Kernel logs only $ journalctl --list-boots # List boot sessions $ journalctl -b -1 # Previous boot logs

Traditional Log Files

/var/log/syslog # General system logs /var/log/auth.log # Authentication logs /var/log/kern.log # Kernel messages /var/log/dmesg # Kernel ring buffer /var/log/boot.log # Boot process logs /var/log/nginx/ # Nginx logs /var/log/apache2/ # Apache logs /var/log/mysql/ # MySQL logs /var/log/postgresql/ # PostgreSQL logs /var/log/apt/ # Package management /var/log/dpkg.log # dpkg operations /var/log/faillog # Failed login attempts /var/log/lastlog # Last login times # Application-specific: /var/log/docker/* # Docker logs /var/log/containers/* # Container logs /var/log/cloud-init* # Cloud initialization

Advanced Log Analysis Commands

Real-time Monitoring

# Monitor multiple logs simultaneously: $ multitail /var/log/syslog /var/log/auth.log /var/log/kern.log # Follow logs with highlighting: $ tail -f /var/log/syslog | grep --color -E "(error|fail|critical|panic)" # Watch log file changes: $ watch -n 2 'tail -20 /var/log/syslog' # Monitor systemd journal in real-time: $ journalctl -f | grep -E "(error|failed|panic|oom)" # Monitor authentication attempts: $ tail -f /var/log/auth.log | grep -E "(Failed|Accepted|invalid)" # Use lnav for advanced log navigation: $ sudo apt install lnav $ lnav /var/log/syslog /var/log/auth.log

Log Filtering & Analysis

# Find errors in last hour: $ journalctl --since "1 hour ago" -p err..emerg # Search for specific patterns: $ grep -i "error\|fail\|panic\|critical" /var/log/syslog $ grep -B5 -A5 "error" /var/log/syslog # Context around error # Count error occurrences: $ grep -c "error" /var/log/syslog $ journalctl -u nginx --since yesterday | grep -c "500" # Extract unique errors: $ grep "error" /var/log/syslog | sort | uniq -c | sort -rn # Parse JSON logs (common with containers): $ cat /var/log/containers/*.log | jq '. | select(.level=="error")' # Time-based analysis: $ awk '/error/ {print $1, $2, $3}' /var/log/syslog | sort | uniq -c $ journalctl --since "00:00" --until "06:00" -p err

Common Log Patterns & Meanings

Critical Errors

kernel: Out of memory: Kill process # OOM killer activated kernel: BUG: unable to handle kernel # Kernel bug kernel: CPU#0: Possible SYN flooding # SYN flood attack systemd: Failed to start Service # Service failure sshd: Failed password for root # Brute force attempt smartd: Device: failed # Disk failure predicted

Common Errors

kernel: ata1.00: exception Emask 0x0 # Disk I/O error nginx: connect() failed (111: Connection refused) mysql: Too many connections systemd: Dependency failed for audit: SYSCALL arch=c000003e success=no exit=-13 # Permission denied kernel: nouveau 0000:01:00.0: gr: TRAP_READ_ERROR # GPU error

Warnings to Monitor

# Disk space warnings: systemd: /var/log full kernel: EXT4-fs (sda1): warning: mounting fs with errors # Network warnings: kernel: eth0: Too much work at interrupt kernel: TCP: time wait bucket table overflow # Memory warnings: kernel: kswapd0: page allocation failure kernel: swapper: page allocation failure # Service warnings: systemd: Service holdoff time over, scheduling restart cron: (root) MAIL (mailed 8 bytes of output but got status 0x004b)

Automated Log Analysis Script

#!/bin/bash # log_analyzer.sh - Automated log analysis and reporting LOG_DIR="/var/log" REPORT_FILE="/tmp/log_analysis_$(date +%Y%m%d_%H%M%S).txt" ALERT_THRESHOLD=10 # Number of errors to trigger alert analyze_system_logs() { echo "=== SYSTEM LOG ANALYSIS REPORT ===" > "$REPORT_FILE" echo "Generated: $(date)" >> "$REPORT_FILE" echo >> "$REPORT_FILE" # 1. Critical errors in syslog echo "1. CRITICAL ERRORS (last 24h):" >> "$REPORT_FILE" journalctl --since "24 hours ago" -p crit >> "$REPORT_FILE" echo >> "$REPORT_FILE" # 2. Failed services echo "2. FAILED SERVICES:" >> "$REPORT_FILE" systemctl --failed >> "$REPORT_FILE" echo >> "$REPORT_FILE" # 3. Authentication failures echo "3. AUTHENTICATION FAILURES (last hour):" >> "$REPORT_FILE" grep "Failed password" "$LOG_DIR"/auth.log 2>/dev/null | \ tail -20 >> "$REPORT_FILE" echo >> "$REPORT_FILE" # 4. Disk space analysis echo "4. DISK USAGE:" >> "$REPORT_FILE" df -h >> "$REPORT_FILE" echo >> "$REPORT_FILE" # 5. Memory usage echo "5. MEMORY USAGE:" >> "$REPORT_FILE" free -h >> "$REPORT_FILE" echo >> "$REPORT_FILE" # 6. Top error occurrences echo "6. TOP ERROR PATTERNS:" >> "$REPORT_FILE" journalctl --since "1 hour ago" -p err | grep -oE "[A-Za-z]+[A-Za-z ]*error" | sort | uniq -c | sort -rn >> "$REPORT_FILE" } check_alerts() { error_count=$(journalctl --since "1 hour ago" -p err | wc -l) if [ "$error_count" -gt "$ALERT_THRESHOLD" ]; then echo "ALERT: High error count detected: $error_count errors in last hour" echo "Sending notification..." # Send alert (configure based on your environment) if command -v mail >/dev/null 2>&1; then echo "High error count: $error_count" | \ mail -s "System Alert: High Error Rate" admin@example.com fi fi } main() { analyze_system_logs check_alerts echo "Log analysis complete. Report saved to: $REPORT_FILE" echo "Preview of report:" head -50 "$REPORT_FILE" } main "$@"

3. Backup & Restore Strategies

3-2-1 Backup Rule

3
3 Copies of your data
Primary + 2 backups
2
2 Different Media
Disk + Cloud or Tape
1
1 Offsite Backup
Geographically separate location

Backup Strategies & Tools

💾

Full Backup

Complete system copy

$ tar -czf backup-full-$(date +%Y%m%d).tar.gz / $ dd if=/dev/sda of=/backup/full.img
🔄

Incremental Backup

Changes since last backup

$ rsync -av --link-dest=/backup/last / $ borg create --stats backup::'{now}' /

Differential Backup

Changes since full backup

$ tar -czf diff-$(date +%Y%m%d).tar.gz \ --newer /backup/full-timestamp /
📸

Snapshot Backup

Point-in-time copies

$ lvcreate -s -L 10G -n snap / $ btrfs subvolume snapshot /

Backup Implementation Examples

rsync - The Swiss Army Knife

# Basic rsync backup: $ rsync -av --delete /source/ /backup/ # Incremental with hard links (like Time Machine): $ rsync -av --link-dest=../latest \ --delete \ /source/ \ /backup/backup-$(date +%Y%m%d_%H%M%S) $ ln -snf /backup/backup-* /backup/latest # Remote backup over SSH: $ rsync -avz -e ssh \ --progress \ /source/ \ user@remote:/backup/ # Exclude patterns: $ rsync -av \ --exclude='*.tmp' \ --exclude='cache/*' \ --exclude='.git/' \ /source/ /backup/ # Bandwidth limiting: $ rsync -av --bwlimit=1000 /source/ /backup/ # Resume interrupted transfer: $ rsync -av --partial /source/ /backup/

BorgBackup - Deduplication

# Initialize repository: $ borg init --encryption=repokey /backup/repo # Create backup: $ borg create \ --stats \ --progress \ /backup/repo::'{hostname}-{now}' \ /home /etc /var # List backups: $ borg list /backup/repo # Mount backup for browsing: $ borg mount /backup/repo::archive-name /mnt # Extract files: $ borg extract /backup/repo::archive-name path/to/file # Prune old backups (keep 7 daily, 4 weekly, 6 monthly): $ borg prune \ --keep-daily=7 \ --keep-weekly=4 \ --keep-monthly=6 \ --stats \ /backup/repo # Remote backup over SSH: $ borg init user@backup-server:/backup/repo $ borg create user@backup-server:/backup/repo::backup-name /data

System Recovery Procedures

1

Full System Restore

# From live USB, restore full system backup: $ sudo mkdir /mnt/root $ sudo mount /dev/sdX1 /mnt/root # Root partition $ cd /mnt/root # Restore from tar backup: $ sudo tar -xzpvf /backup/full-backup.tar.gz # Restore from dd image: $ sudo dd if=/backup/full-system.img of=/dev/sdX bs=4M # Reinstall GRUB: $ sudo mount --bind /dev /mnt/root/dev $ sudo mount --bind /proc /mnt/root/proc $ sudo mount --bind /sys /mnt/root/sys $ sudo chroot /mnt/root $ grub-install /dev/sdX $ update-grub $ exit $ sudo umount -R /mnt/root
2

Partial File Recovery

# Restore specific files from backup: # From tar backup: $ tar -tzf backup.tar.gz # List contents $ tar -xzvf backup.tar.gz path/to/file # Extract specific file # From rsync backup: $ rsync -av /backup/path/to/file /destination/ # From Borg backup: $ borg list backup::archive # List files $ borg extract backup::archive path/to/file # Using find to locate backups: $ find /backup -name "*backup*" -type f | \ xargs -I {} tar -tzf {} | grep filename # Restore deleted files from ext3/4: $ sudo debugfs /dev/sdX1 debugfs> lsdel # List deleted files debugfs> dump /recovered/file
3

Database Recovery

# MySQL/MariaDB backup and restore: $ mysqldump -u root -p --all-databases > full-backup.sql $ mysql -u root -p < full-backup.sql # PostgreSQL backup and restore: $ pg_dumpall -U postgres > full-backup.sql $ psql -U postgres -f full-backup.sql # MongoDB backup and restore: $ mongodump --out /backup/mongodb $ mongorestore /backup/mongodb # Redis backup (RDB file): $ cp /var/lib/redis/dump.rdb /backup/ $ cp /backup/dump.rdb /var/lib/redis/ $ redis-cli shutdown $ systemctl start redis # Point-in-time recovery for databases: # Use WAL archiving (PostgreSQL) or binary logs (MySQL)

Backup Validation Checklist

Regularly test backup restoration process
Verify backup integrity with checksums
Monitor backup job completion and alerts
Store backup encryption keys securely
Document recovery procedures
Test disaster recovery scenarios quarterly
Keep multiple backup generations
Verify backup media periodically

4. Kernel Panics & Recovery

⚠️ Kernel Panic Immediate Actions

  1. Document the panic message (take photo if possible)
  2. Note any recent system changes (updates, hardware)
  3. Attempt safe reboot (if automatic reboot doesn't work)
  4. Boot into recovery/single-user mode
  5. Collect crash dumps and logs

Understanding Kernel Panic Messages

Sample Kernel Panic Analysis

Kernel panic - not syncing: Fatal exception # This is a generic panic message BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 # NULL pointer dereference - buggy driver or kernel module Oops: 0002 [#1] SMP # Oops message - non-fatal but indicates serious problem general protection fault: 0000 [#1] SMP # Memory protection violation kernel BUG at /path/to/file.c:123! # Specific kernel bug location Out of memory: Killed process 1234 (process-name) # OOM killer activated - not a panic but serious # Stack trace follows: Call Trace: [] function_name+0x123/0x456 [] another_function+0x89/0xab # Stack trace shows calling functions

Kernel Debugging Tools

Crash Dump

kdump & crash

# Install kdump tools: $ sudo apt install kdump-tools crash # Configure kdump: $ sudo systemctl enable kdump $ sudo systemctl start kdump # Test kdump: $ echo c | sudo tee /proc/sysrq-trigger # Analyze crash dump: $ crash /var/crash/vmcore /usr/lib/debug/boot/vmlinux # In crash utility: crash> bt # Backtrace crash> ps # Process list crash> kmem -i # Kernel memory info
Live Debugging

Magic SysRq Keys

# Enable SysRq: $ echo 1 > /proc/sys/kernel/sysrq # Common SysRq commands (Alt+SysRq+key): # r - Switch keyboard to raw mode # e - Send SIGTERM to all processes # i - Send SIGKILL to all processes # s - Sync all mounted filesystems # u - Remount filesystems read-only # b - Immediate reboot # c - Crash dump (if kdump configured) # m - Show memory info # p - Show registers # t - Show task list # Emergency reboot sequence: # Alt+SysRq+r e i s u b
Kernel Tracing

ftrace & perf

# Use ftrace: $ cd /sys/kernel/debug/tracing $ echo function > current_tracer $ echo 1 > tracing_on $ cat trace # Use perf for profiling: $ sudo perf record -g -a sleep 10 $ sudo perf report # Trace system calls: $ sudo strace -p $ sudo strace -f command # Trace kernel functions: $ sudo perf trace -a

Common Kernel Panic Causes & Solutions

Symptom Likely Cause Immediate Action Long-term Fix
NULL pointer dereference Buggy kernel module Boot with nomodules Update or blacklist module
Out of memory Memory leak or insufficient RAM Kill process, increase swap Fix memory leak, add RAM
Filesystem corruption Unclean shutdown, hardware fault fsck filesystem Replace failing hardware
CPU soft lockup Kernel bug, hardware issue Reboot, check cooling Update kernel, check hardware
Kernel stack overflow Deep recursion, buggy driver Reboot with older kernel Update driver, report bug
IRQ handler issues Hardware conflict Boot with noapic Update BIOS, check hardware

Emergency Kernel Recovery Procedure

1

Immediate Response

Document panic message, note recent changes, attempt safe reboot

2

Boot to Recovery

Use GRUB to boot older kernel or recovery mode with systemd.unit=rescue.target

3

Collect Evidence

Gather logs: journalctl -xb, dmesg, check /var/crash/

4

Identify Culprit

Check recent updates, new hardware, modified configurations

5

Apply Fix

Remove problematic kernel/modules, restore from backup, update system

6

Test & Monitor

Reboot, monitor stability, implement preventive measures

#!/bin/bash # kernel_panic_recovery.sh - Automated recovery steps collect_diagnostics() { echo "Collecting diagnostic information..." # Save kernel messages dmesg > /tmp/dmesg-$(date +%s).log # Save journal from last boot journalctl -b -1 > /tmp/journal-previous-boot.log # Check for crash dumps if [ -d /var/crash ]; then cp -r /var/crash /tmp/crash-backup-$(date +%s) fi # List loaded kernel modules lsmod > /tmp/lsmod-$(date +%s).log } recover_system() { echo "Attempting recovery..." # 1. Check filesystem sudo fsck -y /dev/sdX1 # 2. Remove potentially problematic modules sudo rmmod nouveau # Common problematic GPU driver sudo rmmod nvidia # 3. Boot with safe parameters next time sudo grubby --update-kernel=ALL \ --args="nomodeset noapic nolapic" # 4. Update system sudo apt update && sudo apt upgrade -y echo "Recovery steps completed. Please reboot." } main() { collect_diagnostics recover_system } main "$@"

5. DevOps Best Practices for Linux Environments

DevOps Philosophy: Infrastructure as Code, Continuous Monitoring, Automated Recovery, and Blameless Postmortems. Focus on reproducibility, observability, and resilience.

Infrastructure as Code (IaC)

📁 Configuration Management

# Ansible playbook for server hardening: - name: Harden Linux Server hosts: all become: yes tasks: - name: Update all packages apt: update_cache: yes upgrade: dist autoremove: yes - name: Configure SSH template: src: sshd_config.j2 dest: /etc/ssh/sshd_config notify: restart ssh - name: Configure firewall ufw: rule: allow port: '{{ item }}' proto: tcp loop: - 22 - 80 - 443 - name: Install monitoring agent apt: name: prometheus-node-exporter state: present

🐳 Container Orchestration

# Kubernetes Deployment with best practices: apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:{{ .Values.imageTag }} ports: - containerPort: 8080 # Resource limits resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 256Mi # Health checks livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5

🏗️ Infrastructure Provisioning

# Terraform for cloud infrastructure: resource "aws_instance" "web" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.micro" # Security groups vpc_security_group_ids = [aws_security_group.web.id] # Monitoring monitoring = true # Root volume encryption root_block_device { encrypted = true volume_size = 20 } # User data for bootstrap user_data = file("bootstrap.sh") tags = { Name = "web-server" Environment = "production" ManagedBy = "terraform" } } resource "aws_security_group" "web" { name = "web-sg" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

Monitoring & Observability

Metrics Collection

# Prometheus node exporter metrics: $ curl http://localhost:9100/metrics # Key metrics to monitor: # - node_load1, node_load5, node_load15 # - node_memory_MemTotal, node_memory_MemFree # - node_filesystem_size_bytes, node_filesystem_free_bytes # - node_network_receive_bytes_total # - node_cpu_seconds_total # Custom metrics with pushgateway: $ cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/myjob # TYPE my_metric gauge my_metric{instance="server1"} 42 EOF # Alert rules (Prometheus): groups: - name: instance rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}"

Logging & Tracing

# Structured logging with JSON: { "timestamp": "2024-01-15T10:30:00Z", "level": "ERROR", "service": "api-gateway", "message": "Failed to connect to database", "error": "connection refused", "attempt": 3, "duration_ms": 1250, "trace_id": "abc123def456", "span_id": "def456", "user_id": "user-123", "request_id": "req-789" } # Distributed tracing with OpenTelemetry: from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("http-request"): # Your code here pass # Log aggregation with fluentd/fluent-bit: <source> @type tail path /var/log/nginx/access.log pos_file /var/log/fluentd/nginx.pos tag nginx.access format json </source> <match nginx.**> @type elasticsearch host elasticsearch port 9200 logstash_format true </match>

Security Hardening

Linux Server Hardening Checklist

Disable root SSH login (PermitRootLogin no)
Use SSH key authentication only
Configure firewall (ufw/iptables/firewalld)
Enable automatic security updates
Install and configure fail2ban
Set up auditd for security auditing
Configure apparmor/selinux profiles
Disable unnecessary services
Set proper file permissions
Enable disk encryption
Configure log rotation and monitoring
Regular vulnerability scanning

Automation & CI/CD

GitLab CI/CD Pipeline

# .gitlab-ci.yml stages: - test - build - security-scan - deploy variables: DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA unit-test: stage: test image: node:16 script: - npm ci - npm test - npm run test:coverage artifacts: reports: junit: junit.xml coverage_report: coverage_format: cobertura path: coverage/cobertura-coverage.xml docker-build: stage: build image: docker:20.10 services: - docker:20.10-dind script: - docker build -t $DOCKER_IMAGE . - docker push $DOCKER_IMAGE only: - main - merge_requests security-scan: stage: security-scan image: aquasec/trivy:latest script: - trivy image --exit-code 1 --severity HIGH,CRITICAL $DOCKER_IMAGE - trivy fs --exit-code 1 --severity HIGH,CRITICAL . deploy-production: stage: deploy image: bitnami/kubectl:latest script: - kubectl set image deployment/myapp myapp=$DOCKER_IMAGE - kubectl rollout status deployment/myapp environment: name: production url: https://myapp.example.com only: - main when: manual # Manual approval for production

Infrastructure Testing

# ServerSpec tests for infrastructure: require 'serverspec' describe package('nginx') do it { should be_installed } end describe service('nginx') do it { should be_enabled } it { should be_running } end describe port(80) do it { should be_listening } end describe file('/etc/nginx/nginx.conf') do it { should be_file } it { should be_owned_by 'root' } it { should be_mode 644 } end describe command('curl -s http://localhost/health') do its(:stdout) { should match /"status":"ok"/ } its(:exit_status) { should eq 0 } end # Terraform testing with terratest: package test import ( "testing" "github.com/gruntwork-io/terratest/modules/terraform" ) func TestTerraform(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) // Add assertions here }

⚠️ Incident Response & Postmortem

1

Detection & Assessment

Monitor alerts, assess impact, classify severity (SEV-1, SEV-2, etc.)

2

Containment

Isolate affected systems, implement workarounds, prevent spread

3

Eradication

Identify root cause, apply fixes, verify resolution

4

Recovery

Restore services, validate functionality, monitor stability

5

Postmortem

Blameless analysis, document lessons, implement improvements

📊 Key Performance Indicators (KPIs)

Metric Target Measurement Importance
MTTR (Mean Time to Repair) < 1 hour Incident duration High
MTBF (Mean Time Between Failures) > 30 days Uptime between incidents High
Deployment Frequency Daily Deployments per day Medium
Change Failure Rate < 5% Failed deployments / total High
Recovery Time Objective < 4 hours Maximum acceptable downtime Critical
Recovery Point Objective < 15 minutes Maximum data loss Critical

Troubleshooting Tools Reference

System Monitoring

htop, atop, nmon

$ htop # Interactive process viewer $ atop # Advanced system & process monitor $ nmon # Performance monitoring $ iotop # I/O monitoring $ iftop # Network traffic monitoring
Network Debugging

tcpdump, netstat, ss

$ tcpdump -i eth0 port 80 $ netstat -tulpn # Listening ports $ ss -tulpn # Modern netstat $ nc -zv host port # Port testing $ mtr host # Network diagnostics
Disk & Filesystem

iostat, df, du, ncdu

$ iostat -x 1 # Disk I/O stats $ df -h # Disk space $ du -sh /* # Directory sizes $ ncdu / # Interactive disk usage $ lsblk # Block devices
Process Analysis

strace, lsof, pstree

$ strace -p # System calls $ lsof -p # Open files $ pstree -p # Process tree $ ps auxf # Process list $ pidstat 1 # Process stats
Memory Analysis

free, vmstat, pmap

$ free -h # Memory usage $ vmstat 1 # Virtual memory stats $ pmap -x # Process memory map $ slabtop # Kernel slab cache info $ numactl --hardware # NUMA info
Security Tools

fail2ban, auditd, lynis

$ fail2ban-client status $ ausearch -m avc # SELinux denials $ lynis audit system # Security audit $ chkrootkit # Rootkit check $ rkhunter --check # Rootkit hunter

🚨 Emergency Recovery Quick Reference

System Won't Boot

  1. Boot from live USB/DVD
  2. Mount root partition: mount /dev/sdX1 /mnt
  3. Chroot: chroot /mnt
  4. Check logs: journalctl -xb
  5. Reinstall GRUB: grub-install /dev/sdX
  6. Update GRUB: update-grub

Forgotten Root Password

  1. Reboot and edit GRUB (press 'e')
  2. Add init=/bin/bash to kernel line
  3. Press Ctrl+X to boot
  4. Remount root: mount -o remount,rw /
  5. Change password: passwd root
  6. Reboot: exec /sbin/init

Accidental File Deletion

  1. Stop writes to filesystem immediately
  2. Use extundelete for ext3/4: extundelete /dev/sdX1 --restore-file /path
  3. Use testdisk for other filesystems
  4. Restore from backup if available
  5. Check /proc/pid/fd/ for open deleted files

Disk Full / No Space

  1. Find large files: du -ahx / | sort -rh | head -20
  2. Clear package cache: apt clean or yum clean all
  3. Remove old logs: journalctl --vacuum-time=7d
  4. Check for large deleted files still open: lsof | grep deleted
  5. Extend filesystem if using LVM

✅ Production Readiness Checklist

✅ Monitoring configured (metrics, logs, alerts)
✅ Backup system tested and working
✅ Security hardening applied
✅ Disaster recovery plan documented
✅ Incident response procedure in place
✅ Regular patching schedule established
✅ Performance baselines recorded
✅ Documentation updated and accessible
✅ Team trained on emergency procedures
✅ Postmortem process defined

🔄 Continuous Improvement Cycle

Monitor
Analyze
Fix
Improve
🔄

DevOps is a continuous cycle of monitoring, analyzing, fixing, and improving. Embrace failures as learning opportunities and always strive for better automation, reliability, and security.

No comments:

Post a Comment

Linux Troubleshooting & DevOps Best Practices

Linux Troubleshooting & DevOps Best Practices Linux Troubleshooting & DevOps...