Linux Troubleshooting & DevOps Best Practices
Mastering System Diagnostics, Recovery, and Production Operations
Quick Navigation
⚠️ Emergency Notice
This guide contains critical troubleshooting information. In production emergencies, always follow your organization's incident response procedures and maintain clear communication with your team.
1. Common Boot Issues & Recovery
Linux Boot Process Flow
GRUB Boot Issues
GRUB Not Loading
GRUB Rescue Mode
Kernel Panic at Boot
Filesystem & Mount Issues
Filesystem Corruption
Mount Issues & fstab Problems
⚠️ Critical Boot Parameters for Recovery
2. Log Analysis Techniques
Log Analysis Philosophy: "Logs don't lie. Learn to ask them the right questions." Start with the most recent logs and work backwards. Correlate timestamps across different log sources.
System Log Locations
Systemd Journal
Traditional Log Files
Advanced Log Analysis Commands
Real-time Monitoring
Log Filtering & Analysis
Common Log Patterns & Meanings
Critical Errors
Common Errors
Warnings to Monitor
Automated Log Analysis Script
3. Backup & Restore Strategies
3-2-1 Backup Rule
Backup Strategies & Tools
Full Backup
Complete system copy
Incremental Backup
Changes since last backup
Differential Backup
Changes since full backup
Snapshot Backup
Point-in-time copies
Backup Implementation Examples
rsync - The Swiss Army Knife
BorgBackup - Deduplication
System Recovery Procedures
Full System Restore
Partial File Recovery
Database Recovery
Backup Validation Checklist
4. Kernel Panics & Recovery
⚠️ Kernel Panic Immediate Actions
- Document the panic message (take photo if possible)
- Note any recent system changes (updates, hardware)
- Attempt safe reboot (if automatic reboot doesn't work)
- Boot into recovery/single-user mode
- Collect crash dumps and logs
Understanding Kernel Panic Messages
Sample Kernel Panic Analysis
Kernel Debugging Tools
kdump & crash
Magic SysRq Keys
ftrace & perf
Common Kernel Panic Causes & Solutions
| Symptom | Likely Cause | Immediate Action | Long-term Fix |
|---|---|---|---|
| NULL pointer dereference | Buggy kernel module | Boot with nomodules |
Update or blacklist module |
| Out of memory | Memory leak or insufficient RAM | Kill process, increase swap | Fix memory leak, add RAM |
| Filesystem corruption | Unclean shutdown, hardware fault | fsck filesystem |
Replace failing hardware |
| CPU soft lockup | Kernel bug, hardware issue | Reboot, check cooling | Update kernel, check hardware |
| Kernel stack overflow | Deep recursion, buggy driver | Reboot with older kernel | Update driver, report bug |
| IRQ handler issues | Hardware conflict | Boot with noapic |
Update BIOS, check hardware |
Emergency Kernel Recovery Procedure
Immediate Response
Document panic message, note recent changes, attempt safe reboot
Boot to Recovery
Use GRUB to boot older kernel or recovery mode with systemd.unit=rescue.target
Collect Evidence
Gather logs: journalctl -xb, dmesg, check /var/crash/
Identify Culprit
Check recent updates, new hardware, modified configurations
Apply Fix
Remove problematic kernel/modules, restore from backup, update system
Test & Monitor
Reboot, monitor stability, implement preventive measures
5. DevOps Best Practices for Linux Environments
DevOps Philosophy: Infrastructure as Code, Continuous Monitoring, Automated Recovery, and Blameless Postmortems. Focus on reproducibility, observability, and resilience.
Infrastructure as Code (IaC)
📁 Configuration Management
🐳 Container Orchestration
🏗️ Infrastructure Provisioning
Monitoring & Observability
Metrics Collection
Logging & Tracing
Security Hardening
Linux Server Hardening Checklist
Automation & CI/CD
GitLab CI/CD Pipeline
Infrastructure Testing
⚠️ Incident Response & Postmortem
Detection & Assessment
Monitor alerts, assess impact, classify severity (SEV-1, SEV-2, etc.)
Containment
Isolate affected systems, implement workarounds, prevent spread
Eradication
Identify root cause, apply fixes, verify resolution
Recovery
Restore services, validate functionality, monitor stability
Postmortem
Blameless analysis, document lessons, implement improvements
📊 Key Performance Indicators (KPIs)
| Metric | Target | Measurement | Importance |
|---|---|---|---|
| MTTR (Mean Time to Repair) | < 1 hour | Incident duration | High |
| MTBF (Mean Time Between Failures) | > 30 days | Uptime between incidents | High |
| Deployment Frequency | Daily | Deployments per day | Medium |
| Change Failure Rate | < 5% | Failed deployments / total | High |
| Recovery Time Objective | < 4 hours | Maximum acceptable downtime | Critical |
| Recovery Point Objective | < 15 minutes | Maximum data loss | Critical |
Troubleshooting Tools Reference
htop, atop, nmon
tcpdump, netstat, ss
iostat, df, du, ncdu
strace, lsof, pstree
free, vmstat, pmap
fail2ban, auditd, lynis
🚨 Emergency Recovery Quick Reference
System Won't Boot
- Boot from live USB/DVD
- Mount root partition:
mount /dev/sdX1 /mnt - Chroot:
chroot /mnt - Check logs:
journalctl -xb - Reinstall GRUB:
grub-install /dev/sdX - Update GRUB:
update-grub
Forgotten Root Password
- Reboot and edit GRUB (press 'e')
- Add
init=/bin/bashto kernel line - Press Ctrl+X to boot
- Remount root:
mount -o remount,rw / - Change password:
passwd root - Reboot:
exec /sbin/init
Accidental File Deletion
- Stop writes to filesystem immediately
- Use extundelete for ext3/4:
extundelete /dev/sdX1 --restore-file /path - Use testdisk for other filesystems
- Restore from backup if available
- Check
/proc/pid/fd/for open deleted files
Disk Full / No Space
- Find large files:
du -ahx / | sort -rh | head -20 - Clear package cache:
apt cleanoryum clean all - Remove old logs:
journalctl --vacuum-time=7d - Check for large deleted files still open:
lsof | grep deleted - Extend filesystem if using LVM
✅ Production Readiness Checklist
🔄 Continuous Improvement Cycle
DevOps is a continuous cycle of monitoring, analyzing, fixing, and improving. Embrace failures as learning opportunities and always strive for better automation, reliability, and security.
No comments:
Post a Comment