File Operations & Text Processing: The DevOps Power Tools
Master the essential text manipulation and file processing skills that every DevOps engineer needs daily.
📅 Published: Feb 2026
⏱️ Estimated Reading Time: 20 minutes
🏷️ Tags: Linux Text Processing, grep, sed, awk, Redirection, Log Analysis, DevOps Tools
🔄 Redirection & Piping: Controlling Input and Output
Understanding the Three Streams
Before we dive into redirection, you need to understand that Linux handles three types of data streams:
Standard Input (stdin) - Where programs read data from (usually your keyboard)
Standard Output (stdout) - Where programs write normal output (usually your screen)
Standard Error (stderr) - Where programs write error messages (also usually your screen)
These streams have numbers:
0 = stdin
1 = stdout
2 = stderr
Output Redirection: Saving Results to Files
The > symbol redirects output to a file instead of showing it on screen. Think of it as saying "take the output and put it in this file."
# Save directory listing to a file ls -la > files.txt # Append to a file instead of overwriting echo "New entry" >> log.txt # Save only errors to a file some_command 2> errors.log # Save both output and errors command > output.log 2>&1 # Or the modern way: command &> combined.log
Real-world example: You're running a backup script and want to save both the progress and any errors:
backup_script.sh > backup.log 2>&1 # This saves everything to backup.log
Common mistake: Using > instead of >>. The single > overwrites the file, while >> appends to it. This is especially important for logs!
Input Redirection: Feeding Files to Programs
The < symbol takes input from a file instead of the keyboard.
# Count words in a file wc < document.txt # Sort lines from a file sort < unsorted.txt > sorted.txt # Send email with file content mail user@example.com < message.txt
When to use: Most commands can take files as arguments directly (wc document.txt), but < is useful in pipelines or with commands that only read from stdin.
Piping: Connecting Commands Together
The | symbol (pipe) takes the output of one command and feeds it as input to another. This is where Linux becomes incredibly powerful!
# Simple pipe: count files in directory ls | wc -l # Multiple pipes: find large files and sort them find /home -type f -size +100M | sort -rh | head -10 # Process logs: find errors, count them, sort by frequency grep "ERROR" app.log | cut -d' ' -f3 | sort | uniq -c | sort -rn
Visual analogy: Think of pipes as an assembly line in a factory. Raw data goes in one end, gets processed by multiple machines (commands), and finished information comes out the other end.
Real DevOps scenario: Monitoring website access logs:
tail -1000 access.log | grep "404" | cut -d' ' -f7 | sort | uniq -c | sort -rn | head -20 # This shows the top 20 "404 not found" pages on your website
🔍 Filters & Text Utilities: Your Swiss Army Knife
grep: The Search Master
We touched on grep earlier, but let's go deeper. grep stands for Global Regular Expression Print - it finds patterns in text.
# Basic search grep "error" logfile.txt # Case-insensitive search (very common) grep -i "error\|warning\|fail" logfile.txt # Show line numbers (great for debugging) grep -n "function" script.py # Show context (lines before and after matches) grep -B2 -A2 "exception" app.log # Shows 2 lines before and 2 lines after each exception # Inverse search (lines NOT containing pattern) grep -v "DEBUG" logfile.txt # Shows all lines except DEBUG messages # Count matches grep -c "GET" access.log # Counts how many GET requests # Search in multiple files grep -r "TODO" /home/user/projects/
Regular expressions power: grep supports patterns, not just exact text:
grep "^2024" log.txt- Lines starting with 2024grep "fail$" log.txt- Lines ending with failgrep "error.*database" log.txt- Error followed by database somewhere on line
Real use: Finding all database connection strings in configuration files:
grep -r "mysql://\|postgresql://" /etc/
sed: The Stream Editor
sed stands for Stream EDitor. It edits text on-the-fly without opening files. Think of it as "search and replace" for the command line.
# Simple replacement sed 's/old/new/g' file.txt # s = substitute, g = global (all occurrences) # Replace and save to new file sed 's/error/ERROR/g' log.txt > fixed_log.txt # Replace only on specific lines sed '5,10s/foo/bar/g' file.txt # Only lines 5 through 10 # Delete lines containing pattern sed '/debug/d' log.txt # d = delete # Print specific lines sed -n '10,20p' largefile.txt # Shows only lines 10-20
Common DevOps uses:
# Remove comments from config file sed '/^#/d' /etc/nginx/nginx.conf # Change port in configuration sed -i 's/port=8080/port=9000/g' config.properties # -i = edit file in-place (be careful!) # Extract specific column from CSV sed 's/,/ /g' data.csv | awk '{print $3}'
Pro tip: Use sed 's/pattern/replacement/' without -i first to see what changes would be made, then add -i to actually modify the file.
awk: The Programming Powerhouse
awk is a complete programming language for text processing. It's named after its creators: Aho, Weinberger, and Kernighan.
Basic concept: awk processes text line by line, splitting each line into fields.
# Print first column (default separator is whitespace) awk '{print $1}' file.txt # Print specific columns awk '{print $1, $3, $5}' data.txt # Use different separator (CSV files) awk -F',' '{print $2, $4}' data.csv # -F sets field separator # Add header awk 'BEGIN {print "Name\tAge"} {print $1, $2}' people.txt # Conditional printing awk '$3 > 100 {print $1, $3}' sales.txt # Only print if third field > 100 # Calculate totals awk '{sum += $3} END {print "Total:", sum}' sales.txt
Real-world DevOps examples:
# Parse Apache/nginx access logs awk '{print $1, $7, $9}' access.log | head -20 # Shows IP, page, status code # Find average response time awk '{sum += $NF} END {print "Avg:", sum/NR}' response_times.log # NR = Number of Records (lines) # Monitor disk usage df -h | awk '/\/$/ {print "Root fs is", $5, "full"}' # Checks root filesystem usage percentage
Why awk is worth learning: Once you understand basic awk, you can solve complex text processing problems in one line that would take 20 lines in Python.
cut: Extracting Columns
cut is simpler than awk for basic column extraction.
# Extract first column (tab-separated by default) cut -f1 data.txt # Extract multiple columns cut -f1,3,5 data.txt # Use different delimiter cut -d',' -f2,4 data.csv # -d sets delimiter, -f specifies fields # Extract by character position cut -c1-10 file.txt # First 10 characters of each line # Extract by byte position (for binary safety) cut -b1-20 file.txt
When to use cut vs awk:
Use
cutfor simple column extraction with consistent delimitersUse
awkwhen you need calculations, conditions, or complex patterns
sort: Ordering Data
sort organizes lines of text. It's more powerful than you might think!
# Basic alphabetical sort sort names.txt # Reverse sort sort -r numbers.txt # Numeric sort (not alphabetical!) sort -n numbers.txt # Sort by specific column sort -k2,2 data.txt # Sorts by second column # Remove duplicates while sorting sort -u duplicates.txt # Case-insensitive sort sort -f mixedcase.txt # Human-readable number sort (K, M, G suffixes) sort -h sizes.txt
Common use with other commands:
# Find most common errors grep "ERROR" app.log | cut -d' ' -f3 | sort | uniq -c | sort -rn # Sort files by size ls -l | sort -k5 -n -r
uniq: Finding Unique Lines
uniq removes duplicate consecutive lines. Important: It only removes adjacent duplicates, so you usually need sort first.
# Remove consecutive duplicates uniq repeated.txt # Count occurrences uniq -c data.txt # Show only duplicates uniq -d data.txt # Show only unique lines uniq -u data.txt
The classic pattern: sort | uniq -c | sort -rn
First
sortgroups duplicates togetheruniq -ccounts themFinal
sort -rnsorts by count (descending)
tr: Character Translation
tr (translate) changes or deletes characters. It works on individual characters, not words.
# Convert lowercase to uppercase echo "hello" | tr 'a-z' 'A-Z' # Output: HELLO # Replace spaces with commas echo "a b c d" | tr ' ' ',' # Output: a,b,c,d # Delete specific characters echo "hello123" | tr -d '0-9' # Output: hello (removes digits) # Squeeze repeated characters echo "hellooo world" | tr -s ' o' # Output: hello world (squeezes spaces and 'o's) # Delete non-printable characters cat binaryfile.txt | tr -cd '[:print:]'
Useful for cleaning data:
# Clean CSV file (remove extra spaces, fix quotes) cat dirty.csv | tr -d '\r' | tr ' ' '_' > clean.csv
wc: Word Count (and More)
wc counts lines, words, and characters.
# Count lines, words, characters wc document.txt # Count only lines wc -l access.log # Great for: "How many requests today?" # Count only words wc -w essay.txt # Count only characters wc -c data.txt # Count files in directory ls | wc -l
Real use: Monitoring log growth:
# How many error lines per hour? grep "ERROR" app.log | cut -d: -f1 | uniq -c # Or simpler: grep -c "ERROR" app.log
🔎 Searching Files with find and locate
find: The Comprehensive File Searcher
find is incredibly powerful for locating files based on various criteria.
# Basic: Find files by name find /home -name "*.txt" # Case-insensitive name search find /var -iname "*.log" # Find by file type find . -type f # Regular files find . -type d # Directories find . -type l # Symbolic links # Find by size find / -size +100M 2>/dev/null # Larger than 100MB find /home -size -1k # Smaller than 1KB # Find by modification time find /var/log -mtime -7 # Modified in last 7 days find /backup -mtime +30 # Older than 30 days # Find by permissions find /home -perm 644 # Files with 644 permissions find / -perm -4000 2>/dev/null # SUID files (security check!) # Find and execute actions find /tmp -name "*.tmp" -delete # Find and delete find . -name "*.sh" -exec chmod +x {} \; # Find and make executable
The -exec magic: {} is replaced with the filename, and \; ends the command.
Security use: Finding world-writable files (potential security issue):
find / -type f -perm -0002 2>/dev/null
locate: The Fast Search
locate uses a database for lightning-fast searches, but it might not find recently created files.
# Fast search locate nginx.conf # Case-insensitive locate -i "README" # Limit number of results locate "*.log" | head -20 # Update the search database sudo updatedb # Run this if locate doesn't find new files
find vs locate:
Use
findwhen you need precise criteria (size, time, permissions)Use
locatewhen you just need to find a file quickly by name
📊 Managing Large Logs: The Essential Toolkit
The Right Tool for Each Job
When working with logs (which DevOps engineers do constantly), you need different tools for different tasks:
cat: For Small Files Only
# Good for small config files cat /etc/hosts # Bad for large logs! cat /var/log/syslog # Don't do this for 1GB files!
less: Your Go-To Log Viewer
# Open log file less /var/log/nginx/access.log # Within less: # Space = Next page # b = Previous page # /error = Search for "error" # n = Next match # N = Previous match # g = Go to beginning # G = Go to end # q = Quit # Open with line numbers less -N app.log # Follow mode (like tail -f) less +F app.log # Ctrl+C to stop following, then F to resume
tail: Watching Live Logs
# Last 10 lines (default) tail app.log # Last 100 lines tail -100 app.log # Watch log in real-time (CRITICAL for debugging!) tail -f /var/log/nginx/access.log # Watch multiple logs tail -f /var/log/nginx/*.log # Show lines from specific point tail -n +1000 app.log # From line 1000 to end # Follow with context (new in some systems) tail -f --pid=$(pgrep nginx) /var/log/nginx/error.log
Real scenario: Debugging a web application issue:
# In one terminal: watch error log tail -f /var/log/nginx/error.log # In another: watch application log tail -f /var/log/app/app.log # Reproduce the issue and watch both logs simultaneously
head: Checking Beginnings
# First 10 lines (default) head config.yaml # First 50 lines head -50 largefile.csv # Check file structure head -20 database_dump.sql # Shows what type of SQL dump it is
tac: cat Backwards!
Yes, tac is cat spelled backwards, and it shows files in reverse order!
# View log from newest to oldest tac /var/log/syslog | head -100 # Useful for: "What just happened?" # Most recent entries are shown first
🎯 Real-World DevOps Scenarios
Scenario 1: Analyzing Web Server Logs
Problem: Website is slow. Need to find the slowest pages.
# Parse nginx access log for slow requests awk '$NF > 5 {print $7, $NF}' /var/log/nginx/access.log | sort -k2 -rn | head -20 # $NF = last field (response time in nginx logs) # Shows URL and time for requests taking >5 seconds # Alternative with more detail grep " 5\." /var/log/nginx/access.log | awk '{print $7, $NF}' | sort | uniq -c | sort -rn
Scenario 2: Finding Large Files for Cleanup
Problem: Disk is 95% full. Need to find largest files.
# Find files larger than 100MB find / -type f -size +100M 2>/dev/null | xargs ls -lh | sort -k5 -hr | head -20 # Find by directory du -ah /var/log 2>/dev/null | sort -rh | head -20 # Find and delete old log files find /var/log -name "*.log" -type f -mtime +30 -delete
Scenario 3: Monitoring Application Health
Problem: Need to create a daily error report.
# Create error summary report REPORT="/var/log/error-report-$(date +%Y%m%d).txt" echo "=== Daily Error Report ===" > $REPORT echo "Generated: $(date)" >> $REPORT echo >> $REPORT echo "Top 10 errors:" >> $REPORT grep "ERROR" /var/log/app/app.log | cut -d' ' -f4- | sort | uniq -c | sort -rn | head -10 >> $REPORT echo >> $REPORT echo "Error count by hour:" >> $REPORT grep "ERROR" /var/log/app/app.log | cut -d: -f2 | sort | uniq -c >> $REPORT echo "Report saved to: $REPORT"
Scenario 4: Processing CSV Data
Problem: Need to extract specific data from a large CSV.
# Extract column 2 and 4 from CSV awk -F',' '{print $2, $4}' largefile.csv > extracted.txt # Filter rows where column 3 > 100 awk -F',' '$3 > 100 {print $0}' data.csv # Calculate average of column 5 awk -F',' '{sum += $5; count++} END {print "Average:", sum/count}' data.csv # Convert CSV to TSV (tab-separated) sed 's/,/\t/g' data.csv > data.tsv
💡 Pro Tips and Best Practices
1. Chain Commands Wisely
# Good: Clear chain of operations cat file.txt | grep "error" | sort | uniq # Better: Avoid unnecessary cat (Useless Use of Cat award!) grep "error" file.txt | sort | uniq # Best: Use shell features when possible < file.txt grep "error" | sort | uniq
2. Handle Errors Gracefully
# Ignore "permission denied" errors find / -name "*.conf" 2>/dev/null # Save errors separately some_command > output.txt 2> errors.txt # Send errors to /dev/null (discard them) noisy_command 2>/dev/null
3. Use Temporary Files for Complex Operations
# For multi-step processing grep "ERROR" app.log > errors.tmp awk '{print $2, $5}' errors.tmp > processed.tmp sort processed.tmp > final.txt rm -f errors.tmp processed.tmp # Or use process substitution diff <(sort file1.txt) <(sort file2.txt)
4. Learn Regular Expressions
Even basic regex makes you 10x more powerful:
.= any single character*= zero or more of previous^= start of line$= end of line[abc]= any of a, b, or c[0-9]= any digit
📋 Quick Reference Cheat Sheet
| Task | Command | Example |
|---|---|---|
| Save output | > | ls > files.txt |
| Append output | >> | echo "new" >> log.txt |
| Redirect errors | 2> | cmd 2> errors.log |
| Pipe output | | | ls | wc -l |
| Search text | grep | grep "error" log.txt |
| Search/replace | sed | sed 's/old/new/g' file.txt |
| Column extraction | awk | awk '{print $1}' file.txt |
| Simple columns | cut | cut -d',' -f2 file.csv |
| Sort data | sort | sort -n numbers.txt |
| Unique lines | uniq | sort file.txt | uniq |
| Translate chars | tr | echo "hi" | tr 'a-z' 'A-Z' |
| Count elements | wc | wc -l file.txt |
| Find files | find | find / -name "*.log" |
| Fast find | locate | locate nginx.conf |
| View file | less | less largefile.log |
| Watch end | tail -f | tail -f app.log |
| View beginning | head | head -20 file.txt |
| Reverse view | tac | tac log.txt | head -50 |
🚀 Practice Exercises
Exercise 1: Create a Log Analysis Script
# Create a script that analyzes an Apache/nginx log cat > analyze_log.sh << 'EOF' #!/bin/bash LOG_FILE="${1:-/var/log/nginx/access.log}" echo "=== Log Analysis: $LOG_FILE ===" echo echo "1. Total requests:" wc -l "$LOG_FILE" echo echo "2. Top 10 IP addresses:" awk '{print $1}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10 echo echo "3. Top 10 requested pages:" awk '{print $7}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10 echo echo "4. Response code summary:" awk '{print $9}' "$LOG_FILE" | sort | uniq -c | sort -rn echo echo "5. Requests per hour:" awk '{print $4}' "$LOG_FILE" | cut -d: -f2 | sort | uniq -c EOF chmod +x analyze_log.sh ./analyze_log.sh
Exercise 2: Clean and Process Data
# Create sample data cat > sales.csv << 'EOF' Date,Product,Quantity,Price 2024-01-01,Widget,10,25.50 2024-01-02,Gadget,5,99.99 2024-01-01,Widget,15,25.50 2024-01-03,Thingy,8,45.00 EOF # Process it echo "Total revenue per product:" awk -F',' 'NR>1 {revenue[$2] += $3 * $4} END {for (p in revenue) print p, revenue[p]}' sales.csv echo echo "Daily sales:" awk -F',' 'NR>1 {print $1, $3, $4}' sales.csv | sort
Exercise 3: Monitor System in Real-time
# Create a monitoring dashboard watch -n 5 ' echo "=== System Status ===" echo "Time: $(date)" echo echo "CPU Load: $(uptime)" echo echo "Memory:" free -h | grep -E "^Mem:" echo echo "Disk:" df -h / | tail -1 echo echo "Top 5 processes by CPU:" ps aux --sort=-%cpu | head -6 ' # Press Ctrl+C to stop
🔗 Master Text Processing with Hands-on Labs
Text processing skills are what separate junior from senior DevOps engineers. The best way to learn is through guided practice with real-world scenarios.
👉 Practice file operations and text processing with real datasets at:
https://devops.trainwithsky.com/
Our interactive labs give you:
Real log files to analyze
Production-like datasets
Step-by-step challenges
Immediate feedback
Progress from beginner to advanced
Common Questions Answered
Q: When should I use awk vs sed?
A: Use sed for simple text transformations (search/replace, deletion). Use awk when you need to work with columns, do calculations, or have conditional logic.
Q: Why does uniq not remove all duplicates?
A: uniq only removes adjacent duplicates. Always use sort | uniq to remove all duplicates.
Q: Is it bad to use cat with pipes?
A: Sometimes yes (UUOC - Useless Use of Cat). Use < file command or command file when possible. But for beginners, cat file | command is fine and more readable.
Q: How do I handle files with spaces in names?
A: Use find -print0 | xargs -0 or use find -exec. For example: find . -name "*.txt" -print0 | xargs -0 ls -l
Q: What's the best way to learn regular expressions?
A: Start with simple patterns (^, $, ., *), practice on regex testing websites, and gradually add more complexity.
Stuck on a text processing problem? Share your challenge in the comments and we'll help you solve it! 💬
Comments
Post a Comment