Linux File Operations & Text Processing
Published: December 2023 | Topic: Data Manipulation & Automation for DevOps
Mastering Linux text processing and file operations is essential for DevOps efficiency. These skills enable you to analyze logs, transform data, automate tasks, and process information at scale. Whether you're debugging applications, analyzing metrics, or building automation scripts, these tools form the backbone of your daily workflow.
The Unix Philosophy: Do One Thing Well
Linux tools follow the Unix philosophy: small, focused programs that do one thing well, combined together through pipes to solve complex problems. This approach gives you:
- Composability: Combine simple tools to create complex solutions
- Reusability: The same tools work across different scenarios
- Maintainability: Simple tools are easier to understand and debug
- Flexibility: Endless combinations for different tasks
1. Redirection & Piping: Data Flow Control
What are Redirection and Piping?
Redirection controls where command input comes from and where output goes. Piping connects the output of one command to the input of another. Together, they create powerful data processing pipelines.
Command 1
(Generates Data)
Command 2
(Filters/Transforms)
Command 3
(Final Result)
Standard Streams in Linux
Standard Input (stdin)
File descriptor 0. Default input source (usually keyboard).
$ cat < input.txt
# Here document
$ cat << EOF
Line 1
Line 2
EOF
Standard Output (stdout)
File descriptor 1. Default output destination (usually terminal).
$ ls > output.txt
# Append to file
$ echo "new line" >> log.txt
Standard Error (stderr)
File descriptor 2. Error messages and diagnostics.
$ command 2> errors.txt
# Redirect stderr to stdout
$ command 2>&1
Combined Redirections
Advanced redirection patterns.
$ command &> all_output.txt
# Send stdout to file, stderr to terminal
$ command > output.txt 2>&1
# Discard output completely
$ command > /dev/null 2>&1
Practical Redirection Examples
Logging with Append
$ echo "$(date): Backup started" >> /var/log/backup.log
# Log command output
$ deploy.sh &>> /var/log/deploy.log
Error Handling
$ script.sh > output.log 2> error.log
# Count errors
$ grep "ERROR" app.log 2>/dev/null | wc -l
Input Redirection
$ sort < unsorted.txt
# Multiple inputs
$ cat file1 file2 | grep "pattern"
The Power of Pipes
Pipes (|) are the most powerful feature in Linux. They connect commands, creating data transformation pipelines:
$ grep "$(date +%d/%b/%Y)" /var/log/nginx/access.log |
awk '{print $1}' |
sort |
uniq -c |
sort -rn |
head -20
Each command does one thing: filter → extract → sort → deduplicate → sort → limit
2. Filters & Text Utilities: Data Transformation
What are Text Processing Filters?
Filters are commands that read input, transform it, and write output. They're designed to work with pipes, making them perfect for data processing pipelines.
Essential Text Processing Tools
grep - Pattern Search
Search for patterns using regular expressions.
$ grep -i "warning" app.log # Case insensitive
$ grep -v "DEBUG" app.log # Invert match
$ grep -r "TODO" src/ # Recursive
$ grep -E "[0-9]{3}-[0-9]{3}" # Extended regex
sed - Stream Editor
Filter and transform text streams.
$ sed -i 's/foo/bar/' config.conf # In-place
$ sed '/^#/d' file.conf # Delete comment lines
$ sed -n '10,20p' file.log # Print lines 10-20
awk - Text Processing Language
Powerful programming language for text processing.
$ awk -F: '{print $1}' /etc/passwd # Custom delimiter
$ awk '$3 > 100 {print $0}' # Filter rows
$ awk '{sum+=$3} END {print sum}' # Sum column
cut - Extract Columns
Remove sections from lines of files.
# Comma delimiter, fields 1 & 3
$ cut -c1-10 file.txt
# Characters 1-10
sort - Sort Lines
Sort lines of text files.
$ sort -r file.txt # Reverse
$ sort -n numbers.txt # Numeric
$ sort -u file.txt # Unique
uniq - Report/Filter Duplicates
Report or omit repeated lines.
$ uniq -c file.txt # Count occurrences
$ uniq -d file.txt # Only show duplicates
$ uniq -u file.txt # Only show uniques
tr - Translate Characters
Translate or delete characters.
# Convert to uppercase
$ tr -d '\r' < dosfile.txt > unixfile.txt
# Remove carriage returns
wc - Word Count
Count lines, words, and bytes.
# Lines Words Bytes
$ wc -l file.txt # Only lines
$ wc -w file.txt # Only words
$ wc -c file.txt # Only bytes
Advanced Text Processing Examples
Real-World DevOps Text Processing
$ awk '{print $1}' /var/log/nginx/access.log |
sort | uniq -c | sort -rn | head -10
# Find all PHP files containing "mysql_connect" (deprecated function)
$ grep -r "mysql_connect" /var/www --include="*.php"
# Convert CSV to TSV (tab-separated)
$ sed 's/,/\t/g' data.csv > data.tsv
# Sum memory usage of all Java processes
$ ps aux | grep java | awk '{sum+=$4} END {print sum "%"}'
# Remove duplicate lines while preserving order
$ awk '!seen[$0]++' file.txt
⚠️ Common Text Processing Pitfalls
- Special characters: Always quote regex patterns properly
- Large files: Some tools load entire file into memory
- Encoding issues: Use
file -ito check encoding - Performance: Chain operations efficiently (filter early)
- Locale settings: Sort order can vary by locale
3. Searching Files with find and locate
What are File Search Tools?
find searches for files in real-time, while locate uses a database for faster searches. Each has its use cases in DevOps workflows.
find - Real-time File Search
Searches filesystem in real-time. Powerful but slower.
$ find . -type f -size +100M
$ find /var -mtime -7 # Modified last 7 days
$ find . -user root
$ find /etc -type f -exec grep -l "pattern" {} \;
Best for: Precise searches, complex criteria, real-time results
locate - Database File Search
Searches pre-built database. Very fast but not real-time.
$ locate "*.log"
$ locate -i "Document" # Case insensitive
$ locate -l 20 "*.conf" # Limit results
$ sudo updatedb # Update database
Best for: Quick searches, known file names, when speed matters
Advanced find Usage
Finding by Time
$ find /tmp -atime +30 # Accessed >30 days ago
$ find . -newer reference.txt
Finding by Size
$ find . -size +10k -size -1M
$ find /var -empty # Empty files/dirs
Finding by Permissions
$ find /etc -perm /u=s # SUID files
$ find . -type f -perm /o+w # World-writable
Executing Commands
$ find /var/log -name "*.log" -exec gzip {} \;
$ find . -type f -print0 | xargs -0 rm
Practical Search Examples for DevOps
DevOps File Search Scenarios
$ find /var/log -type f -name "*.log" -size +100M
# Find configuration files modified recently
$ find /etc -type f -mtime -1
# Find PHP files with specific function (security audit)
$ find /var/www -type f -name "*.php" -exec grep -l "eval(" {} \;
# Find and archive old backup files
$ find /backup -type f -name "*.tar.gz" -mtime +30 -exec tar -czf old_backups.tar.gz {} +
# Count total lines in all source files
$ find src/ -type f -name "*.py" -exec wc -l {} + | tail -1
4. Managing Large Logs: Viewing & Monitoring
What is Log Management?
Log management involves viewing, monitoring, and analyzing log files efficiently. In DevOps, you work with large log files daily for debugging and monitoring.
Log File Navigation Techniques
Essential Log Viewing Commands
cat - Concatenate Files
Display entire file content.
$ cat file1 file2 file3
$ cat > newfile.txt # Create file
$ cat file1 file2 > combined.txt
Use for: Small files, combining files
tac - Reverse cat
Display file in reverse order (last line first).
$ tac file.txt | head -20 # Last 20 lines in order
Use for: Viewing recent entries first
less - File Pager
View file one page at a time.
# Navigation:
Space - Next page
b - Previous page
/pattern - Search
g - Go to start
G - Go to end
q - Quit
Use for: Large files, interactive viewing
more - Basic Pager
Simpler pager (less feature-rich).
# Basic navigation only
Space - Next page
Enter - Next line
q - Quit
Use for: Simple pagination
head - View Beginning
Display first lines of file.
$ head -20 app.log # First 20 lines
$ head -c 100 app.log # First 100 bytes
$ head -n +50 app.log # Lines 1-50
Use for: File headers, beginning of logs
tail - View End
Display last lines of file.
$ tail -100 app.log # Last 100 lines
$ tail -f app.log # Follow (watch) file
$ tail -n 50 app.log # Last 50 lines
Use for: Recent entries, log monitoring
Advanced Log Monitoring Techniques
Real-time Log Monitoring
$ tail -f /var/log/nginx/access.log
# Follow multiple log files
$ tail -f /var/log/nginx/*.log
# Follow with line numbers
$ tail -f app.log | cat -n
# Follow and grep for errors
$ tail -f app.log | grep -E "(ERROR|FAILED|CRITICAL)"
Log Analysis & Statistics
$ awk '{print $9}' access.log | sort | uniq -c
# Top 10 requested URLs
$ awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10
# Response time statistics
$ awk '{print $(NF-1)}' access.log | sort -n | awk '
NR == 1 {min=$1}
{sum+=$1; array[NR]=$1}
END {
avg=sum/NR;
median=(NR%2==0) ? (array[NR/2]+array[NR/2+1])/2 : array[(NR+1)/2];
print "Min:", min, "Avg:", avg, "Median:", median, "Max:", array[NR]
}'
Complete Log Analysis Pipeline
$ tail -F /var/log/application/*.log |
grep --line-buffered -E "(ERROR|CRITICAL|FATAL)" |
while read line; do
echo "$(date): $line" >> /var/log/application/errors.log
echo "$line" | mail -s "Application Error Alert" admin@example.com
done
Combining Tools: Power Pipelines
The Art of Pipeline Construction
Mastering Linux means knowing how to combine tools effectively. Here are key principles:
- Filter early: Remove unwanted data as soon as possible
- Use appropriate tools: Each tool has strengths
- Keep pipelines readable: Use line breaks for complex pipelines
- Test incrementally: Build pipelines step by step
- Consider performance: Some operations are expensive
$ find /var/log -type f -name "*.log" -size +50M |
xargs ls -lh |
awk '{print $5, $9}' |
sort -hr |
head -10 |
while read size file; do
lines=$(wc -l < "$file")
errors=$(grep -c "ERROR" "$file")
echo "$file: $size, $lines lines, $errors errors"
done > /tmp/log_analysis_report.txt
Performance Optimization Tips
⚠️ Performance Considerations
- Use
xargswith-P: Parallel processing for large file sets - Avoid reading large files multiple times: Chain operations in single pass
- Use
grep -Ffor fixed strings: Faster than regex when possible - Consider
awkover multiple commands: Single process is faster - Use
sort -uinstead ofsort | uniq: More efficient - Limit
findscope: Narrow search paths when possible - Use
LC_ALL=Cfor ASCII sorting: Faster than locale-aware sort
$ find . -name "*.log" -print0 | xargs -0 -P 4 grep "ERROR"
# 4 parallel grep processes
Practical DevOps Scenarios
Real-World DevOps Tasks
Log Rotation Monitoring
$ ls -lh /var/log/nginx/access.log* | tail -5
$ du -sh /var/log/nginx/*.log
$ tail -n 1 /var/log/nginx/access.log.1
Error Rate Monitoring
$ start=$(date -d '1 hour ago' +'%H:%M:%S')
$ end=$(date +'%H:%M:%S')
$ total=$(sed -n "/$start/,/$end/p" app.log | wc -l)
$ errors=$(sed -n "/$start/,/$end/p" app.log | grep -c "ERROR")
$ echo "scale=2; $errors * 100 / $total" | bc
Configuration Audit
$ grep -r "password" /etc --include="*.conf" | grep -v "^#"
# Check for world-writable configs
$ find /etc -type f -perm /o+w -name "*.conf"
Cleanup Old Files
$ find /tmp -type f -mtime +7 -delete
# Clean Docker unused resources
$ docker system prune -af
Essential Commands Cheat Sheet
Redirection & Piping
$ cmd >> file # Append stdout
$ cmd &> file # Stdout & stderr
$ cmd 2> file # Stderr to file
$ cmd1 | cmd2 # Pipe output
Text Processing
$ sed 's/old/new/' file # Replace text
$ awk '{print $1}' file # Print column
$ cut -d, -f1 file # Extract field
$ sort file # Sort lines
$ uniq file # Remove duplicates
File Viewing
$ less file # Page through file
$ head -20 file # First 20 lines
$ tail -f file # Follow file
$ tac file # Reverse view
File Search
$ locate filename # Fast search
$ which cmd # Find command
$ whereis cmd # Find binary, source, man
Practice Exercises for DevOps Engineers
- Monitor a growing log file in real-time and extract only lines containing "ERROR" or "WARNING".
- Find all files modified in the last 24 hours that are larger than 100MB and list them with sizes.
- Parse Nginx access logs to show the top 10 IP addresses by request count in the last hour.
- Convert a CSV file to JSON using only command-line tools (no Python/Ruby).
- Create a script that watches a directory for new files and processes them automatically.
- Extract all email addresses from a collection of text files and sort them uniquely.
- Find all configuration files in /etc that contain uncommented passwords.
- Calculate the average response time from web server logs for the last 5 minutes.
Key Takeaways
- Pipes are powerful: Combine simple tools to solve complex problems
- Learn the core tools well: grep, sed, awk, find, sort, uniq, cut, tr
- Redirect wisely: Control where input comes from and output goes
- Choose the right tool: Each tool has strengths and weaknesses
- Monitor logs effectively: tail -f for real-time, less for exploration
- Optimize pipelines: Filter early, use appropriate tools
- Practice pattern matching: Regular expressions are essential
- Automate repetitive tasks: Script common text processing workflows
Mastering Linux text processing and file operations will dramatically increase your productivity as a DevOps engineer. These skills enable you to quickly analyze data, automate tasks, and troubleshoot systems efficiently.
No comments:
Post a Comment