Skip to main content

Linux File Operations & Text Processing

 File Operations & Text Processing: The DevOps Power Tools

Master the essential text manipulation and file processing skills that every DevOps engineer needs daily.

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 20 minutes
🏷️ Tags: Linux Text Processing, grep, sed, awk, Redirection, Log Analysis, DevOps Tools


🔄 Redirection & Piping: Controlling Input and Output

Understanding the Three Streams

Before we dive into redirection, you need to understand that Linux handles three types of data streams:

  1. Standard Input (stdin) - Where programs read data from (usually your keyboard)

  2. Standard Output (stdout) - Where programs write normal output (usually your screen)

  3. Standard Error (stderr) - Where programs write error messages (also usually your screen)

These streams have numbers:

  • 0 = stdin

  • 1 = stdout

  • 2 = stderr

Output Redirection: Saving Results to Files

The > symbol redirects output to a file instead of showing it on screen. Think of it as saying "take the output and put it in this file."

bash
# Save directory listing to a file
ls -la > files.txt

# Append to a file instead of overwriting
echo "New entry" >> log.txt

# Save only errors to a file
some_command 2> errors.log

# Save both output and errors
command > output.log 2>&1
# Or the modern way:
command &> combined.log

Real-world example: You're running a backup script and want to save both the progress and any errors:

bash
backup_script.sh > backup.log 2>&1
# This saves everything to backup.log

Common mistake: Using > instead of >>. The single > overwrites the file, while >> appends to it. This is especially important for logs!

Input Redirection: Feeding Files to Programs

The < symbol takes input from a file instead of the keyboard.

bash
# Count words in a file
wc < document.txt

# Sort lines from a file
sort < unsorted.txt > sorted.txt

# Send email with file content
mail user@example.com < message.txt

When to use: Most commands can take files as arguments directly (wc document.txt), but < is useful in pipelines or with commands that only read from stdin.

Piping: Connecting Commands Together

The | symbol (pipe) takes the output of one command and feeds it as input to another. This is where Linux becomes incredibly powerful!

bash
# Simple pipe: count files in directory
ls | wc -l

# Multiple pipes: find large files and sort them
find /home -type f -size +100M | sort -rh | head -10

# Process logs: find errors, count them, sort by frequency
grep "ERROR" app.log | cut -d' ' -f3 | sort | uniq -c | sort -rn

Visual analogy: Think of pipes as an assembly line in a factory. Raw data goes in one end, gets processed by multiple machines (commands), and finished information comes out the other end.

Real DevOps scenario: Monitoring website access logs:

bash
tail -1000 access.log | grep "404" | cut -d' ' -f7 | sort | uniq -c | sort -rn | head -20
# This shows the top 20 "404 not found" pages on your website

🔍 Filters & Text Utilities: Your Swiss Army Knife

grep: The Search Master

We touched on grep earlier, but let's go deeper. grep stands for Global Regular Expression Print - it finds patterns in text.

bash
# Basic search
grep "error" logfile.txt

# Case-insensitive search (very common)
grep -i "error\|warning\|fail" logfile.txt

# Show line numbers (great for debugging)
grep -n "function" script.py

# Show context (lines before and after matches)
grep -B2 -A2 "exception" app.log
# Shows 2 lines before and 2 lines after each exception

# Inverse search (lines NOT containing pattern)
grep -v "DEBUG" logfile.txt
# Shows all lines except DEBUG messages

# Count matches
grep -c "GET" access.log
# Counts how many GET requests

# Search in multiple files
grep -r "TODO" /home/user/projects/

Regular expressions power: grep supports patterns, not just exact text:

  • grep "^2024" log.txt - Lines starting with 2024

  • grep "fail$" log.txt - Lines ending with fail

  • grep "error.*database" log.txt - Error followed by database somewhere on line

Real use: Finding all database connection strings in configuration files:

bash
grep -r "mysql://\|postgresql://" /etc/

sed: The Stream Editor

sed stands for Stream EDitor. It edits text on-the-fly without opening files. Think of it as "search and replace" for the command line.

bash
# Simple replacement
sed 's/old/new/g' file.txt
# s = substitute, g = global (all occurrences)

# Replace and save to new file
sed 's/error/ERROR/g' log.txt > fixed_log.txt

# Replace only on specific lines
sed '5,10s/foo/bar/g' file.txt
# Only lines 5 through 10

# Delete lines containing pattern
sed '/debug/d' log.txt
# d = delete

# Print specific lines
sed -n '10,20p' largefile.txt
# Shows only lines 10-20

Common DevOps uses:

bash
# Remove comments from config file
sed '/^#/d' /etc/nginx/nginx.conf

# Change port in configuration
sed -i 's/port=8080/port=9000/g' config.properties
# -i = edit file in-place (be careful!)

# Extract specific column from CSV
sed 's/,/ /g' data.csv | awk '{print $3}'

Pro tip: Use sed 's/pattern/replacement/' without -i first to see what changes would be made, then add -i to actually modify the file.

awk: The Programming Powerhouse

awk is a complete programming language for text processing. It's named after its creators: Aho, Weinberger, and Kernighan.

Basic concept: awk processes text line by line, splitting each line into fields.

bash
# Print first column (default separator is whitespace)
awk '{print $1}' file.txt

# Print specific columns
awk '{print $1, $3, $5}' data.txt

# Use different separator (CSV files)
awk -F',' '{print $2, $4}' data.csv
# -F sets field separator

# Add header
awk 'BEGIN {print "Name\tAge"} {print $1, $2}' people.txt

# Conditional printing
awk '$3 > 100 {print $1, $3}' sales.txt
# Only print if third field > 100

# Calculate totals
awk '{sum += $3} END {print "Total:", sum}' sales.txt

Real-world DevOps examples:

bash
# Parse Apache/nginx access logs
awk '{print $1, $7, $9}' access.log | head -20
# Shows IP, page, status code

# Find average response time
awk '{sum += $NF} END {print "Avg:", sum/NR}' response_times.log
# NR = Number of Records (lines)

# Monitor disk usage
df -h | awk '/\/$/ {print "Root fs is", $5, "full"}'
# Checks root filesystem usage percentage

Why awk is worth learning: Once you understand basic awk, you can solve complex text processing problems in one line that would take 20 lines in Python.

cut: Extracting Columns

cut is simpler than awk for basic column extraction.

bash
# Extract first column (tab-separated by default)
cut -f1 data.txt

# Extract multiple columns
cut -f1,3,5 data.txt

# Use different delimiter
cut -d',' -f2,4 data.csv
# -d sets delimiter, -f specifies fields

# Extract by character position
cut -c1-10 file.txt
# First 10 characters of each line

# Extract by byte position (for binary safety)
cut -b1-20 file.txt

When to use cut vs awk:

  • Use cut for simple column extraction with consistent delimiters

  • Use awk when you need calculations, conditions, or complex patterns

sort: Ordering Data

sort organizes lines of text. It's more powerful than you might think!

bash
# Basic alphabetical sort
sort names.txt

# Reverse sort
sort -r numbers.txt

# Numeric sort (not alphabetical!)
sort -n numbers.txt

# Sort by specific column
sort -k2,2 data.txt
# Sorts by second column

# Remove duplicates while sorting
sort -u duplicates.txt

# Case-insensitive sort
sort -f mixedcase.txt

# Human-readable number sort (K, M, G suffixes)
sort -h sizes.txt

Common use with other commands:

bash
# Find most common errors
grep "ERROR" app.log | cut -d' ' -f3 | sort | uniq -c | sort -rn

# Sort files by size
ls -l | sort -k5 -n -r

uniq: Finding Unique Lines

uniq removes duplicate consecutive lines. Important: It only removes adjacent duplicates, so you usually need sort first.

bash
# Remove consecutive duplicates
uniq repeated.txt

# Count occurrences
uniq -c data.txt

# Show only duplicates
uniq -d data.txt

# Show only unique lines
uniq -u data.txt

The classic pattern: sort | uniq -c | sort -rn

  • First sort groups duplicates together

  • uniq -c counts them

  • Final sort -rn sorts by count (descending)

tr: Character Translation

tr (translate) changes or deletes characters. It works on individual characters, not words.

bash
# Convert lowercase to uppercase
echo "hello" | tr 'a-z' 'A-Z'
# Output: HELLO

# Replace spaces with commas
echo "a b c d" | tr ' ' ','
# Output: a,b,c,d

# Delete specific characters
echo "hello123" | tr -d '0-9'
# Output: hello (removes digits)

# Squeeze repeated characters
echo "hellooo    world" | tr -s ' o'
# Output: hello world (squeezes spaces and 'o's)

# Delete non-printable characters
cat binaryfile.txt | tr -cd '[:print:]'

Useful for cleaning data:

bash
# Clean CSV file (remove extra spaces, fix quotes)
cat dirty.csv | tr -d '\r' | tr ' ' '_' > clean.csv

wc: Word Count (and More)

wc counts lines, words, and characters.

bash
# Count lines, words, characters
wc document.txt

# Count only lines
wc -l access.log
# Great for: "How many requests today?"

# Count only words
wc -w essay.txt

# Count only characters
wc -c data.txt

# Count files in directory
ls | wc -l

Real use: Monitoring log growth:

bash
# How many error lines per hour?
grep "ERROR" app.log | cut -d: -f1 | uniq -c
# Or simpler:
grep -c "ERROR" app.log

🔎 Searching Files with find and locate

find: The Comprehensive File Searcher

find is incredibly powerful for locating files based on various criteria.

bash
# Basic: Find files by name
find /home -name "*.txt"

# Case-insensitive name search
find /var -iname "*.log"

# Find by file type
find . -type f           # Regular files
find . -type d           # Directories
find . -type l           # Symbolic links

# Find by size
find / -size +100M 2>/dev/null      # Larger than 100MB
find /home -size -1k                # Smaller than 1KB

# Find by modification time
find /var/log -mtime -7             # Modified in last 7 days
find /backup -mtime +30             # Older than 30 days

# Find by permissions
find /home -perm 644                # Files with 644 permissions
find / -perm -4000 2>/dev/null      # SUID files (security check!)

# Find and execute actions
find /tmp -name "*.tmp" -delete     # Find and delete
find . -name "*.sh" -exec chmod +x {} \;  # Find and make executable

The -exec magic: {} is replaced with the filename, and \; ends the command.

Security use: Finding world-writable files (potential security issue):

bash
find / -type f -perm -0002 2>/dev/null

locate: The Fast Search

locate uses a database for lightning-fast searches, but it might not find recently created files.

bash
# Fast search
locate nginx.conf

# Case-insensitive
locate -i "README"

# Limit number of results
locate "*.log" | head -20

# Update the search database
sudo updatedb
# Run this if locate doesn't find new files

find vs locate:

  • Use find when you need precise criteria (size, time, permissions)

  • Use locate when you just need to find a file quickly by name


📊 Managing Large Logs: The Essential Toolkit

The Right Tool for Each Job

When working with logs (which DevOps engineers do constantly), you need different tools for different tasks:

cat: For Small Files Only

bash
# Good for small config files
cat /etc/hosts

# Bad for large logs!
cat /var/log/syslog  # Don't do this for 1GB files!

less: Your Go-To Log Viewer

bash
# Open log file
less /var/log/nginx/access.log

# Within less:
# Space = Next page
# b = Previous page
# /error = Search for "error"
# n = Next match
# N = Previous match
# g = Go to beginning
# G = Go to end
# q = Quit

# Open with line numbers
less -N app.log

# Follow mode (like tail -f)
less +F app.log
# Ctrl+C to stop following, then F to resume

tail: Watching Live Logs

bash
# Last 10 lines (default)
tail app.log

# Last 100 lines
tail -100 app.log

# Watch log in real-time (CRITICAL for debugging!)
tail -f /var/log/nginx/access.log

# Watch multiple logs
tail -f /var/log/nginx/*.log

# Show lines from specific point
tail -n +1000 app.log  # From line 1000 to end

# Follow with context (new in some systems)
tail -f --pid=$(pgrep nginx) /var/log/nginx/error.log

Real scenario: Debugging a web application issue:

bash
# In one terminal: watch error log
tail -f /var/log/nginx/error.log

# In another: watch application log
tail -f /var/log/app/app.log

# Reproduce the issue and watch both logs simultaneously

head: Checking Beginnings

bash
# First 10 lines (default)
head config.yaml

# First 50 lines
head -50 largefile.csv

# Check file structure
head -20 database_dump.sql
# Shows what type of SQL dump it is

tac: cat Backwards!

Yes, tac is cat spelled backwards, and it shows files in reverse order!

bash
# View log from newest to oldest
tac /var/log/syslog | head -100

# Useful for: "What just happened?"
# Most recent entries are shown first

🎯 Real-World DevOps Scenarios

Scenario 1: Analyzing Web Server Logs

Problem: Website is slow. Need to find the slowest pages.

bash
# Parse nginx access log for slow requests
awk '$NF > 5 {print $7, $NF}' /var/log/nginx/access.log | sort -k2 -rn | head -20
# $NF = last field (response time in nginx logs)
# Shows URL and time for requests taking >5 seconds

# Alternative with more detail
grep " 5\." /var/log/nginx/access.log | awk '{print $7, $NF}' | sort | uniq -c | sort -rn

Scenario 2: Finding Large Files for Cleanup

Problem: Disk is 95% full. Need to find largest files.

bash
# Find files larger than 100MB
find / -type f -size +100M 2>/dev/null | xargs ls -lh | sort -k5 -hr | head -20

# Find by directory
du -ah /var/log 2>/dev/null | sort -rh | head -20

# Find and delete old log files
find /var/log -name "*.log" -type f -mtime +30 -delete

Scenario 3: Monitoring Application Health

Problem: Need to create a daily error report.

bash
# Create error summary report
REPORT="/var/log/error-report-$(date +%Y%m%d).txt"

echo "=== Daily Error Report ===" > $REPORT
echo "Generated: $(date)" >> $REPORT
echo >> $REPORT

echo "Top 10 errors:" >> $REPORT
grep "ERROR" /var/log/app/app.log | cut -d' ' -f4- | sort | uniq -c | sort -rn | head -10 >> $REPORT

echo >> $REPORT
echo "Error count by hour:" >> $REPORT
grep "ERROR" /var/log/app/app.log | cut -d: -f2 | sort | uniq -c >> $REPORT

echo "Report saved to: $REPORT"

Scenario 4: Processing CSV Data

Problem: Need to extract specific data from a large CSV.

bash
# Extract column 2 and 4 from CSV
awk -F',' '{print $2, $4}' largefile.csv > extracted.txt

# Filter rows where column 3 > 100
awk -F',' '$3 > 100 {print $0}' data.csv

# Calculate average of column 5
awk -F',' '{sum += $5; count++} END {print "Average:", sum/count}' data.csv

# Convert CSV to TSV (tab-separated)
sed 's/,/\t/g' data.csv > data.tsv

💡 Pro Tips and Best Practices

1. Chain Commands Wisely

bash
# Good: Clear chain of operations
cat file.txt | grep "error" | sort | uniq

# Better: Avoid unnecessary cat (Useless Use of Cat award!)
grep "error" file.txt | sort | uniq

# Best: Use shell features when possible
< file.txt grep "error" | sort | uniq

2. Handle Errors Gracefully

bash
# Ignore "permission denied" errors
find / -name "*.conf" 2>/dev/null

# Save errors separately
some_command > output.txt 2> errors.txt

# Send errors to /dev/null (discard them)
noisy_command 2>/dev/null

3. Use Temporary Files for Complex Operations

bash
# For multi-step processing
grep "ERROR" app.log > errors.tmp
awk '{print $2, $5}' errors.tmp > processed.tmp
sort processed.tmp > final.txt
rm -f errors.tmp processed.tmp

# Or use process substitution
diff <(sort file1.txt) <(sort file2.txt)

4. Learn Regular Expressions

Even basic regex makes you 10x more powerful:

  • . = any single character

  • * = zero or more of previous

  • ^ = start of line

  • $ = end of line

  • [abc] = any of a, b, or c

  • [0-9] = any digit


📋 Quick Reference Cheat Sheet

TaskCommandExample
Save output>ls > files.txt
Append output>>echo "new" >> log.txt
Redirect errors2>cmd 2> errors.log
Pipe output|ls | wc -l
Search textgrepgrep "error" log.txt
Search/replacesedsed 's/old/new/g' file.txt
Column extractionawkawk '{print $1}' file.txt
Simple columnscutcut -d',' -f2 file.csv
Sort datasortsort -n numbers.txt
Unique linesuniqsort file.txt | uniq
Translate charstrecho "hi" | tr 'a-z' 'A-Z'
Count elementswcwc -l file.txt
Find filesfindfind / -name "*.log"
Fast findlocatelocate nginx.conf
View filelessless largefile.log
Watch endtail -ftail -f app.log
View beginningheadhead -20 file.txt
Reverse viewtactac log.txt | head -50

🚀 Practice Exercises

Exercise 1: Create a Log Analysis Script

bash
# Create a script that analyzes an Apache/nginx log
cat > analyze_log.sh << 'EOF'
#!/bin/bash
LOG_FILE="${1:-/var/log/nginx/access.log}"

echo "=== Log Analysis: $LOG_FILE ==="
echo

echo "1. Total requests:"
wc -l "$LOG_FILE"

echo
echo "2. Top 10 IP addresses:"
awk '{print $1}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10

echo
echo "3. Top 10 requested pages:"
awk '{print $7}' "$LOG_FILE" | sort | uniq -c | sort -rn | head -10

echo
echo "4. Response code summary:"
awk '{print $9}' "$LOG_FILE" | sort | uniq -c | sort -rn

echo
echo "5. Requests per hour:"
awk '{print $4}' "$LOG_FILE" | cut -d: -f2 | sort | uniq -c
EOF

chmod +x analyze_log.sh
./analyze_log.sh

Exercise 2: Clean and Process Data

bash
# Create sample data
cat > sales.csv << 'EOF'
Date,Product,Quantity,Price
2024-01-01,Widget,10,25.50
2024-01-02,Gadget,5,99.99
2024-01-01,Widget,15,25.50
2024-01-03,Thingy,8,45.00
EOF

# Process it
echo "Total revenue per product:"
awk -F',' 'NR>1 {revenue[$2] += $3 * $4} END {for (p in revenue) print p, revenue[p]}' sales.csv

echo
echo "Daily sales:"
awk -F',' 'NR>1 {print $1, $3, $4}' sales.csv | sort

Exercise 3: Monitor System in Real-time

bash
# Create a monitoring dashboard
watch -n 5 '
echo "=== System Status ==="
echo "Time: $(date)"
echo
echo "CPU Load: $(uptime)"
echo
echo "Memory:"
free -h | grep -E "^Mem:"
echo
echo "Disk:"
df -h / | tail -1
echo
echo "Top 5 processes by CPU:"
ps aux --sort=-%cpu | head -6
'
# Press Ctrl+C to stop

🔗 Master Text Processing with Hands-on Labs

Text processing skills are what separate junior from senior DevOps engineers. The best way to learn is through guided practice with real-world scenarios.

👉 Practice file operations and text processing with real datasets at:
https://devops.trainwithsky.com/

Our interactive labs give you:

  • Real log files to analyze

  • Production-like datasets

  • Step-by-step challenges

  • Immediate feedback

  • Progress from beginner to advanced


Common Questions Answered

Q: When should I use awk vs sed?
A: Use sed for simple text transformations (search/replace, deletion). Use awk when you need to work with columns, do calculations, or have conditional logic.

Q: Why does uniq not remove all duplicates?
A: uniq only removes adjacent duplicates. Always use sort | uniq to remove all duplicates.

Q: Is it bad to use cat with pipes?
A: Sometimes yes (UUOC - Useless Use of Cat). Use < file command or command file when possible. But for beginners, cat file | command is fine and more readable.

Q: How do I handle files with spaces in names?
A: Use find -print0 | xargs -0 or use find -exec. For example: find . -name "*.txt" -print0 | xargs -0 ls -l

Q: What's the best way to learn regular expressions?
A: Start with simple patterns (^$.*), practice on regex testing websites, and gradually add more complexity.

Stuck on a text processing problem? Share your challenge in the comments and we'll help you solve it! 💬

Comments

Popular posts from this blog

Introduction to Terraform – The Future of Infrastructure as Code

  Introduction to Terraform – The Future of Infrastructure as Code In today’s fast-paced DevOps world, managing infrastructure manually is outdated . This is where Terraform comes in—a powerful Infrastructure as Code (IaC) tool that allows you to define, provision, and manage cloud infrastructure efficiently . Whether you're working with AWS, Azure, Google Cloud, or on-premises servers , Terraform provides a declarative, automation-first approach to infrastructure deployment. Shape Your Future with AI & Infinite Knowledge...!! Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! In today’s digital-first world, agility and automation are no longer optional—they’re essential. Companies across the globe are rapidly shifting their operations to the cloud to keep up with the pace of innovatio...

📊 Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd

  Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd Monitoring and logging are essential for maintaining a healthy and well-performing Kubernetes cluster. In this guide, we’ll cover why monitoring is important, key monitoring tools like Prometheus and Grafana, and logging tools like Fluentd to help you gain visibility into your cluster’s performance and logs. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction In today’s fast-paced cloud-native environment, Kubernetes has emerged as the de-facto container orchestration platform. But deploying and managing applications in Kubernetes is just half the ba...

🔒 Kubernetes Security – RBAC, Network Policies, and Secrets Management

  Kubernetes Security – RBAC, Network Policies, and Secrets Management Security is a critical aspect of managing Kubernetes clusters. In this guide, we'll cover essential security mechanisms like Role-Based Access Control (RBAC) , Network Policies , and Secrets Management to help you secure your Kubernetes environment effectively. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction: Why Kubernetes Security Is Non-Negotiable As Kubernetes becomes the backbone of modern cloud-native infrastructure, security is no longer optional—it’s mission-critical . With multiple moving parts like containers, pods, services, nodes, and more, Kuberne...