Saturday, December 13, 2025

Linux File Operations & Text Processing

Linux File Operations & Text Processing - DevOps Data Manipulation Mastery

Linux File Operations & Text Processing

Published: December 2023 | Topic: Data Manipulation & Automation for DevOps

Mastering Linux text processing and file operations is essential for DevOps efficiency. These skills enable you to analyze logs, transform data, automate tasks, and process information at scale. Whether you're debugging applications, analyzing metrics, or building automation scripts, these tools form the backbone of your daily workflow.

The Unix Philosophy: Do One Thing Well

Linux tools follow the Unix philosophy: small, focused programs that do one thing well, combined together through pipes to solve complex problems. This approach gives you:

  • Composability: Combine simple tools to create complex solutions
  • Reusability: The same tools work across different scenarios
  • Maintainability: Simple tools are easier to understand and debug
  • Flexibility: Endless combinations for different tasks

1. Redirection & Piping: Data Flow Control

What are Redirection and Piping?

Redirection controls where command input comes from and where output goes. Piping connects the output of one command to the input of another. Together, they create powerful data processing pipelines.

Input
Command 1
(Generates Data)
|
Process
Command 2
(Filters/Transforms)
|
Output
Command 3
(Final Result)

Standard Streams in Linux

Standard Input (stdin)

File descriptor 0. Default input source (usually keyboard).

# Read from stdin
$ cat < input.txt
# Here document
$ cat << EOF
Line 1
Line 2
EOF

Standard Output (stdout)

File descriptor 1. Default output destination (usually terminal).

# Redirect stdout to file
$ ls > output.txt
# Append to file
$ echo "new line" >> log.txt

Standard Error (stderr)

File descriptor 2. Error messages and diagnostics.

# Redirect stderr to file
$ command 2> errors.txt
# Redirect stderr to stdout
$ command 2>&1

Combined Redirections

Advanced redirection patterns.

# Redirect both stdout and stderr
$ command &> all_output.txt
# Send stdout to file, stderr to terminal
$ command > output.txt 2>&1
# Discard output completely
$ command > /dev/null 2>&1

Practical Redirection Examples

Logging with Append

# Add timestamp to log
$ echo "$(date): Backup started" >> /var/log/backup.log
# Log command output
$ deploy.sh &>> /var/log/deploy.log

Error Handling

# Separate logs and errors
$ script.sh > output.log 2> error.log
# Count errors
$ grep "ERROR" app.log 2>/dev/null | wc -l

Input Redirection

# Feed file as input
$ sort < unsorted.txt
# Multiple inputs
$ cat file1 file2 | grep "pattern"

The Power of Pipes

Pipes (|) are the most powerful feature in Linux. They connect commands, creating data transformation pipelines:

# Count unique IPs accessing Nginx today
$ grep "$(date +%d/%b/%Y)" /var/log/nginx/access.log |
    awk '{print $1}' |
    sort |
    uniq -c |
    sort -rn |
    head -20

Each command does one thing: filter → extract → sort → deduplicate → sort → limit

2. Filters & Text Utilities: Data Transformation

What are Text Processing Filters?

Filters are commands that read input, transform it, and write output. They're designed to work with pipes, making them perfect for data processing pipelines.

Essential Text Processing Tools

ggrep - Pattern Search

Search for patterns using regular expressions.

$ grep "error" app.log
$ grep -i "warning" app.log # Case insensitive
$ grep -v "DEBUG" app.log # Invert match
$ grep -r "TODO" src/ # Recursive
$ grep -E "[0-9]{3}-[0-9]{3}" # Extended regex

ssed - Stream Editor

Filter and transform text streams.

$ sed 's/old/new/g' file.txt
$ sed -i 's/foo/bar/' config.conf # In-place
$ sed '/^#/d' file.conf # Delete comment lines
$ sed -n '10,20p' file.log # Print lines 10-20

aawk - Text Processing Language

Powerful programming language for text processing.

$ awk '{print $1}' file.log # Print first column
$ awk -F: '{print $1}' /etc/passwd # Custom delimiter
$ awk '$3 > 100 {print $0}' # Filter rows
$ awk '{sum+=$3} END {print sum}' # Sum column

ccut - Extract Columns

Remove sections from lines of files.

$ cut -d, -f1,3 data.csv
# Comma delimiter, fields 1 & 3
$ cut -c1-10 file.txt
# Characters 1-10

ssort - Sort Lines

Sort lines of text files.

$ sort file.txt
$ sort -r file.txt # Reverse
$ sort -n numbers.txt # Numeric
$ sort -u file.txt # Unique

uuniq - Report/Filter Duplicates

Report or omit repeated lines.

$ sort file.txt | uniq
$ uniq -c file.txt # Count occurrences
$ uniq -d file.txt # Only show duplicates
$ uniq -u file.txt # Only show uniques

ttr - Translate Characters

Translate or delete characters.

$ tr 'a-z' 'A-Z' < file.txt
# Convert to uppercase
$ tr -d '\r' < dosfile.txt > unixfile.txt
# Remove carriage returns

wwc - Word Count

Count lines, words, and bytes.

$ wc file.txt
# Lines Words Bytes
$ wc -l file.txt # Only lines
$ wc -w file.txt # Only words
$ wc -c file.txt # Only bytes

Advanced Text Processing Examples

Real-World DevOps Text Processing

# Extract top 10 IP addresses from web logs
$ awk '{print $1}' /var/log/nginx/access.log |
    sort | uniq -c | sort -rn | head -10

# Find all PHP files containing "mysql_connect" (deprecated function)
$ grep -r "mysql_connect" /var/www --include="*.php"

# Convert CSV to TSV (tab-separated)
$ sed 's/,/\t/g' data.csv > data.tsv

# Sum memory usage of all Java processes
$ ps aux | grep java | awk '{sum+=$4} END {print sum "%"}'

# Remove duplicate lines while preserving order
$ awk '!seen[$0]++' file.txt

⚠️ Common Text Processing Pitfalls

  • Special characters: Always quote regex patterns properly
  • Large files: Some tools load entire file into memory
  • Encoding issues: Use file -i to check encoding
  • Performance: Chain operations efficiently (filter early)
  • Locale settings: Sort order can vary by locale

4. Managing Large Logs: Viewing & Monitoring

What is Log Management?

Log management involves viewing, monitoring, and analyzing log files efficiently. In DevOps, you work with large log files daily for debugging and monitoring.

Log File Navigation Techniques

Beginning
head - View start
Middle
less - Page through
End
tail - View end

Essential Log Viewing Commands

ccat - Concatenate Files

Display entire file content.

$ cat app.log
$ cat file1 file2 file3
$ cat > newfile.txt # Create file
$ cat file1 file2 > combined.txt

Use for: Small files, combining files

ttac - Reverse cat

Display file in reverse order (last line first).

$ tac app.log
$ tac file.txt | head -20 # Last 20 lines in order

Use for: Viewing recent entries first

lless - File Pager

View file one page at a time.

$ less app.log
# Navigation:
Space - Next page
b - Previous page
/pattern - Search
g - Go to start
G - Go to end
q - Quit

Use for: Large files, interactive viewing

mmore - Basic Pager

Simpler pager (less feature-rich).

$ more app.log
# Basic navigation only
Space - Next page
Enter - Next line
q - Quit

Use for: Simple pagination

hhead - View Beginning

Display first lines of file.

$ head app.log
$ head -20 app.log # First 20 lines
$ head -c 100 app.log # First 100 bytes
$ head -n +50 app.log # Lines 1-50

Use for: File headers, beginning of logs

ttail - View End

Display last lines of file.

$ tail app.log
$ tail -100 app.log # Last 100 lines
$ tail -f app.log # Follow (watch) file
$ tail -n 50 app.log # Last 50 lines

Use for: Recent entries, log monitoring

Advanced Log Monitoring Techniques

Real-time Log Monitoring

# Follow single log file
$ tail -f /var/log/nginx/access.log

# Follow multiple log files
$ tail -f /var/log/nginx/*.log

# Follow with line numbers
$ tail -f app.log | cat -n

# Follow and grep for errors
$ tail -f app.log | grep -E "(ERROR|FAILED|CRITICAL)"

Log Analysis & Statistics

# Count HTTP response codes
$ awk '{print $9}' access.log | sort | uniq -c

# Top 10 requested URLs
$ awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10

# Response time statistics
$ awk '{print $(NF-1)}' access.log | sort -n | awk '
  NR == 1 {min=$1}
  {sum+=$1; array[NR]=$1}
  END {
    avg=sum/NR;
    median=(NR%2==0) ? (array[NR/2]+array[NR/2+1])/2 : array[(NR+1)/2];
    print "Min:", min, "Avg:", avg, "Median:", median, "Max:", array[NR]
  }'

Complete Log Analysis Pipeline

# Monitor logs in real-time, extract errors, and send alerts
$ tail -F /var/log/application/*.log |
    grep --line-buffered -E "(ERROR|CRITICAL|FATAL)" |
    while read line; do
      echo "$(date): $line" >> /var/log/application/errors.log
      echo "$line" | mail -s "Application Error Alert" admin@example.com
    done

Combining Tools: Power Pipelines

The Art of Pipeline Construction

Mastering Linux means knowing how to combine tools effectively. Here are key principles:

  1. Filter early: Remove unwanted data as soon as possible
  2. Use appropriate tools: Each tool has strengths
  3. Keep pipelines readable: Use line breaks for complex pipelines
  4. Test incrementally: Build pipelines step by step
  5. Consider performance: Some operations are expensive
# Complete DevOps pipeline: Find, analyze, and report on large log files
$ find /var/log -type f -name "*.log" -size +50M |
    xargs ls -lh |
    awk '{print $5, $9}' |
    sort -hr |
    head -10 |
    while read size file; do
      lines=$(wc -l < "$file")
      errors=$(grep -c "ERROR" "$file")
      echo "$file: $size, $lines lines, $errors errors"
    done > /tmp/log_analysis_report.txt

Performance Optimization Tips

⚠️ Performance Considerations

  • Use xargs with -P: Parallel processing for large file sets
  • Avoid reading large files multiple times: Chain operations in single pass
  • Use grep -F for fixed strings: Faster than regex when possible
  • Consider awk over multiple commands: Single process is faster
  • Use sort -u instead of sort | uniq: More efficient
  • Limit find scope: Narrow search paths when possible
  • Use LC_ALL=C for ASCII sorting: Faster than locale-aware sort
# Parallel processing example
$ find . -name "*.log" -print0 | xargs -0 -P 4 grep "ERROR"
# 4 parallel grep processes

Practical DevOps Scenarios

Real-World DevOps Tasks

Log Rotation Monitoring

# Check if logs are rotating properly
$ ls -lh /var/log/nginx/access.log* | tail -5
$ du -sh /var/log/nginx/*.log
$ tail -n 1 /var/log/nginx/access.log.1

Error Rate Monitoring

# Calculate error rate in last hour
$ start=$(date -d '1 hour ago' +'%H:%M:%S')
$ end=$(date +'%H:%M:%S')
$ total=$(sed -n "/$start/,/$end/p" app.log | wc -l)
$ errors=$(sed -n "/$start/,/$end/p" app.log | grep -c "ERROR")
$ echo "scale=2; $errors * 100 / $total" | bc

Configuration Audit

# Find all config files with passwords
$ grep -r "password" /etc --include="*.conf" | grep -v "^#"
# Check for world-writable configs
$ find /etc -type f -perm /o+w -name "*.conf"

Cleanup Old Files

# Remove temp files older than 7 days
$ find /tmp -type f -mtime +7 -delete
# Clean Docker unused resources
$ docker system prune -af

Essential Commands Cheat Sheet

Redirection & Piping

$ cmd > file # Stdout to file
$ cmd >> file # Append stdout
$ cmd &> file # Stdout & stderr
$ cmd 2> file # Stderr to file
$ cmd1 | cmd2 # Pipe output

Text Processing

$ grep pattern file # Search text
$ sed 's/old/new/' file # Replace text
$ awk '{print $1}' file # Print column
$ cut -d, -f1 file # Extract field
$ sort file # Sort lines
$ uniq file # Remove duplicates

File Viewing

$ cat file # View entire file
$ less file # Page through file
$ head -20 file # First 20 lines
$ tail -f file # Follow file
$ tac file # Reverse view

File Search

$ find . -name "*.log" # Find files
$ locate filename # Fast search
$ which cmd # Find command
$ whereis cmd # Find binary, source, man

Practice Exercises for DevOps Engineers

  1. Monitor a growing log file in real-time and extract only lines containing "ERROR" or "WARNING".
  2. Find all files modified in the last 24 hours that are larger than 100MB and list them with sizes.
  3. Parse Nginx access logs to show the top 10 IP addresses by request count in the last hour.
  4. Convert a CSV file to JSON using only command-line tools (no Python/Ruby).
  5. Create a script that watches a directory for new files and processes them automatically.
  6. Extract all email addresses from a collection of text files and sort them uniquely.
  7. Find all configuration files in /etc that contain uncommented passwords.
  8. Calculate the average response time from web server logs for the last 5 minutes.

Key Takeaways

  • Pipes are powerful: Combine simple tools to solve complex problems
  • Learn the core tools well: grep, sed, awk, find, sort, uniq, cut, tr
  • Redirect wisely: Control where input comes from and output goes
  • Choose the right tool: Each tool has strengths and weaknesses
  • Monitor logs effectively: tail -f for real-time, less for exploration
  • Optimize pipelines: Filter early, use appropriate tools
  • Practice pattern matching: Regular expressions are essential
  • Automate repetitive tasks: Script common text processing workflows

Mastering Linux text processing and file operations will dramatically increase your productivity as a DevOps engineer. These skills enable you to quickly analyze data, automate tasks, and troubleshoot systems efficiently.

No comments:

Post a Comment

Linux Security & Permissions for DevOps

Linux Security & Permissions - DevOps Security Guide Linux Security & Permissions ...