Linux File Operations & Text Processing - DevOps Data Manipulation Mastery

Linux File Operations & Text Processing

Published: December 2023 | Topic: Data Manipulation & Automation for DevOps

Mastering Linux text processing and file operations is essential for DevOps efficiency. These skills enable you to analyze logs, transform data, automate tasks, and process information at scale. Whether you're debugging applications, analyzing metrics, or building automation scripts, these tools form the backbone of your daily workflow.

The Unix Philosophy: Do One Thing Well

Linux tools follow the Unix philosophy: small, focused programs that do one thing well, combined together through pipes to solve complex problems. This approach gives you:

Composability: Combine simple tools to create complex solutions
Reusability: The same tools work across different scenarios
Maintainability: Simple tools are easier to understand and debug
Flexibility: Endless combinations for different tasks

1. Redirection & Piping: Data Flow Control

What are Redirection and Piping?

Redirection controls where command input comes from and where output goes. Piping connects the output of one command to the input of another. Together, they create powerful data processing pipelines.

Input
Command 1
(Generates Data)

Process
Command 2
(Filters/Transforms)

Output
Command 3
(Final Result)

Standard Streams in Linux

Standard Input (stdin)

File descriptor 0. Default input source (usually keyboard).

                        # Read from stdin

                        $ cat < input.txt

                        # Here document

                        $ cat << EOF

                        Line 1

                        Line 2

                        EOF

Standard Output (stdout)

File descriptor 1. Default output destination (usually terminal).

                        # Redirect stdout to file

                        $ ls > output.txt

                        # Append to file

                        $ echo "new line" >> log.txt

Standard Error (stderr)

File descriptor 2. Error messages and diagnostics.

                        # Redirect stderr to file

                        $ command 2> errors.txt

                        # Redirect stderr to stdout

                        $ command 2>&1

Combined Redirections

Advanced redirection patterns.

                        # Redirect both stdout and stderr

                        $ command &> all_output.txt

                        # Send stdout to file, stderr to terminal

                        $ command > output.txt 2>&1

                        # Discard output completely

                        $ command > /dev/null 2>&1

Practical Redirection Examples

Logging with Append

                        # Add timestamp to log

                        $ echo "$(date): Backup started" >> /var/log/backup.log

                        # Log command output

                        $ deploy.sh &>> /var/log/deploy.log

Error Handling

                        # Separate logs and errors

                        $ script.sh > output.log 2> error.log

                        # Count errors

                        $ grep "ERROR" app.log 2>/dev/null | wc -l

Input Redirection

                        # Feed file as input

                        $ sort < unsorted.txt

                        # Multiple inputs

                        $ cat file1 file2 | grep "pattern"

The Power of Pipes

Pipes (|) are the most powerful feature in Linux. They connect commands, creating data transformation pipelines:

# Count unique IPs accessing Nginx today
$ grep "$(date +%d/%b/%Y)" /var/log/nginx/access.log |
    awk '{print $1}' |
    sort |
    uniq -c |
    sort -rn |
    head -20

Each command does one thing: filter → extract → sort → deduplicate → sort → limit

2. Filters & Text Utilities: Data Transformation

What are Text Processing Filters?

Filters are commands that read input, transform it, and write output. They're designed to work with pipes, making them perfect for data processing pipelines.

Essential Text Processing Tools

ggrep - Pattern Search

Search for patterns using regular expressions.

                        $ grep "error" app.log

                        $ grep -i "warning" app.log  # Case insensitive

                        $ grep -v "DEBUG" app.log   # Invert match

                        $ grep -r "TODO" src/       # Recursive

                        $ grep -E "[0-9]{3}-[0-9]{3}" # Extended regex

ssed - Stream Editor

Filter and transform text streams.

                        $ sed 's/old/new/g' file.txt

                        $ sed -i 's/foo/bar/' config.conf # In-place

                        $ sed '/^#/d' file.conf        # Delete comment lines

                        $ sed -n '10,20p' file.log   # Print lines 10-20

aawk - Text Processing Language

Powerful programming language for text processing.

                        $ awk '{print $1}' file.log   # Print first column

                        $ awk -F: '{print $1}' /etc/passwd # Custom delimiter

                        $ awk '$3 > 100 {print $0}'   # Filter rows

                        $ awk '{sum+=$3} END {print sum}' # Sum column

ccut - Extract Columns

Remove sections from lines of files.

                        $ cut -d, -f1,3 data.csv

                        # Comma delimiter, fields 1 & 3

                        $ cut -c1-10 file.txt

                        # Characters 1-10

ssort - Sort Lines

Sort lines of text files.

                        $ sort file.txt

                        $ sort -r file.txt      # Reverse

                        $ sort -n numbers.txt   # Numeric

                        $ sort -u file.txt      # Unique

uuniq - Report/Filter Duplicates

Report or omit repeated lines.

                        $ sort file.txt | uniq

                        $ uniq -c file.txt      # Count occurrences

                        $ uniq -d file.txt      # Only show duplicates

                        $ uniq -u file.txt      # Only show uniques

ttr - Translate Characters

Translate or delete characters.

                        $ tr 'a-z' 'A-Z' < file.txt

                        # Convert to uppercase

                        $ tr -d '\r' < dosfile.txt > unixfile.txt

                        # Remove carriage returns

wwc - Word Count

Count lines, words, and bytes.

                        $ wc file.txt

                        # Lines Words Bytes

                        $ wc -l file.txt       # Only lines

                        $ wc -w file.txt       # Only words

                        $ wc -c file.txt       # Only bytes

Advanced Text Processing Examples

Real-World DevOps Text Processing

                    # Extract top 10 IP addresses from web logs

                    $ awk '{print $1}' /var/log/nginx/access.log |

                        sort | uniq -c | sort -rn | head -10

                    # Find all PHP files containing "mysql_connect" (deprecated function)

                    $ grep -r "mysql_connect" /var/www --include="*.php"

                    # Convert CSV to TSV (tab-separated)

                    $ sed 's/,/\t/g' data.csv > data.tsv

                    # Sum memory usage of all Java processes

                    $ ps aux | grep java | awk '{sum+=$4} END {print sum "%"}'

                    # Remove duplicate lines while preserving order

                    $ awk '!seen[$0]++' file.txt

⚠️ Common Text Processing Pitfalls

Special characters: Always quote regex patterns properly
Large files: Some tools load entire file into memory
Encoding issues: Use file -i to check encoding
Performance: Chain operations efficiently (filter early)
Locale settings: Sort order can vary by locale

3. Searching Files with find and locate

What are File Search Tools?

find searches for files in real-time, while locate uses a database for faster searches. Each has its use cases in DevOps workflows.

ffind - Real-time File Search

Searches filesystem in real-time. Powerful but slower.

                        $ find /path -name "*.log"

                        $ find . -type f -size +100M

                        $ find /var -mtime -7       # Modified last 7 days

                        $ find . -user root

                        $ find /etc -type f -exec grep -l "pattern" {} \;

Best for: Precise searches, complex criteria, real-time results

llocate - Database File Search

Searches pre-built database. Very fast but not real-time.

                        $ locate nginx.conf

                        $ locate "*.log"

                        $ locate -i "Document"     # Case insensitive

                        $ locate -l 20 "*.conf"    # Limit results

                        $ sudo updatedb           # Update database

Best for: Quick searches, known file names, when speed matters

Advanced find Usage

Finding by Time

                        $ find /var/log -mtime -1   # Modified today

                        $ find /tmp -atime +30     # Accessed >30 days ago

                        $ find . -newer reference.txt

Finding by Size

                        $ find / -type f -size +500M

                        $ find . -size +10k -size -1M

                        $ find /var -empty          # Empty files/dirs

Finding by Permissions

                        $ find /home -perm 777     # Exactly 777

                        $ find /etc -perm /u=s     # SUID files

                        $ find . -type f -perm /o+w # World-writable

Executing Commands

                        $ find . -name "*.tmp" -delete

                        $ find /var/log -name "*.log" -exec gzip {} \;

                        $ find . -type f -print0 | xargs -0 rm

Practical Search Examples for DevOps

DevOps File Search Scenarios

                    # Find large log files to rotate

                    $ find /var/log -type f -name "*.log" -size +100M

                    # Find configuration files modified recently

                    $ find /etc -type f -mtime -1

                    # Find PHP files with specific function (security audit)

                    $ find /var/www -type f -name "*.php" -exec grep -l "eval(" {} \;

                    # Find and archive old backup files

                    $ find /backup -type f -name "*.tar.gz" -mtime +30 -exec tar -czf old_backups.tar.gz {} +

                    # Count total lines in all source files

                    $ find src/ -type f -name "*.py" -exec wc -l {} + | tail -1

4. Managing Large Logs: Viewing & Monitoring

What is Log Management?

Log management involves viewing, monitoring, and analyzing log files efficiently. In DevOps, you work with large log files daily for debugging and monitoring.

Log File Navigation Techniques

Beginning

↓

head - View start

Middle

↓

less - Page through

End

↓

tail - View end

Essential Log Viewing Commands

ccat - Concatenate Files

Display entire file content.

                        $ cat app.log

                        $ cat file1 file2 file3

                        $ cat > newfile.txt     # Create file

                        $ cat file1 file2 > combined.txt

Use for: Small files, combining files

ttac - Reverse cat

Display file in reverse order (last line first).

                        $ tac app.log

                        $ tac file.txt | head -20 # Last 20 lines in order

Use for: Viewing recent entries first

lless - File Pager

View file one page at a time.

                        $ less app.log

                        # Navigation:

                        Space - Next page

                        b - Previous page

                        /pattern - Search

                        g - Go to start

                        G - Go to end

                        q - Quit

Use for: Large files, interactive viewing

mmore - Basic Pager

Simpler pager (less feature-rich).

                        $ more app.log

                        # Basic navigation only

                        Space - Next page

                        Enter - Next line

                        q - Quit

Use for: Simple pagination

hhead - View Beginning

Display first lines of file.

                        $ head app.log

                        $ head -20 app.log     # First 20 lines

                        $ head -c 100 app.log  # First 100 bytes

                        $ head -n +50 app.log  # Lines 1-50

Use for: File headers, beginning of logs

ttail - View End

Display last lines of file.

                        $ tail app.log

                        $ tail -100 app.log     # Last 100 lines

                        $ tail -f app.log      # Follow (watch) file

                        $ tail -n 50 app.log    # Last 50 lines

Use for: Recent entries, log monitoring

Advanced Log Monitoring Techniques

Real-time Log Monitoring

                        # Follow single log file

                        $ tail -f /var/log/nginx/access.log

                        # Follow multiple log files

                        $ tail -f /var/log/nginx/*.log

                        # Follow with line numbers

                        $ tail -f app.log | cat -n

                        # Follow and grep for errors

                        $ tail -f app.log | grep -E "(ERROR|FAILED|CRITICAL)"

Log Analysis & Statistics

                        # Count HTTP response codes

                        $ awk '{print $9}' access.log | sort | uniq -c

                        # Top 10 requested URLs

                        $ awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10

                        # Response time statistics

                        $ awk '{print $(NF-1)}' access.log | sort -n | awk '

                          NR == 1 {min=$1}

                          {sum+=$1; array[NR]=$1}

                          END {

                            avg=sum/NR;

                            median=(NR%2==0) ? (array[NR/2]+array[NR/2+1])/2 : array[(NR+1)/2];

                            print "Min:", min, "Avg:", avg, "Median:", median, "Max:", array[NR]

                          }'

Complete Log Analysis Pipeline

# Monitor logs in real-time, extract errors, and send alerts
$ tail -F /var/log/application/*.log |
    grep --line-buffered -E "(ERROR|CRITICAL|FATAL)" |
    while read line; do
      echo "$(date): $line" >> /var/log/application/errors.log
      echo "$line" | mail -s "Application Error Alert" admin@example.com
    done

Combining Tools: Power Pipelines

The Art of Pipeline Construction

Mastering Linux means knowing how to combine tools effectively. Here are key principles:

Filter early: Remove unwanted data as soon as possible
Use appropriate tools: Each tool has strengths
Keep pipelines readable: Use line breaks for complex pipelines
Test incrementally: Build pipelines step by step
Consider performance: Some operations are expensive

# Complete DevOps pipeline: Find, analyze, and report on large log files
$ find /var/log -type f -name "*.log" -size +50M |
    xargs ls -lh |
    awk '{print $5, $9}' |
    sort -hr |
    head -10 |
    while read size file; do
      lines=$(wc -l < "$file")
      errors=$(grep -c "ERROR" "$file")
      echo "$file: $size, $lines lines, $errors errors"
    done > /tmp/log_analysis_report.txt

Performance Optimization Tips

⚠️ Performance Considerations

Use xargs with -P: Parallel processing for large file sets
Avoid reading large files multiple times: Chain operations in single pass
Use grep -F for fixed strings: Faster than regex when possible
Consider awk over multiple commands: Single process is faster
Use sort -u instead of sort | uniq: More efficient
Limit find scope: Narrow search paths when possible
Use LC_ALL=C for ASCII sorting: Faster than locale-aware sort

                # Parallel processing example

                $ find . -name "*.log" -print0 | xargs -0 -P 4 grep "ERROR"

                # 4 parallel grep processes

Practical DevOps Scenarios

Real-World DevOps Tasks

Log Rotation Monitoring

                        # Check if logs are rotating properly

                        $ ls -lh /var/log/nginx/access.log* | tail -5

                        $ du -sh /var/log/nginx/*.log

                        $ tail -n 1 /var/log/nginx/access.log.1

Error Rate Monitoring

                        # Calculate error rate in last hour

                        $ start=$(date -d '1 hour ago' +'%H:%M:%S')

                        $ end=$(date +'%H:%M:%S')

                        $ total=$(sed -n "/$start/,/$end/p" app.log | wc -l)

                        $ errors=$(sed -n "/$start/,/$end/p" app.log | grep -c "ERROR")

                        $ echo "scale=2; $errors * 100 / $total" | bc

Configuration Audit

                        # Find all config files with passwords

                        $ grep -r "password" /etc --include="*.conf" | grep -v "^#"

                        # Check for world-writable configs

                        $ find /etc -type f -perm /o+w -name "*.conf"

Cleanup Old Files

                        # Remove temp files older than 7 days

                        $ find /tmp -type f -mtime +7 -delete

                        # Clean Docker unused resources

                        $ docker system prune -af

Essential Commands Cheat Sheet

Redirection & Piping

                    $ cmd > file      # Stdout to file

                    $ cmd >> file     # Append stdout

                    $ cmd &> file     # Stdout & stderr

                    $ cmd 2> file     # Stderr to file

                    $ cmd1 | cmd2     # Pipe output

Text Processing

                    $ grep pattern file    # Search text

                    $ sed 's/old/new/' file # Replace text

                    $ awk '{print $1}' file # Print column

                    $ cut -d, -f1 file     # Extract field

                    $ sort file           # Sort lines

                    $ uniq file            # Remove duplicates

File Viewing

                    $ cat file            # View entire file

                    $ less file           # Page through file

                    $ head -20 file       # First 20 lines

                    $ tail -f file        # Follow file

                    $ tac file            # Reverse view

File Search

                    $ find . -name "*.log" # Find files

                    $ locate filename     # Fast search

                    $ which cmd           # Find command

                    $ whereis cmd         # Find binary, source, man

Practice Exercises for DevOps Engineers

Monitor a growing log file in real-time and extract only lines containing "ERROR" or "WARNING".
Find all files modified in the last 24 hours that are larger than 100MB and list them with sizes.
Parse Nginx access logs to show the top 10 IP addresses by request count in the last hour.
Convert a CSV file to JSON using only command-line tools (no Python/Ruby).
Create a script that watches a directory for new files and processes them automatically.
Extract all email addresses from a collection of text files and sort them uniquely.
Find all configuration files in /etc that contain uncommented passwords.
Calculate the average response time from web server logs for the last 5 minutes.

Key Takeaways

Pipes are powerful: Combine simple tools to solve complex problems
Learn the core tools well: grep, sed, awk, find, sort, uniq, cut, tr
Redirect wisely: Control where input comes from and output goes
Choose the right tool: Each tool has strengths and weaknesses
Monitor logs effectively: tail -f for real-time, less for exploration
Optimize pipelines: Filter early, use appropriate tools
Practice pattern matching: Regular expressions are essential
Automate repetitive tasks: Script common text processing workflows

Mastering Linux text processing and file operations will dramatically increase your productivity as a DevOps engineer. These skills enable you to quickly analyze data, automate tasks, and troubleshoot systems efficiently.

Saturday, December 13, 2025