Linux awk Command: The Swiss Army Knife for Text Processing
Linux awk Command: The Swiss Army Knife for Text Processing#
awk is one of the most powerful text processing tools in Linux. The name comes from the initials of its three creators: Aho, Weinberger, and Kernighan. Many people only use it for simple column extraction, but awk’s capabilities go far beyond that.
The Core Model#
awk’s workflow can be summarized as:
awk 'pattern { action }' file
- pattern: Matching condition (regex, expression, range)
- action: Operation to perform (print, calculate, variable assignment)
For each line, awk will:
- Automatically split fields by delimiter (default: whitespace)
- Store fields in
$1, $2, $3..., with the whole line as$0 - Check the pattern, execute action if matched
# Extract first and third columns
awk '{ print $1, $3 }' data.txt
# Only process lines containing "error"
awk '/error/ { print $0 }' app.log
# Count total lines in file
awk 'END { print NR }' data.txt
NR is a built-in variable representing the current line number (Number of Records). END is a special pattern that executes after all lines are processed.
Field Separators: Beyond Whitespace#
The -F parameter specifies the field separator:
# CSV file with comma separator
awk -F',' '{ print $1, $3 }' data.csv
# Regex separator: one or more spaces
awk -F'[ ]+' '{ print $1 }' data.txt
# Multi-character separator
awk -F'|' '{ print $1 }' data.txt
You can also set FS (Field Separator) inside the script:
awk 'BEGIN { FS = "," } { print $1, $3 }' data.csv
BEGIN executes before any line is processed, commonly used for variable initialization.
Built-in Variables Secrets#
awk provides several built-in variables:
| Variable | Meaning |
|---|---|
$0 |
Entire line content |
$1~$n |
nth field |
NF |
Number of fields in current line |
NR |
Current line number (global) |
FNR |
Current line number (current file) |
FS |
Field separator |
OFS |
Output field separator |
RS |
Record separator |
ORS |
Output record separator |
The power of NF: reference the last field
# Print the last field of each line
awk '{ print $NF }' data.txt
# Print the second-to-last field
awk '{ print $(NF-1) }' data.txt
Conditionals and Loops#
awk supports if-else and for/while loops:
# Filter and label by condition
awk '{
if ($3 > 100) {
print $1, "HIGH"
} else {
print $1, "NORMAL"
}
}' data.txt
# Calculate sum of fields per line
awk '{
sum = 0
for (i = 1; i <= NF; i++) {
sum += $i
}
print sum
}' numbers.txt
Arrays and Statistics#
awk arrays are associative arrays, keys can be any string:
# Count word occurrences
awk '{
for (i = 1; i <= NF; i++) {
count[$i]++
}
}
END {
for (word in count) {
print word, count[word]
}
}' text.txt
# Count HTTP status codes by access count
awk '{ count[$9]++ } END { for (code in count) print code, count[code] }' access.log
Here $9 is the status code field in Nginx logs (assuming standard format).
Real-World Example: Analyzing Nginx Access Logs#
Assume log format:
192.168.1.1 - - [10/May/2026:10:30:45 +0800] "GET /api/users HTTP/1.1" 200 1234 "-" "Mozilla/5.0"
1. Top 10 Visiting IPs#
awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head -10
Pure awk implementation:
awk '{
ip[$1]++
}
END {
for (i in ip) print ip[i], i
}' access.log | sort -rn | head -10
2. Calculate Average Response Time#
Assume log format includes response time (last field):
awk '{
total += $NF
count++
}
END {
print "Average:", total/count, "ms"
}' access.log
3. Extract 4xx and 5xx Errors#
# Extract all 4xx and 5xx status requests
awk '$9 ~ /^[45][0-9][0-9]$/ { print $0 }' access.log
# Count error type distribution
awk '$9 ~ /^[45][0-9][0-9]$/ {
errors[$9]++
}
END {
for (code in errors) print code, errors[code]
}' access.log
~ is the regex match operator, $9 ~ /^.../ means the 9th field matches the regex.
Performance Optimization Tips#
1. Skip Invalid Lines#
Use next to skip lines that don’t need processing:
awk '/^#/ { next } { print $1 }' config.conf
Skip comment lines (starting with #).
2. Process Only First N Lines#
awk 'NR > 100 { exit } { print $1 }' data.txt
Exit after processing first 100 lines, avoid reading entire large file.
3. FNR for Multi-File Processing#
When processing multiple files, NR is global line number, FNR is current file line number:
# Separate statistics per file
awk 'FNR == 1 { print "File:", FILENAME } { print NR, FNR, $0 }' file1.txt file2.txt
Advanced Example: Calculating Moving Average#
Given a temperature data file with one temperature per line, calculate 3-point moving average:
awk '{
values[NR] = $1
if (NR >= 3) {
sum = values[NR] + values[NR-1] + values[NR-2]
print (NR-2), sum/3
}
}' temperature.txt
awk vs sed vs grep#
Many people confuse these three tools:
| Tool | Core Capability | Typical Use Case |
|---|---|---|
| grep | Line filtering | Quickly search matching lines |
| sed | Stream editing | Replace, delete, insert |
| awk | Field processing + calculation | Statistics, reports, formatting |
They’re often used together:
# Combined example: extract error lines, replace timestamp format, count by hour
grep "ERROR" app.log | \
sed 's/\[.*\]//' | \
awk '{ count[$1]++ } END { for (h in count) print h, count[h] }'
Summary#
awk’s power lies in:
- Automatic field splitting, no manual split needed
- Complete programming language (variables, arrays, functions, loops)
- Built-in pattern matching mechanism
Mastering awk makes text file processing as efficient as querying a database with SQL. Complex statistics, formatting, and transformation tasks can be done in a single awk command.
Related: Linux sed Command | Text Deduplicate Tool | Grep Command Guide