Linux awk Command: The Swiss Army Knife for Text Processing#

awk is one of the most powerful text processing tools in Linux. The name comes from the initials of its three creators: Aho, Weinberger, and Kernighan. Many people only use it for simple column extraction, but awk’s capabilities go far beyond that.

The Core Model#

awk’s workflow can be summarized as:

awk 'pattern { action }' file
  • pattern: Matching condition (regex, expression, range)
  • action: Operation to perform (print, calculate, variable assignment)

For each line, awk will:

  1. Automatically split fields by delimiter (default: whitespace)
  2. Store fields in $1, $2, $3..., with the whole line as $0
  3. Check the pattern, execute action if matched
# Extract first and third columns
awk '{ print $1, $3 }' data.txt

# Only process lines containing "error"
awk '/error/ { print $0 }' app.log

# Count total lines in file
awk 'END { print NR }' data.txt

NR is a built-in variable representing the current line number (Number of Records). END is a special pattern that executes after all lines are processed.

Field Separators: Beyond Whitespace#

The -F parameter specifies the field separator:

# CSV file with comma separator
awk -F',' '{ print $1, $3 }' data.csv

# Regex separator: one or more spaces
awk -F'[ ]+' '{ print $1 }' data.txt

# Multi-character separator
awk -F'|' '{ print $1 }' data.txt

You can also set FS (Field Separator) inside the script:

awk 'BEGIN { FS = "," } { print $1, $3 }' data.csv

BEGIN executes before any line is processed, commonly used for variable initialization.

Built-in Variables Secrets#

awk provides several built-in variables:

Variable Meaning
$0 Entire line content
$1~$n nth field
NF Number of fields in current line
NR Current line number (global)
FNR Current line number (current file)
FS Field separator
OFS Output field separator
RS Record separator
ORS Output record separator

The power of NF: reference the last field

# Print the last field of each line
awk '{ print $NF }' data.txt

# Print the second-to-last field
awk '{ print $(NF-1) }' data.txt

Conditionals and Loops#

awk supports if-else and for/while loops:

# Filter and label by condition
awk '{
  if ($3 > 100) {
    print $1, "HIGH"
  } else {
    print $1, "NORMAL"
  }
}' data.txt

# Calculate sum of fields per line
awk '{
  sum = 0
  for (i = 1; i <= NF; i++) {
    sum += $i
  }
  print sum
}' numbers.txt

Arrays and Statistics#

awk arrays are associative arrays, keys can be any string:

# Count word occurrences
awk '{
  for (i = 1; i <= NF; i++) {
    count[$i]++
  }
}
END {
  for (word in count) {
    print word, count[word]
  }
}' text.txt

# Count HTTP status codes by access count
awk '{ count[$9]++ } END { for (code in count) print code, count[code] }' access.log

Here $9 is the status code field in Nginx logs (assuming standard format).

Real-World Example: Analyzing Nginx Access Logs#

Assume log format:

192.168.1.1 - - [10/May/2026:10:30:45 +0800] "GET /api/users HTTP/1.1" 200 1234 "-" "Mozilla/5.0"

1. Top 10 Visiting IPs#

awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head -10

Pure awk implementation:

awk '{
  ip[$1]++
}
END {
  for (i in ip) print ip[i], i
}' access.log | sort -rn | head -10

2. Calculate Average Response Time#

Assume log format includes response time (last field):

awk '{
  total += $NF
  count++
}
END {
  print "Average:", total/count, "ms"
}' access.log

3. Extract 4xx and 5xx Errors#

# Extract all 4xx and 5xx status requests
awk '$9 ~ /^[45][0-9][0-9]$/ { print $0 }' access.log

# Count error type distribution
awk '$9 ~ /^[45][0-9][0-9]$/ {
  errors[$9]++
}
END {
  for (code in errors) print code, errors[code]
}' access.log

~ is the regex match operator, $9 ~ /^.../ means the 9th field matches the regex.

Performance Optimization Tips#

1. Skip Invalid Lines#

Use next to skip lines that don’t need processing:

awk '/^#/ { next } { print $1 }' config.conf

Skip comment lines (starting with #).

2. Process Only First N Lines#

awk 'NR > 100 { exit } { print $1 }' data.txt

Exit after processing first 100 lines, avoid reading entire large file.

3. FNR for Multi-File Processing#

When processing multiple files, NR is global line number, FNR is current file line number:

# Separate statistics per file
awk 'FNR == 1 { print "File:", FILENAME } { print NR, FNR, $0 }' file1.txt file2.txt

Advanced Example: Calculating Moving Average#

Given a temperature data file with one temperature per line, calculate 3-point moving average:

awk '{
  values[NR] = $1
  if (NR >= 3) {
    sum = values[NR] + values[NR-1] + values[NR-2]
    print (NR-2), sum/3
  }
}' temperature.txt

awk vs sed vs grep#

Many people confuse these three tools:

Tool Core Capability Typical Use Case
grep Line filtering Quickly search matching lines
sed Stream editing Replace, delete, insert
awk Field processing + calculation Statistics, reports, formatting

They’re often used together:

# Combined example: extract error lines, replace timestamp format, count by hour
grep "ERROR" app.log | \
  sed 's/\[.*\]//' | \
  awk '{ count[$1]++ } END { for (h in count) print h, count[h] }'

Summary#

awk’s power lies in:

  1. Automatic field splitting, no manual split needed
  2. Complete programming language (variables, arrays, functions, loops)
  3. Built-in pattern matching mechanism

Mastering awk makes text file processing as efficient as querying a database with SQL. Complex statistics, formatting, and transformation tasks can be done in a single awk command.


Related: Linux sed Command | Text Deduplicate Tool | Grep Command Guide