AWK tutorial
last modified October 18, 2023
This is AWK tutorial. It covers the basics of the AWK tool.
AWK
AWK is a pattern-directed scanning and processing language. An AWK program consists of a set of actions to be taken against streams of textual data. AWK extensively uses regular expressions. It is a standard feature of most Unix-like operating systems.
AWK was created at Bell Labs in the 1977. Its name is derived from the family names of its authors – Alfred Aho, Peter Weinberger, and Brian Kernighan.
There are two major implementations of AWK. The traditional Unix AWK and the newer GAWK. GAWK is the GNU Project's implementation of the AWK programming language. GAWK has several extensions to the original AWK.
AWK program
An AWK program consists of a sequence of pattern-action statements and optional function definitions. It processes text files. AWK is a line oriented language. It divides a file into lines called records. Each line is broken up into a sequence of fields. The fields are accessed by special variables: $1 reads the first field, $2 the second and so on. The $0 variable refers to the whole record.
The structure of an AWK program has the following form:
pattern { action }
The pattern is a test that is performed on each of the records. If the condition is met then the action is performed. Either pattern or action can be omitted, but not both. The default pattern matches each line and the default action is to print the record.
awk -f program-file [file-list] awk program [file-list]
An AWK program can be run in two basic ways: a) the program is read from a
separate file; the name of the program follows the -f
option, b)
the program is specified on the command line enclosed by quote characters.
AWK one-liners
AWK one-linears are simple one-shot programs run from the command line. Let us have the following text file:
storeroom tree cup store book cloud existence ministerial falcon town sky top bookworm bookcase war
We want to print all words included in the words.txt
file that are
longer than five characters.
$ awk 'length($1) > 5 {print $0}' words.txt storeroom existence ministerial falcon bookworm bookcase
The AWK program is placed between two single quote characters. The first is the
pattern; we specify that the length of the record is greater that five. The
length
function returns the length of the string. The
$1
variable refers to the first field of the record; in our case
there is only one field per record. The action is placed between curly brackets.
$ awk 'length($1) > 5' words.txt storeroom existence ministerial falcon bookworm bookcase
As we have specified earlier, the action can be omitted. In such a case a default action is performed — printing of the whole record.
$ awk 'length($1) == 3' words.txt cup sky top war
We print all words that have three characters.
$ awk '!(length($1) == 3)' words.txt storeroom tree store book cloud existence ministerial falcon town bookworm bookcase
With the !
operator, we can negate the condition; we print all
lines that do not have three characters.
$ awk '(length($1) == 3) || (length($1) == 4)' words.txt tree cup book town sky top war
We combined two conditions with the ||
operator.
$ awk 'length($1) > 0 {print $1, "has", length($1), "chars"}' words.txt storeroom has 9 chars tree has 4 chars cup has 3 chars store has 5 chars book has 4 chars cloud has 5 chars existence has 9 chars ministerial has 11 chars falcon has 6 chars town has 4 chars sky has 3 chars top has 3 chars bookworm has 8 chars bookcase has 8 chars war has 3 chars
This AWK command prints the length of each of the words. If we separate
print
arguments with a comma, AWK adds a space character.
$ grep book words.txt book bookworm bookcase $ grep book words.txt -n 5:book 13:bookworm 14:bookcase
The grep
command is used for locating text patterns inside files.
$ awk '/book/ {print}' words.txt book bookworm bookcase $ awk '/book/ {print NR ":" $0}' words.txt 5:book 13:bookworm 14:bookcase
These are AWK equivalents of the above grep commands. The NR
variable gives the total number of records/lines being processed.
Next we apply conditions on numbers.
Peter 89 Lucia 95 Thomas 76 Marta 67 Joe 92 Alex 78 Sophia 90 Alfred 65 Kate 46
We have a file with scores of students.
$ awk '$2 >= 90 { print $0 }' scores.txt Lucia 95 Joe 92 Sophia 90
We print all students with scores 90+.
$ awk '$2 >= 90 { print }' scores.txt Lucia 95 Joe 92 Sophia 90
If we omit an argument for the print
function, the $0
is assumed.
$ awk '$2 >= 90' scores.txt Lucia 95 Joe 92 Sophia 90
A missing { action }
means print the matching line.
$ awk '{ if ($2 >= 90) print }' scores.txt Lucia 95 Joe 92 Sophia 90
Instead of a pattern, we can also use an if condition in the action.
$ awk '{sum += $2} END { printf("The average score is %.2f\n", sum/NR) }' scores.txt The average score is 77.56
This command calculates the average score. In the action block, we calculate
the sum of scores. In the END
block, we print the average score.
We format the output with the built-in printf
function. The
%.2f
is a format specifier; each specifier begins with the
%
character. The .2
is the precision -- the number of
digits after the decimal point. The f
expects a floating point
value. The \n
is not a part of the specifier; it is a newline
character. It prints a newline after the string is shown on the terminal.
AWK working with pipes
AWK can receive input and send output to other commands via the pipe.
$ echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}' 5 8
In this case, AWK receives output from the echo
command.
It prints the values of last column.
$ awk -F: '$7 ~ /bash/ {print $1}' /etc/passwd | wc -l 3
Here, the AWK program sends data to the wc
command via the pipe.
In the AWK program, we find out those users who use bash. Their names are
passed to the wc
command which counts them. In our
case, there are three users using bash.
AWK fields
AWK reads files line by line. Each line or record can be separated into fields.
The FS
variable stores the field separator, which is a space by
default.
$ ls -l total 132 drwxr-xr-x 2 jano7 jano7 512 Feb 11 16:02 data -rw-r--r-- 1 jano7 jano7 110211 Oct 12 2019 sid.jpg -rw-r--r-- 1 jano7 jano7 5 Jul 22 20:21 some.txt -rw-r--r-- 1 jano7 jano7 226 Apr 23 16:56 thermopylae.txt -rw-r--r-- 1 jano7 jano7 365 Aug 4 10:22 users.txt -rw-r--r-- 1 jano7 jano7 24 Jul 21 21:03 words.txt -rw-r--r-- 1 jano7 jano7 30 Jul 22 21:20 words2.txt
We have these files in the current working directory.
$ ls -l | awk '{print $6 " " $9}' Feb data Oct sid.jpg Jul some.txt Apr thermopylae.txt Aug users.txt Jul words.txt Jul words2.txt
We redirect the output of the ls
command to AWK. We print the sixth
and ninth columns of the output.
John Doe, gardener, London, M, 11/23/1982 Jane Doe, teacher, London, F, 10/12/1988 Peter Smith, programmer, New York, M, 9/18/2000 Joe Brown, driver, Portland, M, 1/1/1976 Jack Smith, physician, Manchester, M, 2/27/1983 Lucy Black, accountant, Birmingham, F, 5/5/1998 Martin Porto, actor, Los Angeles, M, 4/30/1967 Sofia Harris, interpreter, Budapest, F, 8/18/1993
In the users.txt
file, we have a few users. The fields are now
separated with a comma character.
$ awk -F, '{print $1 " is a(n)" $2}' users.txt John Doe is a(n) gardener Jane Doe is a(n) teacher Peter Smith is a(n) programmer Joe Brown is a(n) driver Jack Smith is a(n) physician Lucy Black is a(n) accountant Martin Porto is a(n) actor Sofia Harris is a(n) interpreter
We print the first and the second column of the file. We specify the field
separator with the -F
option.
$ awk 'BEGIN {FS=","} {print $3}' users.txt London London New York Portland Manchester Birmingham Los Angeles Budapest
The field separator can be also set in the program. We set the FS
variable to comma inside the BEGIN block, which is executed once at the
beginning of the program execution.
$ awk 'BEGIN {FS=","} {print $3}' users.txt | uniq London New York Portland Manchester Birmingham Los Angeles Budapest
We pass the output to the uniq
command to get distinct values.
$ awk -F, '$4 ~ "F" {print $1}' users.txt Jane Doe Lucy Black Sofia Harris
We print all females. We use the ~
operator to match against a
pattern.
$ awk '{print "The", NR". record has", length($0), "characters"}' users.txt The 1. record has 41 characters The 2. record has 40 characters The 3. record has 47 characters The 4. record has 40 characters The 5. record has 47 characters The 6. record has 47 characters The 7. record has 46 characters The 8. record has 49 characters
The command prints the number of characters for each record. The $0
stands for the whole line.
$ awk -F, '{print $NF, $(NF-1)}' users.txt 11/23/1982 M 10/12/1988 F 9/18/2000 M 1/1/1976 M 2/27/1983 M 5/5/1998 F 4/30/1967 M 8/18/1993 F
The $NF
is the last field, the $(NF-1)
is the second
last field.
$ awk -F, '{ if ($4 ~ "M") {m++} else {f++} } END {printf "users: %d\nmales: %d\nfemales: %d\n", m+f, m, f}' users.txt users: 8 males: 5 females: 3
The command prints the number of users, males, and females. When the command is becoming too complex, it is better to put it inside a file.
{ if ($4 ~ "M") { m++ } else { f++ } } END { printf "users: %d\nmales: %d\nfemales: %d\n", m+f, m, f }
The first block delimited by {}
is executed for each line of the
file. We count all records that have and don't have M in the 4th field. The
numbers are stored in the m
and f
variables. The
END
block is executed once in the end of the program. There we
print the number of users, males, and females. The printf
function
allows us to create formatted strings.
$ awk -F, -f males_females.awk users.txt users: 8 males: 5 females: 3
AWK reads the program from the file followed by the -f
option.
AWK regular expressions
Regular expressions are often applied on AWK fields. The ~
is the
regular expression match operator. It checks if a string matches the provided
regular expression.
$ awk '$1 ~ /^[b,c]/ {print $1}' words.txt cup book cloud bookworm bookcase
In this command we print all the words that begin with b or c character. The regular expression is placed between two slash characters.
$ awk '$1 ~ /[e,n]$/ {print $1}' words.txt tree store existence falcon town bookcase
This command prints all words that end with e or n.
$ awk '$1 ~ /\<...\>/ {print $1}' words.txt cup sky top war
The command prints all words that have three characters. The doc (.) stands for any character and the \< \> characters are word boundaries.
$ awk '$1 ~ /\<...\>/ || $1 ~ /\<....\>/ {print $1}' words.txt tree cup book town sky top war
We combine two condition with the or (||) operator. The AWK command prints all words that have either three of four characters.
$ awk '$1 ~ /store|room|book/' words.txt storeroom store book bookworm bookcase
With the alternation operator (|), we print fields that contain either of the specified wores.
$ awk '$1 ~ /^book(worm|case)?$/' words.txt book bookworm bookcase
Applying a subpattern with , we print fields that include words
book, bookwor, or bookcase. The
?
tells that the subpattern may or
may not be there.
The match
is a built-in string manipulation function. It tests if
the given string contains a regular expression pattern. The first parameter is
the string, the second is the regex pattern. It is similar to the ~
operator.
$ awk 'match($0, /^[c,b]/)' words.txt brown craftsmanship book beautiful computer
The program prints those lines that begin with c or b. The regular expression is placed between two slash characters.
The match
function sets the RSTART
variable;
it is the index of the start of the matching pattern.
$ awk 'match($0, /i/) {print $0 " has i character at " RSTART}' words.txt craftsmanship has i character at 12 beautiful has i character at 6 existence has i character at 3 ministerial has i character at 2
The program prints those words that contain the i character. In addition, it prints the first occurrence of the character.
AWK built-in variables
AWK provides important built-in variables.
Variable name | Description |
---|---|
FS | field separator (default space) |
NF | # of fields in the current record |
NR | current record/line number |
$0 | whole line |
$n | n-th field |
FNR | current record number in the current file |
RS | input record separator (default newline) |
OFS | output field separator (default blank) |
ORS | output record separator (default newline) |
OFMT | output format for numbers (default %.6g) |
SUBSEP | separates multiple subscripts (default 034) |
ARGC | argument count |
ARGV | array of arguments |
FILENAME | the name of the current input file |
RSTART | index of the start of the matching pattern |
RLENGTH | the length of the string matched by the match function |
CONVFMT | conversion format used when converting numbers (default %.6g) |
The table lists common AWK variables.
$ awk 'NR % 2 == 0 {print}' words.txt tree store cloud ministerial town top bookcase
The above program prints each second record of the words.txt
file.
Modulo dividing the NR
variable we get an even line.
Say we want to print the line numbers of the file.
$ awk '{print NR, $0}' words.txt 1 storeroom 2 tree 3 cup 4 store 5 book 6 cloud 7 existence 8 ministerial 9 falcon 10 town 11 sky 12 top 13 bookworm 14 bookcase 15 war
Again, we use the NR
variable. We skip the pattern, therefore, the
action is peformed on each line. The $0
variable refers to the
whole record.
$ echo -e "cup\nbill\ncoin" > words1.txt $ echo -e "cloud\nbreath\nrank" > words2.txt
We create two text files with some words.
$ awk '{ print $1, "is at line", FNR, "in", FILENAME }' words1.txt words2.txt cup is at line 1 in words1.txt bill is at line 2 in words1.txt coin is at line 3 in words1.txt cloud is at line 1 in words2.txt breath is at line 2 in words2.txt rank is at line 3 in words2.txt
We print where each word is located; we include the line number and the
filename. The difference between the NR
and FNR
variables is that the former counts lines in all files while the latter counts
lines always in the current file.
For the following example, we have this C source file.
1 #include <stdio.h> 2 3 int main(void) { 4 5 char *countries[5] = { "Germany", "Slovakia", "Poland", 6 "China", "Hungary" }; 7 8 size_t len = sizeof(countries) / sizeof(*countries); 9 10 for (size_t i=0; i < len; i++) { 11 12 printf("%s\n", countries[i]); 13 } 14 }
We have a source file with line numbers. Our task is to remove the numbers from the text.
$ awk '{print substr($0, 4)}' source.c #include <stdio.h> int main(void) { char *countries[5] = { "Germany", "Slovakia", "Poland", "China", "Hungary" }; size_t len = sizeof(countries) / sizeof(*countries); for (size_t i=0; i < len; i++) { printf("%s\n", countries[i]); } }
We use the substr
function. It prints a substring from the given
string. We apply the function on each line, skipping the first three characters.
In other words, we print each record from the fourth character till its end.
$ awk '{print substr($0, 4) >> "source2.c"}' source.c
We redirect the output to a new file.
The NF
is the number of fields in the current record.
2 3 1 34 21 12 43 21 11 2 11 33 12 43 72 91 90 32 14 34 87 22 12 75 2 42 13 75 23 1 42 41 94 4 32 2 1 6 2 1 3 1 4 53 13 52 84 14 14 63 3 2 5 76 31 45
We have a file of values.
$ awk 'NF == 6' values.txt 2 3 1 34 21 12 43 72 91 90 32 14 3 2 5 76 31 45
We print records that have six fields.
$ awk '{print "line", NR, "has", NF, "values"}' values.txt line 1 has 6 values line 2 has 7 values line 3 has 6 values line 4 has 8 values line 5 has 8 values line 6 has 8 values line 7 has 7 values line 8 has 6 values
This command prints the number of values for each line.
{ for (i = 1; i<=NF; i++) { sum += $i } print "line", NR, "sum:", sum sum = 0 }
The program calculates the sum of values for each line.
for (i = 1; i<=NF; i++) { sum += $i }
This is a classic for loop. We go through each of the fields in the record and
add the value to the sum
variable. The +=
is a
compound addition operator.
$ awk -f calc_sum.awk values.txt line 1 sum: 73 line 2 sum: 133 line 3 sum: 342 line 4 sum: 287 line 5 sum: 312 line 6 sum: 20 line 7 sum: 293 line 8 sum: 162
The following is an alternative solution using the split
function.
{ split($0, vals) for (idx in vals) { sum += vals[idx] } print "line", NR, "sum:", sum sum = 0 }
The program calculates the sum of values for each line.
split($0, vals)
The split
function splits the given string into an array; the
default separator for elements of the record is FS
.
for (idx in vals) { sum += vals[idx] }
We go through the array and calculate the sum. In each of the loop cycles, the
idx
variable is set to the current index of the array.
$ awk -f calc_sum2.awk values.txt line 1 sum: 73 line 2 sum: 133 line 3 sum: 342 line 4 sum: 287 line 5 sum: 312 line 6 sum: 20 line 7 sum: 293 line 8 sum: 162
BEGIN and END blocks
BEGIN
and END
are blocks that are executed before and
after all records have been read. These two keywords are followed by curly
brackets where we specify statements to be executed.
$ awk 'BEGIN { print "Unix time: ", systime()}' Unix time: 1628156179
The BEGIN
block is executed before the first input line is
processed. We print the Unix time, utilizing the systime
function.
The function is a gawk extension function.
$ awk 'BEGIN { print "Today is", strftime("%Y-%m-%d") }' Today is 2021-08-05
The program prints the current date. The strftime
is a GAWK
extension.
$ echo "1,2,3,4,5" | awk '{ split($0,a,",");for (idx in a) {sum+=a[idx]} } END {print sum}' 15
The program splits the line into an array of numbers with the split
function. We go over the array elements and calculate their sum. In the
END
block, we print the sum.
The Battle of Thermopylae was fought between an alliance of Greek city-states, led by King Leonidas of Sparta, and the Persian Empire of Xerxes I over the course of three days, during the second Persian invasion of Greece.
We want to count the number of lines, words, and characters in the file.
$ wc thermopylae.txt 4 38 226 thermopylae.txt
To count the number of lines, words, and characters in a file, we have the
wc
command.
{ words += NF chars += length + 1 # include newline character } END { print NR, words, chars }
The first part of the program is executed for each line of the file. The
END
block is run at the end of the program.
$ awk -f count_words.awk thermopylae.txt 4 38 226
$ cat words.txt brown tree craftsmanship book beautiful existence ministerial computer town $ cat words2.txt pleasant curly storm hering immune
We want to know the number of lines in those two lines.
$ awk 'END {print NR}' words.txt words2.txt 14
We pass two files to the AWK program. AWK sequentially processes the
file names received on the command line.
The block following the END
keyword is executed
at the end of the program; we print the NR
variable
which holds the line number of the last processed line.
$ awk 'BEGIN {srand()} {lines[NR] = $0} END { r=int(rand()*NR + 1); print lines[r]}' words.txt tree
The above program prints a random line from the words.txt
file.
The srand
function seeds the random number generator.
The function has to be executed only once. In the main part of the program,
we store the current record into the lines
array.
In the end, we compute a random number between 1 and NR
and print the randomly chosen line from the array structure.
AWK playing with words dictionary
In the following examples, we create a couple of AWK programs that work with
an English dictionary. On Unix systems, a dictionary is located in
/usr/share/dict/words
file.
$ awk 'length($1) == 10 { n++ } END {print n}' /usr/share/dict/words 30882
This command prints the number of words in the given dictionary that have ten
characters. In the action block, we increase the n
variable for
each match. In the END
block, we print the final number.
$ awk '$1 ~ /^w/ && length($1) == 4 { n++; if (n<15) {print} else {exit} }' /usr/share/dict/words waag waar wabe wace wack wade wadi waeg waer waff waft wage waif waik
This command prints the first fifteen words that begin with 'w' and have four
letters. The exit
statement terminates the AWK program.
$ awk '$1 ~ /^w.*r$/ { n++; if (n<15) {print} } END {print n}' /usr/share/dict/words waar wabber wabster wacker wadder waddler wader wadmaker wadsetter waer wafer waferer wafermaker wafter 417
The command prints the first fifteen words that begin with 'w' and end in 'r'. In the end, it prints the total number of such words in the file.
Palindrome is a word, number, phrase, or other sequence of characters which reads the same backward as forward, such as madam or racecar.
{ for (i=length($0); i!=0; i--) { r = r substr($0, i, 1) } if (length($0) > 1 && $0 == r) { print n++ } r = "" } END { printf "There are %d palindromes\n", n }
The program finds all palindromes. The algorithm is that the original word must equal to a reversed word.
for (i=length($0); i!=0; i--) { r = r substr($0, i, 1) }
Using a for loop, we reverse the given string. The substr
function
returns a subtring; the first parameter is the string, the second is the
beginning position, and the last is the length of the substring. To concatenate
strings in AWK, we simply separate them by a space character.
if (length($0) > 1 && $0 == r) { print n++ }
The length of the word must be greater than 1; we don't count single letters
as palindromes. If the reversed word is equal to the original word, we print it
and increase the n
variable.
r = ""
We reset the r
variable.
END { printf "There are %d palindromes\n", n }
In the end, we print the number of palindromes in the file.
{ if (length($0) == 1) {next} rev = reverse($0) if ($0 == rev) { print n++ } } END { printf "There are %d palindromes\n", n } function reverse(word) { r = "" for (i=length(word); i!=0; i--) { r = r substr(word, i, 1) } return r }
To improve the readability of the program, we create a custom
reverse
function.
AWK ARGC and ARGV variables
Next, we work with ARGC
and ARGV
variables.
$ awk 'BEGIN { print ARGC, ARGV[0], ARGV[1]}' words.txt 2 awk words.txt
The program prints the number of arguments of the AWK program and the first
two arguments. ARGC
is the number of command line arguments;
in our case there are two arguments including the AWK itself.
ARGV
is an array of command line arguments. The array is indexed
from 0 to ARGC
- 1.
FS
is an input field separator, a space by default. NF
is the number of fields in the current input record.
For the following program, we use this file:
$ cat values 2, 53, 4, 16, 4, 23, 2, 7, 88 4, 5, 16, 42, 3, 7, 8, 39, 21 23, 43, 67, 12, 11, 33, 3, 6
We have three lines of comma-separated values.
BEGIN { FS="," max = 0 min = 10**10 sum = 0 avg = 0 } { for (i=1; i<=NF; i++) { sum += $i if (max < $i) { max = $i } if (min > $i) { min = $i } printf("%d ", $i) } } END { avg = sum / NF printf("\n") printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg) }
The program counts the basic statistics from the provided values.
FS=","
The values in the file are separated by the comma character; therefore, we
set the FS
variable to comma character.
max = 0 min = 10**10 sum = 0 avg = 0
We define default values for the maximum, minimum, sum, and average. AWK variables are dynamic; their values are either floating-point numbers or strings, or both, depending upon how they are used.
{ for (i=1; i<=NF; i++) { sum += $i if (max < $i) { max = $i } if (min > $i) { min = $i } printf("%d ", $i) } }
In the main part of the script, we go through each line and
calculate the maximum, minumum, and the sum of the values.
The NF
is used to determine the number of values
per line.
END { avg = sum / NF printf("\n") printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg) }
In the end part of the script, we calculate the average and print the calculations to the console.
$ awk -f stats.awk values 2 53 4 16 4 23 2 7 88 4 5 16 42 3 7 8 39 21 23 43 67 12 11 33 3 6 Min: 2, Max: 88, Sum: 542, Average: 67
The FS
variable can be specified as a command line option with
the -F
flag.
$ awk -F: '{print $1, $7}' /etc/passwd | head -7 root /bin/bash daemon /usr/sbin/nologin bin /usr/sbin/nologin sys /usr/sbin/nologin sync /bin/sync games /usr/sbin/nologin man /usr/sbin/nologin
The example prints the first (the user name) and the seventh field (user's shell)
from the system /etc/passwd
file. The head
command is used
to print only the first seven lines. The data in the /etc/passwd
file
is separated by a colon. So the colon is given to the -F
option.
The RS
is the input record separator, by default a newline.
$ echo "Jane 17#Tom 23#Mark 34" | awk 'BEGIN {RS="#"} {print $1, "is", $2, "years old"}' Jane is 17 years old Tom is 23 years old Mark is 34 years old
In the example, we have relevant data separated by the # character. The RS
is used to strip them. AWK can receive input from other commands like echo
.
AWK GET request
AWK can make HTTP requests. We use the getline
function and
the /inet/tcp/0/
file.
BEGIN { site = "webcode.me" server = "/inet/tcp/0/" site "/80" print "GET / HTTP/1.0" |& server print "Host: " site |& server print "\r\n\r\n" |& server while ((server |& getline line) > 0 ) { content = content line "\n" } close(server) print content }
The program makes a GET request to the webcode.me page and reads its response.
print "GET / HTTP/1.0" |& server
The |&
operator starts a coprocess, which allows a two-way
communication.
while ((server |& getline line) > 0 ) { content = content line "\n" }
With the getline
function, we read the response from the server.
Passing variables to AWK
AWK has the -v
option which is used to assign values to variables.
$ awk -v today=$(date +%Y-%m-%d) 'BEGIN { print "Today is", today }' Today is 2021-08-05
We pass the output of the date
command to the today
variable, which can be then accessed in the AWK program.
The Battle of Thermopylae was fought between an alliance of Greek city-states, led by King Leonidas of Sparta, and the Persian Empire of Xerxes I over the course of three days, during the second Persian invasion of Greece.
{ for (i=1; i<=NF; i++) { field = $i if (field ~ word) { c = index($0, field) print NR "," c, $0 next } } }
The example simulates the grep
utility. It finds the provided
word and prints its line and the its starting index. (The program finds only
the first occurrence of the word.) The word
variable
is passed to the program using the -v
option.
$ awk -v word=the -f mygrep.awk thermopylae.txt 2,37 led by King Leonidas of Sparta, and the Persian Empire of Xerxes I over the 3,30 course of three days, during the second Persian invasion of Greece.
We have looked for the "the" word in the thermopylae.txt
file.
Word frequency
Next, we calculate the word frequency of the Bible.
$ wget https://raw.githubusercontent.com/janbodnar/data/main/the-king-james-bible.txt
We download the King James Bible.
$ file the-king-james-bible.txt the-king-james-bible.txt: UTF-8 Unicode (with BOM) text
When we examine the text, we can see that is is UTF-8 Unicode text with Byte order mark. The BOM must be taken into account in the AWK proram.
{ if (NR == 1) { sub(/^\xef\xbb\xbf/,"") } gsub(/[,;!()*:?.]*/, "") for (i = 1; i <= NF; i++) { if ($i ~ /^[0-9]/) { continue } w = $i words[w]++ } } END { for (idx in words) { print idx, words[idx] } }
We calculate how many times a word is present in the book.
if (NR == 1) { sub(/^\xef\xbb\xbf/,"") }
From the first line, we remove the BOM character. If we did not remove the BOM, the very first word (The in our case) would contain it and would be thus recognized as a unique word.
gsub(/[,;!()*:?.]*/, "")
From the current record, we remove punctuation characters such as comma and colon. Otherwise, text such as the and the, would be considered as two distinct words.
The gsub
function globally replaces the given regular expression
with the specified string; since the string is empty, it means that they are
deleted. If the string where the substitutions should take place is not
specified, the $0
is assumed.
for (i = 1; i <= NF; i++) { if ($i ~ /^[0-9]/) { continue } w = $i words[w]++ }
In the for loop, we go over the fields of the current line. Bible text is
preceded with verses; these we do not want to include. So if the first field
begins with a digit, we skip the current cycle with the continue
keyword. The words
is an array of words. Each index is a word
from the text. The values correspoinding to the indexes are frequencies.
Each time a word is encountered, its value is incremented.
END { for (idx in words) { print idx, words[idx] } }
In the end, we go through the words and print their indexes (words) and values (frequencies).
$ awk -f word_freq.awk the-king-james-bible.txt > bible_words.txt
We run the program and redirect the output to the file.
$ sort -nr -k 2 bible_words.txt | head the 62103 and 38848 of 34478 to 13400 And 12846 that 12576 in 12331 shall 9760 he 9665 unto 8942
We sort the data and print the first ten most frequent words.
PROCINFO
is a special, built-in array which can influence the AWK
program. For instance, it can determine the way the array is traversed. It is
a GAWK extension.
{ if (NR == 1) { sub(/^\xef\xbb\xbf/,"") } gsub(/[,;!()*:?.]*/, "") for (i = 1; i <= NF; i++) { if ($i ~ /[0-9]/) { continue } w = $i words[w]++ } } END { PROCINFO["sorted_in"] = "@val_num_desc" for (idx in words) { print idx, words[idx] } }
With PROCINFO["sorted_in"] = "@val_num_desc"
, we traverse the array
by comparing the values in descending order.
$ awk -f freq_top.awk the-king-james-bible.txt | head the 62103 and 38848 of 34478 to 13400 And 12846 that 12576 in 12331 shall 9760 he 9665 unto 8942
Spell checking
We create an AWK program for spell checking.
BEGIN { count = 0 i = 0 while (getline myword <"/usr/share/dict/words") { dict[i] = myword i++ } } { for (i=1; i<=NF; i++) { field = $i if (match(field, /[[:punct:]]$/)) { field = substr(field, 0, RSTART-1) } mywords[count] = field count++ } } END { for (w_i in mywords) { for (w_j in dict) { if (mywords[w_i] == dict[w_j] || tolower(mywords[w_i]) == dict[w_j]) { delete mywords[w_i] } } } for (w_i in mywords) { if (mywords[w_i] != "") { print mywords[w_i] } } }
The script compares the words of the provided text file against
a dictionary. Under the standard /usr/share/dict/words
path we can find an English dictionary; each word is on
a separate line.
BEGIN { count = 0 i = 0 while (getline myword <"/usr/share/dict/words") { dict[i] = myword i++ } }
Inside the BEGIN
block, we read the words from the dictionary
into the dict
array. The getline
command reads a
record from the given file name; the record is stored in the
$0
variable.
{ for (i=1; i<=NF; i++) { field = $i if (match(field, /[[:punct:]]$/)) { field = substr(field, 0, RSTART-1) } mywords[count] = field count++ } }
In the main part of the program, we place the words of
the file that we are spell checking into the mywords
array. We remove any punctuation marks (like commas or dots) from
the endings of the words.
END { for (w_i in mywords) { for (w_j in dict) { if (mywords[w_i] == dict[w_j] || tolower(mywords[w_i]) == dict[w_j]) { delete mywords[w_i] } } } ... }
We compare the words from the mywords
array against the dictionary
array. If the word is in the dictionary, it is removed with the
delete
command. Words that begin a sentence start with an uppercase
letter; therefore, we also check for a lowercase alternative utilizing the
tolower
function.
for (w_i in mywords) { if (mywords[w_i] != "") { print mywords[w_i] } }
Remaining words have not been found in the dictionary; they are printed to the console.
$ awk -f spellcheck.awk text consciosness finaly
We have run the program on a text file; we have found two misspelled words. Note that the program takes some time to finish.
Rock-paper-scissors
Rock-paper-scissors is a popular hand game in which each player simultaneously forms one of three shapes with an outstretched hand. We create this game in AWK.
# This program creates a rock-paper-scissors game. BEGIN { srand() opts[1] = "rock" opts[2] = "paper" opts[3] = "scissors" do { print "1 - rock" print "2 - paper" print "3 - scissors" print "9 - end game" ret = getline < "-" if (ret == 0 || ret == -1) { exit } val = $0 if (val == 9) { exit } else if (val != 1 && val != 2 && val != 3) { print "Invalid option" continue } else { play_game(val) } } while (1) } function play_game(val) { r = int(rand()*3) + 1 print "I have " opts[r] " you have " opts[val] if (val == r) { print "Tie, next throw" return } if (val == 1 && r == 2) { print "Paper covers rock, you loose" } else if (val == 2 && r == 1) { print "Paper covers rock, you win" } else if (val == 2 && r == 3) { print "Scissors cut paper, you loose" } else if (val == 3 && r == 2) { print "Scissors cut paper, you win" } else if (val == 3 && r == 1) { print "Rock blunts scissors, you loose" } else if (val == 1 && r == 3) { print "Rock blunts scissors, you win" } }
We play the game against the computer, which chooses its options randomly.
srand()
We seed the random number generator with the srand
function.
opts[1] = "rock" opts[2] = "paper" opts[3] = "scissors"
The three options are stored in the opts
array.
do { print "1 - rock" print "2 - paper" print "3 - scissors" print "9 - end game" ...
The cycle of the game is controlled by the do-while
loop.
First, the options are printed to the terminal.
ret = getline < "-" if (ret == 0 || ret == -1) { exit } val = $0
A value, our choice, is read from the command line using the getline
command;
the value is stored in the val
variable.
if (val == 9) { exit } else if (val != 1 && val != 2 && val != 3) { print "Invalid option" continue } else { play_game(val) }
We exit the program if we choose option 9. If the value is outside the printed
menu options, we print an error message and start a new loop with the
continue
command. If we have choosen one of the three options
correctly, we call the play_game
function.
r = int(rand()*3) + 1
A random value from 1..3 is chosen with the rand
function. This is
the choice of the computer.
if (val == r) { print "Tie, next throw" return }
In case both players choose the same option there is a tie. We return from the function and a new loop is started.
if (val == 1 && r == 2) { print "Paper covers rock, you loose" } else if (val == 2 && r == 1) { ...
We compare the chosen values of the players and print the result to the console.
$ awk -f rock_scissors_paper.awk 1 - rock 2 - paper 3 - scissors 9 - end game 1 I have scissors you have rock Rock blunts scissors, you win 1 - rock 2 - paper 3 - scissors 9 - end game 3 I have paper you have scissors Scissors cut paper, you win 1 - rock 2 - paper 3 - scissors 9 - end game
A sample run of the game.
Marking keywords
In the following example, we mark Java keywords in a source file.
# the program adds tags around Java keywords # it works on keywords that are separate words BEGIN { # load java keywords i = 0 while (getline kwd <"javakeywords2") { keywords[i] = kwd i++ } } { mtch = 0 ln = "" space = "" # calculate the beginning space if (match($0, /[^[:space:]]/)) { if (RSTART > 1) { space = sprintf("%*s", RSTART, "") } } # add the space to the line ln = ln space for (i=1; i <= NF; i++) { field = $i # go through keywords for (w_i in keywords) { kwd = keywords[w_i] # check if a field is a keyword if (field == kwd) { mtch = 1 } } # add tags to the line if (mtch == 1) { ln = ln "<kwd>" field "</kwd> " } else { ln = ln field " " } mtch = 0 } print ln }
The program adds <kwd> and </kwd> tags around each of the keywords that it recognizes. This is a basic example; it works on keywords that are separate words. It does not address the more complicated structures.
# load java keywords i = 0 while (getline kwd <"javakeywords2") { keywords[i] = kwd i++ }
We load Java keywords from a file; each keyword is on a separate line.
The keywords are stored in the keywords
array.
# calculate the beginning space if (match($0, /[^[:space:]]/)) { if (RSTART > 1) { space = sprintf("%*s", RSTART, "") } }
Using regular expression, we calculate the space at the beginning
of the line if any. The space
is a string variable equaling
to the width of the space at the current line. The space is calculated
in order to keep the indentation of the program.
# add the space to the line ln = ln space
The space is added to the ln
variable. In AWK, we use
a space to add strings.
for (i=1; i <= NF; i++) { field = $i ... }
We go through the fields of the current line; the field in question
is stored in the field
variable.
# go through keywords for (w_i in keywords) { kwd = keywords[w_i] # check if a field is a keyword if (field == kwd) { mtch = 1 } }
In a for loop, we go through the Java keywords and check if a field is a Java keyword.
# add tags to the line if (mtch == 1) { ln = ln "<kwd>" field "</kwd> " } else { ln = ln field " " }
If there is a keyword, we attach the tags around the keyword; otherwise we just append the field to the line.
print ln
The constructed line is printed to the console.
$ awk -f markkeywords2.awk program.java <kwd>package</kwd> com.zetcode; <kwd>class</kwd> Test { <kwd>int</kwd> x = 1; <kwd>public</kwd> <kwd>void</kwd> exec1() { System.out.println(this.x); System.out.println(x); } <kwd>public</kwd> <kwd>void</kwd> exec2() { <kwd>int</kwd> z = 5; System.out.println(x); System.out.println(z); } } <kwd>public</kwd> <kwd>class</kwd> MethodScope { <kwd>public</kwd> <kwd>static</kwd> <kwd>void</kwd> main(String[] args) { Test ts = <kwd>new</kwd> Test(); ts.exec1(); ts.exec2(); } }
A sample run on a small Java program.
This was AWK tutorial.