AWK tutorial

This is AWK tutorial. It covers the basics of the AWK tool.

AWK

AWK is a pattern scanning and processing language. An AWK program consists of a set of actions to be taken against streams of textual data. AWK extensively uses regular expressions. It is a standard feature of most Unix-like operating systems.

AWK was created at Bell Labs in the 1977. Its name is derived from the family names of its authors – Alfred Aho, Peter Weinberger, and Brian Kernighan.

AWK program

An AWK program consists of a sequence of pattern-action statements and optional function definitions. It processes text files. AWK is a line oriented language. It divides a file into lines called records. Each line is broken up into a sequence of fields. The fields are accessed by special variables: $1 reads the first field, $2 the second and so on. The $0 variable refers to the whole record.

The structure of an AWK program has the following form:

pattern { action }

The pattern is a test that is performed on each of the records. If the condition is met then the action is performed. Either pattern or action can be omitted, but not both. The default pattern matches each line and the default action is to print the record.

awk -f program-file [file-list]
awk program [file-list]

An AWK program can be run in two basic ways: a) the program is read from a separate file; the name of the program follows the -f option, b) the program is specified on the command line enclosed by quote characters.

AWK one-liners

AWK one-linears are simple one-shot programs run from the command line. Let us have the following text file:

$ cat mywords 
brown
tree
craftsmanship
book
beautiful
existence
ministerial
computer
town

We want to print all words included in the mywords file that are longer than five characters.

$ awk 'length($1) > 5 {print}' mywords
craftsmanship
beautiful
existence
ministerial
computer

The AWK program is placed between two single quote characters. The first is the pattern; we specify that the length of the record is greater that five. The length() function returns the length of the string. The $1 variable refers to the first field of the record; in our case there is only one field per record. The action is placed between curly brackets.

$ awk 'length($1) > 5' mywords
craftsmanship
beautiful
existence
ministerial
computer

As we have specified earlier, the action can be omitted. In such a case a default action is performed — printing of the whole record.

Regular expressions are often applied on AWK fields. The ~ is the regular expression match operator. It checks if a string matches the provided regular expression.

$ awk '$1 ~ /^[b,c]/ {print $1}' mywords
brown
craftsmanship
book
beautiful
computer

In this program we print all the words that begin with b or c character. The regular expression is placed between two slash characters.

AWK provides important built-in variables. For instance, NR is a built-in variable that refers to the current line being processed.

$ awk 'NR % 2 == 0 {print}' mywords
tree
book
existence
computer

The above program prints each second record of the mywords file. Modulo dividing the NR variable we get an even line.

Say we want to print the line numbers of the file.

$  awk '{print NR, $0}' mywords
1 brown
2 tree
3 craftsmanship
4 book
5 beautiful
6 existence
7 ministerial
8 computer
9 town

Again, we use the NR variable. We skip the pattern, therefore, the action is peformed on each line. The $0 variable refers to the whole record.

For the following example, we have this C source file.

$ cat source.c 
1  #include <stdio.h>
2
3  int main(void) {
4
5      char *countries[5] = { "Germany", "Slovakia", "Poland", 
6              "China", "Hungary" };
7    
8      size_t len = sizeof(countries) / sizeof(*countries);
9    
10     for (size_t i=0; i < len; i++) {
11        
12         printf("%s\n", countries[i]);
13     }
14 }

It happens that we copy some source including line numbers. Our task is to remove the numbers from the text.

$ awk '{print substr($0, 4)}' source.c
#include <stdio.h>

int main(void) {

    char *countries[5] = { "Germany", "Slovakia", "Poland", 
            "China", "Hungary" };
  
    size_t len = sizeof(countries) / sizeof(*countries);
  
    for (size_t i=0; i < len; i++) {
       
        printf("%s\n", countries[i]);
    }
}

We use the substr() function. It prints a substring from the given string. We apply the function on each line, skipping the first three characters. In other words, we print each record from the fourth character till its end.

BEGIN and END patterns

BEGIN and END are special patterns that are executed before and after all records have been read. These two keywords are followed by curly brackets where we specify statements to be executed.

We have the following two files:

$ cat mywords; 
brown
tree
craftsmanship
book
beautiful
existence
ministerial
computer
town
$ cat mywords2; 
pleasant
curly
storm
hering
immune

We want to know the number of lines in those two lines.

$ awk 'END {print NR}' mywords mywords2
14

We pass two files to the AWK program. AWK sequentially processes the file names received on the command line. The block following the END keyword is executed at the end of the program; we print the NR variable which holds the line number of the last processed line.

$ awk 'BEGIN {srand()} {lines[NR] = $0} END { r=int(rand()*NR + 1); print lines[r]}' mywords
tree

The above program prints a random line from the mywords file. The srand() function seeds the random number generator. The function has to be executed only once. In the main part of the program, we store the current record into the lines array. In the end, we compute a random number between 1 and NR and print the randomly chosen line from the array structure.

The match function

The match() is a built-in string manipulation function. It tests if the given string contains a regular expression pattern. The first parameter is the string, the second is the regex pattern. It is similar to the ~ operator.

$ awk 'match($0, /^[c,b]/)' mywords
brown
craftsmanship
book
beautiful
computer

The program prints those lines that begin with c or b. The regular expression is placed between two slash characters.

The match() function sets the RSTART variable; it is the index of the start of the matching pattern.

$ awk 'match($0, /i/) {print $0 " has i character at " RSTART}' mywords
craftsmanship has i character at 12
beautiful has i character at 6
existence has i character at 3
ministerial has i character at 2

The program prints those words that contain the i character. In addition, it prints the first occurrence of the character.

AWK built-in variables

AWK has several built-in variables. They are set by AWK when the program is run. We have already seen the NR, $0, and RSTART variables.

$ awk 'BEGIN { print ARGC, ARGV[0], ARGV[1]}' mywords
2 awk mywords

The program prints the number of arguments of the AWK program and the first two arguments. ARGC is the number of command line arguments; in our case there are two arguments including the AWK itself. ARGV is an array of command line arguments. The array is indexed from 0 to ARGC - 1.

FS is an input field separator, a space by default. NF is the number of fields in the current input record.

For the following program, we use this file:

$ cat values 
2, 53, 4, 16, 4, 23, 2, 7, 88
4, 5, 16, 42, 3, 7, 8, 39, 21
23, 43, 67, 12, 11, 33, 3, 6

We have three lines of comma-separated values.

stats.awk
BEGIN {

    FS=","
    max = 0
    min = 10**10
    sum = 0
    avg = 0
} 

{ 
    for (i=1; i<=NF; i++) { 
    
        sum += $i
    
        if (max < $i) {
            max = $i
        }
        
        if (min > $i) {
            min = $i
        }
    
        printf("%d ",  $i) 
    }
}

END {
    
    avg = sum / NF
    printf("\n")
    printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg)
}

The program counts the basic statistics from the provided values.

FS=","

The values in the file are separated by the comma character; therefore, we set the FS variable to comma character.

max = 0
min = 10**10
sum = 0
avg = 0

We define default values for the maximum, minimum, sum, and average. AWK variables are dynamic; their values are either floating-point numbers or strings, or both, depending upon how they are used.

{ 
    for (i=1; i<=NF; i++) { 
    
        sum += $i
    
        if (max < $i) {
            max = $i
        }
        
        if (min > $i) {
            min = $i
        }
    
        printf("%d ",  $i) 
    }
}

In the main part of the script, we go through each line and calculate the maximum, minumum, and the sum of the values. The NF is used to determine the number of values per line.

END {
    
    avg = sum / NF
    printf("\n")
    printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg)
}

In the end part of the script, we calculate the average and print the calculations to the console.

$ awk -f stats.awk values
2 53 4 16 4 23 2 7 88 4 5 16 42 3 7 8 39 21 23 43 67 12 11 33 3 6 
Min: 2, Max: 88, Sum: 542, Average: 67

This is the output of the stats.awk program.

The FS variable can be specified as a command line option with the -F flag.

$ awk -F: '{print $1, $7}' /etc/passwd | head -7
root /bin/bash
daemon /usr/sbin/nologin
bin /usr/sbin/nologin
sys /usr/sbin/nologin
sync /bin/sync
games /usr/sbin/nologin
man /usr/sbin/nologin

The example prints the first (the user name) and the seventh field (user's shell) from the system /etc/passwd file. The head command is used to print only the first seven lines. The data in the /etc/passwd file is separated by a colon. So the colon is given to the -F option.

The RS is the input record separator, by default a newline.

$ echo "Jane 17#Tom 23#Mark 34" | awk 'BEGIN {RS="#"} {print $1, "is", $2, "years old"}'
Jane is 17 years old
Tom is 23 years old
Mark is 34 years old

In the example, we have relevant data separated by the # character. The RS is used to strip them. AWK can receive input from other commands like echo.

Passing variables to AWK

AWK has the -v option which is used to assign values to variables. For the next program, we have the text file:

$ cat text
The French nation, oppressed, degraded during many centuries
by the most insolent despotism, has finally awakened to a 
consciousness of its rights and of the power to which its 
destinies summon it.
mygrep.awk
{
    for (i=1; i<=NF; i++) {

        field = $i
        
        if (field ~ word) {
            c = index($0, field)
            print NR "," c, $0
            next
        }
    }
}

The example simulates the grep utility. It finds the provided word and prints its line and the its starting index. (The program finds only the first occurrence of the word.) The word variable is passed to the program using the -v option.

$ awk -f mygrep.awk -v word=the text
2,4 by the most insolent despotism, has finally awakened to a 
3,36 consciousness of its rights and of the power to which its 

We have looked for the "the" word in the text file.

Pipes

AWK can receive input and send output to other commands via the pipe.

$ echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}'
5
8

In this case, AWK receives output from the echo command. It prints the values of last column.

$ awk -F: '$7 ~ /bash/ {print $1}' /etc/passwd | wc -l
3

Here, the AWK program sends data to the wc command via the pipe. In the AWK program, we find out those users who use bash. Their names are passed to the wc command which counts them. In our case, there are three users using bash.

Spell checking

We create an AWK program for spell checking.

spellcheck.awk
BEGIN {
    count = 0
    
    i = 0
    while (getline myword <"/usr/share/dict/words") {
        dict[i] = myword
        i++
    }
}

{
    for (i=1; i<=NF; i++) {
    
        field = $i
    
        if (match(field, /[[:punct:]]$/)) {
            field = substr(field, 0, RSTART-1)
        }
    
        mywords[count] = field
        count++
    }
}

END {

    for (w_i in mywords) { 
        for (w_j in dict) { 
            if (mywords[w_i] == dict[w_j] || 
                        tolower(mywords[w_i]) == dict[w_j]) {
                delete mywords[w_i]
            }
        }
    }

    for (w_i in mywords) { 
        if (mywords[w_i] != "") {
            print mywords[w_i]        
        }
    }
}

The script compares the words of the provided text file against a dictionary. Under the standard /usr/share/dict/words path we can find an English dictionary; each word is on a separate line.

BEGIN {
    count = 0
    
    i = 0
    while (getline myword <"/usr/share/dict/words") {
        dict[i] = myword
        i++
    }
}

Inside the BEGIN block, we read the words from the dictionary into the dict array. The getline command reads a record from the given file name; the record is stored in the $0 variable.

{
    for (i=1; i<=NF; i++) {
    
        field = $i
    
        if (match(field, /[[:punct:]]$/)) {
            field = substr(field, 0, RSTART-1)
        }
    
        mywords[count] = field
        count++
    }
}

In the main part of the program, we place the words of the file that we are spell checking into the mywords array. We remove any punctuation marks (like commas or dots) from the endings of the words.

END {

    for (w_i in mywords) { 
        for (w_j in dict) { 
            if (mywords[w_i] == dict[w_j] || 
                        tolower(mywords[w_i]) == dict[w_j]) {
                delete mywords[w_i]
            }
        }
    }
...
}    

We compare the words from the mywords array against the dictionary array. If the word is in the dictionary, it is removed with the delete command. Words that begin a sentence start with an uppercase letter; therefore, we also check for a lowercase alternative utilizing the tolower() function.

for (w_i in mywords) { 
    if (mywords[w_i] != "") {
        print mywords[w_i]        
    }
}

Remaining words have not been found in the dictionary; they are printed to the console.

$ awk -f spellcheck.awk text
consciosness
finaly

We have run the program on a text file; we have found two misspelled words. Note that the program takes some time to finish.

Rock-paper-scissors

Rock-paper-scissors is a popular hand game in which each player simultaneously forms one of three shapes with an outstretched hand. We create this game in AWK.

rock_scissors_paper.awk
# This program creates a rock-paper-scissors game.

BEGIN {

    srand()
    
    opts[1] = "rock"
    opts[2] = "paper"
    opts[3] = "scissors"

    do {
    
        print "1 - rock"
        print "2 - paper"
        print "3 - scissors"
        print "9 - end game"
        
        ret = getline < "-"

        if (ret == 0 || ret == -1) {
            exit
        }
        
        val = $0
        
        if (val == 9) {
            exit
        } else if (val != 1 && val != 2 && val != 3) {
            print "Invalid option"
            continue
        } else {
            play_game(val)
        }
    
    } while (1)
}

function play_game(val) {

    r = int(rand()*3) + 1

    print "I have " opts[r] " you have "  opts[val]
    
    if (val == r) {
        print "Tie, next throw"
        return
    }
    
    if (val == 1 && r == 2) {
    
        print "Paper covers rock, you loose"
    } else if (val == 2 && r == 1) {
    
        print "Paper covers rock, you win"
    } else if (val == 2 && r == 3) {
    
        print "Scissors cut paper, you loose"
    } else if (val == 3 && r == 2) {
    
        print "Scissors cut paper, you win"
    } else if (val == 3 && r == 1) {
    
        print "Rock blunts scissors, you loose"
    } else if (val == 1 && r == 3) {
    
        print "Rock blunts scissors, you win"
    } 
}

We play the game against the computer, which chooses its options randomly.

srand()

We seed the random number generator with the srand() function.

opts[1] = "rock"
opts[2] = "paper"
opts[3] = "scissors"

The three options are stored in the opts array.

do {

    print "1 - rock"
    print "2 - paper"
    print "3 - scissors"
    print "9 - end game"
...    

The cycle of the game is controlled by the do-while loop. First, the options are printed to the terminal.

ret = getline < "-"

if (ret == 0 || ret == -1) {
    exit
}

val = $0

A value, our choice, is read from the command line using the getline command; the value is stored in the val variable.

if (val == 9) {
    exit
} else if (val != 1 && val != 2 && val != 3) {
    print "Invalid option"
    continue
} else {
    play_game(val)
}

We exit the program if we choose option 9. If the value is outside the printed menu options, we print an error message and start a new loop with the continue command. If we have choosen one of the three options correctly, we call the play_game() function.

r = int(rand()*3) + 1

A random value from 1..3 is chosen with the rand() function. This is the choice of the computer.

if (val == r) {
    print "Tie, next throw"
    return
}

In case both players choose the same option there is a tie. We return from the function and a new loop is started.

if (val == 1 && r == 2) {

    print "Paper covers rock, you loose"
} else if (val == 2 && r == 1) {
...

We compare the chosen values of the players and print the result to the console.

$ awk -f rock_scissors_paper.awk 
1 - rock
2 - paper
3 - scissors
9 - end game
1
I have scissors you have rock
Rock blunts scissors, you win
1 - rock
2 - paper
3 - scissors
9 - end game
3
I have paper you have scissors
Scissors cut paper, you win
1 - rock
2 - paper
3 - scissors
9 - end game

A sample run of the game.

Marking keywords

In the following example, we mark Java keywords in a source file.

mark_keywords.awk
# the program adds tags around Java keywords
# it works on keywords that are separate words

BEGIN {

    # load java keywords
    i = 0
    while (getline kwd <"javakeywords2") {
        keywords[i] = kwd
        i++
    }
}

{
    mtch = 0
    ln = ""
    space = ""
    
    # calculate the beginning space
    if (match($0, /[^[:space:]]/)) {
        if (RSTART > 1) {
            space = sprintf("%*s", RSTART, "") 
        }
    }     
    
    # add the space to the line
    ln = ln space
    
    for (i=1; i <= NF; i++) {
    
        field = $i
         
        # go through keywords   
        for (w_i in keywords) { 
        
            kwd = keywords[w_i]
            
            # check if a field is a keyword
            if (field == kwd) {
                mtch = 1     
            } 
        }
        
        # add tags to the line        
        if (mtch == 1) {
            ln = ln  "<kwd>" field  "</kwd> "   
        } else {
            ln = ln field " " 
        }
        
        mtch = 0
            
    }
    
    print ln
}

The program adds <kwd> and </kwd> tags around each of the keywords that it recognizes. This is a basic example; it works on keywords that are separate words. It does not address the more complicated structures.

# load java keywords
i = 0
while (getline kwd <"javakeywords2") {
    keywords[i] = kwd
    i++
}

We load Java keywords from a file; each keyword is on a separate line. The keywords are stored in the keywords array.

# calculate the beginning space
if (match($0, /[^[:space:]]/)) {
    if (RSTART > 1) {
        space = sprintf("%*s", RSTART, "") 
    }
}        

Using regular expression, we calculate the space at the beginning of the line if any. The space is a string variable equaling to the width of the space at the current line. The space is calculated in order to keep the indentation of the program.

# add the space to the line
ln = ln space   

The space is added to the ln variable. In AWK, we use a space to add strings.

for (i=1; i <= NF; i++) {

field = $i
...
}

We go through the fields of the current line; the field in question is stored in the field variable.

# go through keywords   
for (w_i in keywords) { 

    kwd = keywords[w_i]
    
    # check if a field is a keyword
    if (field == kwd) {
        mtch = 1     
    } 
}

In a for loop, we go through the Java keywords and check if a field is a Java keyword.

# add tags to the line        
if (mtch == 1) {
    ln = ln  "<kwd>" field  "</kwd> "   
} else {
    ln = ln field " " 
}

If there is a keyword, we attach the tags around the keyword; otherwise we just append the field to the line.

print ln

The constructed line is printed to the console.

$ awk -f markkeywords2.awk program.java 
<kwd>package</kwd> com.zetcode; 

<kwd>class</kwd> Test { 

     <kwd>int</kwd> x = 1; 

     <kwd>public</kwd> <kwd>void</kwd> exec1() { 

         System.out.println(this.x); 
         System.out.println(x); 
     } 

     <kwd>public</kwd> <kwd>void</kwd> exec2() { 

         <kwd>int</kwd> z = 5; 

         System.out.println(x); 
         System.out.println(z); 
     } 
} 

<kwd>public</kwd> <kwd>class</kwd> MethodScope { 

     <kwd>public</kwd> <kwd>static</kwd> <kwd>void</kwd> main(String[] args) { 

         Test ts = <kwd>new</kwd> Test(); 
         ts.exec1(); 
         ts.exec2(); 
     } 
} 

A sample run on a small Java program.

This was AWK tutorial.