ZetCode

Go word frequency

last modified April 11, 2024

In this article we show how to calculate word frequency in Golang.

$ go version
go version go1.22.2 linux/amd64

We use Go version 1.22.2.

Our examples will work only with latin words and are specifically targeted at analyzing the Bible.

$ wget https://raw.githubusercontent.com/janbodnar/data/main/the-king-james-bible.txt

We use the King James Bible.

To cut the text into words, we use Go's strings.FieldsFunc and regular expressions.

Go word frequency example I

The FieldsFunc function splits the string at each run of Unicode code points satisfying the provided function and returns an array of slices.

read_freq.go
package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "sort"
    "strings"
)

func main() {

    fileName := "the-king-james-bible.txt"

    bs, err := ioutil.ReadFile(fileName)

    if err != nil {

        log.Fatal(err)
    }

    text := string(bs)

    fields := strings.FieldsFunc(text, func(r rune) bool {

        return !('a' <= r && r <= 'z' || 'A' <= r && r <= 'Z' || r == '\'')
    })

    wordsCount := make(map[string]int)

    for _, field := range fields {

        wordsCount[field]++
    }

    keys := make([]string, 0, len(wordsCount))

    for key := range wordsCount {

        keys = append(keys, key)
    }

    sort.Slice(keys, func(i, j int) bool {

        return wordsCount[keys[i]] > wordsCount[keys[j]]
    })

    for idx, key := range keys {

        fmt.Printf("%s %d\n", key, wordsCount[key])

        if idx == 10 {
            break
        }
    }
}

We count the frequency of the words from the King James Bible.

fields := strings.FieldsFunc(text, func(r rune) bool {

    return !('a' <= r && r <= 'z' || 'A' <= r && r <= 'Z' || r == '\'')
})

The FieldsFunc cuts the text by characters that are not alphabetic and apostrophe. This will also disregard all the verse numbers.

wordsCount := make(map[string]int)

for _, field := range fields {

    wordsCount[field]++
}

Each word and its frequency is stored in the wordsCount map.

keys := make([]string, 0, len(wordsCount))

for key := range wordsCount {

    keys = append(keys, key)
}

sort.Slice(keys, func(i, j int) bool {

    return wordsCount[keys[i]] > wordsCount[keys[j]]
})

In order to sort the words by frequency, we create a new keys slice. We put all the words there and sort them by their frequency values.

for idx, key := range keys {

    fmt.Printf("%s %d\n", key, wordsCount[key])

    if idx == 10 {
        break
    }
}

We print the top ten frequent words from the Bible.

$ go run word_freq.go
the 62103
and 38848
of 34478
to 13400
And 12846
that 12576
in 12331
shall 9760
he 9665
unto 8942
I 8854

Go word frequency example II

In the second example, we use a regular expression to divide text into words.

word_freq2.go
package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "regexp"
    "sort"
)

type WordFreq struct {
    word string
    freq int
}

func (p WordFreq) String() string {
    return fmt.Sprintf("%s %d", p.word, p.freq)
}

func main() {

    fileName := "the-king-james-bible.txt"

    reg := regexp.MustCompile("[a-zA-Z']+")
    bs, err := ioutil.ReadFile(fileName)

    if err != nil {
        log.Fatal(err)
    }

    text := string(bs)
    matches := reg.FindAllString(text, -1)

    words := make(map[string]int)

    for _, match := range matches {
        words[match]++
    }

    var wordFreqs []WordFreq
    for k, v := range words {
        wordFreqs = append(wordFreqs, WordFreq{k, v})
    }

    sort.Slice(wordFreqs, func(i, j int) bool {

        return wordFreqs[i].freq > wordFreqs[j].freq
    })

    for i := 0; i < 10; i++ {

        fmt.Println(wordFreqs[i])
    }
}

We store the words and their frequencies in the WordFreq structure.

reg := regexp.MustCompile("[a-zA-Z']+")

In our regular expression, one or more alphabetic characters or an apostrophe constitues a word.

matches := reg.FindAllString(text, -1)

The FindAllString function returns a slice of all successive matches of the expression.

words := make(map[string]int)

for _, match := range matches {
    words[match]++
}

We go over the matches and calculate their frequencies in the file. The words and the number of their occurrences are store in the words map.

var wordFreqs []WordFreq
for k, v := range words {
    wordFreqs = append(wordFreqs, WordFreq{k, v})
}

We build a slice of WordFreq structures out of the words map.

sort.Slice(wordFreqs, func(i, j int) bool {

    return wordFreqs[i].freq > wordFreqs[j].freq
})

We sort the wordFreqs slice by the freq field.

for i := 0; i < 10; i++ {

    fmt.Println(wordFreqs[i])
}

We print the first ten most common words.

Go word frequency example III

In the next example, we also use a regular expression.

word_freq3.go
package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "regexp"
    "sort"
)

type WordFreq struct {
    word string
    freq int
}

func (p WordFreq) String() string {
    return fmt.Sprintf("%s %d", p.word, p.freq)
}

type byFreq []WordFreq

func (a byFreq) Len() int           { return len(a) }
func (a byFreq) Swap(i, j int)      { a[i], a[j] = a[j], a[i] }
func (a byFreq) Less(i, j int) bool { return a[i].freq < a[j].freq }

func main() {

    fileName := "the-king-james-bible.txt"
    bs, err := ioutil.ReadFile(fileName)

    if err != nil {
        log.Fatal(err)
    }

    text := string(bs)

    re := regexp.MustCompile("[a-zA-Z']+")
    matches := re.FindAllString(text, -1)

    words := make(map[string]int)

    for _, match := range matches {
        words[match]++
    }

    var wordFreqs []WordFreq
    for k, v := range words {
        wordFreqs = append(wordFreqs, WordFreq{k, v})
    }

    sort.Sort(sort.Reverse(byFreq(wordFreqs)))

    for i := 0; i < 10; i++ {
        fmt.Printf("%v\n", wordFreqs[i])
    }
}

This example also implements a custom sorting interface.

type byFreq []WordFreq

func (a byFreq) Len() int           { return len(a) }
func (a byFreq) Swap(i, j int)      { a[i], a[j] = a[j], a[i] }
func (a byFreq) Less(i, j int) bool { return a[i].freq < a[j].freq }

We implement the sort.Interface for []WordFreq based on the freq field.

sort.Sort(sort.Reverse(byFreq(wordFreqs)))

To sort the WordFreq structures in descending order, we use the sort.Reverse function.

Source

Go regexp package - reference

In this article we have counted word frequencies in the King James Bible.

Author

My name is Jan Bodnar and I am a passionate programmer with many years of programming experience. I have been writing programming articles since 2007. So far, I have written over 1400 articles and 8 e-books. I have over eight years of experience in teaching programming.

List all Go tutorials.