Go word frequency
last modified April 11, 2024
In this article we show how to calculate word frequency in Golang.
$ go version go version go1.22.2 linux/amd64
We use Go version 1.22.2.
Our examples will work only with latin words and are specifically targeted at analyzing the Bible.
$ wget https://raw.githubusercontent.com/janbodnar/data/main/the-king-james-bible.txt
We use the King James Bible.
To cut the text into words, we use Go's strings.FieldsFunc
and
regular expressions.
Go word frequency example I
The FieldsFunc
function splits the string at each run of Unicode
code points satisfying the provided function and returns an array of slices.
package main import ( "fmt" "io/ioutil" "log" "sort" "strings" ) func main() { fileName := "the-king-james-bible.txt" bs, err := ioutil.ReadFile(fileName) if err != nil { log.Fatal(err) } text := string(bs) fields := strings.FieldsFunc(text, func(r rune) bool { return !('a' <= r && r <= 'z' || 'A' <= r && r <= 'Z' || r == '\'') }) wordsCount := make(map[string]int) for _, field := range fields { wordsCount[field]++ } keys := make([]string, 0, len(wordsCount)) for key := range wordsCount { keys = append(keys, key) } sort.Slice(keys, func(i, j int) bool { return wordsCount[keys[i]] > wordsCount[keys[j]] }) for idx, key := range keys { fmt.Printf("%s %d\n", key, wordsCount[key]) if idx == 10 { break } } }
We count the frequency of the words from the King James Bible.
fields := strings.FieldsFunc(text, func(r rune) bool { return !('a' <= r && r <= 'z' || 'A' <= r && r <= 'Z' || r == '\'') })
The FieldsFunc
cuts the text by characters that are not alphabetic
and apostrophe. This will also disregard all the verse numbers.
wordsCount := make(map[string]int) for _, field := range fields { wordsCount[field]++ }
Each word and its frequency is stored in the wordsCount
map.
keys := make([]string, 0, len(wordsCount)) for key := range wordsCount { keys = append(keys, key) } sort.Slice(keys, func(i, j int) bool { return wordsCount[keys[i]] > wordsCount[keys[j]] })
In order to sort the words by frequency, we create a new keys
slice. We put all the words there and sort them by their frequency values.
for idx, key := range keys { fmt.Printf("%s %d\n", key, wordsCount[key]) if idx == 10 { break } }
We print the top ten frequent words from the Bible.
$ go run word_freq.go the 62103 and 38848 of 34478 to 13400 And 12846 that 12576 in 12331 shall 9760 he 9665 unto 8942 I 8854
Go word frequency example II
In the second example, we use a regular expression to divide text into words.
package main import ( "fmt" "io/ioutil" "log" "regexp" "sort" ) type WordFreq struct { word string freq int } func (p WordFreq) String() string { return fmt.Sprintf("%s %d", p.word, p.freq) } func main() { fileName := "the-king-james-bible.txt" reg := regexp.MustCompile("[a-zA-Z']+") bs, err := ioutil.ReadFile(fileName) if err != nil { log.Fatal(err) } text := string(bs) matches := reg.FindAllString(text, -1) words := make(map[string]int) for _, match := range matches { words[match]++ } var wordFreqs []WordFreq for k, v := range words { wordFreqs = append(wordFreqs, WordFreq{k, v}) } sort.Slice(wordFreqs, func(i, j int) bool { return wordFreqs[i].freq > wordFreqs[j].freq }) for i := 0; i < 10; i++ { fmt.Println(wordFreqs[i]) } }
We store the words and their frequencies in the WordFreq
structure.
reg := regexp.MustCompile("[a-zA-Z']+")
In our regular expression, one or more alphabetic characters or an apostrophe constitues a word.
matches := reg.FindAllString(text, -1)
The FindAllString
function returns a slice of all successive
matches of the expression.
words := make(map[string]int) for _, match := range matches { words[match]++ }
We go over the matches and calculate their frequencies in the file. The words
and the number of their occurrences are store in the words
map.
var wordFreqs []WordFreq for k, v := range words { wordFreqs = append(wordFreqs, WordFreq{k, v}) }
We build a slice of WordFreq
structures out of the
words
map.
sort.Slice(wordFreqs, func(i, j int) bool { return wordFreqs[i].freq > wordFreqs[j].freq })
We sort the wordFreqs
slice by the freq
field.
for i := 0; i < 10; i++ { fmt.Println(wordFreqs[i]) }
We print the first ten most common words.
Go word frequency example III
In the next example, we also use a regular expression.
package main import ( "fmt" "io/ioutil" "log" "regexp" "sort" ) type WordFreq struct { word string freq int } func (p WordFreq) String() string { return fmt.Sprintf("%s %d", p.word, p.freq) } type byFreq []WordFreq func (a byFreq) Len() int { return len(a) } func (a byFreq) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a byFreq) Less(i, j int) bool { return a[i].freq < a[j].freq } func main() { fileName := "the-king-james-bible.txt" bs, err := ioutil.ReadFile(fileName) if err != nil { log.Fatal(err) } text := string(bs) re := regexp.MustCompile("[a-zA-Z']+") matches := re.FindAllString(text, -1) words := make(map[string]int) for _, match := range matches { words[match]++ } var wordFreqs []WordFreq for k, v := range words { wordFreqs = append(wordFreqs, WordFreq{k, v}) } sort.Sort(sort.Reverse(byFreq(wordFreqs))) for i := 0; i < 10; i++ { fmt.Printf("%v\n", wordFreqs[i]) } }
This example also implements a custom sorting interface.
type byFreq []WordFreq func (a byFreq) Len() int { return len(a) } func (a byFreq) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a byFreq) Less(i, j int) bool { return a[i].freq < a[j].freq }
We implement the sort.Interface
for []WordFreq
based
on the freq
field.
sort.Sort(sort.Reverse(byFreq(wordFreqs)))
To sort the WordFreq
structures in descending order, we use the
sort.Reverse
function.
Source
In this article we have counted word frequencies in the King James Bible.
Author
List all Go tutorials.