Go goquery
last modified April 11, 2024
In this article we show how to do web scraping/HTML parsing in Golang with goquery. The goquery API is similar to jQuery.
The goquery is based on the net/html
package and the CSS Selector
library cascadia.
$ go get github.com/PuerkitoBio/goquery
We get the goquery
package for our project.
$ go version go version go1.22.2 linux/amd64
We use Go version 1.22.2.
Go goquery get title
The following example, we get a title of a webpage.
package main import ( "fmt" "github.com/PuerkitoBio/goquery" "log" "net/http" ) func main() { webPage := "http://webcode.me" resp, err := http.Get(webPage) if err != nil { log.Fatal(err) } defer resp.Body.Close() if resp.StatusCode != 200 { log.Fatalf("failed to fetch data: %d %s", resp.StatusCode, resp.Status) } doc, err := goquery.NewDocumentFromReader(resp.Body) if err != nil { log.Fatal(err) } title := doc.Find("title").Text() fmt.Println(title) }
We generate a GET request to the specified webpage and retrieve its contents. From the body of the response, we generate a goquery document. From this document, we retrieve the title.
title := doc.Find("title").Text()
The Find
method returns a set of matched elements. In our case,
it is one title
tag. With Text
, we get the text
content of the tag.
$ go run get_title.go My html page
Go goquery read local file
The following example reads a local HTML file.
<!DOCTYPE html> <html lang="en"> <body> <main> <h1>My website</h1> <p> I am a Go programmer. </p> <p> My hobbies are: </p> <ul> <li>Swimming</li> <li>Tai Chi</li> <li>Running</li> <li>Web development</li> <li>Reading</li> <li>Music</li> </ul> </main> </body> </html>
This is a simple HTML file.
package main import ( "fmt" "io/ioutil" "log" "regexp" "strings" "github.com/PuerkitoBio/goquery" ) func main() { data, err := ioutil.ReadFile("index.html") if err != nil { log.Fatal(err) } doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(data))) if err != nil { log.Fatal(err) } text := doc.Find("h1,p").Text() re := regexp.MustCompile("\\s{2,}") fmt.Println(re.ReplaceAllString(text, "\n")) }
We get the text contents of two tags.
data, err := ioutil.ReadFile("index.html")
We read the file.
doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(data)))
We generate a new goquery document with NewDocumentFromReader
.
text := doc.Find("h1,p").Text()
We get the text contents of two tags: h1 and p.
re := regexp.MustCompile("\\s{2,}") fmt.Println(re.ReplaceAllString(text, "\n"))
Using a regular expression, we remove excessive white space.
$ go run read_local.go My website I am a Go programmer. My hobbies are:
Go goquery read from HTML string
In the next example, we process a built-in HTML string.
package main import ( "fmt" "log" "strings" "github.com/PuerkitoBio/goquery" ) func main() { data := ` <html lang="en"> <body> <p>List of words</p> <ul> <li>dark</li> <li>smart</li> <li>war</li> <li>cloud</li> <li>park</li> <li>cup</li> <li>worm</li> <li>water</li> <li>rock</li> <li>warm</li> </ul> <footer>footer for words</footer> </body> </html> ` doc, err := goquery.NewDocumentFromReader(strings.NewReader(data)) if err != nil { log.Fatal(err) } words := doc.Find("li").Map(func(i int, sel *goquery.Selection) string { return fmt.Sprintf("%d: %s", i+1, sel.Text()) }) fmt.Println(words) }
We get the words from the HTML list.
words := doc.Find("li").Map(func(i int, sel *goquery.Selection) string { return fmt.Sprintf("%d: %s", i+1, sel.Text()) })
With Find
, we get all the li
elements. The
Map
method is used to build a string that contains the word
and its index in the list.
$ go run get_words.go [1: dark 2: smart 3: war 4: cloud 5: park 6: cup 7: worm 8: water 9: rock 10: warm]
Go goquery filter words
The following example filters words.
package main import ( "fmt" "log" "strings" "github.com/PuerkitoBio/goquery" ) func main() { data := ` <html lang="en"> <body> <p>List of words</p> <ul> <li>dark</li> <li>smart</li> <li>war</li> <li>cloud</li> <li>park</li> <li>cup</li> <li>worm</li> <li>water</li> <li>rock</li> <li>warm</li> </ul> <footer>footer for words</footer> </body> </html> ` doc, err := goquery.NewDocumentFromReader(strings.NewReader(data)) if err != nil { log.Fatal(err) } f := func(i int, sel *goquery.Selection) bool { return strings.HasPrefix(sel.Text(), "w") } var words []string doc.Find("li").FilterFunction(f).Each(func(_ int, sel *goquery.Selection) { words = append(words, sel.Text()) }) fmt.Println(words) }
We retrieve all words starting with 'w'.
f := func(i int, sel *goquery.Selection) bool { return strings.HasPrefix(sel.Text(), "w") }
This is a predicate function that returns a boolean true for all words that begin with 'w'.
doc.Find("li").FilterFunction(f).Each(func(_ int, sel *goquery.Selection) { words = append(words, sel.Text()) })
We locate the set of matching tags with Find
. We filter the set
with FilterFunction
and go over the filtered results with
Each
. We add each filtered word to the words slice.
fmt.Println(words)
Finally, we print the slice.
$ go run filter_words.go [war worm water warm]
Go goquery union words
With Union
, we can combine selections.
package main import ( "fmt" "log" "strings" "github.com/PuerkitoBio/goquery" ) func main() { data := ` <html lang="en"> <body> <p>List of words</p> <ul> <li>dark</li> <li>smart</li> <li>war</li> <li>cloud</li> <li>park</li> <li>cup</li> <li>worm</li> <li>water</li> <li>rock</li> <li>warm</li> </ul> <footer>footer for words</footer> </body> </html> ` doc, err := goquery.NewDocumentFromReader(strings.NewReader(data)) if err != nil { log.Fatal(err) } var words []string sel1 := doc.Find("li:first-child, li:last-child") sel2 := doc.Find("li:nth-child(3), li:nth-child(7)") sel1.Union(sel2).Each(func(_ int, sel *goquery.Selection) { words = append(words, sel.Text()) }) fmt.Println(words) }
The example combines two selections.
sel1 := doc.Find("li:first-child, li:last-child")
The first selection contains the first and the last element.
sel2 := doc.Find("li:nth-child(3), li:nth-child(7)")
The second selection contains the third and the seventh element.
sel1.Union(sel2).Each(func(_ int, sel *goquery.Selection) { words = append(words, sel.Text()) })
We combine the two selections with Union
.
$ go run union_words.go [dark warm war worm]
Go goquery get links
The following example retrieves links from a webpage.
package main import ( "fmt" "log" "net/http" "strings" "github.com/PuerkitoBio/goquery" ) func getLinks() { webPage := "https://golang.org" resp, err := http.Get(webPage) if err != nil { log.Fatal(err) } defer resp.Body.Close() if resp.StatusCode != 200 { log.Fatalf("status code error: %d %s", resp.StatusCode, resp.Status) } doc, err := goquery.NewDocumentFromReader(resp.Body) if err != nil { log.Fatal(err) } f := func(i int, s *goquery.Selection) bool { link, _ := s.Attr("href") return strings.HasPrefix(link, "https") } doc.Find("body a").FilterFunction(f).Each(func(_ int, tag *goquery.Selection) { link, _ := tag.Attr("href") linkText := tag.Text() fmt.Printf("%s %s\n", linkText, link) }) } func main() { getLinks() }
The example retrieves external links to secured web pages.
f := func(i int, s *goquery.Selection) bool { link, _ := s.Attr("href") return strings.HasPrefix(link, "https") }
In the predicate function we ensure that the link has the https
prefix.
doc.Find("body a").FilterFunction(f).Each(func(_ int, tag *goquery.Selection) { link, _ := tag.Attr("href") linkText := tag.Text() fmt.Printf("%s %s\n", linkText, link) })
We find all the anchor tags, filter them, and then print the filtered links to the console.
Go goquery StackOverflow questions
We are going to get the latest StackOverflow questions for the Raku tag.
package main import ( "fmt" "log" "net/http" "github.com/PuerkitoBio/goquery" ) func main() { webPage := "https://stackoverflow.com/questions/tagged/raku" resp, err := http.Get(webPage) if err != nil { log.Fatal(err) } defer resp.Body.Close() if resp.StatusCode != 200 { log.Fatalf("failed to fetch data: %d %s", resp.StatusCode, resp.Status) } doc, err := goquery.NewDocumentFromReader(resp.Body) if err != nil { log.Fatal(err) } doc.Find(".question-summary .summary").Each(func(i int, s *goquery.Selection) { title := s.Find("h3").Text() fmt.Println(i+1, title) }) }
In the code example, we print the last fifty titles of the StackOverflow questions on the Raku programming language.
doc.Find(".question-summary .summary").Each(func(i int, s *goquery.Selection) { title := s.Find("h3").Text() fmt.Println(i+1, title) })
We locate the questions and print their titles; the title is in the
h3
tag.
$ go run get_qs.go 1 Raku pop() order of execution 2 Does the `do` keyword run a block or treat it as an expression? 3 Junction ~~ Junction behavior 4 Is there a way to detect whether something is immutable? 5 Optimize without sacrificing usual workflow: arguments, POD etc 6 Find out external command runability ...
Go goquery get earthquakes
In the next example, we fetch data about earthquakes.
$ go get github.com/olekukonko/tablewriter
We use the tablewriter
package to display data in tabular format.
package main import ( "fmt" "github.com/PuerkitoBio/goquery" "github.com/olekukonko/tablewriter" "log" "net/http" "os" "strings" ) type Earthquake struct { Date string Latitude string Longitude string Magnitude string Depth string Location string IrisId string } var quakes []Earthquake func fetchQuakes() { webPage := "http://ds.iris.edu/seismon/eventlist/index.phtml" resp, err := http.Get(webPage) if err != nil { log.Fatal(err) } defer resp.Body.Close() if resp.StatusCode != 200 { log.Fatalf("failed to fetch data: %d %s", resp.StatusCode, resp.Status) } doc, err := goquery.NewDocumentFromReader(resp.Body) if err != nil { log.Fatal(err) } doc.Find("tbody tr").Each(func(j int, tr *goquery.Selection) { if j >= 10 { return } e := Earthquake{} tr.Find("td").Each(func(ix int, td *goquery.Selection) { switch ix { case 0: e.Date = td.Text() case 1: e.Latitude = td.Text() case 2: e.Longitude = td.Text() case 3: e.Magnitude = td.Text() case 4: e.Depth = td.Text() case 5: e.Location = strings.TrimSpace(td.Text()) case 6: e.IrisId = td.Text() } }) quakes = append(quakes, e) }) table := tablewriter.NewWriter(os.Stdout) table.SetHeader([]string{"Date", "Location", "Magnitude", "Longitude", "Latitude", "Depth", "IrisId"}) table.SetCaption(true, "Last ten earthquakes") for _, quake := range quakes { s := []string{ quake.Date, quake.Location, quake.Magnitude, quake.Longitude, quake.Latitude, quake.Depth, quake.IrisId, } table.Append(s) } table.Render() } func main() { fetchQuakes() }
The example retrieves ten latest earthquakes from the Iris database. It prints the data in a tabular format.
type Earthquake struct { Date string Latitude string Longitude string Magnitude string Depth string Location string IrisId string }
The data is grouped in the Earthquake
structure.
var quakes []Earthquake
The structures will be stored in the quakes
slice.
doc.Find("tbody tr").Each(func(j int, tr *goquery.Selection) {
Locating data is simple; we go for the tr
tags inside the
tbody
tag.
e := Earthquake{} tr.Find("td").Each(func(ix int, td *goquery.Selection) { switch ix { case 0: e.Date = td.Text() case 1: e.Latitude = td.Text() case 2: e.Longitude = td.Text() case 3: e.Magnitude = td.Text() case 4: e.Depth = td.Text() case 5: e.Location = strings.TrimSpace(td.Text()) case 6: e.IrisId = td.Text() } }) quakes = append(quakes, e)
We create a new Earthquake
structure, fill it with table row data
and put the structure into the quakes
slice.
table := tablewriter.NewWriter(os.Stdout) table.SetHeader([]string{"Date", "Location", "Magnitude", "Longitude", "Latitude", "Depth", "IrisId"}) table.SetCaption(true, "Last ten earthquakes")
We create a new table for displaying our data. The data will be shown in the standard output (console). We create a header and a caption for the table.
for _, quake := range quakes { s := []string{ quake.Date, quake.Location, quake.Magnitude, quake.Longitude, quake.Latitude, quake.Depth, quake.IrisId, } table.Append(s) }
The table takes a string slice as a parameter; therefore, we transform the
structure into a slice and append the slice to the table with
Append
.
table.Render()
In the end, we render the table.
$ go run earthquakes.go +------------------------+--------------------------------+-----------+-----------+----------+-------+------------+ | DATE | LOCATION | MAGNITUDE | LONGITUDE | LATITUDE | DEPTH | IRISID | +------------------------+--------------------------------+-----------+-----------+----------+-------+------------+ | 17-AUG-2021 07:54:31 | TONGA ISLANDS | 4.9 | -174.01 | -17.44 | 45 | 11457319 | | 17-AUG-2021 03:10:50 | SOUTH SANDWICH ISLANDS REGION | 5.7 | -24.02 | -58.04 | 10 | 11457233 | | 17-AUG-2021 02:22:46 | LEYTE, PHILIPPINES | 4.4 | 125.44 | 10.37 | 228 | 11457202 | | 17-AUG-2021 02:19:28 | CHILE-ARGENTINA BORDER REGION | 4.5 | -67.28 | -24.27 | 183 | 11457198 | | 17-AUG-2021 01:30:26 | WEST CHILE RISE | 4.9 | -81.25 | -44.38 | 10 | 11457192 | | 17-AUG-2021 00:38:38 | AFGHANISTAN-TAJIKISTAN BORD | 4.4 | 71.13 | 36.72 | 240 | 11457214 | | | REG. | | | | | | | 16-AUG-2021 23:58:56 | NORTHWESTERN BALKAN REGION | 4.6 | 16.28 | 45.44 | 10 | 11457177 | | 16-AUG-2021 23:37:25 | SOUTH SANDWICH ISLANDS REGION | 5.5 | -26.23 | -59.56 | 52 | 11457169 | | 16-AUG-2021 20:50:34 | SOUTH SANDWICH ISLANDS REGION | 5.5 | -24.90 | -60.25 | 10 | 11457139 | | 16-AUG-2021 19:17:09 | SOUTH SANDWICH ISLANDS REGION | 5.1 | -26.77 | -60.22 | 35 | 11457054 | +------------------------+--------------------------------+-----------+-----------+----------+-------+------------+ Last ten earthquakes
Source
In this article we have scraped web/parsed HTML in Go with the goquery
package.
Author
List all Go tutorials.