Go parse HTML with net/html
last modified April 11, 2024
In this article we show how to parse HTML in Golang with the net/html library. The net/html is a supplementary Go networking library.
$ go version go version go1.22.2 linux/amd64
We use Go version 1.22.2.
The Go net/html library has two basic set of APIs to parse HTML: the tokenizer API and the tree-based node parsing API.
In the tokenizer API, a Token
consists of a TokenType
and some Data
(tag name for start and end tags, content for text,
comments and doctypes). A tag Token
may also contain a slice of
attributes. Tokenization is done by creating a Tokenizer
for an
io.Reader
.
Parsing is done by calling Parse
with an io.Reader
,
which returns the root of the parse tree (the document element) as a
*Node
. A node consists of a NodeType
and some
Data
(tag name for element nodes, content for text) and are part of
a tree of Nodes
.
$ go get -u golang.org/x/net
We need to install the libraries.
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Colour</title> </head> <body> <p> A list of colours </p> <ul> <li>red</li> <li>green</li> <li>blue</li> <li>yellow</li> <li>orange</li> <li>brown</li> <li>pink</li> </ul> <footer> A footer </footer> </body> </html>
Some of the examples use this HTML file.
Go parse HTML list
In the next example, we parse an HTML list using the tokenizer API.
package main import ( "fmt" "golang.org/x/net/html" "io/ioutil" "log" "strings" ) func readHtmlFromFile(fileName string) (string, error) { bs, err := ioutil.ReadFile(fileName) if err != nil { return "", err } return string(bs), nil } func parse(text string) (data []string) { tkn := html.NewTokenizer(strings.NewReader(text)) var vals []string var isLi bool for { tt := tkn.Next() switch { case tt == html.ErrorToken: return vals case tt == html.StartTagToken: t := tkn.Token() isLi = t.Data == "li" case tt == html.TextToken: t := tkn.Token() if isLi { vals = append(vals, t.Data) } isLi = false } } } func main() { fileName := "index.html" text, err := readHtmlFromFile(fileName) if err != nil { log.Fatal(err) } data := parse(text) fmt.Println(data) }
The example prints the names of the colours from the list.
tkn := html.NewTokenizer(strings.NewReader(text))
A tokenizer is created with html.NewTokenizer
.
for { tt := tkn.Next() ...
We go through the tokens in a for loop. The Next
function scans the
next token and returns its type.
case tt == html.ErrorToken: return vals
We terminate the for loop at the end of the parsing and return the data.
case tt == html.StartTagToken: t := tkn.Token() isLi = t.Data == "li"
If the token is a starting tag, we get the current token with the
Token
function. We set the isLi
variable to true
if we encounter the li
tag.
case tt == html.TextToken: t := tkn.Token() if isLi { vals = append(vals, t.Data) } isLi = false
When a token is text data, we add its content to the vals
slice
provided the isLi
variable is set; i.e. we are parsing text inside
the li
tag.
$ go run parse_list.go [red green blue yellow orange brown pink]
Go parse HTML table
In the next example, we parse an HTML list.
package main import ( "fmt" "golang.org/x/net/html" "io/ioutil" "log" "net/http" "strings" ) func getHtmlPage(webPage string) (string, error) { resp, err := http.Get(webPage) if err != nil { return "", err } defer resp.Body.Close() body, err := ioutil.ReadAll(resp.Body) if err != nil { return "", err } return string(body), nil } func parseAndShow(text string) { tkn := html.NewTokenizer(strings.NewReader(text)) var isTd bool var n int for { tt := tkn.Next() switch { case tt == html.ErrorToken: return case tt == html.StartTagToken: t := tkn.Token() isTd = t.Data == "td" case tt == html.TextToken: t := tkn.Token() if isTd { fmt.Printf("%s ", t.Data) n++ } if isTd && n % 3 == 0 { fmt.Println() } isTd = false } } } func main() { webPage := "http://webcode.me/countries.html" data, err := getHtmlPage(webPage) if err != nil { log.Fatal(err) } parseAndShow(data) }
We retrieve a webpage and parse its HTML table. We get the data from the
td
tags.
$ go run parse_table.go Id Name Population 1 China 1382050000 2 India 1313210000 3 USA 324666000 4 Indonesia 260581000 5 Brazil 207221000 6 Pakistan 196626000 ...
Go parse HTML list II
In the next example, we parse an HTML list using the parsing API.
package main import ( "fmt" "golang.org/x/net/html" "io/ioutil" "log" "strings" ) func main() { fileName := "index.html" bs, err := ioutil.ReadFile(fileName) if err != nil { log.Fatal(err) } text := string(bs) doc, err := html.Parse(strings.NewReader(text)) if err != nil { log.Fatal(err) } var data []string doTraverse(doc, &data, "li") fmt.Println(data) } func doTraverse(doc *html.Node, data *[]string, tag string) { var traverse func(n *html.Node, tag string) *html.Node traverse = func(n *html.Node, tag string) *html.Node { for c := n.FirstChild; c != nil; c = c.NextSibling { if c.Type == html.TextNode && c.Parent.Data == tag { *data = append(*data, c.Data) } res := traverse(c, tag) if res != nil { return res } } return nil } traverse(doc, tag) }
We recursively traverse the document to locate all li
tags.
doc, err := html.Parse(strings.NewReader(text))
We get the document as a tree from the string with html.Parse
.
traverse = func(n *html.Node, tag string) *html.Node { for c := n.FirstChild; c != nil; c = c.NextSibling { if c.Type == html.TextNode && c.Parent.Data == tag { *data = append(*data, c.Data) } res := traverse(c, tag) if res != nil { return res } } return nil }
We go over the tags of the document via a recursive algorithm. If we deal
with a text node of an li
tag, we append its contents to the
data
slice.
$ go run parsing.go [red green blue yellow orange brown pink]
Go find tag by id
In the following example, we find a tag by its id
. There should be
only one unique tag inside an HTML document with a specific id
.
The id
. We can get attributes of a tag through the
Attr
property.
package main import ( "bytes" "fmt" "golang.org/x/net/html" "io" "log" "strings" ) func getAttribute(n *html.Node, key string) (string, bool) { for _, attr := range n.Attr { if attr.Key == key { return attr.Val, true } } return "", false } func renderNode(n *html.Node) string { var buf bytes.Buffer w := io.Writer(&buf) err := html.Render(w, n) if err != nil { return "" } return buf.String() } func checkId(n *html.Node, id string) bool { if n.Type == html.ElementNode { s, ok := getAttribute(n, "id") if ok && s == id { return true } } return false } func traverse(n *html.Node, id string) *html.Node { if checkId(n, id) { return n } for c := n.FirstChild; c != nil; c = c.NextSibling { res := traverse(c, id) if res != nil { return res } } return nil } func getElementById(n *html.Node, id string) *html.Node { return traverse(n, id) } func main() { doc, err := html.Parse(strings.NewReader(data)) if err != nil { log.Fatal(err) } tag := getElementById(doc, "yellow") output := renderNode(tag) fmt.Println(output) } var data = `<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Colour</title> </head> <body> <p> A list of colours: </p> <ul> <li>red</li> <li>green</li> <li>blue</li> <li id="yellow">yellow</li> <li>orange</li> <li>brown</li> <li>pink</li> </ul> </body> </html>`
We locate a specific tag and render its HTML. We load HTML data from a multiline string.
func getAttribute(n *html.Node, key string) (string, bool) { for _, attr := range n.Attr { if attr.Key == key { return attr.Val, true } } return "", false }
We get the attributes from the Attr
property of the tag.
func renderNode(n *html.Node) string { var buf bytes.Buffer w := io.Writer(&buf) err := html.Render(w, n) if err != nil { return "" } return buf.String() }
The html.Render
method renders the tag.
$ go run find_by_id.go <li id="yellow">yellow</li>
Go parse titles concurrently
In the next example, we parse HTML titles from various websites concurrently. The example uses the tokenizer API.
package main import ( "fmt" "golang.org/x/net/html" "net/http" "sync" ) var wg sync.WaitGroup func main() { urls := []string{ "http://webcode.me", "https://example.com", "http://httpbin.org", "https://www.perl.org", "https://www.php.net", "https://www.python.org", "https://code.visualstudio.com", "https://clojure.org", } showTitles(urls) } func showTitles(urls []string) { c := getTitleTags(urls) for msg := range c { fmt.Println(msg) } } func getTitleTags(urls []string) chan string { c := make(chan string) for _, url := range urls { wg.Add(1) go getTitle(url, c) } go func() { wg.Wait() close(c) }() return c } func getTitle(url string, c chan string) { defer wg.Done() resp, err := http.Get(url) if err != nil { c <- "failed to fetch data" return } defer resp.Body.Close() tkn := html.NewTokenizer(resp.Body) var isTitle bool for { tt := tkn.Next() switch { case tt == html.ErrorToken: return case tt == html.StartTagToken: t := tkn.Token() isTitle = t.Data == "title" case tt == html.TextToken: t := tkn.Token() if isTitle { c <- t.Data isTitle = false } } } }
We use goroutines to launch our tasks concurrently. The parsed titles are send
to the caller via a channel. The sync.WaitGroup
is used to finish
the program once all tasks have finished.
$ go run parse_titles.go My html page Welcome to Python.org The Perl Programming Language - www.perl.org Clojure PHP: Hypertext Preprocessor Visual Studio Code - Code Editing. Redefined httpbin.org Example Domain
Source
Go net/http package - reference
In this article we have parsed HTML with Go's net/html library.
Author
List all Go tutorials.