ZetCode

Go colly

last modified September 26, 2022

Go Colly tutorial shows how to do web scraping and crawling in Golang.

Colly is a fast web scraping and crawling framework for Golang. It can be used for tasks such as data mining, data processing or archiving.

Colly has automatic cookie and session handling. It supports synchronous, asynchronous and parallel scraping. It supports caching, respects robots.txt file, and enables distributed scraping.

$ go version
go version go1.18.1 linux/amd64

We use Go version 1.18.

Colly Collector

Collector is the main interface to Colly. It manages the network communication and responsible for the execution of the attached callbacks while a collector job is running. The collector is started with the Visit function.

Colly is an event-based framework. We put code in the various event handlers.

Colly has the following event handlers:

Go Colly simple example

We start with a simple example.

title.go
package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    c.Visit("http://webcode.me")
}

The program retrieves a website's title.

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

First, we import the library.

c := colly.NewCollector()

A collector is created.

c.OnHTML("title", func(e *colly.HTMLElement) {
     fmt.Println(e.Text)
})

In the OnHTML handler, we register an anonymous function which is called for each title tag. We print the title's text.

c.Visit("http://webcode.me")

The Visit starts collector's collecting job by creating a request to the specified URL.

$ go run title.go
My html page

Go Colly event handlers

We can react to different event handlers.

callbacks.go
package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()
    c.UserAgent = "Go program"

    c.OnRequest(func(r *colly.Request) {

        for key, value := range *r.Headers {
            fmt.Printf("%s: %s\n", key, value)
        }

        fmt.Println(r.Method)
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {

        fmt.Println("-----------------------------")

        fmt.Println(e.Text)
    })

    c.OnResponse(func(r *colly.Response) {

        fmt.Println("-----------------------------")

        fmt.Println(r.StatusCode)

        for key, value := range *r.Headers {
            fmt.Printf("%s: %s\n", key, value)
        }
    })

    c.Visit("http://webcode.me")
}

In the code example, we have event handlers for OnRequest, OnHTML, and OnResponse.

c.OnRequest(func(r *colly.Request) {

    fmt.Println("-----------------------------")

    for key, value := range *r.Headers {
        fmt.Printf("%s: %s\n", key, value)
    }

    fmt.Println(r.Method)
})

In the OnRequest handler, we print the request headers and request method.

c.OnHTML("title", func(e *colly.HTMLElement) {

    fmt.Println("-----------------------------")

    fmt.Println(e.Text)
})

We process the title tag in the OnHTML handler.

c.OnResponse(func(r *colly.Response) {

    fmt.Println("-----------------------------")

    fmt.Println(r.StatusCode)

    for key, value := range *r.Headers {
        fmt.Printf("%s: %s\n", key, value)
    }
})

Finally, we print the status code of the response and its header values in the OnResponse handler.

$ go run callbacks.go
User-Agent: [Go program]
GET
-----------------------------
200
Connection: [keep-alive]
Access-Control-Allow-Origin: [*]
Server: [nginx/1.6.2]
Date: [Sun, 23 Jan 2022 14:13:04 GMT]
Content-Type: [text/html]
Last-Modified: [Sun, 23 Jan 2022 10:39:25 GMT]
-----------------------------
My html page

Go Colly scrape local files

We can scrape files on a local disk.

words.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document title</title>
</head>
<body>
<p>List of words</p>
<ul>
    <li>dark</li>
    <li>smart</li>
    <li>war</li>
    <li>cloud</li>
    <li>park</li>
    <li>cup</li>
    <li>worm</li>
    <li>water</li>
    <li>rock</li>
    <li>warm</li>
</ul>
<footer>footer for words</footer>
</body>
</html>

We have this HTML file.

local.go
package main

import (
    "fmt"
    "net/http"

    "github.com/gocolly/colly/v2"
)

func main() {

    t := &http.Transport{}
    t.RegisterProtocol("file", http.NewFileTransport(http.Dir(".")))

    c := colly.NewCollector()
    c.WithTransport(t)

    words := []string{}

    c.OnHTML("li", func(e *colly.HTMLElement) {
        words = append(words, e.Text)
    })

    c.Visit("file://./words.html")

    for _, p := range words {
        fmt.Printf("%s\n", p)
    }
}

To scrape local files, we must register a file protocol. We scrape all the words from the list.

$ go run local.go
dark
smart
war
cloud
park
cup
worm
water
rock
warm

Top retailers

In the following example, we are going to get the top retailers in the US.

retail.go
package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()

    i := 0
    scan := true

    c.OnHTML("#stores-list--section-16266 td.data-cell-0,td.data-cell-1,td.data-cell-2,td.data-cell-3",
        func(e *colly.HTMLElement) {

            if scan {

                fmt.Printf("%s ", e.Text)
            }
            i++

            if i%4 == 0 && i < 40 {
                fmt.Println()
            }

            if i == 40 {
                scan = false
                fmt.Println()
            }
        })

    c.Visit("https://nrf.com/resources/top-retailers/top-100-retailers/top-100-retailers-2019")
}

The example prints top 10 US retailer for year 2019.

c.OnHTML("#stores-list--section-16266 td.data-cell-0,td.data-cell-1,td.data-cell-2,td.data-cell-3",
    func(e *colly.HTMLElement) {

We have to look at the HTML source and determine which ids or classes to look for. In our case, we take four columns from the table.

$ go run retail.go
1 Walmart $387.66  Bentonville, AR
2 Amazon.com $120.93  Seattle, WA
3 The Kroger Co. $119.70  Cincinnati, OH
4 Costco $101.43  Issaquah, WA
5 Walgreens Boots Alliance $98.39  Deerfield, IL
6 The Home Depot $97.27  Atlanta, GA
7 CVS Health Corporation $83.79  Woonsocket, RI
8 Target $74.48  Minneapolis, MN
9 Lowe's Companies $64.09  Mooresville, NC
10 Albertsons Companies $59.71  Boise, ID

Go Colly crawl links

In more complex tasks, we need to crawl links that we find in our documents.

links.go
package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    c := colly.NewCollector()

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {

        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })

    c.Visit("http://webcode.me/small.html")
}

For our example, we have created a small HTML document that contains three links. Those three links do not contain additional links. We print the title of each of the documents.

c.OnHTML("a[href]", func(e *colly.HTMLElement) {

    link := e.Attr("href")
    c.Visit(e.Request.AbsoluteURL(link))
})

In the OnHTML handler, we find all the links, get their href attributes and visit them.

$ go run links.go
Small
My html page
Something.

Stackoverflow questions

In the next example, we scrape Stackoverflow questions.

stack.go
package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

type question struct {
    title   string
    excerpt string
}

func main() {

    c := colly.NewCollector()
    qs := []question{}

    c.OnHTML("div.summary", func(e *colly.HTMLElement) {

        q := question{}
        q.title = e.ChildText("a.question-hyperlink")
        q.excerpt = e.ChildText(".excerpt")

        qs = append(qs, q)

    })

    c.OnScraped(func(r *colly.Response) {
        for idx, q := range qs {

            fmt.Println("---------------------------------")
            fmt.Println(idx + 1)
            fmt.Printf("Q: %s \n\n", q.title)
            fmt.Println(q.excerpt)
        }
    })

    c.Visit("https://stackoverflow.com/questions/tagged/perl")
}

The program prints the current Perl questions from the main page.

type question struct {
    title   string
    excerpt string
}

We create a structure that will hold the question title and its excerpt.

qs := []question{}

A slice is created for the questions.

c.OnHTML("div.summary", func(e *colly.HTMLElement) {

    q := question{}
    q.title = e.ChildText("a.question-hyperlink")
    q.excerpt = e.ChildText(".excerpt")

    qs = append(qs, q)

})

We get the data into the structures.

c.OnScraped(func(r *colly.Response) {
    for idx, q := range qs {

        fmt.Println("---------------------------------")
        fmt.Println(idx + 1)
        fmt.Printf("Q: %s \n\n", q.title)
        fmt.Println(q.excerpt)
    }
})

The OnScraped is called after all works is done. At this time we can go over the slice and print all the scraped data.

Go Colly async mode

The default mode in which Colly works is synchronous. We enable the asynchronous mode with the Async function. In asynchronous mode, we need to call Wait to wait until the collector jobs are finished.

async.go
package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    urls := []string{
        "http://webcode.me",
        "https://example.com",
        "http://httpbin.org",
        "https://www.perl.org",
        "https://www.php.net",
        "https://www.python.org",
        "https://code.visualstudio.com",
        "https://clojure.org",
    }

    c := colly.NewCollector(
        colly.Async(),
    )

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    for _, url := range urls {

        c.Visit(url)
    }

    c.Wait()

}

Notice that each time we run the program we get a different order of the returned titles.

In this tutorial, we have performed web scraping and crawling in Golang with Colly.

List all Go tutorials.