Ebooks

Cheerio tutorial

Cheerio tutorial shows how to do web scraping in JavaScript with Cheerio module. Cheerio implements the core of jQuery designed for the server.

Cheerio

Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server.

In this tutorial we scrape HTML from a local web server. For the local web server, we use the local-web-server.

index.html
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Home page</title>
    <style>
        .fpar {
            font-family: Georgia;
        }
    </style>
</head>

<body>
    <main>
        <h1>My website</h1>

        <p class="fpar">
            I am a JavaScript programmer.
        </p>

        <p>
            My hobbies are:
        </p>

        <ul>
            <li>Swimming</li>
            <li>Tai Chi</li>
            <li>Running</li>
            <li>Web development</li>
            <li>Reading</li>
            <li>Music</li>
        </ul>
    </main>
</body>

</html>

We will be working with this HTML file.

Cheerio selectors

In Cherrion, we use selectors to select tags of an HTML document. The selector syntax was borrowed from jQuery.

The following is a partial list of available selectors:

Installing Cheerio and other modules

We install cheerio module and two additional modules.

$ nodejs -v
v9.11.2

We use Node version 9.11.2.

$ sudo npm i cheerio
$ sudo npm i request
$ sudo npm i -g local-web-server

We install cheerio, request, and local-web-server.

$ ws
Serving at http://t400:8000, http://127.0.0.1:8000, http://192.168.0.3:8000

Inside the project directory, where we have the index.html file, we start the local web server. It automatically serves the index.html file on three different locations.

Cheerio title

In the first example, we get the title of the document.

get_title.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let title = $('title');

    console.log(title.text());
});

The example prints the title of the HTML document.

const cheerio = require('cheerio');
const request = require('request');

We include cheerio and request modules. With cheerio, we do web scraping. With request, we create GET requests.

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

We create a GET request to the localhost which is served by our local web server. The resource is available in the body parameter.

let $ = cheerio.load(body);

First, we load the HTML document. To mimic jQuery, we use the $ variable.

let title = $('title');

The selector returns the title tag.

console.log(title.text());

With the text() method, we get the text of the title tag.

$ node get_title.js 
Home page

The example prints the title of the document.

Cheerio get parent element

The parent element is retrieved with parent().

get_parent.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let h1El = $('h1');

    let parentEl = h1El.parent();

    console.log(parentEl.get(0).tagName)
});

We get the parent of the h1 element.

$ node get_parent.js 
main

The parent element of h1 is main.

Cheerio first & last element

The first element of a cheerio object can be found with first(), the last element with last().

first_last.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let main = $('main');

    let fel = main.children().first();
    let lel = main.children().last();

    console.log(fel.get(0).tagName);
    console.log(lel.get(0).tagName);
});

The example prints the first and last element of the main tag.

let main = $('main');

We select the main tag.

let fel = main.children().first();
let lel = main.children().last();

We get the first and the last element from the main children.

console.log(fel.get(0).tagName);
console.log(lel.get(0).tagName);

We find out the tag names.

$ node first_last.js 
h1
ul

The first tag of the main is h1, the last one is ul.

Cheerio add element

The append() method adds a new element at the end of the specied tag.

add_element.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let ulEl = $('ul');

    ulEl.append('<li>Travel</li>');

    let lis = $('ul').html();
    let items = lis.split('\n');

    items.forEach((e) => {
        if (e) {
            console.log(e.replace(/(\s+)/g, ''));
        }
    });
});

In the example, we add a new list item to the ul element and print it to the console.

ulEl.append('<li>Travel</li>');

We append a new hobby.

let lis = $('ul').html();

We get the HTML of the ul tag.

let items = lis.split('\n');

items.forEach((e) => {
    if (e) {
        console.log(e.replace(/(\s+)/g, ''));
    }
});

We strip white spaces. Text data of elements contains lots of space.

$ node add_element.js 
<li>Swimming</li>
<li>TaiChi</li>
<li>Running</li>
<li>Webdevelopment</li>
<li>Reading</li>
<li>Music</li>
<li>Travel</li>

A new travel hobby was appended at the end of the list.

Cheerio insert after element

With after(), we can insert an element after a tag.

insert_after.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    $('main').after('<footer>This is a footer</footer>')

    console.log($.html());
});

In the example, we insert a footer element after the main element.

Cheerio loop over elements

With each(), we can loop over elements.

loop_elements.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let hobbies = [];

    $('li').each(function (i, e) {
        hobbies[i] = $(this).text();
    });

    console.log(hobbies);
});

The example loops over li tags of the ul and prints the text of the elements in an array.

$ node loop_elements.js 
[ 'Swimming',
  'Tai Chi',
  'Running',
  'Web development',
  'Reading',
  'Music' ]

This is the output.

Cheerio get element attributes

Attributes can be retrieved with attr() function.

attributes.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let fpEl = $('h1 + p');
    let attrs = fpEl.attr();

    console.log(attrs);
});

In the example, we get the attributes of the paragraph that is the immediate sibling of h1.

$ node attributes.js 
{ class: 'fpar' }

The paragraph contains the fpar class.

Cheerio filter elements

We can use filter() to apply a filter on the elements.

filtering.js
const cheerio = require('cheerio');
const request = require('request');

request({
    method: 'GET',
    url: 'http://localhost:8000'
}, (err, res, body) => {

    if (err) return console.error(err);

    let $ = cheerio.load(body);

    let allEls = $('*');

    let filteredEls = allEls.filter(function (i, el) {
        // this === el
        return $(this).children().length > 3;
    });

    let items = filteredEls.get();

    items.forEach(e => {
        console.log(e.name);
    });

});

In the example, we find out all elements of the document that contain more than three children.

let allEls = $('*');

The * selector selects all elements.

let filteredEls = allEls.filter(function (i, el) {
    // this === el
    return $(this).children().length > 3;
});

On the retrieved elements, we apply a filter. An element is included in the filtered list only if it contains more than three children.

let items = filteredEls.get();

items.forEach(e => {
    console.log(e.name);
});

We go through the filtered list and print the names of the elements.

$ node filtering.js 
head
main
ul

The head, main, and ul elements contain more than three children. The body is not included because it contains only one immediate child.

In this tutorial, we have done web scraping in JavaScript with Cheerio library.

You might also be interested in the following related tutorials: JQuery tutorial, Moment.js tutorial, Reading JSON from URL in JavaScript, JavaScript Snake tutorial, Node Sass tutorial, Lodash tutorial.