Pyquery tutorial

Pyquery tutorial shows how to make jquery queries on XML documents in Python.

jQuery

jQuery is a JavaScript library which is used to manipulate DOM. With jQuery, we can find, select, traverse, and manipulate parts of an HTML document.

Pyquery

Pyquery is a Python library which has similar API to jQuery. It uses lxml module for fast XML and HTML manipulation. The API is as much as possible similar to jQuery.

Installing pyquery

Pyquery is installed with the following command:

$ sudo pip3 install pyquery

We use the pip3 command to install pyquery module.

The HTML file

In the examples, we will use the following HTML file:

index.html
<!DOCTYPE html>
<html>
    <head>
        <title>Header</title>
        <meta charset="utf-8">                   
    </head>
        
    <body>
        <h2>Operating systems</h2>
        
        <ul id="mylist" style="width:150px">
            <li>Solaris</li>
            <li>FreeBSD</li>
            <li>Debian</li>                      
            <li>NetBSD</li>           
            <li>Windows</li>         
        </ul>
    </body>    
</html>

Simple example

In the first example, we use pyquery module to get the text of a header.

header.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)
    text = doc("h2").text()
    
    print(text)

The code example prints the text of the h2 element.

from pyquery import PyQuery as pq

We import the PyQuery class from the pyquery module. The PyQuery is the main class for doing work.

with open("index.html", "r") as f:
    
    contents = f.read()

We open the index.html file and read its contents with the read() method.

doc = pq(contents)

A PyQuery object is created; the HTML data is passed to the constructor.

text = doc("h2").text()

We select the h2 tag and get its text with the text() method.

$ ./header.py 
Operating systems

This is the output.

The text and html methods

The text() method retrieves the text of an element while the html() method retrieves the HTML data of the element.

get_list.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)
    
    text = doc("ul").text()
    print("\n".join(text.split()))
    
    text = doc("ul").html()
    print("\n".join(text.split()))    

We get the text data and the HTML data of the ul element.

$ ./get_list.py 
Solaris
FreeBSD
Debian
NetBSD
Windows
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>

This is the output.

Attributes

Element attributes can be retrieved with the attr() method.

attributes.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)
    
    tag = doc("ul")
    
    print(tag.attr("id"))
    print(tag.attr("style")) 

In the code example, we retrieve and print two attributes of the ul element: id and style.

$ ./attributes.py 
mylist
width:150px

This is the output.

Web scraping

Requests is a simple Python HTTP library. It provides methods for accessing Web resources via HTTP.

scraping.py
#!/usr/bin/python3

from pyquery import PyQuery as pq
import requests as req

resp = req.get("http://www.something.com")
doc = pq(resp.text)

title = doc('title').text()
print(title)

The example retrieves the title of a simple web page.

resp = req.get("http://www.something.com")
doc = pq(resp.text)

We get the HTML data of the page.

title = doc('title').text()
print(title)

We retrieve its title.

$ ./scraping.py 
Something.

This is the output.

Selecting tags

The selectors are used to select elements in an HTML document that meet certain criteria. The criteria can be their name, id, class name, attributes or a combination of them.

selecting.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

def print_item(self, item):
    
    print("Tag: {0}, Text: {1}".format(item.tag, item.text))

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)
    
    first_li = doc("li:first")
    print(first_li.text())
    
    last_li = doc("li:last")
    print(last_li.text())
    
    odd_lis = doc("li:odd")    
    odd_lis.each(print_item)

The example selects various li tags from the HTML document.

def print_item(self, item):
    
    print("Tag: {0}, Text: {1}".format(item.tag, item.text))

In this function, we print the tag name and its text.

first_li = doc("li:first")
print(first_li.text())

We select the first li tag and print its content with the text() method.

last_li = doc("li:last")
print(last_li.text())

Here we get the last li tag.

odd_lis = doc("li:odd")    
odd_lis.each(print_item)

With the help of the each() method, we print the tag and its content of the every odd li element.

$ ./selecting.py 
Solaris
Windows
Tag: li, Text: FreeBSD
Tag: li, Text: NetBSD

This is the output.

Removing elements

The remove() method deletes a tag.

removing.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)

    removed_item = doc('li:last').remove()
    
    print(removed_item)
    print(doc)

In the example, we remove the last li tag.

removed_item = doc('li:last').remove()

We select the last li tag and remove it with remove(). The removed element is returned.

print(removed_item)
print(doc)

We print the deleted item and the document, which has the element removed.

$ ./removing.py 
<li>Windows</li>         
        
<html>
    <head>
        <title>Header</title>
        <meta charset="utf-8"/>                   
    </head>
        
    <body>
        <h2>Operating systems</h2>
        
        <ul id="mylist" style="width:150px">
            <li>Solaris</li>
            <li>FreeBSD</li>
            <li>Debian</li>                      
            <li>NetBSD</li>           
                      
        </ul>
    </body>    
</html>

This is the output.

The items method

The items() method allows to iterate over elements.

iterate.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)

    items = [item.text() for item in doc.items('li')]
    print(items)

The example iterates over the li elements of the document.

items = [item.text() for item in doc.items('li')]

The items() method is used to create a Python list of li elements in a list comprehension.

$ ./iterate.py 
['Solaris', 'FreeBSD', 'Debian', 'NetBSD', 'Windows']

This is the output.

Appending and prepending elements

The append() method adds an element at the end of a node and the prepend() method inserts the element at the beginning of a node.

append_prepend.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)
    mylist = doc("#mylist")
    
    mylist.prepend("<li>DragonFly</li>")
    mylist.append("<li>OpenBSD</li>")
    
    print(mylist)

The code example inserts two li elements with the prepend() and append() methods.

The filter method

The filter() method is used to filter elements.

filtering.py
#!/usr/bin/python3

from pyquery import PyQuery as pq

with open("index.html", "r") as f:
    
    contents = f.read()
    
    doc = pq(contents)

    filtered = doc('li').filter(lambda i: pq(this).text().startswith(('F', 'D', 'N')))
    print(filtered.text())

The example displays operating systems that start with F, D, or N. We use a filter() method and an anonymous function.

 ./filtering.py 
FreeBSD Debian NetBSD

This is the output.

In this tutorial, we have worked with the Python pyquery library.

You might also be interested in the following related tutorials: jQuery tutorial, Beautifulsoup tutorial, Python tutorial, Python list comprehensions, Openpyxl tutorial, Python requests tutorial, and Python CSV tutorial.