Pyquery
last modified January 29, 2024
Pyquery tutorial shows how to make jquery queries on XML documents in Python.
jQuery is a JavaScript library which is used to manipulate DOM. With jQuery, we can find, select, traverse, and manipulate parts of an HTML document.
Pyquery
Pyquery is a Python library which has similar API to jQuery. It uses
lxml
module for fast XML and HTML manipulation. The API is as much
as possible similar to jQuery.
Installing pyquery
Pyquery is installed with the following command:
$ sudo pip3 install pyquery
We use the pip3
command to install pyquery
module.
The HTML file
In the examples, we will use the following HTML file:
<!DOCTYPE html> <html> <head> <title>Header</title> <meta charset="utf-8"> </head> <body> <h2>Operating systems</h2> <ul id="mylist" style="width:150px"> <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> <li>Windows</li> </ul> </body> </html>
Simple example
In the first example, we use pyquery module to get the text of a header.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) text = doc("h2").text() print(text)
The code example prints the text of the h2
element.
from pyquery import PyQuery as pq
We import the PyQuery
class from the pyquery
module. The PyQuery
is the main class for doing work.
with open("index.html", "r") as f: contents = f.read()
We open the index.html
file and read its contents
with the read
method.
doc = pq(contents)
A PyQuery
object is created; the HTML data is passed to the
constructor.
text = doc("h2").text()
We select the h2
tag and get its text with the text
method.
$ ./header.py Operating systems
The text and html methods
The text
method retrieves the text of an element while
the html
method retrieves the HTML data of the element.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) text = doc("ul").text() print("\n".join(text.split())) text = doc("ul").html() print("\n".join(text.split()))
We get the text data and the HTML data of the ul
element.
$ ./get_list.py Solaris FreeBSD Debian NetBSD Windows <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> <li>Windows</li>
Attributes
Element attributes can be retrieved with the attr
method.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) tag = doc("ul") print(tag.attr("id")) print(tag.attr("style"))
In the code example, we retrieve and print two attributes of the ul
element: id
and style
.
$ ./attributes.py mylist width:150px
Web scraping
Requests is a simple Python HTTP library. It provides methods for accessing Web resources via HTTP.
#!/usr/bin/python from pyquery import PyQuery as pq import requests as req resp = req.get("http://www.webcode.me") doc = pq(resp.text) title = doc('title').text() print(title)
The example retrieves the title of a simple web page.
resp = req.get("http://www.webcode.me") doc = pq(resp.text)
We get the HTML data of the page.
title = doc('title').text() print(title)
We retrieve its title.
$ ./scraping.py My html page
Selecting tags
The selectors are used to select elements in an HTML document that meet certain criteria. The criteria can be their name, id, class name, attributes or a combination of them.
#!/usr/bin/python from pyquery import PyQuery as pq def print_item(self, item): print("Tag: {0}, Text: {1}".format(item.tag, item.text)) with open("index.html", "r") as f: contents = f.read() doc = pq(contents) first_li = doc("li:first") print(first_li.text()) last_li = doc("li:last") print(last_li.text()) odd_lis = doc("li:odd") odd_lis.each(print_item)
The example selects various li
tags from the HTML document.
def print_item(self, item): print("Tag: {0}, Text: {1}".format(item.tag, item.text))
In this function, we print the tag name and its text.
first_li = doc("li:first") print(first_li.text())
We select the first li
tag and print its content with
the text
method.
last_li = doc("li:last") print(last_li.text())
Here we get the last li
tag.
odd_lis = doc("li:odd") odd_lis.each(print_item)
With the help of the each
method, we print the tag and its content
of the every odd li
element.
$ ./selecting.py Solaris Windows Tag: li, Text: FreeBSD Tag: li, Text: NetBSD
Removing elements
The remove
method deletes a tag.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) removed_item = doc('li:last').remove() print(removed_item) print(doc)
In the example, we remove the last li
tag.
removed_item = doc('li:last').remove()
We select the last li
tag and remove it with remove
.
The removed element is returned.
print(removed_item) print(doc)
We print the deleted item and the document, which has the element removed.
$ ./removing.py <li>Windows</li> <html> <head> <title>Header</title> <meta charset="utf-8"/> </head> <body> <h2>Operating systems</h2> <ul id="mylist" style="width:150px"> <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> </ul> </body> </html>
The items method
The items
method allows to iterate over elements.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) items = [item.text() for item in doc.items('li')] print(items)
The example iterates over the li
elements of the
document.
items = [item.text() for item in doc.items('li')]
The items
method is used to create a Python list of
li
elements in a list comprehension.
$ ./iterate.py ['Solaris', 'FreeBSD', 'Debian', 'NetBSD', 'Windows']
Appending and prepending elements
The append
method adds an element at the end
of a node and the prepend
method inserts the
element at the beginning of a node.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) mylist = doc("#mylist") mylist.prepend("<li>DragonFly</li>") mylist.append("<li>OpenBSD</li>") print(mylist)
The code example inserts two li
elements with
the prepend
and append
methods.
The filter method
The filter
method is used to filter elements.
#!/usr/bin/python from pyquery import PyQuery as pq with open("index.html", "r") as f: contents = f.read() doc = pq(contents) filtered = doc('li').filter(lambda i: pq(this).text().startswith(('F', 'D', 'N'))) print(filtered.text())
The example displays operating systems that start with F, D, or N. We use a filter
method and an anonymous function.
$ ./filtering.py FreeBSD Debian NetBSD
Source
In this article we have worked with the Python pyquery library.
Author
List all Python tutorials.