Python web scrape
last modified January 29, 2024
In this article we show how to do web scraping in Python. We use multiple Python libraries.
Web scraping is fetching and extracting data from web pages. Web scraping is used to collect and process data for marketing or research. The data include job listings, price comparisons, or social media postings.
Python is a popular choice for data science. It contains many libraries for
web scraping. To fetch data, we can utilize the requests
or
urllib3
libraries. The httpx
library can be used if
we want to create asynchronous clients.
To process the data, we can use the lxml
, pyquery
, or
BeautifulSoup. These libraries are suited for static data. If the data is hidden
behind a JavaScript wall, we can use the Selenium or PlayWright libraries.
Web scrape with urllib3 & lxml
In the first example, we fetch data with urllib3
and process it
with lxml
.
#!/usr/bin/python import urllib3 from lxml import html http = urllib3.PoolManager() url = 'http://webcode.me' resp = http.request('GET', url) content = resp.data.decode('utf-8') root = html.fromstring(content) print('------------------------') print(root.head.find(".//title").text) print('------------------------') for e in root: print(e.tag) print('------------------------') print(root.body.text_content().strip())
The program retrievs the HTML title, tags, and the text content of the HTML body.
http = urllib3.PoolManager()
A PoolManager is created. It handles all of the details of connection pooling and thread safety.
url = 'http://webcode.me' resp = http.request('GET', url)
We generate a GET request to the specified URL.
content = resp.data.decode('utf-8') root = html.fromstring(content)
We get and decode the content. We parse the string to create an HTML document for lxml.
print(root.head.find(".//title").text)
We print the title of the document.
for e in root: print(e.tag)
Here we print all the tags from the first level of the document.
print(root.body.text_content().strip())
We print the text data of the HTML body.
$ ./main.py ------------------------ My html page ------------------------ head body ------------------------ Today is a beautiful day. We go swimming and fishing. Hello there. How are you?
Web scrape with requests and pyquery
In the second example, we use the requests
library to fetch data
and pyquery
to process data.
#!/usr/bin/python from pyquery import PyQuery as pq import requests as req resp = req.get("http://www.webcode.me") doc = pq(resp.text) title = doc('title').text() print(title) pars = doc('p').text() print(pars)
In the example, we get the title and the text data from all p
tags.
resp = req.get("http://www.webcode.me") doc = pq(resp.text)
We generate a GET request and create a parsable document object from the response.
title = doc('title').text() print(title)
We get the title tag from the document and print its text.
$ ./main.py My html page Today is a beautiful day. We go swimming and fishing. Hello there. How are you?
Python scrape dictionary definitions
In the next example, we scrape definitions of a word from dictionary.com.
We use the requests
and lxml
libraries.
#!/usr/bin/python import requests as req from lxml import html import textwrap term = "dog" resp = req.get("http://www.dictionary.com/browse/" + term) root = html.fromstring(resp.content) for sel in root.xpath("//span[contains(@class, 'one-click-content')]"): if sel.text: s = sel.text.strip() if (len(s) > 3): print(textwrap.fill(s, width=50))
The program fetches the definitions of the term dog.
import textwrap
The textwrap
module is used to wrap text to a certain width.
resp = req.get("http://www.dictionary.com/browse/" + term)
To perform a search, we append the term at the end of the URL.
root = html.fromstring(resp.content)
We need to use resp.content
rather than resp.text
because html.fromstring
implicitly expects bytes as input. (The
resp.content
returns content in bytes whereas
resp.text
as Unicode text.
for sel in root.xpath("//span[contains(@class, 'one-click-content')]"): if sel.text: s = sel.text.strip() if (len(s) > 3): print(textwrap.fill(s, width=50))
We parse the content. The main definitions are located inside the
span
tag, which has the one-click-content
attribute.
We improve the formatting by removing excessive white space and stray
characters. The text width has maximum of 50 characters. Note that such parsing
is subject to change.
$ ./get_term.py a domesticated canid, any carnivore of the dog family Canidae, having prominent canine teeth and, in the wild state, a long and slender muzzle, a deep-chested muscular body, a bushy tail, and large, erect ears. ...
Python web scrape with BeautifulSoup
BeautifulSoup is a Python library for parsing HTML and XML documents. It is one of the most powerful web scraping solutions.
BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.
#!/usr/bin/python from bs4 import BeautifulSoup import requests as req resp = req.get('http://webcode.me') soup = BeautifulSoup(resp.text, 'lxml') print(soup.title) print(soup.title.text) print(soup.title.parent)
In the example, we get the title tag, title text and the parent of the title tag. To fetch the web page, we utilize the requests library.
soup = BeautifulSoup(resp.text, 'lxml')
A BeautifulSoup object is created; the HTML data is passed to the constructor. The second option specifies the internal parser.
print(soup.title) print(soup.title.text) print(soup.title.parent)
We get the data using built-in attributes.
$ ./main.py <title>My html page</title> My html page <head> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1.0" name="viewport"/> <link href="format.css" rel="stylesheet"/> <title>My html page</title> </head>
Python scrape top 5 countries
In the next example, we extract top 5 most populated countries.
#!/usr/bin/python from bs4 import BeautifulSoup import requests as req resp = req.get('http://webcode.me/countries.html') soup = BeautifulSoup(resp.text, 'lxml') data = soup.select('tbody tr:nth-child(-n+5)') for row in data: print(row.text.strip().replace('\n', ' '))
To extract the data, we use the select
method which perform a CSS
selection operation.
$ ./top_countries.py 1 China 1382050000 2 India 1313210000 3 USA 324666000 4 Indonesia 260581000 5 Brazil 207221000
Python scrape dynamic content
We can scrape dynamic content with PlayWright or Selenium. In our example, we use the PlayWright library.
$ pip install --upgrade pip $ pip install playwright $ playwright install
We install PlayWright and the drivers.
#!/usr/bin/python from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("http://webcode.me/click.html") page.click('button', button='left') print(page.query_selector('#output').text_content()) browser.close()
There is a single button on the web page. When we click on the button, a text message appears in the output div tag.
with sync_playwright() as p:
We work in synchronous mode.
browser = p.chromium.launch()
We use the chromium browser. The browser is headless.
page = browser.new_page() page.goto("http://webcode.me/click.html")
We navigate to the page.
page.click('button', button='left')
We click on the button.
print(page.query_selector('#output').text_content())
We retrieve the message.
$ ./main.py Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/109.0.5414.46 Safari/537.36
Source
Python Playwright documentation
In this article we have showed how to do web scraping in Python.
Author
List all Python tutorials.