Python web scrape w/ BeautifulSoup
last modified January 29, 2024
In this article we show how to do web scraping in Python using the BeautifulSoup library.
Web scraping is fetching and extracting data from web pages. Web scraping is used to collect and process data for marketing or research. The data include job listings, price comparisons, or social media postings.
BeautifulSoup
BeautifulSoup is a popular Python library for parsing HTML and XML documents. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.
To fetch data from a web page, we use the requests
library.
Scraping title
In the first example, we scrape the title of a web page.
#!/usr/bin/python import bs4 import requests url = 'http://webcode.me' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') print(soup.title) print(soup.title.text) print(soup.title.parent)
The program fetches an HTML document from webcode.me and uses BeautifulSoup library to get the title of the document.
import bs4
We import the BeautifulSoup library.
import requests
We import the Requests library; it is a popular HTTP library.
url = 'http://webcode.me' resp = requests.get(url)
We generate a GET requests to the given URL.
soup = bs4.BeautifulSoup(resp.text, 'lxml')
We create the BeautifulSoup
object from the HTML document.
print(soup.title) print(soup.title.text) print(soup.title.parent)
We print the title, its contents, and the title's parent.
$ ./title.py <title>My html page</title> My html page <head> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1.0" name="viewport"/> <link href="format.css" rel="stylesheet"/> <title>My html page</title> </head>
Scraping attributes
HTML attributes are special words used inside the opening tag to control the element's behaviour.
#!/usr/bin/python import bs4 import requests url = 'http://webcode.me/' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') link = soup.link print(link.attrs) print(link.attrs['href'])
In the example, we print the attributes of the link
tag.
link = soup.link
We get the link tag.
print(link.attrs)
We retrieve all attributes of the link tag.
print(link.attrs['href'])
Here we get the href
attribute.
$ ./attrs.py {'rel': ['stylesheet'], 'href': 'format.css'} format.css
Scraping descendants and parents
All the descendants and parents of a tag can be accessed via the
descendants
and parents
attributes.
#!/usr/bin/python import bs4 import requests url = 'http://webcode.me/os.html' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') ul = soup.ul for e in ul.descendants: if isinstance(e, bs4.element.Tag): print(e.name) print('-----------------------') for e in ul.parents: print(e.name)
In the program, we find the ul
tag and print its children and
ancestors.
for e in ul.descendants: if isinstance(e, bs4.element.Tag): print(e.name)
The descendants also include text elements while we want only the tags.
Therefore, we filter the elements with the isinstance
method.
We get the names of the tags with the name
attribute.
$ v ./des_par.py li li li li li ----------------------- body html [document]
Finding tags by name
With the find_all
method, we can find all elements by name.
#!/usr/bin/python import bs4 import requests url = 'http://webcode.me/os.html' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') els = soup.find_all('li') for e in els: print(e.string)
In the program, we find all li
tags.
els = soup.find_all('li') for e in els: print(e.string)
We find all li
tags and print their text contents.
$ ./find_all.py Solaris FreeBSD Debian NetBSD Windows
Finding tags by id
With find
, we can find elements by its id attribute.
#!/usr/bin/python from bs4 import BeautifulSoup import requests as req url = 'http://webcode.me/os.html' resp = req.get(url) soup = BeautifulSoup(resp.text, 'lxml') e = soup.find(id='netbsd') print(e.name) print(e.string) print(e.prettify())
In the example, we find the tag that has id equal to 'netbsd'.
print(e.name) print(e.string) print(e.prettify())
We print the tags name, string content, and the prettified output.
$ ./find_by_id.py p NetBSD is a free, fast, secure, and highly portable Unix-like Open Source operating system. <p id="netbsd"> NetBSD is a free, fast, secure, and highly portable Unix-like Open Source operating system. </p>
Scraping multiple tags
We can pass the find_all
function a list of tags that we want to
find.
#!/usr/bin/python import bs4 import requests url = 'http://webcode.me/os.html' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') els = soup.find_all(['title', 'h1', 'footer']) for e in els: print(e.prettify())
We scrape the following three tags: title, h1, and footer.
$ ./multiple_tags.py <title> Header </title> <h1> Operating systems </h1> <footer> @ webcode.me </footer>
Finding by function
We can pass a function to the attribute; the function returns True for each element that matches.
#!/usr/bin/python import bs4 import requests def find_tags(tag): return tag.name in ['title', 'h1', 'footer'] url = 'http://webcode.me/os.html' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') els = soup.find_all(name=find_tags) for e in els: print(e.prettify())
The program finds three tags in the document.
def find_tags(tag): return tag.name in ['title', 'h1', 'footer']
The find_tags
function returns True for which the tag.name in
['title', 'h1', 'footer']
expression returns True.
Finding elements by string
With the string
attribute, we can find elements that contain the
specified string.
#!/usr/bin/python import bs4 import re import requests url = 'http://webcode.me/os.html' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') term = re.compile('BSD') els = soup.find_all('p', string=term) for e in els: print(e.string.strip())
In the example, we find all paragraphs that contain the BSD
string.
$ ./find_string.py FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms. NetBSD is a free, fast, secure, and highly portable Unix-like Open Source operating system.
Finding by CSS selectors
CSS selectors allow us to search by complex queries.
#!/usr/bin/python import bs4 import requests url = 'http://webcode.me/countries.html' resp = requests.get(url) soup = bs4.BeautifulSoup(resp.text, 'lxml') data = soup.select('tbody tr:nth-child(-n+5)') for row in data: print(row.text.strip().replace('\n', ' '))
We can pass CSS selectors to the select
method.
data = soup.select('tbody tr:nth-child(-n+5)')
With the given CSS selector, we find top 5 most populated country from the given table.
$ ./top_countries.py 1 China 1382050000 2 India 1313210000 3 USA 324666000 4 Indonesia 260581000 5 Brazil 207221000
Source
Python BeautifulSoup documentation
In this article we have showed how to do web scraping in Python with the BeautifulSoup library.
Author
List all Python tutorials.