Java JSoup
last modified July 4, 2024
JSoup tutorial an introductory guide to the JSoup HTML parser. In the tutorial we are going to parse HTML data from a HTML string, local HTML file, and a web page. We are going to sanitize data and perform a Google search.
JSoup is a Java library for extracting and manipulating HTML data. It implements the HTML5 specification, and parses HTML to the same DOM as modern browsers.
With JSoup we are able to:
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
Dependency
In the examples of this tutorial, we use the following Maven dependency.
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.17.2</version> </dependency>
JSoup
class provides the core public access point to the jsoup
functionality via its static methods. For instance, the clean
methods sanitize HTML code, the connect
method creates a connection
to URL, or parse
methods parse HTML content.
HTML file
In some of the examples, we use the following HTML file:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document title</title> </head> <body> <p>List of words</p> <ul> <li>dark</li> <li>smart</li> <li>war</li> <li>cloud</li> <li>park</li> <li>cup</li> <li>worm</li> <li>water</li> <li>rock</li> <li>warm</li> </ul> <footer>footer for words</footer> </body> </html>
Parse HTML string
The JSoup.parse
method perses an HTML string into a document.
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; void main() { String htmlString = """ <html><head><title>My title</title></head> <body>Body content</body></html>"""; Document doc = Jsoup.parse(htmlString); String title = doc.title(); String body = doc.body().text(); System.out.printf("Title: %s%n", title); System.out.printf("Body: %s", body); }
The example parses a HTML string and outputs its title and body content.
String htmlString = """ <html><head><title>My title</title></head> <body>Body content</body></html>""";
This string contains simple HTML data.
Document doc = Jsoup.parse(htmlString);
With the Jsoup's parse
method, we parse the HTML string. The method
returns a HTML document.
String title = doc.title();
The document's title
method gets the string contents of the
document's title element.
String body = doc.body().text();
The document's body
method returns the body element; its
text
method gets the text of the element.
JSoup parse local HTML file
In the second example, we are going to parse a local HTML file. We use the
overloaded Jsoup.parse
method that takes a File
object
as its first parameter.
<!DOCTYPE html> <html> <head> <title>My title</title> <meta charset="UTF-8"> </head> <body> <div id="mydiv">Contents of a div element</div> </body> </html>
For the example, we use the above HTML file.
import java.io.File; import java.io.IOException; import java.util.Optional; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; void main() throws IOException { String fileName = "src/main/resources/index.html"; Document doc = Jsoup.parse(new File(fileName), "utf-8"); Optional<Element> divTag = Optional.ofNullable(doc.getElementById("mydiv")); divTag.ifPresent(e -> System.out.println(e.text())); }
The example parses the index.html
file, which is located
in the src/main/resources/
directory.
Document doc = Jsoup.parse(new File(fileName), "utf-8");
We parse the HTML file with the Jsoup.parse
method.
Optional<Element> divTag = Optional.ofNullable(doc.getElementById("mydiv"));
With the document's getElementById
method, we get the element by
its ID.
divTag.ifPresent(e -> System.out.println(e.text()));
The text of the tag is retrieved with the element's text
method.
Read web site's title
In the following example, we scrape and parse a web page and retrieve the content of the title element.
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; void main() throws IOException { String url = "https://webcode.me"; Document doc = Jsoup.connect(url).get(); String title = doc.title(); System.out.println(title); }
In the code example, we read the title of a specified web page.
Document doc = Jsoup.connect(url).get();
The Jsoup's connect
method creates a connection to
the given URL. The get
method executes a GET request
and parses the result; it returns a HTML document.
String title = doc.title();
With the document's title
method, we get the title
of the HTML document.
Read web page
The next example retrieves the HTML source of a web page.
import java.io.IOException; import org.jsoup.Jsoup; void main() throws IOException { String webPage = "https://webcode.me"; String html = Jsoup.connect(webPage).get().html(); System.out.println(html); }
The example prints the HTML of a web page.
String html = Jsoup.connect(webPage).get().html();
The html
method returns the HTML of an element; in our case the
HTML source of the whole document.
Metadata information
Meta information of a HTML document provides structured metadata about a Web page, such as its description and keywords.
import java.io.IOException; import java.util.Optional; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; void main() throws IOException { String url = "https://jsoup.org"; Document doc = Jsoup.connect(url).get(); Optional<Element> el1 = Optional.ofNullable(doc.select("meta[name=description]").first()); el1.ifPresent(e -> System.out.println(e.attr("content"))); Optional<Element> el2 = Optional.ofNullable(doc.select("meta[name=keywords]").first()); el2.ifPresent(e -> System.out.println(e.attr("content"))); }
The code example retrieves meta information about a specified web page.
Optional<Element> el2 = Optional.ofNullable(doc.select("meta[name=keywords]").first()); el2.ifPresent(e -> System.out.println(e.attr("content")));
The document's select
method finds elements that match the
given query. The first
method returns the first matched element.
With the attr
method, we get the value of the content
attribute. We use Optional
to handle possible NullPointerExceptions.
Get all tags
To get all tags, we pass the *
character to the select
method.
import org.jsoup.Jsoup; import java.io.File; import java.io.IOException; void main() throws IOException { var fileName = "src/main/resources/words.html"; var myFile = new File(fileName); var doc = Jsoup.parse(myFile, "UTF-8"); var all = doc.body().select("*"); all.forEach(e -> System.out.println(e.tagName())); }
We get all the tags from the words.html
document.
var all = doc.body().select("*");
We get all elements.
all.forEach(e -> System.out.println(e.tagName()));
We go over all the elements and print their tag names with tagName
.
The text method
The text
method gets the combined text of this element and all its
children. The whitespace is normalized and trimmed.
import org.jsoup.Jsoup; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.File; import java.io.IOException; import java.util.Optional; void main() throws IOException { var fileName = "src/main/resources/words.html"; var myFile = new File(fileName); var doc = Jsoup.parse(myFile, "UTF-8"); System.out.println(doc.text()); System.out.println("---------------------------"); System.out.println(doc.body().text()); System.out.println("---------------------------"); Optional<Element> e1 = Optional.ofNullable(doc.select("body>p").first()); e1.ifPresent(e -> System.out.println(e.text())); System.out.println("---------------------------"); Optional<Element> e2 = Optional.ofNullable(doc.select("body>ul").first()); e2.ifPresent(e -> System.out.println(e.text())); System.out.println("---------------------------"); e2.ifPresent(e -> { Elements lis = e.children(); Optional<Element> ch1 = Optional.ofNullable(lis.first()); ch1.ifPresent(ce -> System.out.println(ce.text())); Optional<Element> ch2 = Optional.ofNullable(lis.last()); ch2.ifPresent(ce -> System.out.println(ce.text())); }); }
In the example, we get the text data from the whole document, body, paragraph, unordered list, and first and last list item.
Document title List of words dark smart cloud park cup water rock footer for words --------------------------- List of words dark smart cloud park cup water rock footer for words --------------------------- List of words --------------------------- dark smart cloud park cup water rock --------------------------- dark rock
Modify text
The overloaded text
method sets the text of the specified element.
import org.jsoup.Jsoup; void main() { String htmlString = """ <html><head><title>My title</title></head> <body>Body content</body></html>"""; var doc = Jsoup.parse(htmlString); doc.body().text("Lorem ipsum dolor sit amet"); System.out.println(doc); }
In the example, we change the text inside the body
tag.
Modify document
There are multiple methods for modifying the HTML document. For instance, the
append
method appends a tag and the prepend
method
prepends a tag to an element.
import org.jsoup.Jsoup; import org.jsoup.nodes.Element; import java.util.Optional; void main() { String htmlString = """ <html><head><title>My title</title></head> <body></body></html>"""; var doc = Jsoup.parse(htmlString); Optional<Element> bodyEl = Optional.ofNullable(doc.select("body").first()); bodyEl.ifPresent(e -> { e.append("<p>hello there!</p>"); e.prepend("<h1>Heading</h1>"); }); System.out.println(doc); }
In the example, we add h1
and p
tags to the document.
<html> <head> <title>My title</title> </head> <body> <h1>Heading</h1> <p>hello there!</p> </body> </html>
Parse links
The next example parses links from a HTML page.
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; void main() throws IOException { String url = "https://jsoup.org"; Document document = Jsoup.connect(url).get(); Elements links = document.select("a[href]"); for (Element link : links) { System.out.println("link : " + link.attr("href")); System.out.println("text : " + link.text()); } }
In the example, we connect to a web page and parse all its link elements.
Elements links = document.select("a[href]");
To get a list of links, we use the document's select
method.
Sanitize HTML data
JSoup provides methods for sanitizing HTML data.
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.safety.Cleaner; import org.jsoup.safety.Safelist; void main() { String htmlString = """ <html><head><title>My title</title></head> <body><center>Body content</center></body></html> """; boolean valid = Jsoup.isValid(htmlString, Safelist.basic()); if (valid) { System.out.println("The document is valid"); } else { System.out.println("The document is not valid."); System.out.println("Cleaned document"); Document dirtyDoc = Jsoup.parse(htmlString); Document cleanDoc = new Cleaner(Safelist.basic()).clean(dirtyDoc); System.out.println(cleanDoc.html()); } }
In the example, we sanitize and clean HTML data.
String htmlString = """ <html><head><title>My title</title></head> <body><center>Body content</center></body></html> """;
The HTML string contains the center element, which is deprecated.
boolean valid = Jsoup.isValid(htmlString, Safelist.basic());
The isValid
method determines whether the string is a valid HTML.
A white list is a list of HTML (elements and attributes) that can pass through the cleaner.
The Whitelist.basic
defines a set of basic clean HTML tags.
Document dirtyDoc = Jsoup.parse(htmlString); Document cleanDoc = new Cleaner(Safelist.basic()).clean(dirtyDoc);
With the help of the Cleaner
, we clean the dirty HTML document.
The document is not valid. Cleaned document <html> <head></head> <body> Body content </body> </html>
We can see that the center element was removed.
Source
This tutorial was dedicated to the JSoup HTML parser.
Author
List all Java tutorials.