Ebooks

Java Regular Expressions tutorial

Java Regular Expressions tutorial shows how to parse text in Java using regular expressions.

Regular expressions

Regular expressions are used for text searching and more advanced text manipulation. Regular expressions are built into tools including grep and sed, text editors including vi and emacs, programming languages including Perl, Java, and C#.

Java has built-in API for working with regular expressions; it is located in java.util.regex.

A regular expression defines a search pattern for strings. Pattern is a compiled representation of a regular expression. Matcher is an engine that interprets the pattern and performs match operations against an input string. Matcher has methods such as find(), matches(), end() to perform matching operations. When there is an exception parsing a regular expression, Java throws a PatternSyntaxException.

Regex examples

The following table shows a couple of regular expression strings.

Regex Meaning
. Matches any single character.
? Matches the preceding element once or not at all.
+ Matches the preceding element once or more times.
* Matches the preceding element zero or more times.
^ Matches the starting position within the string.
$ Matches the ending position within the string.
| Alternation operator.
[abc] Matches a or b, or c.
[a-c] Range; mathes a or b, or c.
[^abc] Negation, matches everything except a, or b, or c.
\s Matcher white space character.
\w Matches a word character; equivalent to [a-zA-Z_0-9]

Java simple regular expression

In the first example we match a word agains a list of words.

JavaRegexEx.java
package com.zetcode;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexEx {

    public static void main(String[] args) {

        List<String> words = Arrays.asList("Seven", "even",
                "Maven", "Amen", "eleven");

        Pattern p = Pattern.compile(".even");
        
        for (String word: words) {
            
            Matcher m = p.matcher(word);  
            
            if (m.matches()) {
                System.out.printf("%s matches%n", word);
            } else {
                System.out.printf("%s does not match%n", word);
            }
        }
    }
}

In the example, we have five words in a list. We check which words match the .even regular expression.

Pattern p = Pattern.compile(".even");

We compile the pattern. The dot (.) metacharacter stands for any single character in the text.

for (String word: words) {
    
    Matcher m = p.matcher(word);  
    
    if (m.matches()) {
        System.out.printf("%s matches%n", word);
    } else {
        System.out.printf("%s does not match%n", word);
    }
}

We go through the list of words. A matcher is created with the matcher() method. The matches() method returns true if the word matches the regular expression.

Seven matches
even does not match
Maven does not match
Amen does not match
eleven does not match

This is the output.

Java Regex anchors

Anchors match positions of characters inside a given text. In the next example, we look if a string is located at the beginning of a sentence.

JavaRegexAnchor.java
package com.zetcode;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexAnchor {

    public static void main(String[] args) {

        List<String> sentences = Arrays.asList("I am looking for Jane.",
                "Jane was walking along the rive.",
                "Kate and Jane are close friends.");

        Pattern p = Pattern.compile("^Jane");

        for (String word : sentences) {

            Matcher m = p.matcher(word);

            if (m.find()) {
                System.out.printf("%s matches%n", word);
            } else {
                System.out.printf("%s does not match%n", word);
            }
        }
    }
}

We have three sentences. The search pattern is ^Jane. The pattern checks if the "Jane" string is located at the beginning of the text. $Jane would look for "Jane" at the end of the text.

Java Regex alternations

The alternation operator | enables to create a regular expression with several choices.

JavaRegexAlternation.java
package com.zetcode;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexAlternation {

    public static void main(String[] args) {

        List<String> users = Arrays.asList("Jane", "Thomas", "Robert",
                "Lucy", "Beky", "John", "Peter", "Andy");

        Pattern p = Pattern.compile("Jane|Beky|Robert");

        for (String user : users) {

            Matcher m = p.matcher(user);

            if (m.matches()) {
                System.out.printf("%s matches%n", user);
            } else {
                System.out.printf("%s does not match%n", user);
            }
        }
    }
}

We have nine names in the list.

Pattern p = Pattern.compile("Jane|Beky|Robert");

This regular expression looks for "Jane", "Beky", or "Robert" strings.

Java Regex capturing groups

The capturing groups technique is a way to treat multiple characters as a single unit. They are created by placing charactes inside a set of round brackets. For instance, (book) is a single group containing 'b', 'o', 'o', 'k', characters.

The capturing groups technique allows us to find out those parts of the string that match the regular pattern. The mather's group() method returns the input subsequence captured by the given group during the previous match operation.

JavaRegexGroups.java
package com.zetcode;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexGroups {

    public static void main(String[] args) {

        String content = "<p>The <code>Pattern</code> is a compiled "
                + "representation of a regular expression.</p>";

        Pattern p = Pattern.compile("(</?[a-z]*>)");

        Matcher matcher = p.matcher(content);

        while (matcher.find()) {

            System.out.println(matcher.group(1));
        }
    }
}

This example prints all HTML tags from the supplied string by capturing a group of characters.

<p>
<code>
</code>
</p>

This is the output.

Java Regex replacing strings

It is possible to replace strings with replaceAll() and replaceFirst() methods. The methods return modified strings.

JavaRegexReplacingStrings.java
package com.zetcode;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class JavaRegexReplacingStrings {

    public static void main(String[] args) throws MalformedURLException, IOException {

        URL url = new URL("http://www.something.com");

        try (InputStreamReader isr = new InputStreamReader(url.openStream(),
                StandardCharsets.UTF_8);
                BufferedReader br = new BufferedReader(isr)) {
            
            String content = br.lines().collect(
                Collectors.joining(System.lineSeparator()));

            Pattern p = Pattern.compile("<[^>]*>");

            Matcher matcher = p.matcher(content);
            String stripped = matcher.replaceAll("");
            
            System.out.println(stripped);
        }
    }
}

The example reads HTML data of a web page and strips its HTML tags using a regular expression.

Pattern p = Pattern.compile("<[^>]*>");

This pattern defines a regular expression that matches HTML tags.

String stripped = matcher.replaceAll("");

We remove all the tags with replaceAll() method.

Java Regex splitting text

Text can be split with Pattern's split() method.

data.csv
22, 1, 3, 4, 5, 17, 18
2, 13, 4, 1, 8, 4
3, 21, 4, 5, 1, 48, 9, 42

We read from data.csv file.

JavaRegexSplitText.java
package com.zetcode;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
import java.util.regex.Pattern;

public class JavaRegexSplitText {
    
    static int sum = 0;
        
    public static void main(String[] args) throws IOException {
        
        Path myPath = Paths.get("src/main/resources/data.csv");
        
        List<String> lines = Files.readAllLines(myPath);
       
        String regex = ",";
        
        Pattern p = Pattern.compile(regex);
        
        lines.forEach((line) -> {
            
            String[] parts = p.split(line);
            
            for (String part : parts) {
                
                String val = part.trim();
                
                sum += Integer.valueOf(val);
            }
            
        });
        
        System.out.printf("Sum of values: %d", sum);
    }
}

The examples reads values from a CSV file and computes the sum of them. It uses regular expression to read the data.

List<String> lines = Files.readAllLines(myPath);

In one shot, we read all data into the list of strings with Files.readAllLines().

String regex = ",";

The regular expression is a comma character.

lines.forEach((line) -> {
    
    String[] parts = p.split(line);
    
    for (String part : parts) {
        
        String val = part.trim();
        
        sum += Integer.valueOf(val);
    }
    
});

We go throug the lines and split them into an array of strings with split. We cut off spaces with trim() and compute the sum value.

Java case-insensitive regular expression

By setting the Pattern.CASE_INSENSITIVE flag, we can have case-insensitive matching.

JavaRegexCaseInsensitive.java
package com.zetcode;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexCaseInsensitive {

    public static void main(String[] args) {

        List<String> users = Arrays.asList("dog", "Dog", "DOG", "Doggy");

        Pattern p = Pattern.compile("dog", Pattern.CASE_INSENSITIVE);

        users.forEach((user) -> {
            
            Matcher m = p.matcher(user);

            if (m.matches()) {
                System.out.printf("%s matches%n", user);
            } else {
                System.out.printf("%s does not match%n", user);
            }
        });
    }
}

The example performs case-insensitive matching of the regular expression.

Pattern p = Pattern.compile("dog", Pattern.CASE_INSENSITIVE);

Case-insensitive matching is set by setting Pattern.CASE_INSENSITIVE as the second parameter to Pattern.compile().

Java Regex subpatterns

Subpatterns are patterns within patterns. Subpatterns are created with () characters.

JavaRegexSubpatterns.java
package com.zetcode;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexSubpatterns {

    public static void main(String[] args) {
        
        List<String> words = Arrays.asList("book", "bookshelf", "bookworm",
                "bookcase", "bookish", "bookkeeper", "booklet", "bookmark");

        Pattern p = Pattern.compile("book(worm|mark|keeper)?");

        for (String word : words) {

            Matcher m = p.matcher(word);

            if (m.matches()) {
                System.out.printf("%s matches%n", word);
            } else {
                System.out.printf("%s does not match%n", word);
            }
        }        
    }
}

The example creates a subpattern.

Pattern p = Pattern.compile("book(worm|mark|keeper)?");

The regular expression uses a subpattern. It matches bookworm, bookmark, bookkeeper, and book words.

Java Regex email example

In the following example, we create a regex pattern for checking email addresses.

JavaRegexEmail.java
package com.zetcode;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class JavaRegexEmail {

    public static void main(String[] args) {
        
        List<String> emails = Arrays.asList("luke@gmail.com", 
                "andy@yahoocom", "34234sdfa#2345", "f344@gmail.com");

        String regex = "^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\\.[a-zA-Z.]{2,18}$";
        
        Pattern p = Pattern.compile(regex);

        for (String email : emails) {

            Matcher m = p.matcher(email);

            if (m.matches()) {
                System.out.printf("%s matches%n", email);
            } else {
                System.out.printf("%s does not match%n", email);
            }
        }
    }
}

This example provides only one possible solution.

String regex = "^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\\.[a-zA-Z.]{2,18}$";

The first ^ and the last $ characters provide an exact pattern match. No characters before and after the pattern are allowed. The email is divided into five parts. The first part is the local part. This is usually a name of a company, individual, or a nickname. The [a-zA-Z0-9._-]+ lists all possible characters, we can use in the local part. They can be used one or more times.

The second part consists of the literal @ character. The third part is the domain part. It is usually the domain name of the email provider, like yahoo, or gmail. The [a-zA-Z0-9-]+ is a character set providing all characters that can be used in the domain name. The + quantifier makes use of one or more of these characters.

The fourth part is the dot character. It is preceded by the escape character (\). This is because the dot character is a metacharacter and has a special meaning. By escaping it, we get a literal dot.

The final part is the top level domain: [a-zA-Z.]{2,18}. Top level domains can have from 2 to 18 characters, such as sk, net, info, travel, cleaning, travelinsurance. The maximum lenght can be 63 characters, but most domain are shorter than 18 characters today. There is also a dot character. This is because some top level domains have two parts; for example co.uk.

In this tutorial, we have worked with regular expression in Java. You might also be interested in the related tutorials: Java tutorial or Reading text files in Java.