Ebooks

Kotlin regular expressions

Kotlin regular expressions tutorial shows how to use regular expressions in Kotlin.

Regular expressions are used for text searching and more advanced text manipulation. Regular expressions are built-in tools like grep, sed, text editors like vi, emacs, programming languages including Kotlin, JavaScript, Perl, and Python.

Kotlin regular expression

In Kotlin, we build regular expressions with the Regex.

Regex("book")
"book".toRegex()
Regex.fromLiteral("book")

A pattern is a regular expression that defines the text we are searching for or manipulating. It consists of text literals and metacharacters. Metacharacters are special characters that control how the regular expression is going to be evaluated. For instance, with \s we search for white spaces.

Special characters must be double escaped or we can use Kotlin raw strings.

After we have created a pattern, we can use one of the functions to apply the pattern on a text string. The funcions include matches(), containsMatchIn(), find(), findall(), replace(), and split().

The following table shows some commonly used regular expressions:

Regex Meaning
. Matches any single character.
? Matches the preceding element once or not at all.
+ Matches the preceding element once or more times.
* Matches the preceding element zero or more times.
^ Matches the starting position within the string.
$ Matches the ending position within the string.
| Alternation operator.
[abc] Matches a or b, or c.
[a-c] Range; matches a or b, or c.
[^abc] Negation, matches everything except a, or b, or c.
\s Matches white space character.
\w Matches a word character; equivalent to [a-zA-Z_0-9]

Kotlin matches and containsMatchIn methods

The matches() method returns true if the regular expression matches the entire input string. The containsMatchIn() method indicates whether the regular expression can find at least one match in the specified input.

KotlinRegexSimple.kt
package com.zetcode

fun main(args : Array<String>) {

    val words = listOf("book", "bookworm", "Bible",
            "bookish","cookbook", "bookstore", "pocketbook")

    val pattern = "book".toRegex()

    println("*********************")
    println("containsMatchIn function")

    words.forEach { word ->
        if (pattern.containsMatchIn(word)) {
            println("$word matches")
        }
    }

    println("*********************")
    println("matches function")

    words.forEach { word ->
        if (pattern.matches(word)) {
            println("$word matches")
        }
    }
}

In the example, we use the matches() and containsMatchIn() methods. We have a list of words. The pattern will look for a 'book' string in each of the words using both methods.

val pattern = "book".toRegex()

A regular expression pattern is created with toRegex() method. The regular expression consists of four normal characters.

words.forEach { word ->
    if (pattern.containsMatchIn(word)) {
        println("$word matches")
    }
}

We iterate over the list and apply containsMatchIn() on each of the words.

words.forEach { word ->
    if (pattern.matches(word)) {
        println("$word matches")
    }
}

We iterate over the list again and apply matches() on each of the words.

*********************
containsMatchIn function
book matches
bookworm matches
bookish matches
cookbook matches
bookstore matches
pocketbook matches
*********************
matches function
book matches

For the containsMatchIn() method, the pattern matches if the 'book' word is somewhere in the word; for the matches(), the input string must entirely match the pattern.

Kotlin find method

The find() method returns the first match of a regular expression in the input, beginning at the specified start index. The start index is 0 by default.

KotlinRegexFind.kt
package com.zetcode

fun main(args : Array<String>) {

    val text = "I saw a fox in the wood. The fox had red fur."

    val pattern = "fox".toRegex()

    val found = pattern.find(text)

    val m = found?.value
    val idx = found?.range

    println("$m found at indexes: $idx")

    val found2 = pattern.find(text, 11)

    val m2 = found2?.value
    val idx2 = found2?.range

    println("$m2 found at indexes: $idx2")
}

In the example, we find out the indexes of the match of the 'fox' term.

val found = pattern.find(text)

val m = found?.value
val idx = found?.range

We find the first match of the 'fox' term. We get its value and indexes.

val found2 = pattern.find(text, 11)

val m2 = found2?.value
val idx2 = found2?.range

In the second case, we start the search from the index 11, finding thus the next term.

fox found at indexes: 8..10
fox found at indexes: 29..31

This is the output.

Kotlin findAll method

The findAll() method returns a sequence of all occurrences of a regular expression within the input string.

KotlinFindAll.kt
package com.zetcode

fun main(args : Array<String>) {

    val text = "I saw a fox in the wood. The fox had red fur."

    val pattern = "fox".toRegex()

    val found = pattern.findAll(text)

    found.forEach { f ->
        val m = f.value
        val idx = f.range
        println("$m found at indexes: $idx")
     }
}

In the example, we find all occurrences of the 'fox' term with findAll().

Kotlin split function

The split() method splits the input string around matches of the regular expression.

KotlinRegexSplitting.js
package com.zetcode

fun main(args: Array<String>) {

    val text = "I saw a fox in the wood. The fox had red fur."

    val pattern = "\\W+".toRegex()

    val words = pattern.split(text).filter { it.isNotBlank() }

    println(words)
}

In the exmaple, we find out the number of occurrences of the 'fox' term.

val pattern = "\\W+".toRegex()

The pattern contains the \W named character class, which stands for non-word character. In conjunction with the + quantifier, the pattern looks for non-word character(s) such as space, comma, or dot, which are often used to separate words in text. Note that the character class is double escaped.

val words = pattern.split(text).filter { it.isNotBlank() }

With the split() method, we split the input string into a list of words. In addition, we remove the blank trailing word, which was created because our text ended in a non-word character.

[I, saw, a, fox, in, the, wood, The, fox, had, red, fur]

This is the output.

Case insensitive match

To enable case insensitive search, we pass the RegexOption.IGNORE_CASE to the toRegex() method.

KotlinRegexCaseInsensitive.kt
package com.zetcode

fun main(args: Array<String>) {

    val words = listOf("dog", "Dog", "DOG", "Doggy")

    val pattern = "dog".toRegex(RegexOption.IGNORE_CASE)

    words.forEach { word ->
        if (pattern.matches(word)) {
            println("$word matches")

        }
    }
}

In the example, we apply the pattern on words regardless of the case.

val pattern = "dog".toRegex(RegexOption.IGNORE_CASE)

We use the RegexOption.IGNORE_CASE to ignore the case of the input string.

dog matches
Dog matches
DOG matches

This is the output.

The dot metacharacter

The dot (.) metacharacter stands for any single character in the text.

KotlinRegexDotMeta.kt
package com.zetcode

fun main(args : Array<String>) {

    val words = listOf("seven", "even", "prevent", "revenge", "maven",
            "eleven", "amen", "event")

    val pattern = "..even".toRegex()

    words.forEach { word ->
        if (pattern.containsMatchIn(word)) {
            println("$word matches")

        }
    }
}

In the example, we have eight words in a list. We apply a pattern containing two dot metacharacters on each of the words.

prevent matches
eleven matches

There are two words that match the pattern.

Question mark meta character

The question mark (?) meta character is a quantifier that matches the previous element zero or one time.

KotlinRegexQMarkMeta.kt
package com.zetcode

fun main(args : Array<String>) {

    val words = listOf("seven", "even", "prevent", "revenge", "maven",
            "eleven", "amen", "event")

    val pattern = ".?even".toRegex()

    words.forEach { word ->
        if (pattern.matches(word)) {
            println("$word matches")

        }
    }
}

In the example, we add a question mark after the dot character. This means that in the pattern we can have one arbitrary character or we can have no character there.

seven matches
even matches

This is the output.

The {n,m} quantifier

The {n,m} quantifier matches at least n and at most m occurrences of the preceding expression.

KotlinRegexMnQuantifier.kt
package com.zetcode

fun main(args: Array<String>) {

    val words = listOf("pen", "book", "cool", "pencil", "forest", "car",
            "list", "rest", "ask", "point", "eyes")

    val pattern = "\\w{3,4}".toRegex()

    words.forEach { word ->
        if (pattern.matches(word)) {

            println("$word matches")
        } else {
            println("$word does not match")
        }
    }
}

In the example, we search for words that have either three or four characters.

val pattern = "\\w{3,4}".toRegex()

In the pattern, we have a word character repeated three or four times. Note that there must not be a space between the numbers.

pen matches
book matches
cool matches
pencil does not match
forest does not match
car matches
list matches
rest matches
ask matches
point does not match
eyes matches

This is the output.

Anchors

Anchors match positions of characters inside a given text. When using the ^ anchor the match must occur at the beginning of the string and when using the $ anchor the match must occur at the end of the string.

KotlinRegexAnchors.kt
package com.zetcode

fun main(args : Array<String>) {

    val sentences = listOf("I am looking for Jane.",
        "Jane was walking along the river.",
        "Kate and Jane are close friends.")

    val pattern = "^Jane".toRegex()

    sentences.forEach { sentence ->
        if (pattern.containsMatchIn(sentence)) {
            println("$sentence")

        }
    }
}

In the example, we have three sentences. The search pattern is ^Jane. The pattern checks if the "Jane" string is located at the beginning of the text. The Jane\. would look for "Jane" at the end of the sentence.

Alternations

The alternation operator | creates a regular expression with several choices.

KotlinRegexAlternations.kt
package com.zetcode

fun main(args: Array<String>) {

    val words = listOf("Jane", "Thomas", "Robert",
            "Lucy", "Beky", "John", "Peter", "Andy")

    val pattern = "Jane|Beky|Robert".toRegex()

    words.forEach { word ->

        if (pattern.matches(word)) {

            println("$word")
        }
    }
}

We have eight names in the list.

val pattern = "Jane|Beky|Robert".toRegex()

This regular expression looks for "Jane", "Beky", or "Robert" strings.

Subpatterns

Subpatterns are patterns within patterns. Subpatterns are created with () characters.

KotlinRegexSubpatterns.kt
package com.zetcode

fun main(args: Array<String>) {

    val words = listOf("book", "bookshelf", "bookworm",
            "bookcase", "bookish", "bookkeeper", "booklet", "bookmark")

    val pattern = "book(worm|mark|keeper)?".toRegex()

    words.forEach { word ->

        if (pattern.matches(word)) {

            println("$word matches")
        } else {

            println("$word does not match")
        }
    }
}

The example creates a subpattern.

val pattern = "book(worm|mark|keeper)?".toRegex()

The regular expression uses a subpattern. It matches bookworm, bookmark, bookkeeper, and book words.

book matches
bookshelf does not match
bookworm matches
bookcase does not match
bookish does not match
bookkeeper matches
booklet does not match
bookmark matches

This is the output.

Character classes

A character class defines a set of characters, any one of which can occur in an input string for a match to succeed.

KotlinRegexChClass.kt
package com.zetcode

fun main(args: Array<String>) {

    val words = listOf("a gray bird", "grey hair", "great look")

    val pattern = "gr[ea]y".toRegex()

    words.forEach { word ->

        if (pattern.containsMatchIn(word)) {

            println("$word")
        }
    }
}

In the example, we use a character class to include both gray and grey words.

val pattern = "gr[ea]y".toRegex()

The [ea] class allows to use either 'e' or 'a' character in the pattern.

Named character classes

There are some predefined character classes. The \s matches a whitespace character [\t\n\t\f\v], the \d a digit [0-9], and the \w a word character [a-zA-Z0-9_].

KotlinRegexNamedClass.kt
package com.zetcode

fun main(args: Array<String>) {
    
    val text = "We met in 2013. She must be now about 27 years old."

    val pattern = "\\d+".toRegex()
    val found = pattern.findAll(text)

    found.forEach { f ->
        val m = f.value
        println("$m")
    }
}

In the example, we search for numbers in the text.

val pattern = "\\d+".toRegex()

The \d+ pattern looks for any number of digit sets in the text.

val found = pattern.findAll(text) 

To find all the matches with findAll().

2013
27

This is the output.

Capturing groups

Capturing groups is a way to treat multiple characters as a single unit. They are created by placing charactes inside a set of round brackets. For instance, (book) is a single group containing 'b', 'o', 'o', 'k', characters.

The capturing groups technique allows us to find out those parts of a string that match the regular expression pattern.

KotlinRegexCapturingGroups.kt
package com.zetcode

fun main(args: Array<String>) {

    val content = """<p>The <code>Pattern</code> is a compiled
representation of a regular expression.</p>"""

    val pattern = "(<\\/?[a-z]*>)".toRegex()

    val found = pattern.findAll(content)

    found.forEach { f ->
        val m = f.value
        println("$m")
    }
}

The code example prints all HTML tags from the supplied string by capturing a group of characters.

val found = pattern.findAll(content)

In order to find all tags, we use the findAll() method.

<p>
<code>
</code>
</p>

We have found four HTML tags.

Kotlin regex email example

In the following example, we create a regex pattern for checking email addresses.

KotlinRegexEmails.kt
package com.zetcode

fun main(args: Array<String>) {

    val emails = listOf("luke@gmail.com", "andy@yahoocom",
            "34234sdfa#2345", "f344@gmail.com", "dandy!@yahoo.com")

    val pattern = "[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\\.[a-zA-Z.]{2,18}".toRegex()

    emails.forEach { email ->

        if (pattern.matches(email)) {

            println("$email matches")
        } else {

            println("$email does not match")
        }
    }
}

This example provides one possible solution.

val pattern = "[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\\.[a-zA-Z.]{2,18}".toRegex()

The email is divided into five parts. The first part is the local part. Usually it is a name of a company, an individual, or a nickname. The [a-zA-Z0-9._-]+ lists all possible characters that we can use in the local part. They can be used one or more times.

The second part consists of the literal @ character. The third part is the domain part. It is usually the domain name of the email provider such as yahoo, or gmail. The [a-zA-Z0-9-]+ is a character class providing all characters that can be used in the domain name. The + quantifier allows to use of one or more of these characters.

The fourth part is the dot character; it is preceded by double escape character (\\) to get a literal dot.

The final part is the top level domain name: [a-zA-Z.]{2,18}. Top level domains can have from 2 to 18 characters, such as sk, net, info, travel, cleaning, travelinsurance. The maximum length can be 63 characters, but most domain are shorter than 18 characters today. There is also a dot character. This is because some top level domains have two parts; for instance co.uk.

luke@gmail.com matches
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches
dandy!@yahoo.com does not match

This is the output.

In this chapter, we have covered regular expressions in Kotlin.

You might also be interested in the following related tutorials: Kotlin ranges tutorial, and Kotlin sets tutorial.