ZetCode

Java grapheme

last modified May 24, 2025

In this article we will work with graphemes in Java. A grapheme is the smallest unit of a writing system of any given language.

Grapheme

A grapheme is the smallest fundamental unit in a writing system of any given language. It may consist of a single character or multiple combined elements, such as accents, diacritics, or ligatures, forming a visually distinct symbol. While a grapheme may not always convey meaning on its own, it plays a crucial role in representing language structure.

Historically, the term character was used to refer to individual symbols in the original ASCII table, which provided support for only a limited set of characters. As digital text representation evolved, the Unicode standard was introduced to accommodate the immense diversity of written languages, enabling representation for complex scripts, emojis, and multi-character graphemes.

Unicode

Unicode is a globally recognized standard that ensures the consistent encoding, representation, and handling of text across different writing systems. It was developed to unify character representation, allowing seamless text processing in various languages, scripts, and specialized symbols such as mathematical notation and emojis.

In Java, a string is a sequence of Unicode characters, functioning as a structured data type that stores text as a series of values, typically represented in bytes. Java internally uses UTF-16 encoding, which supports direct representation for most characters while employing surrogate pairs to extend support for Unicode characters beyond the Basic Multilingual Plane (BMP).

Beyond the ASCII character set, it is more appropriate to use the term grapheme rather than "character." Java provides the BreakIterator class, which allows iteration over text elements (graphemes) within a string. A text element may consist of a base character, a surrogate pair, or a combining character sequence, together forming a single displayed symbol.

A code point represents the numerical identifier assigned to a Unicode character. For example, the Latin letter Q corresponds to the U+0051 code point, while the Cyrillic small letter ж is represented by U+0436. Unicode assigns a unique code point to each character, ensuring universal compatibility in text processing across various systems.

A grapheme cluster consists of one or more code points that visually combine to form a single graphical unit. For example, the Hindi syllable ते is composed of two code points: U+0924 for and U+0947 for . Proper handling of grapheme clusters is essential for accurately rendering complex scripts.

The actual stored representation of a string consists of bytes, which encode each code point. Depending on the Unicode encoding scheme—such as UTF-8, UTF-16, or UTF-32—each code point may require a different number of bytes for full representation in memory or storage.

Unicode table

Below is a small portion of the Unicode table, showing code points, characters, and their names:

Code pointCharacterName
U+0041ALATIN CAPITAL LETTER A
U+0061aLATIN SMALL LETTER A
U+03B1αGREEK SMALL LETTER ALPHA
U+0416ЖCYRILLIC CAPITAL LETTER ZHE
U+05D0אHEBREW LETTER ALEF
U+0924DEVANAGARI LETTER TA
U+1F600😀GRINNING FACE

Surrogate Pairs

Java employs the UTF-16 encoding scheme for Unicode text storage, using 16-bit (two-byte) code units to represent each character. This scheme supports all characters in the Basic Multilingual Plane (BMP), ranging from U+0000 to U+FFFF, which includes widely used scripts like Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, as well as symbols and punctuation.

For Unicode characters beyond the BMP, spanning U+10000 to U+10FFFF, Java uses surrogate pairs to extend UTF-16 encoding capabilities, enabling representation of these additional characters.

A surrogate pair is a pair of 16-bit code units that together represent a single Unicode character outside the BMP. The first unit, called the high surrogate, is followed by the low surrogate. This mechanism enables Java to handle extended characters, such as emojis, historical scripts, and specialized symbols used across various domains.

Unicode also supports combining character sequences, where a base character is modified by one or more combining characters, such as diacritics. A surrogate pair can represent a standalone character or be part of a larger grapheme cluster when combined with other code points, forming a single visual unit.

Code pointCharacterHigh surrogateLow surrogateName
U+1F600😀D83DDE00GRINNING FACE
U+1F680🚀D83DDE80ROCKET
U+1F4A9💩D83DDCA9PILE OF POO
U+1F60D😍D83DDE0DSMILING FACE WITH HEART-EYES
U+1F47B👻D83DDC7BGHOST

In the table above, the code points U+1F600, U+1F680, U+1F4A9, U+1F60D, and U+1F47B are represented as surrogate pairs. The high surrogate is the first code unit (D83D), and the low surrogate is the second code unit (DE00, DE80, DCA9, DE0D, DC7B). These pairs allow the representation of characters outside the Basic Multilingual Plane (BMP), which includes many emojis and other special characters.

Java grapheme example

In the following example, we work with graphemes.

Main.java
void main() {

    System.out.println("The Hindi word Namaste");

    String word = "नमस्ते";
    System.out.println(word);
    System.out.println();

    // code points
    System.out.println("Code points:");
    for (int i = 0; i < word.length(); i += Character.charCount(word.codePointAt(i))) {
        int cp = word.codePointAt(i);
        System.out.printf("U+%04X %s%n", cp, new String(Character.toChars(cp)));
    }
    System.out.println();

    // bytes
    System.out.println("Bytes:");
    byte[] bytes = word.getBytes(StandardCharsets.UTF_8);

    for (byte b : bytes) {
        System.out.printf("%d ", b & 0xFF); // Convert signed byte to unsigned
    }

    System.out.println("\n");

    // graphemes
    System.out.println("Graphemes:");
    BreakIterator graphemeIter = BreakIterator.getCharacterInstance();
    graphemeIter.setText(word);

    int count = 0;
    int start = graphemeIter.first();

    while (start != BreakIterator.DONE) {

        int end = graphemeIter.next();
        if (end == BreakIterator.DONE) {
            break;
        }

        String grapheme = word.substring(start, end);
        System.out.println(grapheme);
        count++;
        start = end;
    }

    System.out.printf("the word has %d graphemes%n", count);
}

The example defines a variable which contains a Hindi word namaste. We print the word, print its code points, bytes, and print and count the number of graphemes.

String word = "नमस्ते";
System.out.println(word);

We have the Hindi namaste word; we print it to the console.

System.out.println("Code points:");

for (int i = 0; i < word.length(); i += Character.charCount(word.codePointAt(i))) {
    int cp = word.codePointAt(i);
    System.out.printf("U+%04X %s%n", cp, new String(Character.toChars(cp)));
}

We print the code points of the word. The length method determines the number of UTF-16 chars in the string. The Character.charCount method is used to determine the number of char values needed to represent the code point. If it is a surrogate pair, we need more bytes to represent a grapheme.

With the word.codePointAt method, we get the Unicode code point at the specified index. The Character.toChars method converts the code point into a char array, which we convert to a string to print the grapheme.

System.out.println("Bytes:");
byte[] bytes = word.getBytes(StandardCharsets.UTF_8);

for (byte b : bytes) {
    System.out.printf("%d ", b & 0xFF); // Convert signed byte to unsigned
}

We print the actual bytes of the word that are stored on a disk. We use the getBytes method with StandardCharsets.UTF_8 to get the array of underlying bytes.

System.out.println("Graphemes:");
BreakIterator graphemeIter = BreakIterator.getCharacterInstance();
graphemeIter.setText(word);

int count = 0;
int start = graphemeIter.first();

while (start != BreakIterator.DONE) {

    int end = graphemeIter.next();

    if (end == BreakIterator.DONE)  {
        break;
    }

    String grapheme = word.substring(start, end);
    System.out.println(grapheme);
    count++;
    start = end;
}

We print the graphemes of the word and count them. The BreakIterator.getCharacterInstance is used to enumerate graphemes of the word. The substring method extracts the grapheme between the start and end indices.

$ java Main.java
The Hindi word Namaste
नमस्ते

Code points:
U+0928 न
U+092E म
U+0938 स
U+094D ्
U+0924 त
U+0947 े

Bytes:
224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135 

Graphemes:
न
म
स्ते
the word has 3 graphemes

The output shows the word, its code points, bytes, and graphemes. The graphemes are printed as separate lines, and the total number of graphemes is displayed at the end. In this case, the word नमस्ते consists of four graphemes: न, म, स, and ते. The grapheme ते is a combination of the base character त and the combining character े, which together form a single grapheme.

However, Java incorrectly counts the graphemes in this case.

Source

Java BreakIterator

Helpful sites for working with Unicode: www.fileformat.info and www.utf8-chartable.de.

In this article we have worked with graphemes, code points, bytes, and surrogate pairs of a Unicode string.

Author

My name is Jan Bodnar, and I am a passionate programmer with extensive programming experience. I have been writing programming articles since 2007. To date, I have authored over 1,400 articles and 8 e-books. I possess more than ten years of experience in teaching programming.

List all Java tutorials.