Java grapheme
last modified May 24, 2025
In this article we will work with graphemes in Java. A grapheme is the smallest unit of a writing system of any given language.
Grapheme
A grapheme is the smallest fundamental unit in a writing system of any given language. It may consist of a single character or multiple combined elements, such as accents, diacritics, or ligatures, forming a visually distinct symbol. While a grapheme may not always convey meaning on its own, it plays a crucial role in representing language structure.
Historically, the term character was used to refer to individual symbols in the original ASCII table, which provided support for only a limited set of characters. As digital text representation evolved, the Unicode standard was introduced to accommodate the immense diversity of written languages, enabling representation for complex scripts, emojis, and multi-character graphemes.
Unicode
Unicode is a globally recognized standard that ensures the consistent encoding, representation, and handling of text across different writing systems. It was developed to unify character representation, allowing seamless text processing in various languages, scripts, and specialized symbols such as mathematical notation and emojis.
In Java, a string is a sequence of Unicode characters, functioning as a structured data type that stores text as a series of values, typically represented in bytes. Java internally uses UTF-16 encoding, which supports direct representation for most characters while employing surrogate pairs to extend support for Unicode characters beyond the Basic Multilingual Plane (BMP).
Beyond the ASCII character set, it is more appropriate to use the term
grapheme rather than "character." Java provides the
BreakIterator
class, which allows iteration over text elements
(graphemes) within a string. A text element may consist of a base
character, a surrogate pair, or a combining character sequence,
together forming a single displayed symbol.
A code point represents the numerical identifier assigned to a
Unicode character. For example, the Latin letter Q
corresponds to
the U+0051 code point, while the Cyrillic small letter ж
is
represented by U+0436. Unicode assigns a unique code point to each
character, ensuring universal compatibility in text processing across various
systems.
A grapheme cluster consists of one or more code points that
visually combine to form a single graphical unit. For example, the Hindi
syllable ते
is composed of two code points: U+0924
for
त
and U+0947
for े
. Proper handling of
grapheme clusters is essential for accurately rendering complex scripts.
The actual stored representation of a string consists of bytes, which encode each code point. Depending on the Unicode encoding scheme—such as UTF-8, UTF-16, or UTF-32—each code point may require a different number of bytes for full representation in memory or storage.
Unicode table
Below is a small portion of the Unicode table, showing code points, characters, and their names:
Code point | Character | Name |
---|---|---|
U+0041 | A | LATIN CAPITAL LETTER A |
U+0061 | a | LATIN SMALL LETTER A |
U+03B1 | α | GREEK SMALL LETTER ALPHA |
U+0416 | Ж | CYRILLIC CAPITAL LETTER ZHE |
U+05D0 | א | HEBREW LETTER ALEF |
U+0924 | त | DEVANAGARI LETTER TA |
U+1F600 | 😀 | GRINNING FACE |
Surrogate Pairs
Java employs the UTF-16 encoding scheme for Unicode text
storage, using 16-bit (two-byte) code units to represent each character.
This scheme supports all characters in the Basic Multilingual Plane
(BMP), ranging from U+0000
to U+FFFF
,
which includes widely used scripts like Latin, Greek, Cyrillic, Arabic,
Hebrew, Chinese, Japanese, Korean, as well as symbols and punctuation.
For Unicode characters beyond the BMP, spanning U+10000
to
U+10FFFF
, Java uses surrogate pairs to extend
UTF-16 encoding capabilities, enabling representation of these additional
characters.
A surrogate pair is a pair of 16-bit code units that together represent a single Unicode character outside the BMP. The first unit, called the high surrogate, is followed by the low surrogate. This mechanism enables Java to handle extended characters, such as emojis, historical scripts, and specialized symbols used across various domains.
Unicode also supports combining character sequences, where a base character is modified by one or more combining characters, such as diacritics. A surrogate pair can represent a standalone character or be part of a larger grapheme cluster when combined with other code points, forming a single visual unit.
Code point | Character | High surrogate | Low surrogate | Name |
---|---|---|---|---|
U+1F600 | 😀 | D83D | DE00 | GRINNING FACE |
U+1F680 | 🚀 | D83D | DE80 | ROCKET |
U+1F4A9 | 💩 | D83D | DCA9 | PILE OF POO |
U+1F60D | 😍 | D83D | DE0D | SMILING FACE WITH HEART-EYES |
U+1F47B | 👻 | D83D | DC7B | GHOST |
In the table above, the code points U+1F600, U+1F680, U+1F4A9, U+1F60D, and U+1F47B are represented as surrogate pairs. The high surrogate is the first code unit (D83D), and the low surrogate is the second code unit (DE00, DE80, DCA9, DE0D, DC7B). These pairs allow the representation of characters outside the Basic Multilingual Plane (BMP), which includes many emojis and other special characters.
Java grapheme example
In the following example, we work with graphemes.
void main() { System.out.println("The Hindi word Namaste"); String word = "नमस्ते"; System.out.println(word); System.out.println(); // code points System.out.println("Code points:"); for (int i = 0; i < word.length(); i += Character.charCount(word.codePointAt(i))) { int cp = word.codePointAt(i); System.out.printf("U+%04X %s%n", cp, new String(Character.toChars(cp))); } System.out.println(); // bytes System.out.println("Bytes:"); byte[] bytes = word.getBytes(StandardCharsets.UTF_8); for (byte b : bytes) { System.out.printf("%d ", b & 0xFF); // Convert signed byte to unsigned } System.out.println("\n"); // graphemes System.out.println("Graphemes:"); BreakIterator graphemeIter = BreakIterator.getCharacterInstance(); graphemeIter.setText(word); int count = 0; int start = graphemeIter.first(); while (start != BreakIterator.DONE) { int end = graphemeIter.next(); if (end == BreakIterator.DONE) { break; } String grapheme = word.substring(start, end); System.out.println(grapheme); count++; start = end; } System.out.printf("the word has %d graphemes%n", count); }
The example defines a variable which contains a Hindi word namaste. We print the word, print its code points, bytes, and print and count the number of graphemes.
String word = "नमस्ते"; System.out.println(word);
We have the Hindi namaste word; we print it to the console.
System.out.println("Code points:"); for (int i = 0; i < word.length(); i += Character.charCount(word.codePointAt(i))) { int cp = word.codePointAt(i); System.out.printf("U+%04X %s%n", cp, new String(Character.toChars(cp))); }
We print the code points of the word. The length
method determines
the number of UTF-16 chars in the string. The Character.charCount
method is used to determine the number of char values needed to represent the
code point. If it is a surrogate pair, we need more bytes to represent a grapheme.
With the word.codePointAt
method, we get the Unicode code point at
the specified index. The Character.toChars
method converts the code
point into a char array, which we convert to a string to print the grapheme.
System.out.println("Bytes:"); byte[] bytes = word.getBytes(StandardCharsets.UTF_8); for (byte b : bytes) { System.out.printf("%d ", b & 0xFF); // Convert signed byte to unsigned }
We print the actual bytes of the word that are stored on a disk. We use the
getBytes
method with StandardCharsets.UTF_8
to get the
array of underlying bytes.
System.out.println("Graphemes:"); BreakIterator graphemeIter = BreakIterator.getCharacterInstance(); graphemeIter.setText(word); int count = 0; int start = graphemeIter.first(); while (start != BreakIterator.DONE) { int end = graphemeIter.next(); if (end == BreakIterator.DONE) { break; } String grapheme = word.substring(start, end); System.out.println(grapheme); count++; start = end; }
We print the graphemes of the word and count them. The
BreakIterator.getCharacterInstance
is used to enumerate graphemes
of the word. The substring
method extracts the grapheme between the
start and end indices.
$ java Main.java The Hindi word Namaste नमस्ते Code points: U+0928 न U+092E म U+0938 स U+094D ् U+0924 त U+0947 े Bytes: 224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135 Graphemes: न म स्ते the word has 3 graphemes
The output shows the word, its code points, bytes, and graphemes. The graphemes are printed as separate lines, and the total number of graphemes is displayed at the end. In this case, the word नमस्ते consists of four graphemes: न, म, स, and ते. The grapheme ते is a combination of the base character त and the combining character े, which together form a single grapheme.
However, Java incorrectly counts the graphemes in this case.
Source
Helpful sites for working with Unicode: www.fileformat.info and www.utf8-chartable.de.
In this article we have worked with graphemes, code points, bytes, and surrogate pairs of a Unicode string.
Author
List all Java tutorials.