C# grapheme
last modified July 5, 2023
C# grapheme tutorial shows how to work with graphemes in C#.
Grapheme
A grapheme is the smallest unit of a writing system of any given language. An individual grapheme may or may not carry meaning by itself, and may or may not correspond to a single phoneme of the spoken language.
The term character has been used to represent a single character in the original ASCII table. This table, however, can represent a limited set of characters. A Unicode standard has been created to deal with other text values.
Unicode
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
In C#, a string is a sequence of Unicode characters. It is a data type which stores a sequence of data values, usually bytes, in which elements usually stand for characters according to a character encoding. C# internally uses UTF-16 encoding.
Outside of the ASCII table it is better to use the term grapheme instead of the
term character. The .NET platform defines a text element as a unit of text that
is displayed as a grapheme. The TextElementEnumerator
enumerates
the text elements of a string. A text element can be a base character, a
surrogate pair, or a combining character sequence.
A code point is a numerical offset in a character set. Each code point is a number whose meaning is given by the Unicode standard. For example, the latin letter Q is the U+0051 code point and the Cyrillic small letter zhe ж is the U+0436 code point.
A grapheme cluster is a sequence of one or more code points which are
displayed as a single graphical unit. For instance, the Hindi letter the ते cosists
of two code points: U+0924 for त and U+0947 for े
.
The bytes are the actual information stored for the string contents. Each code point can require one or more bytes of storage depending on the Unicode standard being used (UTF-8, UTF-16, etc.).
Surrogate pairs
C# uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, additional bytes are used to store values above this range (0x10000 to 0x10FFFF). This is done using surrogates pairs.
A surrogate pair is a coded character representation for a single abstract character that consists of a sequence of two code units. The first unit of the pair is a high surrogate and the second is a low surrogate. The combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character.
C# grapheme example
In the following example, we work with graphemes.
using System.Text; using System.Globalization; Console.OutputEncoding = System.Text.Encoding.UTF8; Console.WriteLine("The Hindi word Namaste"); string word = "नमस्ते"; Console.WriteLine(word); Console.WriteLine(); // code points Console.WriteLine("Code points:"); for (int i = 0; i < word.Length; i += Char.IsSurrogatePair(word, i) ? 2 : 1) { int x = Char.ConvertToUtf32(word, i); Console.WriteLine("U+{0:X4} {1}", x, Char.ConvertFromUtf32(x)); } Console.WriteLine(); // bytes Console.WriteLine("Bytes: "); byte[] bytes = Encoding.UTF8.GetBytes(word); foreach (byte c in bytes) { Console.Write($"{c} "); } Console.WriteLine("\n"); // graphemes Console.WriteLine("Graphemes: "); int count = 0; TextElementEnumerator graphemeEnum = StringInfo.GetTextElementEnumerator(word); while (graphemeEnum.MoveNext()) { string grapheme = graphemeEnum.GetTextElement(); Console.WriteLine(grapheme); count++; } Console.WriteLine($"the word has {count} graphemes");
The example defines a variable which contains a Hindi word namaste. We print the word, print its code points, bytes, and print and count the number of graphemes.
Console.OutputEncoding = System.Text.Encoding.UTF8;
To output Unicode characters to terminal, we st the console output encoding to UTF8.
string word = "नमस्ते"; Console.WriteLine(word);
We have the Hindi namaste word; we print it to the console.
// code points Console.WriteLine("Code points:"); for (int i = 0; i < word.Length; i += Char.IsSurrogatePair(word, i) ? 2 : 1) { int x = Char.ConvertToUtf32(word, i); Console.WriteLine("U+{0:X4} {1}", x, Char.ConvertFromUtf32(x)); }
We print the code points of the word. The Lenth
property determins
the number of UTF-16 chars in the string. The Char.IsSurrogatePair
method is used to determine whether two adjacent Char
objects at a
specified position in a string form a surrogate pair. If so, we need more bytes
to represent a grapheme.
With the Char.ConvertToUft32
method we print the Unicode code
point; the method converts the value of a UTF-16 encoded character or surrogate
pair at a specified position in a string into a Unicode code point. Finally, the
Char.ConvertFromUtf32
method converts the given Unicode code point
into a UTF-16 encoded string; we get the grapheme.
Console.WriteLine("Bytes: "); byte[] bytes = Encoding.UTF8.GetBytes(word); foreach (byte c in bytes) { Console.Write($"{c} "); } Console.WriteLine("\n");
We print the actual bytes of the word that are stored on a disk. We use the
Encoding.UTF8.GetBytes
method to get the array of underlying bytes.
Console.WriteLine("Graphemes: "); int count = 0; TextElementEnumerator graphemeEnum = StringInfo.GetTextElementEnumerator(word); while (graphemeEnum.MoveNext()) { string grapheme = graphemeEnum.GetTextElement(); Console.WriteLine(grapheme); count++; } Console.WriteLine($"the word has {count} graphemes");
We print the graphemes of the work and count them. The
TextElementEnumerator
is used to enumerate graphemes of the word.
The GetTextElement
is used to get the current text element
(grapheme).
$ dotnet run The Hindi word Namaste नमस्ते Code points: U+0928 न U+092E म U+0938 स U+094D्
U+0924 त U+0947े
Bytes: 224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135 Graphemes: न म स् ते the word has 4 graphemes
Source
Helpful sites for working with Unicode: www.fileformat.info and www.utf8-chartable.de.
In this article we have worked with graphemes, code points, bytes, and surrogate pairs of a Unicode string.
Author
List all C# tutorials.