Ebooks

C# grapheme tutorial

C# grapheme tutorial shows how to work with graphemes in C#. C# tutorial is a comprehensive tutorial on C# language.

Grapheme

A grapheme is the smallest unit of a writing system of any given language. An individual grapheme may or may not carry meaning by itself, and may or may not correspond to a single phoneme of the spoken language.

The term character has been used to represent a single character in the original ASCII table. This table, however, can represent a limited set of characters. A Unicode standard has been created to deal with other text values.

Unicode

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

In C#, a string is a sequence of Unicode characters. It is a data type which stores a sequence of data values, usually bytes, in which elements usually stand for characters according to a character encoding. C# internally uses UTF-16 encoding.

Outside of the ASCII table it is better to use the term grapheme instead of the term character. The .NET platform defines a text element as a unit of text that is displayed as a grapheme. The TextElementEnumerator enumerates the text elements of a string. A text element can be a base character, a surrogate pair, or a combining character sequence.

A code point is an atomic unit of information. Each code point is a number whose meaning is given by the Unicode standard. For example, the latin letter Q is the U+0051 code point and the Cyrillic small letter zhe ж is the U+0436 code point.

A grapheme cluster is a sequence of one or more code points which are displayed as a single graphical unit. For instance, the Hindi letter the ते cosists of two code points: U+0924 for त and U+0947 for .

The bytes are the actual information stored for the string contents. Each code point can require one or more bytes of storage depending on the Unicode standard being used (UTF-8, UTF-16, etc.).

Surrogate pairs

C# uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, additional bytes are used to store values above this range (0x10000 to 0x10FFFF). This is done using surrogates pairs.

A surrogate pair is a coded character representation for a single abstract character that consists of a sequence of two code units. The first unit of the pair is a high surrogate and the second is a low surrogate. The combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character.

C# grapheme example

In the following example, we work with graphemes.

Program.cs
using System;
using System.Text;
using System.Globalization;

namespace StringLength
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.OutputEncoding = System.Text.Encoding.UTF8;

            Console.WriteLine("The Hindi word Namaste");

            string word = "नमस्ते";
            Console.WriteLine(word);
            Console.WriteLine();

            // code points

            Console.WriteLine("Code points:");

            for (int i = 0; i < word.Length; i += Char.IsSurrogatePair(word, i) ? 2 : 1)
            {
                int x = Char.ConvertToUtf32(word, i);

                Console.WriteLine("U+{0:X4} {1}", x, Char.ConvertFromUtf32(x));
            }

            Console.WriteLine();

            // bytes

            Console.WriteLine("Bytes: ");
            byte[] bytes = Encoding.UTF8.GetBytes(word);

            foreach (byte c in bytes)
            {
                Console.Write($"{c} ");
            }

            Console.WriteLine("\n");

            // graphemes

            Console.WriteLine("Graphemes: ");

            int count = 0;

            TextElementEnumerator graphemeEnum = StringInfo.GetTextElementEnumerator(word);
            while (graphemeEnum.MoveNext())
            {
                string grapheme = graphemeEnum.GetTextElement();

                Console.WriteLine(grapheme);

                count++;
            }

            Console.WriteLine($"the word has {count} graphemes");
        }
    }
}

The example defines a variable which contains a Hindi word namaste. We print the word, print its code points, bytes, and print and count the number of graphemes.

Console.OutputEncoding = System.Text.Encoding.UTF8;

To output Unicode characters to terminal, we st the console output encoding to UTF8.

string word = "नमस्ते";
Console.WriteLine(word);

We have the Hindi namaste word; we print it to the console.

// code points

Console.WriteLine("Code points:");

for (int i = 0; i < word.Length; i += Char.IsSurrogatePair(word, i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(word, i);

    Console.WriteLine("U+{0:X4} {1}", x, Char.ConvertFromUtf32(x));
}

We print the code points of the word. The Lenth property determins the number of UTF-16 chars in the string. The Char.IsSurrogatePair() method is used to determine whether two adjacent Char objects at a specified position in a string form a surrogate pair. If so, we need more bytes to represent a grapheme. With the Char.ConvertToUft32() method we print the Unicode code point; the method converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point. Finally, the Char.ConvertFromUtf32() method converts the given Unicode code point into a UTF-16 encoded string; we get the grapheme.

Console.WriteLine("Bytes: ");
byte[] bytes = Encoding.UTF8.GetBytes(word);

foreach (byte c in bytes)
{
    Console.Write($"{c} ");
}

Console.WriteLine("\n");

We print the actual bytes of the word that are stored on a disk. We use the Encoding.UTF8.GetBytes() method to get the array of underlying bytes.

Console.WriteLine("Graphemes: ");

int count = 0;

TextElementEnumerator graphemeEnum = StringInfo.GetTextElementEnumerator(word);
while (graphemeEnum.MoveNext())
{
    string grapheme = graphemeEnum.GetTextElement();

    Console.WriteLine(grapheme);

    count++;
}

Console.WriteLine($"the word has {count} graphemes");

We print the graphemes of the work and count them. The TextElementEnumerator is used to enumerate graphemes of the word. The GetTextElement() is used to get the current text element (grapheme).

$ dotnet run
The Hindi word Namaste
नमस्ते

Code points:
U+0928 न
U+092E म
U+0938 स
U+094D 
U+0924 त
U+0947 

Bytes:
224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135

Graphemes:
न
म
स्
ते
the word has 4 graphemes

This is the output.

In this tutorial, we have worked with graphemes, code points, bytes, and surrogate pairs of a Unicode string.

Helpful sites for working with Unicode: www.fileformat.info and www.utf8-chartable.de.

List all C# tutorials.