ZetCode

C# Encoding

last modified July 5, 2023

In this article we show how to encode and decode data in C#.

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

In C#, a string is a sequence of Unicode characters. It is a data type which stores a sequence of data values, usually bytes, in which elements usually stand for characters according to a character encoding. C# internally uses UTF-16 encoding.

Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. Decoding is the opposite process; it is transforming of a sequence of encoded bytes into a set of Unicode characters.

There standard character encodings available in .NET: ASCII, UTF-7 (deprecated), UTF-8, UTF-16, and UTF-32.

The System.Text.Encoding class is used in .NET for encoding and decoding processes. .NET internally uses the UTF-16 character encoding. It is available under the Encoding.Unicode.

C# Encoding GetByteCount

The GetByteCount method returns the number of bytes produced by encoding the specified characters.

Program.cs
using System.Text;

string text = "one 🐘 and three 🐋";

int n = Encoding.UTF8.GetByteCount(text);
Console.WriteLine($"UTF-8: {n}");

n = Encoding.UTF32.GetByteCount(text);
Console.WriteLine($"UTF-32: {n}");

n = Encoding.Unicode.GetByteCount(text);
Console.WriteLine($"UTF-16: {n}");

n = Encoding.BigEndianUnicode.GetByteCount(text);
Console.WriteLine($"UTF-16BE: {n}");

n = Encoding.Latin1.GetByteCount(text);
Console.WriteLine($"Latin1: {n}");

n = Encoding.ASCII.GetByteCount(text);
Console.WriteLine($"ASCII: {n}");

The example prints the number of bytes produces when the given string is encoded in the specified encoding.

$ dotnet run
UTF-8: 23
UTF-32: 68
UTF-16: 38
UTF-16BE: 38
Latin1: 19
ASCII: 19

C# Encoding GetBytes

The GetBytes method returns a byte array containing the results of encoding the specified set of characters.

Program.cs
using System.Text;

string text = "one 🐘 and three 🐋";

Console.WriteLine("UTF-8 bytes");
byte[] uft8Data = Encoding.UTF8.GetBytes(text);
showBytes(uft8Data);

Console.WriteLine("UTF-16 bytes");
byte[] uft16Data = Encoding.Unicode.GetBytes(text);
showBytes(uft16Data);

Console.WriteLine("UTF-16BE bytes");
byte[] uft16BEData = Encoding.BigEndianUnicode.GetBytes(text);
showBytes(uft16BEData);

Console.WriteLine("Latin1 bytes");
byte[] latin1Data = Encoding.Latin1.GetBytes(text);
showBytes(latin1Data);

void showBytes(byte[] data)
{
    int i = 0;

    foreach (var e in data)
    {
        Console.Write($"{e.ToString("X4")} ");
        i++;

        if (i % 10 == 0)
        {
            Console.WriteLine();
        }
    }

    Console.WriteLine();
}

The example encodes the given string into bytes of UTF-8, UTF-16, UTF-16BE, and Latin1 encodings.

$ dotnet run
UTF-8 bytes
006F 006E 0065 0020 00F0 009F 0090 0098 0020 0061
006E 0064 0020 0074 0068 0072 0065 0065 0020 00F0
009F 0090 008B
UTF-16 bytes
006F 0000 006E 0000 0065 0000 0020 0000 003D 00D8
0018 00DC 0020 0000 0061 0000 006E 0000 0064 0000
0020 0000 0074 0000 0068 0000 0072 0000 0065 0000
0065 0000 0020 0000 003D 00D8 000B 00DC
UTF-16BE bytes
0000 006F 0000 006E 0000 0065 0000 0020 00D8 003D
00DC 0018 0000 0020 0000 0061 0000 006E 0000 0064
0000 0020 0000 0074 0000 0068 0000 0072 0000 0065
0000 0065 0000 0020 00D8 003D 00DC 000B
Latin1 bytes
006F 006E 0065 0020 003F 003F 0020 0061 006E 0064
0020 0074 0068 0072 0065 0065 0020 003F 003F

C# Encoding GetString

The GetString method builds a string that contains the results of decoding the specified sequence of bytes.

Program.cs
using System.Text;

string text = "one 🐘 and three 🐋";

Console.WriteLine("UTF-8 bytes");
byte[] uft8Data = Encoding.UTF8.GetBytes(text);
string output = Encoding.UTF8.GetString(uft8Data);
Console.WriteLine(output);

Console.WriteLine("UTF-16 bytes");
byte[] uft16Data = Encoding.Unicode.GetBytes(text);
output = Encoding.Unicode.GetString(uft16Data);
Console.WriteLine(output);

Console.WriteLine("UTF-16BE bytes");
byte[] uft16BEData = Encoding.BigEndianUnicode.GetBytes(text);
output = Encoding.BigEndianUnicode.GetString(uft16BEData);
Console.WriteLine(output);

Console.WriteLine("Latin1 bytes");
byte[] latin1Data = Encoding.Latin1.GetBytes(text);
output = Encoding.Latin1.GetString(latin1Data);
Console.WriteLine(output);

In the example, we first encode the given string into an array of bytes with GetBytes. Later, we decode the bytes into strings with GetString. We use four different encodings.

$ dotnet run
UTF-8 bytes
one 🐘 and three 🐋
UTF-16 bytes
one 🐘 and three 🐋
UTF-16BE bytes
one 🐘 and three 🐋
Latin1 bytes
one ?? and three ??

The Latin1 encoding is not able to work with emoticons.

C# Encoding.Convert

The Encoding.Convert method converts an entire byte array from one encoding to another.

Program.cs
using System.Text;

string text = "one 🐘 and three 🐋";

byte[] utf16Data = Encoding.Unicode.GetBytes(text);
byte[] utf8Data = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Data);

Console.WriteLine("UTF-16 bytes");
showBytes(utf16Data);

Console.WriteLine();

Console.WriteLine("UTF-8 bytes");
showBytes(utf8Data);

Console.WriteLine();
string output = Encoding.UTF8.GetString(utf8Data);
Console.WriteLine(output);


void showBytes(byte[] data)
{
    int i = 0;

    foreach (var e in data)
    {
        Console.Write($"{e.ToString("X4")} ");
        i++;

        if (i % 10 == 0)
        {
            Console.WriteLine();
        }
    }

    Console.WriteLine();
}

In the example, we convert UTF-16 bytes into UTF-8 bytes.

$ dotnet run
UTF-16 bytes
006F 0000 006E 0000 0065 0000 0020 0000 003D 00D8 
0018 00DC 0020 0000 0061 0000 006E 0000 0064 0000 
0020 0000 0074 0000 0068 0000 0072 0000 0065 0000 
0065 0000 0020 0000 003D 00D8 000B 00DC 

UTF-8 bytes
006F 006E 0065 0020 00F0 009F 0090 0098 0020 0061 
006E 0064 0020 0074 0068 0072 0065 0065 0020 00F0 
009F 0090 008B 

one 🐘 and three 🐋

C# read/write data with Encoding

Next, we write data to a file and read from it using specified encoding.

Program.cs
using System.Text;

string text = "one 🐘 and three 🐋";

using var fs = new FileStream("data.txt", FileMode.OpenOrCreate);
using var sw = new StreamWriter(fs, Encoding.UTF8);
sw.Write(text);

In the example, we write text into a file using Encoding.UTF8.

using var sw = new StreamWriter(fs, Encoding.UTF8);

The second parameter of StreamWriter is the character encoding to use.

$ dotnet run
$ file data.txt 
data.txt: Unicode text, UTF-8 (with BOM) text, with no line terminators
$ cat data.txt 
one 🐘 and three 🐋

Next, we read the data from the file.

Program.cs
using System.Text;

using var fs = new FileStream("data.txt", FileMode.Open);
using var sr = new StreamReader(fs, Encoding.UTF8);

string? text = sr.ReadLine();
Console.WriteLine(text);

We use StreamReader to read the data; we specify the character encoding in the second parameter.

$ dotnet run
one 🐘 and three 🐋

Source

Encoding class - language reference

In this article we were encoding and decoding data in C#.

Author

My name is Jan Bodnar and I am a passionate programmer with many years of programming experience. I have been writing programming articles since 2007. So far, I have written over 1400 articles and 8 e-books. I have over eight years of experience in teaching programming.

List all C# tutorials.