Categories


Archives


Recent Posts


Categories


Unicode vs. UTF-8

astorm

Frustrated by Magento? Then you’ll love Commerce Bug, the must have debugging extension for anyone using Magento. Whether you’re just starting out or you’re a seasoned pro, Commerce Bug will save you and your team hours everyday. Grab a copy and start working with Magento instead of against it.

Updated for Magento 2! No Frills Magento Layout is the only Magento front end book you'll ever need. Get your copy today!

This entry is part 2 of 4 in the series Text Encoding and Unicode. Earlier posts include Inspecting Bytes with Node.js Buffer Objects. Later posts include When Good Unicode Encoding Goes Bad, and PHP and Unicode.

In my last quick tips post I mentioned examining the bytes of a text file that contained the text Hyvä, and getting back the following six bytes.

01001000   72
01111001   121
01110110   118
11000011   195
10100100   164
00001010   10

The first three bytes — 72, 121, and 118 are an ASCII text encoded H, y, and v.

The last byte, 10, is an ASCII encoded newline character that ends the file (my text editor is configured to always add a newline character to the end of files)

What’s a little more mysterious are the following bytes

11000011   195
10100100   164

These represent the character ä. That’s because this file is UTF-8 encoded. In UTF-8 encoding, characters in the original US-centric ASCII encoding (i.e. in the range 0 – 127) are encoded as single bytes. Characters outside that original ASCII encoding range need to be encoded with multiple bytes.

How a program encodes those characters and how many bytes they should use gets tricky. To understand that we need to understand the difference between unicode and unicode encoding.

What is Unicode?

Unicode is an attempt to define every possible human character and assign it a number. This number is called a codepoint. So that ä character? Its unicode code point is U+00E4. The 00E4 portion of that is a hexadecimal number. In decimal that number is 228. In binary that number is

11100100

You’ll notice that it’s possible to store the number 228 as a single byte (11100100). However, we don’t see this byte in our file. That’s because different unicode encoding standards will use different algorithms to encode characters.

Some examples of unicode encoding include

UTF-8
UTF-16
UTF-32

This is a weird, but important distinction to make when discussing unicode. Unicode is the standard that defines the codepoint, but unicode encoding defines the rules that determine how those code points are represented as bytes in your file.

The version of unicode encoding that seems like the defacto these days is UTF-8. In UTF-8 a character might be encoded with one byte, two bytes, three bytes, or four bytes.

UTF-8 Encoding

Per the wikipedia page, codepoints in the following ranges are encoded with the following number of bytes

U+0000  - U+007F:   one bytes (our blessed ASCII text)
U+0080  - U+07FF:   two bytes
U+0800  - U+FFFF:   three bytes
U+10000 - U+10FFFF: four bytes (the encoding that makes all emoji possible)

When compared with other unicode encodings UTF-8 has two things going for it. First, unlike other encoding, it doesn’t force the same multiple byte encoding for characters that might not need all those bytes. Second, and likely more importantly, it’s a flexible encoding that was built allow for a different number of bytes depending on the needs of the character.

In UTF-16 and UTF-32, every character is encoded with a fixed number of bytes regardless of whether the character needs that space or not. UTF-8 allows a character like ä to be encoded with only two bytes, but is also flexible enough that we can encode a character like U-1F4A9 using four bytes.

When a programmer is writing a program to read through a UTF-8 file, if a byte starts with a 0

01001000
01111001
01110110

the programmer knows this is a single byte whose value represents a codepoint.

If, however, the byte begins with a 1, they know they’re at the start of a multi-byte sequence. A byte starting with 110 indicates this byte and the next make up a character. A byte starting with 1110 indicates this byte and the next two bytes make up a character. A byte starting with 11110 indicates this byte and the next three bytes make up a character.

In addition to these prefixes for identifying the number of bytes being using, the second, third, or fourth byte in the sequence will also be prefixed with 10 to indicate they’re part of a multi-byte sequence.

So, if we consider our rules so far

two bytes:   1 1 0 _ _ _ _ _    1 0 _ _ _ _ _ _
three bytes: 1 1 1 0 _ _ _ _    1 0 _ _ _ _ _ _
four bytes:  1 1 1 1 0 _ _ _    1 0 _ _ _ _ _ _

These prefix bits serve as flags. The rest of the bits in the bytes will represent the code point’s value. So for our ä

1 1 0 0 0 0 1 1    1 0 1 0 0 1 0 0

We see it begins with 110, which means it’s a two byte character. This leaves the following bits representing the actual codepoint value

Full Bytes:        1 1 0 0 0 0 1 1    1 0 1 0 0 1 0 0

Codepoint Portion: _ _ _ 0 0 0 1 1    _ _ 1 0 0 1 0 0

These bits will be combined into a single binary number

1 1 1 0 0 1 0 0

Which is 228 decimal or 00E4 in hex. There’s our codepoint.

Series Navigation<< Inspecting Bytes with Node.js Buffer ObjectsWhen Good Unicode Encoding Goes Bad >>

Copyright © Alan Storm 1975 – 2021 All Rights Reserved

Originally Posted: 10th February 2021