Categories


Archives


Recent Posts


Categories


When Good Unicode Encoding Goes Bad

astorm

Frustrated by Magento? Then you’ll love Commerce Bug, the must have debugging extension for anyone using Magento. Whether you’re just starting out or you’re a seasoned pro, Commerce Bug will save you and your team hours everyday. Grab a copy and start working with Magento instead of against it.

Updated for Magento 2! No Frills Magento Layout is the only Magento front end book you'll ever need. Get your copy today!

This entry is part 3 of 4 in the series Text Encoding and Unicode. Earlier posts include Inspecting Bytes with Node.js Buffer Objects, and Unicode vs. UTF-8. Later posts include PHP and Unicode.

So why am I posting so much about unicode and the word Hyvä? That’s a long story.

Whenever I work on a longer piece of writing, part of my copy editing process is to read the contents out loud to myself. I also have my computer read the contents back to me.

I’ve been doing later for a long time via the MacOS command line command program say. The say program will read back any text passed to it either directly or via a file. It can also save that audio as an aiff file

% say "hello world"
% say -f /path/to/input.txt -o /tmp/out.aiff

One tricky part of to this is I write in markdown, and if I pass a markdown file to say directly this means the computer is speaking back URLs, code samples, etc. This is not ideal.

So — lo those many years ago when I fist had this idea, I wrote a small bit of code to strip this information from my markdown file before passing it on to say. This code is written in pigeon PHP and is — not good — but it mostly does the job.

use Michelf\Markdown;

$contents = file_get_contents('/path/to/markdown/file.md');

// covert the markdown to HTML
$html     = Markdown::defaultTransform($contents);

// with a limited, well structured set of HTML out of the markdown library
// do some string replacment to pull out the parts we don't want spoken out
// loud
$html     = preg_replace(
    '%<pre><code>.+?</code></pre>%six',
    '<p>[CODE SNIPPED].</p>',
    $html
);
$html = str_replace('</p>','</p><br>',$html);

// save the HTML file
$tmp = tempnam('/tmp', 'md_to_say') . '.html';
file_put_contents($tmp, $html);

// use the MacOS textutil program to convert the HTML file into a text
// file, effectivly removing any link or image URLs we don't want spoken
// out loud
$cmd = 'textutil -convert txt ' . $tmp;
`$cmd`;

// generate the invocation of the say command
$tmp_txt    = swapExtension($tmp, 'html','txt');
$tmp_aiff   = swapExtension($tmp, 'html','aiff');
$cmd = "say -f $tmp_txt -o $tmp_aiff";

Sins of my youth include parsing HTML with regular expressions and shelling out to another program (textutil) to finish my cleanup. It’s ugly code, but personal workflows are built on the ugly hacks of a programmer who has better things to do.

I said this mostly works. One problem it had is that some special characters wouldn’t survive all the round trips and my computer would end up saying something silly. This was very true when I started copy editing my two posts on the new Magento theme named Hyvä. Whenever I wrote Hyvä, my compter would say

Hyve tilde currency sign

Something was corrupting the unicode of the ä in Hyvä.

The Bytes

Here’s what I saw when I took at look at the bytes in the final text file I was feeding into the say command

01001000   72
01111001   121
01110110   118

11000011   195
10000011   131

11000010   194
10100100   164

00100000   32

The first three characters are the ASCII H, y, and v. Nothing fishy there. The last byte, 32, is an ASCII encoded space, so nothing too fishy there. It may be my best practice to end a file with a newline, but not every program will do this.

What was fishy was there were four additional bytes representing two additional unicode characters. Keeping my UTF-8 encoding in mind, the first byte sequence looked like this

Bytes:             1 1 0 0 0 0 1 1    1 0 0 0 0 0 1 1
Codepoint Bits:    _ _ _ 0 0 0 1 1    _ _ 0 0 0 0 1 1

Binary Codepoint:  11000011
Decimal Codepoint: 195
Hex Codepoint:     0x00C3

In other words, the Unicode codepoint U-00C3, LATIN CAPITAL LETTER A WITH TILDE: Ã.

The second byte sequence looked like this

Bytes:             1 1 0 0 0 0 1 0    1 0 1 0 0 1 0 0
Code[oint Bits:    _ _ _ 0 0 0 1 0    _ _ 1 0 0 1 0 0

Binary Codepoint:  10100100
Decimal Codepoint: 164
Hex Codepoint:     0x00A4

In other words, the Unicode codepoint U-00A4, CURRENCY SIGN: ¤

So my computer wasn’t saying

Hyve tilde currency sign

It was saying

Hyv “A tilde”, currency sign

But still — pronunciation asside — what had happened to my ä?

The Culprit

The culprit turned out to be the textutil command. When given an input like this

<!-- File: input.html -->
<b>Hyvä</b>

and invoked like this

% textutil -format txt input.html

It produces an output file like this

Hyvä

The textutil command wasn’t parsing input.html as a UTF-8 file. Instead, it was parsing it as an a ISO/IEC 8859-1 encoded file. In most text encodings that predate unicode individual characters are encoded in a single byte. This means each of these character sets can only display 256 different characters. The mapping of these byte values to a character is often called a codepage.

So — when we encode our unicode ä as UTF-8 it looks like this

11000011  10100100

If a program is reading through our file and parsing our file as UTF-8, it knows to treat these two bytes as a single character. However — if a program is reading through our file and parsing it as ISO/IEC 8859-1 — then it will see this as two separate characters.

If we refer to the codepage chart on Wikipedia, the binary number 11000011 (decimal 195, hex 0xc3) maps to the characer Ã. Similarly, the binary number 10100100 (decimal 164, hex 0xa4) maps to the character ¤.

So textutil reads these characters in as ISO/IEC 8859-1 encoded text. Then, when writing the text file back it, it encoded the characters as UTF-8.

My initial thought was a knee jerk reaction of “what the heck textutil, you clearly know about UTF-8 — why u change my bytes?” Then I thought about it, sighed a computer sigh, and moved on to finding a solution.

No such Thing as Plain Text Encoding

If you’re looking for a “who did the wrong thing” morale to the this tale, there’s not a great one. So called “plain text” files have a major design flaw/feature that makes this sort of slip up inevitable. Namely, there’s nothing in a text file that tells a programmer what that text file’s encoding is.

The best any programmer can do is take an educated guess based on the sort of characters it sees in the file, or force users to be explicit about their encoding. This is why a lot of text editors have the ability to open a file using multiple encodings.

Interestingly, HTML files have an optional tag where you can specify an encoding. My best guess as to what happened is when textutil is fed an HTML document without a charset set — it defaults to ISO/IEC 8859-1 (or its close cousin, Windows-1252)

This may seem like a poor choice now, but when textutil was written ISO/IEC 8859-1 was considered a reasonable default assumption if no character encoding was set. That this is still the default assumption points more towards a conservative philosophy from Apple in updating these old command line utilities than any lapse in judgment.

As for me, I had a choice. Had the time for me to cleanup this old utility script finally come, or would I slap on another layer of spackling paste and move on?

The quick hack won out. I made sure to generate a <meta charset="UTF-8" /> tag in my saved HTML

$html = '<html><head><meta charset="UTF-8" /></head><body>' .
    $html .'</body></html>';

file_put_contents($tmp, $html);

and textutil starting saving my files with the expected encoding. Another victory for “bad but it works and is just for me” code in the wild.

Series Navigation<< Unicode vs. UTF-8PHP and Unicode >>

Copyright © Alan Storm 1975 – 2021 All Rights Reserved

Originally Posted: 11th February 2021