So why am I posting so much about unicode and the word Hyvä? That’s a long story.
Whenever I work on a longer piece of writing, part of my copy editing process is to read the contents out loud to myself. I also have my computer read the contents back to me.
I’ve been doing later for a long time via the MacOS command line command program say. The
say program will read back any text passed to it either directly or via a file. It can also save that audio as an
% say "hello world" % say -f /path/to/input.txt -o /tmp/out.aiff
One tricky part of to this is I write in markdown, and if I pass a markdown file to
say directly this means the computer is speaking back URLs, code samples, etc. This is not ideal.
So — lo those many years ago when I fist had this idea, I wrote a small bit of code to strip this information from my markdown file before passing it on to
say. This code is written in pigeon PHP and is — not good — but it mostly does the job.
use Michelf\Markdown; $contents = file_get_contents('/path/to/markdown/file.md'); // covert the markdown to HTML $html = Markdown::defaultTransform($contents); // with a limited, well structured set of HTML out of the markdown library // do some string replacment to pull out the parts we don't want spoken out // loud $html = preg_replace( '%<pre><code>.+?</code></pre>%six', '<p>[CODE SNIPPED].</p>', $html ); $html = str_replace('</p>','</p><br>',$html); // save the HTML file $tmp = tempnam('/tmp', 'md_to_say') . '.html'; file_put_contents($tmp, $html); // use the MacOS textutil program to convert the HTML file into a text // file, effectivly removing any link or image URLs we don't want spoken // out loud $cmd = 'textutil -convert txt ' . $tmp; `$cmd`; // generate the invocation of the say command $tmp_txt = swapExtension($tmp, 'html','txt'); $tmp_aiff = swapExtension($tmp, 'html','aiff'); $cmd = "say -f $tmp_txt -o $tmp_aiff";
Sins of my youth include parsing HTML with regular expressions and shelling out to another program (
textutil) to finish my cleanup. It’s ugly code, but personal workflows are built on the ugly hacks of a programmer who has better things to do.
I said this mostly works. One problem it had is that some special characters wouldn’t survive all the round trips and my computer would end up saying something silly. This was very true when I started copy editing my two posts on the new Magento theme named Hyvä. Whenever I wrote Hyvä, my compter would say
Hyve tilde currency sign
Something was corrupting the unicode of the ä in Hyvä.
Here’s what I saw when I took at look at the bytes in the final text file I was feeding into the
01001000 72 01111001 121 01110110 118 11000011 195 10000011 131 11000010 194 10100100 164 00100000 32
The first three characters are the ASCII
v. Nothing fishy there. The last byte, 32, is an ASCII encoded space, so nothing too fishy there. It may be my best practice to end a file with a newline, but not every program will do this.
What was fishy was there were four additional bytes representing two additional unicode characters. Keeping my UTF-8 encoding in mind, the first byte sequence looked like this
Bytes: 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 Codepoint Bits: _ _ _ 0 0 0 1 1 _ _ 0 0 0 0 1 1 Binary Codepoint: 11000011 Decimal Codepoint: 195 Hex Codepoint: 0x00C3
In other words, the Unicode codepoint U-00C3, LATIN CAPITAL LETTER A WITH TILDE: Ã.
The second byte sequence looked like this
Bytes: 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 Code[oint Bits: _ _ _ 0 0 0 1 0 _ _ 1 0 0 1 0 0 Binary Codepoint: 10100100 Decimal Codepoint: 164 Hex Codepoint: 0x00A4
In other words, the Unicode codepoint U-00A4, CURRENCY SIGN: ¤
So my computer wasn’t saying
Hyve tilde currency sign
It was saying
Hyv “A tilde”, currency sign
But still — pronunciation asside — what had happened to my ä?
The culprit turned out to be the
textutil command. When given an input like this
<!-- File: input.html --> <b>Hyvä</b>
and invoked like this
% textutil -format txt input.html
It produces an output file like this
textutil command wasn’t parsing
input.html as a UTF-8 file. Instead, it was parsing it as an a ISO/IEC 8859-1 encoded file. In most text encodings that predate unicode individual characters are encoded in a single byte. This means each of these character sets can only display 256 different characters. The mapping of these byte values to a character is often called a codepage.
So — when we encode our unicode ä as UTF-8 it looks like this
If a program is reading through our file and parsing our file as UTF-8, it knows to treat these two bytes as a single character. However — if a program is reading through our file and parsing it as ISO/IEC 8859-1 — then it will see this as two separate characters.
If we refer to the codepage chart on Wikipedia, the binary number
11000011 (decimal 195, hex 0xc3) maps to the characer Ã. Similarly, the binary number
10100100 (decimal 164, hex 0xa4) maps to the character ¤.
textutil reads these characters in as ISO/IEC 8859-1 encoded text. Then, when writing the text file back it, it encoded the characters as UTF-8.
My initial thought was a knee jerk reaction of “what the heck
textutil, you clearly know about UTF-8 — why u change my bytes?” Then I thought about it, sighed a computer sigh, and moved on to finding a solution.
No such Thing as Plain Text Encoding
If you’re looking for a “who did the wrong thing” morale to the this tale, there’s not a great one. So called “plain text” files have a major design flaw/feature that makes this sort of slip up inevitable. Namely, there’s nothing in a text file that tells a programmer what that text file’s encoding is.
The best any programmer can do is take an educated guess based on the sort of characters it sees in the file, or force users to be explicit about their encoding. This is why a lot of text editors have the ability to open a file using multiple encodings.
Interestingly, HTML files have an optional tag where you can specify an encoding. My best guess as to what happened is when
textutil is fed an HTML document without a
charset set — it defaults to
ISO/IEC 8859-1 (or its close cousin, Windows-1252)
This may seem like a poor choice now, but when
textutil was written
ISO/IEC 8859-1 was considered a reasonable default assumption if no character encoding was set. That this is still the default assumption points more towards a conservative philosophy from Apple in updating these old command line utilities than any lapse in judgment.
As for me, I had a choice. Had the time for me to cleanup this old utility script finally come, or would I slap on another layer of spackling paste and move on?
The quick hack won out. I made sure to generate a
<meta charset="UTF-8" /> tag in my saved HTML
$html = '<html><head><meta charset="UTF-8" /></head><body>' . $html .'</body></html>'; file_put_contents($tmp, $html);
textutil starting saving my files with the expected encoding. Another victory for “bad but it works and is just for me” code in the wild.