Text Encodings

I have been doing a lot of work recently on implementing a Z-Machine disassembler. Half the fun of the project is dealing with the antiquated tech stack of 1979. With memory at that time expressed in kilobytes, a lot of interesting techniques were used to cut down the size of programs. Like, for example, how text is encoded.

Let us take a modern standard, like unicode. Each unicode character is encoded in 2 bytes, or 16 bits. For example, the letter 'a' would be 00 61 or 0000 0000 0110 0001. Because it is 2 bytes, unicode can represent 65536 characters (unicode also has support for character mapping "planes" that expand the set). However, this means that for the word "apple" we require 10 whole bytes or 160 bits! That may not seem like a lot - because it isn't these days - but if you're trying to make a text-adventure game on 1980's technology, that simply won't do. If each paragraph contains 100 words of 5 letters each (an average amount for descriptions in text-adventure game), we will require 1,000 bytes per paragraph. We've already almost reached a kilobyte on one description alone, and we still need to add in logic and other screens. For context, space invaders on a classical arcade set was 8KB.

So we don't really need unicode support - how about ASCII? ASCII encodes characters in 1 byte, or 8 bits. The trade-off is we are limited to now 256 total characters. The letter 'a' would be 61 or 0110 0001. Now our word "apple" on requires 5 bytes or 80 bits. Now we are down to 500 bytes per paragraph, and can squeeze in another screen.

The real question I wanted to get at here is how low can we go? Is there an encoding better than ASCII for our purposes? We need something that can represent all lower and uppercase English characters, as well as some punctuation. We also want something that takes less than 8 bits per letter.

Initially one might think morse code is a solution to our problems. Morse code, as most know, is made up of dots and dashes, signifying length of transmit. By converting dots to 1 and dashes to 0, this could work. For example, we could encode 'a' with 10. 2 bits? That's amazing!

But that's just for 'a' sadly. For 'j' we would have 1000. Here comes the problem with Morse code - it's variable length. Each letter can be 1 to 4 bits. We could solve this by encoding the length in a fixed 3 bit header before each letter. So 'a' would translate to 010 10 and 'j' to 100 1000. So now we are up to 4-7 bits per letter. What about punctuation? We could "expand" the codeset to have 5 bit combinations that would encode punctuation and spaces. We could also switch some of the lesser used like 'Z' with our 'space' code to cut down on size. So now we are at 4-8 bits per letter. That's an average of 6 bits per letter, so definitely better!

"But what about uppercase!" says the aggrevated reader. Well, darn, looks like we need to throw another bit onto our header that will represent whether the character is lower or upper case. Now 'A' would be 1 010 10 and 'a' would be 0 010 10. We would like to not include this character for punctuation but there is no way to parse that. So now we are up to 5-9 bits per letter, or 6.5 bits on average. We are down to 3,250 bits or roughly 400 bytes.

It would be convenient if we could eliminate the header and have a fixed width bit-space for each letter. To fit the whole alphabet, our max value has to be at least 25 (0 can represent 'a'), so we will need at least 5 bits. With 5 bits, we can have up to 32 values, so we have room for punctuation as well. Technically this 5-bit-per-letter scheme is also a well-known system, typically known as a Baudot code. For uppercase, we could use our previous scheme, getting us at a solid 6 bit per character, or 3000 bits.

So is this the system the Z-Machine uses? Well, no, because we can actually do better. The Z-Machine architecture combines three "5-bit" segments into a word (or 2 bytes), with the leftover bit at the beginning representing whether this is the last section of the text. By using those extra values from 26 - 31 in our 5 bits, we can implement "shift" codes. These shift codes (values 4 for uppercase and 5 for punctuation) allow us to implement up to 99 characters in only 5 bits! The only downside of this system is that strings must have a length divisible by 3, but we can pad out shorter strings with shift codes.

Confused? I sure was when I started to try reversing this system. So here is a breakdown. Let's go back to our example word "apple". Z-Machine places the first character 'a' at 6, 'b' at 7, etc. So for our word:

a = 6
p = 0x15
p = 0x15
l = 0x11
e = 0xA

Their binary representation in 5 bit segments:

a = 00110
p = 10101
p = 10101
l = 10001
e = 01010

Next we split the word into 3 letter segments:

a p p = 00110 10101 10101
l e ? = 10001 01010 ?

Like I stated above, we pad out words with shift characters, typically 5 in most Z-Machine games. 5 in bits is 00101, so with that:

a p p = 00110 10101 10101
l e 5 = 10001 01010 00101

Now adding on our front bit to represent whether we are at the end of our string:

a p p = 0 00110 10101 10101
l e 5 = 1 10001 01010 00101

Combining these bits together into our two words, we get our final byte-stream:

0x1AB5 C545

Bit confusing, but our end result is 5 bits for character! 2,500 bits or 312 bytes! From our start we have eliminated over 700 bytes, which means we can now fit 4 screens in our original 1.

So why did we move away from this system to unicode? The answer should be obvious: there is no way to map wingding characters.