Will We Run Out of Unicodes?

billpg2 pts0 comments

Will We Run Out of Unicodes? – billpg industries™

Skip to content

Before Unicode, digital text lived in a fragmented world of 8-bit encodings. ASCII had settled in as the good-enough-for-English core, taking up the first half of codes, but the other half was a mish-mash of regional code pages that mapped characters differently depending on locale. One set for accented Latin letters, another set for Cyrillic.

Each system carried its own assumptions, collisions, and blind spots. Unicode emerged as a unifying vision. a single character set for all human languages, built on a 16-bit foundation. All developers had to do was swap their 8-bit loops for 16-bit loops. Some bristled that half the bytes were all zeros, but this was for the greater good.

16-bits made 65,536 code points. It was a bold expansion from the cramped quarters of ASCII, a ceremonial leap into linguistic universality. This was enough, it was thought, to encode the entirety of written expression. After all, how many characters could the world possibly need?

"Remember this girls. None of you can be first, but all of you can be next."

🐹 I absolutely UTF-8 those zero bytes.

It was in this world of 16-bit Unicode that UTF-8 emerged. This had the notable benefit of being compatible with 7-bit ASCII, using the second half of ASCII to encode the non-ASCII side of Unicode as multiple byte sequences.

If your code knew how to work with ASCII it would probably work with UTF-8 without any changes needed. So long as it passed over those multi-byte sequences without attempting to interpret them, you’d be fine. The trade-off was that while ASCII characters only took up one byte, most of Unicode took three bytes, with the letters-with-accents occupying the two-bytes-per-character range.

This wasn’t the hard limit of UTF-8. The initial design allowed for up to 31-bit character codes. Plenty of room for expansion!

🔨 Knocking on UTF-16’s door.

As linguistic diversity, historical scripts, emoji, and symbolic notations clamoured for representation, the Unicode Consortium realised their neat two-byte packages would not be enough and needed to be extended. The world could have moved over to the UTF-8 where there was plenty of room, but too many systems had 16-bit Unicode baked in.

The community that doggedly stuck with ASCII and its 8-bits per character design must have felt a bit smug seeing the rest of the world move to 16-bit Unicode. They stuck with their good-enough-for-English encoding and were rewarded with UTF-8 with its ASCII compatibility and plenty of room for expansion. Meanwhile, those early adopters who made the effort to move to the purity of their fixed size 16-bit encoding were told that their characters weren’t going to be fixed size any more.

This would be the plan to move beyond the 65,536 limit. Two unused blocks of 1024 codes were set aside. If you wanted a character in the original range of 16-bit values, you’d use the 16-bit code as normal, but if you wanted a character from the new extended space, you had to put two 16-bit codes from these blocks together. The first 16-bit code gave you 10 bits (1024=210) and the second 16-bit code you 10 more bits, making 20 bits in total.

(Incidentally, we need two separate blocks to allow for self-synchronization. If we only had one block of 1024 codes, we could not drop into the middle of a stream of 16-bit codes and simply start reading. It is only by having two blocks you know that if the first 16-bit code you read is from the second block, you know to discard that one and continue afresh from the next one.)

The original Unicode was rechristened the "Basic Multilingual Plane" or plane zero, while the 20-bit codes allowed by this new encoding were split into 16 separate "planes" of 65,536 codes each, numbered from 1 to (hexadecimal) 10. UTF-16 with its one million possible codes was born.

UTF-8 was standardized to match UTF-16 limits. Plane zero characters were represented by one, two or three byte sequences as before, but the new extended planes required four byte sequences. The longer byte sequences were still there but cordoned off with a "Here be dragons" sign, their byte patterns declared meaningless.

"Don’t need quarters, don’t need dimes, to call a friend of mine. Don’t need computer or TV to have a real good time."

🧩 What If We Run Out Again?

Unicode’s architects once believed 64K code points would suffice. Then they expanded to a little over a million. But what if we run out again?

It’s not as far-fetched as it sounds. Scripts evolve. Emoji proliferate. Symbolic domains—mathematical, musical, magical—keep expanding. And if humanity ever starts encoding dreams, gestures, or interspecies diplomacy, we might need more.

Fortunately, UTF-8 is quietly prepared. Recall that its original design allowed for up to 31-bit code points, using up to 7 bytes per character. The technical definition of UTF-8 restricts itself to 21 bits, but the scaffolding for expansion is still...

unicode ascii codes code byte character

Related Articles