As you might know, computers can only store 0’s and 1’s, which means alphabets have to be interpreted as numbers for computers to understand texts. It is called the Character Encoding System. This numerical representation of data also means that it is important that all computers agree on which number represents which character.
Why is that? Imagine that you and I are sending secret messages using numbers, but we both have different rules. How are we going to decipher it? Let’s say A=1, B=2 for me, but for you, it is A=0 and B=1. So, when I said APPLE, then the numbers would be 1, 16, 16, 12, 5, and you’d decipher it as BQQMF, which makes no sense.
So, to communicate efficiently and effectively in the professional world, there has to be a standardized numerical representation of characters. But it is not as simple as one thinks.
In order to tell you how the Unicode encoding system emerged as an essential part of today’s digital communication, we need to take a quick stroll down the memory lane.
Welcoming the ASCII system
The history of Unicode can be dated all the way back to the 1960s – the time when people were still using teletypes (teleprinters, teletypewriters) to communicate with each other.
ASCII was designed for use with these teletypes, which were the primary medium of communication during the 1960s. Before ASCII, teleprinters used a 5-bit encoding system, which can only give 25 = 32 possible characters. People were unsatisfied with the limited available characters which led to the need for a more efficient encoding system.
In 1963, the American Standard Association developed ‘ASCII,’ which is an acronym for American Standard Code for Information Interchange. It is a 7-bit encoding system which can store 2⁷ = 128 different characters, numbered 0 to 127.
So for the English language, which has 26 letters, ASCII has enough slots for both upper and lower letter cases, numbers (0 to 9), punctuation marks, and unprintable control codes for teleprinters.
Later in March 1968, U.S. President Lyndon B. Johnson announced that all computers must adopt ASCII as a Federal Standard to minimize the conflict between computers and telecommunications data systems. Here is a snippet of his memorandum of approving the adoption of the ASCII system:
“I have today approved a recommendation by the Secretary of Commerce, submitted under provisions of Public Law 89-306, that the United States of America Standard Code for Information Interchange be adopted as a Federal standard……I stressed the need for achieving, with industry cooperation, greater compatibility among computers.”
Yes, ASCII became a standard encoding system that solved all the communication fuss going on at that time, but it also has one big issue – it was designed with the English language in mind. So, for European languages that use accented alphabets like German ä, ë, or Polish ź, ł, ę, ASCII wasn’t a favorable option.
One more ‘bit’ added to the history of Unicode
A decade later, 8-bit IBM PCs became widely-used among households thanks to the invention of the world’s first 8-bit microprocessor, known as the Intel 8008. These PCs used the 8-bit system that provides 28 = 256 possible values for graphic characters, ranging from 0 to 255.
Since the ASCII system only uses up to number 127, values from 128 to 255 in the 8-bit system become extra slots that anyone can use. However, those characters from value 128 – 255 were not standardized, which means that different software companies such as IBM, Microsoft, HP, Apple, and Adobe started using different characters for their own encoding systems. IBM PCs, for instance, created Code Page 437, which used the latter 128 spare slots (128 – 255) for accented letters, symbols, shapes, and Greek letters.
This proliferation of character encodings was too chaotic, and to make things more complicated, there was also a need for a system to cater to different Asian languages (especially the CJK languages Chinese, Japanese, Korean) that don’t use Roman alphabets.
Another hopeful attempt of standardization
Since no one could agree on one particular set of encoding at that time, another standardization effort was made in the late 1990s. Fifteen sets of 8-bit characters were created to cover alphabets from languages such as Cyrillic, Arabic, Hebrew, Thai. They are known as ISO 8859-1 to ISO 8859-16 (ISO 8859-12 was abandoned).
So, if a client from, let’s say Estonia, sends you a document, you need to know what code page it uses. If the file is viewed using the wrong code page, it will be nothing but mumble-jumble texts. For e.g., character 188 could be Œ or ¼ depending on the ISO code page.
However, one disadvantage of these ISO standards is that they do not support the standardization of East Asian language scripts (CJK), as they require many thousands of code points. And since this system only has 265 values, these languages have to adopt an entirely different encoding system.
Call for rescue
As you can see, every step to solve this character encoding problem resulted in more complicated issues. At the same time, globalization and internationalization became a part of the modern-day marketing industry, which means that there is a need for a universal standardized system that can put all these chaos to rest.
In the late 1980s, a new standard was introduced – a system in which one unique number (code point) is assigned to every letter of every language in the world. It is called Unicode, a universal encoding system that allows “everyone in the world to use their own language on phones and computers.”
Developed by Joe Becker in 1988, Unicode is currently in version 12.1, and it now has more than 110,000 code points to date. For those who are curious, you can check out their latest set of Unicode character code charts.
Generally, the first 128 codes are the same as ASCII, and slot 128 – 255 contain currency symbols, punctuation marks, and other accented characters. Greek letters, Cyrillic, Hebrew, Arabic, Indic scripts, and Thai start after number 880, whereas Chinese, Japanese, and Korean start from slot number 11904 and beyond.
Since each letter has its own unique number, there will be no more conflicts between different code points. That means Cyrillic Я will always be 1071, and Greek α will always be 945.
A few variants like UTF-8, UTF-16, and UTF-32 (Unicode Transformation Formats that support 8-bit, 16-bits, and 32-bit systems respectively) were later introduced as well. It is not wrong to say that UTF-8 is the savior of the Internet, as 90% of all websites use it as a standard system that can support millions of characters.
Adapting to the dot com technology
IT technology in the 21st century has a drastic advancement compared to how it was from the 1970s. Instead of 8-bit microprocessors, computers these days now use 64-bit processors. The Internet has become the core of our digitized lifestyle, and communicating in more than one language is the norm these days. This also means that apps and web browsers are used more than they have been in the past.
But as technology evolves, new challenges arise too. Despite the continuous updates of the standardized system, we are still having problems with viewing texts in different languages. This character display problem can occur if the application/browser you are using does not have the correct font to display all of the characters or because there is a mismatch in the encoding system between your device and the input text source.
Either way, it is not uncommon for internet users to see weird blanks, question marks, and boxes sometimes. But this character encoding problem is just one part of the challenge in this era of digital communication.
Open our article “The Culprit Behind the Unicode Character Display Problem and How to Solve It” in a new tab and read more about the Unicode character display problem later.
Evolution of modern-day communication: Emoji
The birth of the Unicode encoding system was a result of several attempts to reduce the conflicts between different encoding systems. But we are not using teletypes and bulky IBM PCs to send messages anymore, are we? We are now living in an era where everything is communicated instantly – for e.g., instant messaging, in this case.
Gone are the days when we wrote lengthy emails and exhausting letters to our beloved ones. As our daily communication has shifted to digital platforms, emojis became our go-to symbols to represent our thoughts and moods. You know what they say, right? A picture says a thousand words, but in this case, it is emojis that do the job.
This comic strip amusingly explains how Unicode had a different role when it was first invented and how it is in the present day. We have to admit that the sarcasm here is pretty on point.
The Unicode Consortium (the organization which published the Unicode Standard) decides which emoji should be included in the standard, but the appearance is up to each organization. This explains why emojis look different on every device. If you haven’t noticed the differences, check this out:
Here is a small quiz for you. Take a really good look at all the hamburger emojis and tell us which burger is the odd-one-out? (hint: check the ingredients)
The correct answer: Google’s burger. This IT-related topic not only sparked controversy among internet users but also the foodies. Their main concern has nothing to do with the Unicode encoding system anymore. It is more about whether meat patty should be above or below sliced cheese.
This debate caused such a whirl that even Google CEO – Sundar Pichai brought this topic to Twitter, and the legendary McDonald’s also took side in this cheese-on-patty battle.
No need to change a thing. Stick to what you know guys. Keep doing what you’re doing, and we’ll make burger icons. #BurgerGate #BurgerEmoji pic.twitter.com/xTe7SmaflQ
— McDonald's Sverige (@mcdse) October 30, 2017
This hamburger emoji debate between Google and Apple is the perfect example of how our communication has evolved from sending text messages to digital pictographs that can represent ourselves and our emotions.
Currently, there are a total of 3,304 emojis in the Unicode Standard version 13.0, and you can check out the latest addition of emojis for 2020 in this video :
Moving forward
Unicode came into play due to several failed attempts to fix unsustainable, broken character encoding systems. It was a change that everyone needed – linguists, marketers, software developers – you name it.
Thanks to Unicode and its continuous updates, someone from Germany can now read an email sent by a Japanese client, or a mobile internet user from Thailand can read a Facebook comment written in Russian.
From standardizing texts to emojis, the history of Unicode is the proof of a drastic transformation in our lifestyle and communication in the last thirty years.
It is not yet a perfect solution to the character display problem, but this is by far the best and the most advanced one available today. If you want to solve the Unicode display problem on Windows and Mac, “The Culprit Behind the Unicode Character Display Problem and How to Solve It” is the right article for you.
How did you like May Thawdar Oo’s blog post “Unicode: The Journey From Standardizing Texts to Emojis”? Let us know in the comments if you have anything to add, have another content idea for iGaming blog posts, or just want to say “hello.” 🙂