Why a Tweet in Japanese is Worth 2.2 in English
Why a Tweet in Japanese carries the information of 2.2 tweets in English
By Cesar A. Hidalgo | The MIT Media Lab
Everyone that has been sucked into the rabbit hole of twitter has learned to love and hate twitter’s highly constrained format. We have only 140 characters to make a point, add a reference and acknowledge others, through HT citations and @mentions. According to information theory, however, the amount of information that can be conveyed with 140 characters depends strongly on the size of the alphabet used. This makes twitter the perfect media to explain one of the most basic concepts in information theory: the quantity of information that is embedded in a message.
To keep things simple we will begin with English. The English alphabet has 26 letters, but to get a more meaningful number I’ll consider six extra characters: the space ( ), the period (.), the comma (,) the slash (/), the (@) and the colon (:). This gives us a perfectly round number of 32—you will realize why I call 32 a round number in a sec.
If we limit ourselves to use this 32-character alphabet, and note that we can choose each character every time, we will realize that there are 32^140 possible tweets. To put this number in a slightly more familiar context we can say that there are approximately 5.2×10^(210) possible tweets, or 2^700 tweets. These numbers are so big that cliché explanations make little sense. The GDP of the world in cents of U.S. dollars is a mere 7×10^15, and the age of the universe in seconds is just 10^17 seconds. So there are a lot of possible tweets that can be composed by choosing 140 characters from a 32-character alphabet. But how much information is contained in each of them?
According to Claude Shannon, the information embedded in a tweet is its ability to inform a receiver which of the 2^700 possible tweets was transmitted. It is the reduction of uncertainty we obtain in the set of possible messages. To understand what this means, consider playing a simple game in which you have a book with each of the 2^700 possible tweets and that you to pick a tweet at random. Now, my goal is to guess which tweet you chose using only yes and no questions. How many questions do I need to ask to guess your tweet? Well, that’s the information contained on a tweet.
An easy, although boring way to guess a tweet, is to guess each character. For simplicity, and with no loss of generality, assume that we map each character to a number between 1 and 32 (a=1, b=2, etc..). Now, I can guess your tweet by asking you the following set of questions: Is the first letter larger than 16? If you say no, I will ask: Is it larger than 8? If you say no, I will ask: Is it larger than 4? If you say yes, I will ask: Is it larger than 6? And if you say yes I will know that the first letter is letter number 7.
As you can see from this example, every time I ask a question I am cutting the search space in half. So how many guesses do I need to guess your tweet? Well, to guess each of the 32, or 2^5, characters I need 5 yes or no questions, and since there are 140 characters for me to guess I will need to ask only 140*5=700 yes or no questions. As you see, the little number that was sitting on top of the 2 is the information content of the message, and it will be what would help us compare tweets in different languages.
This example illustrates a very well known formula that was introduced by Ralf Hartley in the 1920’s and generalized by Claude Shannon decades later—by making it a probabilistic statement. The formula states that the information content of a message is equal to the number of characters (140 in the case of twitter) times the logarithm in base two of the size of the alphabet (which is 5 in this case, since 2^5=32). This example also illustrates that even though there are gazillions of possible tweets—technically more than a googol-squared—identifying each tweet requires a very small number of yes or no questions. In fact, less than one thousand!
So, how much information is there in a Japanese tweet? If we consider only the use of Jōyō kanji, which is the list of 2,136 kanji in the official list of the Japanese minister of education, and consider that 2,136=2^11.1, then we can conclude that guessing each Japanese character requires 11.1 yes or no questions. So the information content of a Japanese tweet is 140*11.1=1548.5, which is 2.2 times the information content of an English tweet.
So the next time you are editing the heck out of your tweet to hammer it into 140 characters just remember, if you knew Japanese, your tweet could be a bit more like a short essay.
*The formula used here assumes all characters have the same probability. More refined estimates would include the probability of each character, the probability of observing character pairs, etc. Such revisions will reduce our estimate for the quantity of information, so the numbers presented here are upper bounds for the information contained in a tweet.