How big is your language?

This blog post first appeared, as written by me, on The Copernican science blog on December 20, 2012.


It all starts with Zipf’s law. Ever heard of it? It’s a devious little thing, especially when you apply it to languages.

Zipf’s law states that the chances of finding a word of a language in all the texts written in that language are inversely proportional to the word’s rank in the frequency table. In other words, this means that the chances of finding the most frequent word is twice as much as are chances of finding the second most frequent word, thrice as much as are chances of finding the third most frequent word, and so on.

Unfortunately (only because I like how “Zipf” sounds), the law holds only until about the 1,000th most common word; after this point, a logarithmic plot drawn between frequency and chance stops being linear and starts to curve.

The importance of this break is that if Zipf’s law fails to hold for a large corpus of words, then the language, at some point, must be making some sort of distinction between common and exotic words, and its need for new words must either be increasing or decreasing. This is because, if the need remained constant, then the distinction would be impossible to define except empirically and never conclusively – going against the behaviour of Zipf’s law.

Consequently, the chances of finding the 10,000th word won’t be 10,000 times less than the chances of finding the most frequently used word but a value much lesser or much greater.

A language’s diktat

Analysing each possibility, i.e., if the chances of finding the 10,000th-most-used word are NOT 10,000 times less than the chances of finding the most-used word but…

  • Greater (i.e., The Asymptote): The language must have a long tail, also called an asymptote. Think about it. If the rarer words are all used almost as frequently as each other, then they can all be bunched up into one set, and when plotted, they’d form a straight line almost parallel to the x-axis (chance), a sort of tail attached to the rest of the plot.
  • Lesser (i.e., The Cliff): After expanding to include a sufficiently large vocabulary, the language could be thought to “drop off” the edge of a statistical cliff. That is, at some point, there will be words that exist and mean something, but will almost never be used because syntactically simpler synonyms exist. In other words, in comparison to the usage of the first 1,000 words of the language, the (hypothetical) 10,000th word would be used negligibly.

The former possibility is more likely – that the chances of finding the 10,000th-most-used word would not be as low as 10,000-times less than the chances of encountering the most-used word.

As a language expands to include more words, it is likely that it issues a diktat to those words: “either be meaningful or go away”. And as the length of the language’s tail grows, as more exotic and infrequently used words accumulate, the need for those words drops off faster over time that are farther from Zipf’s domain.

Another way to quantify this phenomenon is through semantics (and this is a far shorter route of argument): As the underlying correlations between different words become more networked – for instance, attain greater betweenness – the need for new words is reduced.

Of course, the counterargument here is that there is no evidence to establish if people are likelier to use existing syntax to encapsulate new meaning than they are to use new syntax. This apparent barrier can be resolved by what is called the principle of least effort.

Proof and consequence

While all of this has been theoretically laid out, there had to have been many proofs over the years because the object under observation is a language – a veritable projection of the right to expression as well as a living, changing entity. And in the pursuit of some proof, on December 12, I spotted a paper on arXiv that claims to have used an “unprecedented” corpus (Nature scientific report here).

Titled “Languages cool as they expand: Allometric scaling and the decreasing need for new words”, it was hard to miss in the midst of papers, for example, being called “Trivial symmetries in a 3D topological torsion model of gravity”.

The abstract of the paper, by Alexander Petersen from the IMT Lucca Institute for Advanced Studies, et al, has this line: “Using corpora of unprecedented size, we test the allometric scaling of growing languages to demonstrate a decreasing marginal need for new words…” This is what caught my eye.

While it’s clear that Petersen’s results have been established only empirically, that their corpus includes all the words in books written with the English language between 1800 and 2008 indicates that the set of observables is almost as large as it can get.

Second: When speaking of corpuses, or corpora, the study has also factored in Heaps’ law (apart from Zipf’s law), and found that there are some words that obey neither Zipf nor Heaps but are distinct enough to constitute a class of their own. This is also why I underlined the word common earlier in this post. (How Petersen, et al, came to identify this is interesting: They observed deviations in the lexicon of individuals diagnosed with schizophrenia!)

The Heaps’ law, also called the Heaps-Herdan law, states that the chances of discovering a new word in one large instance-text, like one article or one book, become lesser as the size of the instance-text grows. It’s like a combination of the sunk-cost fallacy and Zipf’s law.

It’s a really simple law, too, and makes a lot of sense even intuitively, but the ease with which it’s been captured statistically is what makes the Heaps-Herdan law so wondrous.

The sub-linear Heaps' law plot: Instance-text size on x-axis; Number of individual words on y-axis.
The sub-linear Heaps’ law plot: Instance-text size on x-axis; Number of individual words on y-axis.

Falling empires

And Petersen and his team establish in the paper that, extending the consequences of Zipf’s and Heaps’ laws to massive corpora, the larger a language is in terms of the number of individual words it contains, the slower it will grow, the lesser cultural evolution it will engender. In the words of the authors: “… We find a scaling relation that indicates a decreasing ‘marginal need’ for new words which are the manifestations of cultural evolution and the seeds for language growth.”

However, for the class of “distinguished” words, there seems to exist a power law – one that results in a non-linear graph unlike Zipf’s and Heaps’ laws. This means that as new exotic words are added to a language, the need for them, as such, is unpredictable and changes over time for as long as they are away from the Zipf’s law’s domain.

All in all, languages eventually seem an uncanny mirror of empires: The larger they get, the slower they grow, the more intricate the exchanges become within it, the fewer reasons there are to change, until some fluctuations are injected from the world outside (in the form of new words).

In fact, the mirroring is not so uncanny considering both empires and languages are strongly associated with cultural evolution. Ironically enough, it is the possibility of cultural evolution that very meaningfully justifies the creation and employment of languages, which means that at some point, languages only become bloated in some way to stop germination of new ideas and instead start to suffocate such initiatives.

Does this mean the extent to which a culture centered on a language has developed and will develop depends on how much the language itself has developed and will develop? Not conclusively – as there are a host of other factors left to be integrated – but it seems a strong correlation exists between the two.

So… how big is your language?