On Linguistic Fingerprinting

Can an author’s writing style be defined by the frequency of unique words in their writings? According to physicist Sebastian Bernhardsson, the answer is yes. He found a couple of interesting facts: 1) the more we write, the more we repeat words and 2) the rate of repetition (or rate of change) seems to be unique to individual authors (creating a “linguistic fingerprint”… literally his words, not mine). Let me walk through his claims and findings, just a bit.

Bernhardsson undertook a linguistics study which compared rates of unique words between short and long form writing (short stories vs. novels vs. corpora).

The idea of linguistically fingerprinting authors has been around for a while. In some ways it acted as a lost leader decades ago, piquing interest in the use of corpora and statistical methods to study language and now there is even a whole journal called Literary and Linguistic Computing. Plus, there is an established practice of forensic linguistics where linguistic methods are used to establish authorship of critical legal documents.

However, Bernhardsson makes a bold claim. He claims that the process of writing (a cognitively complex process) can be described as the process of pulling chunks out of a large meta-book which shows the same statistical regularities of an authors real work (he hedges on this a bit, of course).

The meta book and size-dependent properties of written language. Authors: Sebastian Bernhardsson, Luis Enrique Correa da Rocha, Petter Minnhagen. New Journal of Physics (2009):

“When the length of a text is increased, the number of different words is also increased. However, the average usage of a specific word is not constant, but increases as well. That is, we tend to repeat the words more when writing a longer text. One might argue that this is because we have a limited vocabulary and when writing more words the probability to repeat an old word increases. But, at the same time, a contradictory argument could be that the scenery and plot, described for example in a novel, are often broader in a longer text, leading to a wider use of ones vocabulary. There is probably some truth in both statements but the empirical data seem to suggest that the dependence of N (types) on M (tokens) reflects a more general property of an authors language.”

These findings lead us towards the meta book concept : The writing of a text can be described by a process where the author pulls a piece of text out of a large mother book (the meta book) and puts it down on paper. This meta book is an imaginary infinite book which gives a representation of the word frequency characteristics of everything that a certain author could ever think of writing. This has nothing to do with semantics and the actual meaning of what is written, but rather to the extent of the vocabulary, the level and type of education and the personal preferences of an author. The fact that people have such different backgrounds, together with the seemingly different behavior of the function N(M) for the different authors, opens up for the speculation that every person has its own and unique meta book, in which case it can be seen as a fingerprint of an author.



