Dictionary.com

How Can Algorithms Help Us Understand Books?

Algorithms, books

Recently the Sunday Times outed J.K. Rowling as the author of the detective novel The Cuckoo’s Calling, published under her nom de plume Robert Galbraith. While devotees of Rowling quickly procured and binge-read her latest work, linguists and language lovers worldwide celebrated the computational analysis of the two scholars who helped reveal the true author of the book in question.

Patrick Juola (Duquesne University) and Peter Millican (Oxford University) were both approached by a Times reporter to compare The Cuckoo’s Calling with the novels of J.K. Rowling and three other possible authors. In a guest post on Language Log, Juola describes his process. He first explains the theory of “forensic stylometry”: “language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices.” By running tests on variables (such as distribution of word length, percentage of the 100 most common words, and frequency of pairs of adjacent words), Juola found that though the results were “mixed,” it suggested Rowling as the most likely author. Millican ran computational tests to arrive at the same conclusion, discovering along the way that Rowling is less likely to use the phrase “as soon as” than the three other writers examined.

Rowling is not the first mystery writer to have her text subjected to the exacting analysis of computational linguistics and their complex algorithms. One episode of the WNYC show Radiolab features Ian Lancashire, a professor at the University of Toronto, who made a startling discovery about Agatha Christie upon running computational word-frequency and vocabulary tests on her novels. On her 73rd detective novel, her vocabulary decreased by a shocking 20% from that of her previous 72 novels. Additionally Christie’s use of words such as “thing,” “anything,” “something,” and “nothing” increased sixfold. Lancashire concluded that Christie’s 73rd novel, appropriately titled Elephants Can Remember, marked the onset of Alzheimer’s for this cherished author, who was never diagnosed in her lifetime. Lancashire told Radiolab, “I was seeing the author in the text in a way that people haven’t seen the author in the text before.”

This kind of textual analysis enabled by computers can give readers a richer understanding of books and the authors behind those works. One paper by researchers at the Federal Technological University of Paraná (Brazil) and the University of Aberdeen (UK) explores the social network in the Odyssey, comparing it with modern social networks to suggest that Homer’s epic is based, in part, on actual events. A visualization of character co-occurrences in Les Misérables created by Mike Bostock helps readers instantly understand the interrelationships of characters in a way that is much more subtle when reading the book.

The Google Ngram Viewer is an excellent resource for language lovers, historians, or sociologists who wish to look at more than just one book; it allows users to search the various Google Books’ corpora (collections of words and texts) to understand trends of word usage over time, often providing insight into social and cultural implications of these trends. Recently the term Popemobile was added to Dictionary.com. As part of the research for that new entry, lexicographers used the Google Ngram Viewer to generate a visualization of when this word first started appearing in English-language books—the mid-1970s. We can also learn from this graph that Popemobile appears more frequently with an initial capital letter than in all lowercase type. This data helps Dictionary.com provide the most accurate and high-quality definitions for our users.

From revealing the true author of mystery books to helping lexicographers write better definitions, technology quickly illuminates books in ways that might have taken a lifetime of research without the aid of computers. Writers who wish to stay anonymous can attempt to outsmart stylometry experts—there’s even a program being developed for this very purpose called Anonymouth. Perhaps J.K. Rowling will use a tool like this to disguise her writing the next time she decides to clandestinely break into a new genre.

Do you use technology to help aid your understanding of literature? Let us know in the comments.

17 Comments

  1. Katzz -  August 27, 2013 - 2:36 am

    Very fascinating article. Esp, that part about how computational analysis detected Alzheimers was quite interesting .Well, i use google to search for something i dont understand in a piece of literature ( This is part of technology too).

    Reply
  2. Holy -  August 6, 2013 - 9:20 pm

    NPB,

    The subject of the sentence

    “This kind of data helps Dictionary.com provide the most accurate and high-quality definitions for our users.”

    is “kind,” which is singular. Thus, the third-person singular verb “helps” is correct. “Data” in this sentence is the object of the preposition “of” and thus cannot be the subject.

    Your proposed sentence “These kind of data help. . .” has a multiplicity of errors! Since “kind” is singular, it takes both the singular demonstrative adjective “this” and not the plural “these” and also the verb “helps.” Your last sentence is correct, but you just got lucky there, in that both the subject “kind” and prepositional object, which you incorrectly thought to be the subject, are singular, and thus take the singular verb “helps.”

    Your assertion that “data” must always be plural flies in the face of long-running usage that has converted it into a mass noun, used as a singular (and still sometimes as a plural).

    You might want to refresh your basic grammar knowledge before “correcting” others. And you also might want to look up Muphry’s Law: http://en.wikipedia.org/wiki/Muphry%27s_law.

    Reply
  3. Bruhaha -  August 6, 2013 - 2:09 pm

    In “correcting” the grammar of Dictionary.com, NBP makes his/her own mistake. While it’s true the word “data” is plural, the word “kind” is singular. “These kind of data” is therefore an abomination.

    Reply
  4. Olaf Singursson -  August 6, 2013 - 1:54 pm

    @NPB I’m sure you meant to write would ‘have’ gotten.

    These comments are disappointing and also, at times, really weird. Which is a shame since the article is so fascinating and so engagingly written. I really enjoyed it. Thank you.

    Reply
  5. Anand -  August 6, 2013 - 6:40 am

    Very interesting. Wonder whether Mr Ian Lancashire did the same with Christie’s last book, ‘Postern of Fate’. I suppose the conclusion would be similar. There were fifty-three long years between her first (mystery) book and the last one, and the finding is not very startling, considering that she was thirty years old versus 83, and faculties do diminish with age. It might be a little rash, however, to conclude from there that she had Alzheimer’s!

    Again, knowing how she functioned, the so called last books could have been written years ago, kept away, and then published when they were.

    Coming to NPB’s objection regarding singular usage of the word ‘data’, the Information Technology community is largely responsible for this anomaly. There are other words too, like ‘headquarters’ and ‘media’, which are used in singular, and are so accepted, though wrong grammatically. Change is the rule of life, and so of language, though puritans feel the loss.

    Reply
  6. Joy -  August 6, 2013 - 3:36 am

    Great article! Yes I do use technology to hep my understanding of literature. I wonder where I would be without it.

    Reply
  7. Arturo Pérez Rodríguez -  August 6, 2013 - 1:43 am

    I am interested im this “technology”, because I am an engineer.
    I now work very hardly in designig and developing a new technology of electrical generators. Had I spare time, I would like to work in this field.
    The most attractive aspect of this writings computational analysis is the discovery through it of “markers” of mental disorders, or illnesses, as the Alzheimer, or the exploration of the operational structures of persons in order to select them for several jobs of great responsability, as train drivers, political leaders or school teachers.

    Reply
  8. Marshall Gass -  August 5, 2013 - 12:36 pm

    If Computational Analysis can detect diseases hidden in Linguistics then we surely can extend that discovery into other regions of thought, perhaps to detect AIDS, SARS, Genetic disorders, common colds and Flu, etc.? The word, after all is derived from complexities in the brain ( and probably from different regions?) and expressed through the mouth. These patterns and regions from they were derived can indicate normal or abnormal functioning of the linguistic brain and how they can be used to interpret other hidden symptoms-not usually visible to the common eye? Good luck.

    Reply
  9. NPB -  August 5, 2013 - 6:52 am

    Overall, this is a very interesting article. However, I take exception with the grammar usage of one sentence.

    “This kind of data helps Dictionary.com provide the most accurate and high-quality definitions for our users.”

    “Data” is the plural form of the word. Although it sounds wrong due to the ubiquity of incorrect usage, the sentence should begin with either “These kind of data help Dictionary.com …” or “This kind of datum helps Dictionary.com …” . I would have thought a dictionary website would gotten the grammar usage correct.

    Reply
  10. Jon -  August 5, 2013 - 2:50 am

    Very interesting article. The usage of computers to find out the author was truly fascinating.

    Reply
  11. Cyberquill -  August 4, 2013 - 1:43 am

    Do I use technology to help aid my understanding of literature? Yes, I use electric light when reading at night. By enabling me to see, this technology indirectly aids my understanding of the material.

    Reply
  12. Erik Nelson -  August 3, 2013 - 8:20 pm

    This technology will put to rest for all time the notion that anyone but Shakespeare wrote Shakespeare.

    Reply
  13. wavewalker -  August 3, 2013 - 4:04 am

    This is entirely fascinating! I wonder…what would the findings be if the Bible were subjected to such analysis? I expect we’d find something such as what we (at least the majority of us?) already seem to know: that God has and never will change, as this is something depicted clearly in the Scriptures.

    Reply
  14. N.M. -  August 3, 2013 - 3:06 am

    Interesting…

    I must admit I have been most baffled by this issue. Perhaps in the near future, I might as well make use of this relatively novel methodology of writing.
    After all, the human’s faculty is far more circumscribed that I had ever imagined.

    Reply
  15. Eniale -  August 2, 2013 - 9:59 am

    No I haven’t but will…very interesting.

    Reply

Leave A Comment

Your email address will not be published. Required fields are marked (required):

Related articles

Back to Top