Encoding Shakespeare into DNA

February 21, 2013

DNA, Shakespeare, Sonnets, binary, binary code It’s time to look at the language of life itself—DNA. As you might remember from 7th-grade science, DNA stands for deoxyribonucleic acid, the molecular structure that stores the genetic code for all life forms.

Scientists continue to wonder if this living blueprint is all that DNA can hold. Researcher Nick Goldman of the European Bioinformatics Institute (EBI) has recently stored all of Shakespeare’s 154 sonnets in DNA, and his synthetic double helix didn’t miss a line.

Because the alphabet of DNA is only four letters long (with A, T, G, C representing the nucleic bases adenine, thymine, guanine, and cytosine), English was not the best fit for the translation. Instead, researchers focused on binary code (the two-character mathematical system of 1s and 0s used by computers). With this technique Goldman’s team encoded text as well as audio files and images in the DNA macromolecule, including a 26-second audio clip of Dr. Martin Luther King’s “I have a dream” speech, and a photo of the EBI facility. (Get the full story here.)

Molecular geneticist George M. Church first attempted a translation of binary code into DNA at Harvard Medical School by encoding an HTML draft of his recent book Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves. But Church’s cipher was too simple (bases G and T represented 1s while A and C represented 0s), and it resulted in glitches when translating the DNA back into binary code.

Learning from Church’s mistake, Goldman developed a more complex code adapted to DNA’s natural tendency toward genetic variation. Goldman’s code makes every byte (or eight-character binary unit) represent a five-letter word out of As, Cs, Gs, and Ts. These combine to form strings of 117 letters. The DNA “sentences” overlap so that decoders can check against other strings if inconsistencies arises. Thus far the method has resulted in 100% accuracy.

So why bother with all this? What’s the advantage of converting data into DNA when it already exists in books and microchips? The answer is clear when you think about the massive amount of data our society is quickly amassing. If 46 microscopic chromosomes can carry all the information necessary to make a human, think of how many libraries could fit in that space. Goldman’s sequencing of Shakespeare’s Sonnets was microscopically miniscule, and his team believes that they could store all the data from the CERN Particle Physics Laboratory, nearly 90 petabytes (one quadrillion bytes), in just 41 grams of DNA.

But the best case for the DNA data storage is just how long it lasts. “The experiment was done 60,000 years ago,” Goldman told Nature, “when a mammoth died and lay there in the ice.” With this in mind, the extreme longevity of DNA has spectacular implications for the “apocalypse-proof” preservation of data in that our culture and literature could not only be archived, but fossilized. And though sequencing and decoding methods will undoubtedly change, the code in which all life on earth is written probably won’t go out of style in the way cassette tapes gave way to CDs, which in turn gave way to MP3s.

Most importantly, now thanks to DNA data storage, Shakespeare may actually be able to keep the promise he made to his mysterious mistress centuries ago in Sonnet 60:

And yet to times in hope, my verse shall standPraising thy worth, despite his cruel hand.