AI

New AI Model GROVER Decodes DNA's Hidden Language

06 August 2024

|

Zaker Adham

Summary

DNA holds the essential information needed to sustain life, and understanding its organization has been a major scientific challenge. Now, with GROVER, a new AI model trained on human DNA, researchers are making strides in decoding this complex information.

Developed by the Biotechnology Center (BIOTEC) at Dresden University of Technology, GROVER treats human DNA as text, learning its rules and context to extract functional information. This innovative tool, featured in Nature Machine Intelligence, has the potential to revolutionize genomics and accelerate personalized medicine.

Since the discovery of the double helix, scientists have been trying to understand the information encoded in DNA. Seventy years later, it's clear that DNA's information is multilayered, with only 1-2% of the genome consisting of genes that code for proteins.

"DNA has many functions beyond coding for proteins. Some sequences regulate genes, others serve structural purposes, and many sequences have multiple functions. We currently don't understand the meaning of most DNA. This is where AI and large language models can help," says Dr. Anna Poetsch, research group leader at BIOTEC.

DNA as a Language

Large language models like GPT have transformed our understanding of language. Trained exclusively on text, these models can use language in various contexts.

"DNA is the code of life. Why not treat it like a language?" says Dr. Poetsch. The team trained a large language model on a reference human genome, resulting in GROVER, or "Genome Rules Obtained via Extracted Representations," which can extract biological meaning from DNA.

"GROVER learned the rules of DNA. In terms of language, we're talking about grammar, syntax, and semantics. For DNA, this means learning the rules governing sequences, the order of nucleotides, and the meaning of sequences. Like GPT models learning human languages, GROVER has learned how to 'speak' DNA," explains Dr. Melissa Sanabria, the researcher behind the project.

The team demonstrated that GROVER can accurately predict DNA sequences and extract contextual information with biological meaning, such as identifying gene promoters or protein binding sites. GROVER also learns processes considered "epigenetic," meaning regulatory processes that occur on top of the DNA rather than being encoded.

"It's fascinating that by training GROVER with only the DNA sequence, without any function annotations, we can extract information on biological function. This shows that function, including some epigenetic information, is encoded in the sequence," says Dr. Sanabria.

The DNA Dictionary

"DNA resembles language. It has four letters that build sequences, and these sequences carry meaning. However, unlike a language, DNA has no defined words," says Dr. Poetsch. DNA consists of four letters (A, T, G, and C) and genes, but there are no predefined sequences of different lengths that combine to build genes or other meaningful sequences.

To train GROVER, the team first created a DNA dictionary using a trick from compression algorithms. "This step is crucial and sets our DNA language model apart from previous attempts," says Dr. Poetsch.

"We analyzed the whole genome and looked for combinations of letters that occur most often. We started with two letters and went over the DNA repeatedly to build up the most common multi-letter combinations. In about 600 cycles, we fragmented the DNA into 'words' that let GROVER perform best in predicting the next sequence," explains Dr. Sanabria.

The Promise of AI in Genomics

GROVER promises to unlock the different layers of genetic code. DNA holds key information about what makes us human, our disease predispositions, and our responses to treatments.

"We believe that understanding the rules of DNA through a language model will help us uncover the depths of biological meaning hidden in DNA, advancing both genomics and personalized medicine," says Dr. Poetsch.