2025-01-10
Artificial intelligence is used in many different fields and is now being applied to the complex domain of DNA comprehension. By leveraging advanced machine learning models, researchers can analyze vast volumes of genomic data with unparalleled precision and speed. AI makes it possible to identify patterns and relationships in DNA sequences that were previously inaccessible, driving significant advancements in understanding gene expression, mutations, and regulatory elements. These new methods pave the way for innovations that have the potential to transform biological research and healthcare. Below is a brief (non-exhaustive) overview of these advancements.
Large Language Models (LLM) are deep learning models trained to process and understand vast amounts of sequential data, perfect for the natural language processing (NLP) because it can « guess » the interaction bewteen words and guess the context and the meaning of a sentence or a groupe of sentences.
Application to DNA context : DNA sequences can be treated as a « language » where the four nucleotides (A, T, C, G) form the « alphabet, » enabling LLMs to analyze complex patterns and structures in genomic data.
Fig: DNA strands represenation
Functional genomic element: Identify promoters, enhancers, and other regulatory elements.
Predict Gene Result: Predicting the functional outcomes of genes, such as expression levels, protein production, or phenotype
Variant Impact Assessment: Analyze the effects of genetic mutations on gene expression and fitness.
Synthetic Biology: Design novel genetic elements and entire genomes.
Multi-Species Genome Analysis
And more …
Definition: Transforming a sequence into a list of smaller components called “tokens.” Example:
“The King eats an apple” -> “The”, “King”, “eats”, “an”, “apple”
“ATCGTAGC” -> “ATC”, “GTA”, “GC”
Definition: Each token is transformed into a vector that captures its relationships with others. To do this, you define a dictionary which is a list of all the tokens/words in your corpus.
Training: Requires pre-training on data to infer patterns and relationships.
More information: Medium article on tokens, vectors, and embeddings
More information: Medium article on transformer encoder/decoders
Definition: A training objective where some tokens in the sequence are masked and the model learns to predict them based on context.
Application in DNA: Enables models to infer missing or obscured nucleotide sequences and learn patterns in genomic data.
Definition: A training objective where the model predicts the next token in a sequence based only on the preceding tokens, ensuring a unidirectional flow of information.
Application in DNA: Helps in simulating nucleotide sequences by predicting the next nucleotide based on the sequence context, useful in genome assembly or generating synthetic DNA sequences.
Focus: Understanding DNA grammar and sequence structure.
Application: Predicting genome elements like promoters and enhancers.
Data: Exclusively human genome.
Performance:
Link: GROVER
Focus: Long-range context for gene expression and variant predictions.
Application: Multi-species genomic tasks and molecular phenotype prediction.
Data: Genomes from humans and 850 species.
Performance:
Link: NT
Focus: Multimodal genome-scale modeling (DNA, RNA, and proteins).
Application: Zero-shot predictions, mutation impact, and synthetic genome generation.
Data: 2.7M prokaryotic and phage genomes.
Performance:
Link: Github evo2
Focus: Protein function prediction using sequence and structure integration.
Application: Functional annotation and discovery of viral proteins.
Data: Protein sequences and structural information.
Performance:
Link: Github Lucaprot
Please feel free to suggest any new models for inclusion in these slides, or contact us for further inquiries.
https://github.com/Romumrn/slides_AI_DNA