Gene-LLMs: a comprehensive survey of transformer-based genomic language models for regulatory and clinical genomics
View abstract on PubMed
Summary
This summary is machine-generated.Genome large language models (Gene-LLMs), a new NLP application, interpret genomic data for bioinformatics. These models analyze nucleotide sequences and gene expression, revolutionizing functional genomics and clinical diagnostics.
Area Of Science
- Bioinformatics and Computational Biology
- Genomics and Molecular Biology
- Artificial Intelligence in Life Sciences
Background
- Natural Language Processing (NLP) and genomics are converging.
- Transformer-based models, termed Genome Large Language Models (Gene-LLMs), are emerging.
- Gene-LLMs interpret genomic data using nucleotide sequences, gene expression, and multi-omic annotations via self-supervised pretraining.
Purpose Of The Study
- To provide a comprehensive overview of the Gene-LLM lifecycle.
- To detail the applications and impact of Gene-LLMs in various biological fields.
- To outline future directions for Gene-LLM development and application.
Main Methods
- Review of Gene-LLM lifecycle stages: data ingestion, tokenization (k-mer, gene-level), and pretraining tasks (masked nucleotide prediction, sequence alignment).
- Analysis of Gene-LLM applications: enhancer/promoter identification, chromatin state modeling, RNA-protein interaction prediction, synthetic sequence generation.
- Evaluation of Gene-LLM impact using benchmarks (CAGI5, GenBench, NT-Bench, BEACON) in functional genomics, clinical diagnostics, and evolutionary inference.
Main Results
- Gene-LLMs demonstrate significant impact across functional genomics, clinical diagnostics, and evolutionary inference.
- Recent advances include encoder-decoder modifications and positional embeddings for enhanced interpretability and translational potential.
- Benchmarks show the effectiveness of Gene-LLMs in diverse genomic tasks.
Conclusions
- Gene-LLMs are a cornerstone technology for the future of biomedicine.
- Future pathways include federated genomic learning, multimodal sequence modeling, and low-resource adaptation for rare variant discovery.
- Gene-LLMs offer a proactive approach to responsible biomedical innovation.
Related Concept Videos
Cis-regulatory sequences are short fragments of non-coding DNA that are present on the same chromosomes as the genes that they regulate. These fragments serve as binding sites for transcriptional regulators, proteins that are responsible for controlling gene transcription and differential gene expression across cell types in eukaryotes. Cis-regulatory sequences can be close to the gene of interest or thousands of bases away in the DNA sequence; however, those sequences that are further away are...
Master transcription regulators are regulatory proteins that are predominantly responsible for regulating the expression of multiple genes. Often these genes work in concert to drive a complex process. Activation of a master transcription regulator can lead to a cascade of transcriptional activation necessary for that outcome. These regulators can directly bind to the regulatory sequences of the various genes involved, or they can indirectly regulate transcription by binding to regulatory...
Overview
An organism is diploid if it inherits two variants, or alleles, of each gene, one from each parent. These two alleles constitute the genotype for a given gene. The term genotype is also used to refer to an organism’s complete set of genes. A diploid organism with two identical alleles has a homozygous genotype, whereas two different alleles indicates a heterozygous genotype. Observable traits arising from genotypes are called phenotypes, which can also be influenced by...
Genomics is the science of genomes: it is the study of all the genetic material of an organism. In humans, the genome consists of information carried in 23 pairs of chromosomes in the nucleus, as well as mitochondrial DNA. In genomics, both coding and non-coding DNA is sequenced and analyzed. Genomics allows a better understanding of all living things, their evolution, and their diversity. It has a myriad of uses: for example, to build phylogenetic trees, to improve productivity and...

