Enhancing gene set overrepresentation analysis with large language models | JoVE Visualize

Area of Science:

Bioinformatics
Computational Biology
Genomics

Background:

Traditional gene set overrepresentation analysis (ORA) relies on static, human-curated databases, limiting flexibility in interpreting high-throughput transcriptomics and proteomics data.
Existing methods struggle to adapt to specific biological contexts or dynamically generated gene lists.

Purpose of the Study:

To develop a flexible framework, llm2geneset, that utilizes large language models (LLMs) to dynamically generate gene set databases.
To enable context-aware functional interpretation of biological data by integrating LLM-generated gene sets with analysis methods like ORA.
To benchmark the performance of LLM-generated gene sets against human-curated databases.

Main Methods:

Development of the llm2geneset framework, leveraging LLMs to create gene sets based on input genes and natural language biological context.
Integration of dynamically generated gene sets with established analysis methods, such as ORA, for functional annotation.
Comparative analysis of LLM-generated gene sets against human-curated databases using benchmarking studies.
Application of the framework to RNA-sequencing data from iPSC-derived microglia treated with a TREM2 agonist.

Main Results:

LLM-generated gene sets demonstrated comparable quality to human-curated gene sets.
The llm2geneset framework successfully identified biological processes within input gene sets, outperforming traditional ORA and direct LLM prompting.
The framework facilitated flexible, context-aware gene set generation and improved the interpretation of high-throughput biological data, as shown in the TREM2 agonist study.

Conclusions:

llm2geneset provides a powerful and flexible alternative to traditional gene set enrichment analysis, utilizing LLMs for dynamic database generation.
The framework enhances the interpretation of complex biological datasets by offering context-specific functional annotations.
llm2geneset represents a significant advancement in bioinformatics tools for biological data analysis and discovery.