High-quality data selection-driven instruction tuning for biomedical large language models | JoVE Visualize

Area of Science:

Biomedical Natural Language Processing (NLP)
Machine Learning
Artificial Intelligence

Background:

Training large language models (LLMs) for biomedical NLP tasks is computationally intensive and requires high-quality data.
Existing data selection methods may not optimally adapt to the diverse challenges within biomedical NLP, such as named entity recognition (NER), relation extraction (RE), event extraction (EE), and text classification (TXTCLASS).
Efficient training strategies are crucial for advancing clinical and research applications of LLMs in the biomedical domain.

Purpose of the Study:

To present a novel data selection framework that enhances the training efficiency of LLMs for critical biomedical NLP tasks.
To introduce and validate the Data Selection (DS) score as a metric for quantifying instructional context's impact on response generation.
To develop and evaluate a fine-tuned LLM, BiomedicalLLM, using the proposed data selection methodology.

Main Methods:

Developed a Data Selection (DS) score to measure the influence of instructional context on model response losses.
Employed the DS method to filter high-quality data from biomedical datasets for specific NLP tasks (NER, RE, EE, TXTCLASS).
Fine-tuned a base LLM on the selected dataset, resulting in the BiomedicalLLM model, and conducted experiments and ablation studies.

Main Results:

The BiomedicalLLM model, trained with the DS framework, achieved an average F1-score improvement of 3.3% across tasks compared to baseline methods.
Ablation studies confirmed the overall effectiveness of the proposed data selection framework.
Analysis showed that the DS method dynamically adjusts sample selection based on task characteristics, optimizing resource allocation for improved diversity and representation.

Conclusions:

The novel data selection framework significantly enhances LLM training efficiency and performance in biomedical NLP.
The DS score provides a valuable metric for data quality assessment and selection in instruction-based LLM training.
This approach offers a transformative strategy for developing advanced LLMs for clinical practice and biomedical research, with the model available open-source.