Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Jun 26, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Enhancing Text Datasets With Scaling and Targeting Data Augmentation to Improve BERT-Based Machine Learners.

Chancellor Woolsey¹, Gondy Leroy¹, Nell Maltman²

¹Department of Management Information Systems, University of Arizona, Tucson, AZ, United States.

Expert Systems with Applications

|June 25, 2026

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation.

Journal of biomedical informatics·2026

Same author

Comparative Evaluation of Text and Audio Simplification: A Methodological Replication Study.

Communications of the Association for Information Systems·2026

Same author

Influence of Audio Speech Rate and Source Text Difficulty on Health Information Comprehension and Retention.

Proceedings of the ... Annual Hawaii International Conference on System Sciences. Annual Hawaii International Conference on System Sciences·2026

Same author

Discourse Marker Use in Mothers of Autistic Individuals and FMR1 Premutation Carriers.

Journal of autism and developmental disorders·2026

Same author

The impact of individual factors on linguistic alignment of autistic boys and their mothers.

Autism : the international journal of research and practice·2025

Same author

Deep learning for autism detection using clinical notes: A comparison of transfer learning for a transparent and black-box approach.

Artificial intelligence in medicine·2025

Same journal

Unlocking 3D baby face photogrammetry: Multi-view BabyMorph reconstruction from uncalibrated photographs.

Expert systems with applications·2026

Same journal

Automatic Bi-Atrial Segmentation and Biomarker Extraction from Late Gadolinium-Enhanced MRI Using Deep Learning.

Expert systems with applications·2026

Same journal

A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model.

Expert systems with applications·2026

Same journal

Deep video anomaly detection in automated laboratory setting.

Expert systems with applications·2026

Same journal

Corrigendum to "Identification of gene regulatory networks associated with breast cancer patient survival using an interpretable deep neural network model" [Expert Syst. Appl. 262 (2025) 125632].

Expert systems with applications·2025

Same journal

Discovering novel prognostic biomarkers of hepatocellular carcinoma using eXplainable Artificial Intelligence.

Expert systems with applications·2025

See all related articles

Synthetic data generation using large language models improved classifier performance for autism spectrum disorder (ASD) behavioral descriptions. While recall increased, precision decreased, indicating trade-offs in augmentation strategies for medical applications.

Area of Science:

Machine Learning
Natural Language Processing
Medical Informatics

Background:

Acquiring sufficient data for machine learning is challenging, especially for text data.
Large language models (LLMs) offer solutions for synthetic text data generation.
Autism spectrum disorder (ASD) diagnosis can benefit from improved machine learning models.

Purpose of the Study:

To analyze the impact of descriptively-selected synthetic data on downstream classifier performance for ASD.
To compare different synthetic data augmentation schemes (Data Targeting vs. Data Scaling).
To evaluate the effectiveness of white-box metrics in guiding data selection.

Main Methods:

A finetuned multilabel, bidirectional encoder model was used to label 10,892 behavioral descriptions with seven ASD diagnostic criteria.

Keywords:

BERT Cost analysis Data augmentation LLM Large language models Stability analysis Synthetic data Text data explainable AI

Related Experiment Videos

Last Updated: Jun 26, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Synthetic data augmentation was applied at 50% and 100% of the baseline dataset size.

Data points were selected using type-token ratio, cosine similarity, and perplexity metrics.

Performance was evaluated using precision, recall, and F1 scores per label.

Main Results:

Synthetic data augmentation consistently increased recall by approximately 8% but decreased precision by approximately 10%.
White-box metrics and stability analysis did not show a clear relationship with the observed performance changes.
Data Targeting augmentation demonstrated potential cost reduction for the BioBERT model.

Conclusions:

Synthetic data augmentation impacts classifier performance, with varying effects on precision and recall.
The choice of augmentation scheme should align with the specific application, such as medical screening or diagnosis.
Further research is needed to refine synthetic data selection methods for optimal performance in clinical settings.