Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jun 26, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Enhancing Text Datasets With Scaling and Targeting Data Augmentation to Improve BERT-Based Machine Learners.

Chancellor Woolsey1, Gondy Leroy1, Nell Maltman2

  • 1Department of Management Information Systems, University of Arizona, Tucson, AZ, United States.

Expert Systems with Applications
|June 25, 2026
PubMed
Summary
This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation.

Journal of biomedical informatics·2026
Same author

Comparative Evaluation of Text and Audio Simplification: A Methodological Replication Study.

Communications of the Association for Information Systems·2026
Same author

Influence of Audio Speech Rate and Source Text Difficulty on Health Information Comprehension and Retention.

Proceedings of the ... Annual Hawaii International Conference on System Sciences. Annual Hawaii International Conference on System Sciences·2026
Same author

Discourse Marker Use in Mothers of Autistic Individuals and FMR1 Premutation Carriers.

Journal of autism and developmental disorders·2026
Same author

The impact of individual factors on linguistic alignment of autistic boys and their mothers.

Autism : the international journal of research and practice·2025
Same author

Deep learning for autism detection using clinical notes: A comparison of transfer learning for a transparent and black-box approach.

Artificial intelligence in medicine·2025
Same journal

Unlocking 3D baby face photogrammetry: Multi-view BabyMorph reconstruction from uncalibrated photographs.

Expert systems with applications·2026
Same journal

Automatic Bi-Atrial Segmentation and Biomarker Extraction from Late Gadolinium-Enhanced MRI Using Deep Learning.

Expert systems with applications·2026
Same journal

A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model.

Expert systems with applications·2026
Same journal

Deep video anomaly detection in automated laboratory setting.

Expert systems with applications·2026
Same journal

Corrigendum to "Identification of gene regulatory networks associated with breast cancer patient survival using an interpretable deep neural network model" [Expert Syst. Appl. 262 (2025) 125632].

Expert systems with applications·2025
Same journal

Discovering novel prognostic biomarkers of hepatocellular carcinoma using eXplainable Artificial Intelligence.

Expert systems with applications·2025
See all related articles

Synthetic data generation using large language models improved classifier performance for autism spectrum disorder (ASD) behavioral descriptions. While recall increased, precision decreased, indicating trade-offs in augmentation strategies for medical applications.

Area of Science:

  • Machine Learning
  • Natural Language Processing
  • Medical Informatics

Background:

  • Acquiring sufficient data for machine learning is challenging, especially for text data.
  • Large language models (LLMs) offer solutions for synthetic text data generation.
  • Autism spectrum disorder (ASD) diagnosis can benefit from improved machine learning models.

Purpose of the Study:

  • To analyze the impact of descriptively-selected synthetic data on downstream classifier performance for ASD.
  • To compare different synthetic data augmentation schemes (Data Targeting vs. Data Scaling).
  • To evaluate the effectiveness of white-box metrics in guiding data selection.

Main Methods:

  • A finetuned multilabel, bidirectional encoder model was used to label 10,892 behavioral descriptions with seven ASD diagnostic criteria.
Keywords:
BERTCost analysisData augmentationLLMLarge language modelsStability analysisSynthetic dataText dataexplainable AI

Related Experiment Videos

Last Updated: Jun 26, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

  • Synthetic data augmentation was applied at 50% and 100% of the baseline dataset size.
  • Data points were selected using type-token ratio, cosine similarity, and perplexity metrics.
  • Performance was evaluated using precision, recall, and F1 scores per label.
  • Main Results:

    • Synthetic data augmentation consistently increased recall by approximately 8% but decreased precision by approximately 10%.
    • White-box metrics and stability analysis did not show a clear relationship with the observed performance changes.
    • Data Targeting augmentation demonstrated potential cost reduction for the BioBERT model.

    Conclusions:

    • Synthetic data augmentation impacts classifier performance, with varying effects on precision and recall.
    • The choice of augmentation scheme should align with the specific application, such as medical screening or diagnosis.
    • Further research is needed to refine synthetic data selection methods for optimal performance in clinical settings.