Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: May 21, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evaluating large language models for abstract evaluation tasks: an empirical study.

Yinuo Liu¹, Emre Sezgin^2,3, Eric A Youngstrom^1,4

¹Institute for Mental and Behavioral Health Research, Nationwide Children's Hospital, Columbus, OH, United States.

Frontiers in Research Metrics and Analytics

|May 20, 2026

Summary

Related Concept Videos

Language and Cognition

Language and Cognition

Language serves as a bridge between ideas and communication, influencing how individuals perceive and interact with the world. Psychologists have long debated whether language shapes thought or vice versa. This discussion gained grip with Edward Sapir and Benjamin Lee Whorf in the 1940s, who proposed that language determines thought, a concept known as linguistic determinism. They suggested that the vocabulary and structure of a language influence how its speakers think and perceive reality.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

The effectiveness of CBT-based NLP-enabled AI conversational agents for mental health intervention: a systematic review and meta-analysis.

NPJ digital medicine·2026

Same author

Identifying the Aggression Impulsive/Reactive (AIR) Profile in Youth With Behavioral Challenges.

JAACAP open·2026

Same author

Enhanced anti-toxicity memory of Cr(VI)-4-CP stressed denitrification by bio-promoter: Microbial cooperation and multi-path electron transfer drive toxics transformation-migration.

Journal of hazardous materials·2026

Same author

Measurement-Based Care Benchmarks and Bipolar Classification in ABCD Youth.

medRxiv : the preprint server for health sciences·2026

Same author

The pediatric AI readiness framework: bridging evidence to practice in pediatric artificial intelligence.

Frontiers in artificial intelligence·2026

Same author

Long-Term Stability of IQ Scores for Children With Neurodevelopmental Disabilities: Stable Global IQ But Unstable Index, Subtest, and Profile IQ Scores.

Journal of autism and developmental disorders·2026

Same journal

ChatGPT and higher education in Latin America: measuring perceived academic skills.

Frontiers in research metrics and analytics·2026

Same journal

Guidelines for setting cut-off scores in AUC (AUC-GUIDE): balancing sensitivity, specificity, and purpose.

Frontiers in research metrics and analytics·2026

Same journal

University presses, academic books, and authors in Ibero-America: a systematic review.

Frontiers in research metrics and analytics·2026

Same journal

Responsible research evaluation: integrating quality, leadership, and integrity in national systems. The case of Peru.

Frontiers in research metrics and analytics·2026

Same journal

AI-driven personalization and impulsive buying in e-commerce: a bibliometric analysis of research trends among Millennials and Generation Z.

Frontiers in research metrics and analytics·2026

Same journal

Mapping the intellectual structure of the refrigerated vehicle routing problem: research perspectives and structural knowledge gaps.

Frontiers in research metrics and analytics·2026

See all related articles

This summary is machine-generated.

Large language models (LLMs) show promise in scientific peer review, agreeing moderately with human experts on objective criteria. However, LLMs struggle with subjective assessments, highlighting the need for continued human oversight in academic evaluations.

Area of Science:

Artificial Intelligence
Scientific Publishing
Academic Evaluation

Background:

Large language models (LLMs) offer potential for assisting in scientific peer review.
Investigating the agreement between LLMs and human experts in quantitative assessment is crucial.

Purpose of the Study:

To evaluate the consistency and reliability of ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5 in assessing conference abstracts.
To compare LLM evaluations against human reviewers using a standardized rubric.

Main Methods:

Three LLMs independently graded 160 conference abstracts using an 8-criterion rubric (1-5 scale).
14 human reviewers assessed subsets of abstracts with the same rubric.
Inter-rater reliability was assessed using intraclass correlation coefficients (ICCs) and Bland-Altman plots.

Keywords:

abstract evaluation artificial intelligence inter-rater reliability large language models peer-review

Related Experiment Videos

Last Updated: May 21, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Main Results:

LLMs demonstrated high internal consistency; ChatGPT and Claude showed moderate agreement with human reviewers on objective criteria (ICCs 0.45-0.60).
Agreement was fair for subjective criteria (ICCs 0.23-0.38), with Gemini performing worse.
Bland-Altman analysis indicated acceptable systematic bias for ChatGPT and Claude.

Conclusions:

LLMs can moderately agree with human experts on objective abstract criteria, aiding pre-screening or reviewer support.
LLMs offer efficiency and standardization advantages for large-scale abstract review.
Human judgment remains essential for comprehensive assessment due to LLMs' limitations in evaluating subjective dimensions.