Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: May 21, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evaluating large language models for abstract evaluation tasks: an empirical study.

Yinuo Liu1, Emre Sezgin2,3, Eric A Youngstrom1,4

  • 1Institute for Mental and Behavioral Health Research, Nationwide Children's Hospital, Columbus, OH, United States.

Frontiers in Research Metrics and Analytics
|May 20, 2026
PubMed
Summary

Related Concept Videos

Language and Cognition01:27

Language and Cognition

Language serves as a bridge between ideas and communication, influencing how individuals perceive and interact with the world. Psychologists have long debated whether language shapes thought or vice versa. This discussion gained grip with Edward Sapir and Benjamin Lee Whorf in the 1940s, who proposed that language determines thought, a concept known as linguistic determinism. They suggested that the vocabulary and structure of a language influence how its speakers think and perceive reality.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

The effectiveness of CBT-based NLP-enabled AI conversational agents for mental health intervention: a systematic review and meta-analysis.

NPJ digital medicine·2026
Same author

Identifying the Aggression Impulsive/Reactive (AIR) Profile in Youth With Behavioral Challenges.

JAACAP open·2026
Same author

Enhanced anti-toxicity memory of Cr(VI)-4-CP stressed denitrification by bio-promoter: Microbial cooperation and multi-path electron transfer drive toxics transformation-migration.

Journal of hazardous materials·2026
Same author

Measurement-Based Care Benchmarks and Bipolar Classification in ABCD Youth.

medRxiv : the preprint server for health sciences·2026
Same author

The pediatric AI readiness framework: bridging evidence to practice in pediatric artificial intelligence.

Frontiers in artificial intelligence·2026
Same author

Long-Term Stability of IQ Scores for Children With Neurodevelopmental Disabilities: Stable Global IQ But Unstable Index, Subtest, and Profile IQ Scores.

Journal of autism and developmental disorders·2026
Same journal

ChatGPT and higher education in Latin America: measuring perceived academic skills.

Frontiers in research metrics and analytics·2026
Same journal

Guidelines for setting cut-off scores in AUC (AUC-GUIDE): balancing sensitivity, specificity, and purpose.

Frontiers in research metrics and analytics·2026
Same journal

University presses, academic books, and authors in Ibero-America: a systematic review.

Frontiers in research metrics and analytics·2026
Same journal

Responsible research evaluation: integrating quality, leadership, and integrity in national systems. The case of Peru.

Frontiers in research metrics and analytics·2026
Same journal

AI-driven personalization and impulsive buying in e-commerce: a bibliometric analysis of research trends among Millennials and Generation Z.

Frontiers in research metrics and analytics·2026
Same journal

Mapping the intellectual structure of the refrigerated vehicle routing problem: research perspectives and structural knowledge gaps.

Frontiers in research metrics and analytics·2026
See all related articles
This summary is machine-generated.

Large language models (LLMs) show promise in scientific peer review, agreeing moderately with human experts on objective criteria. However, LLMs struggle with subjective assessments, highlighting the need for continued human oversight in academic evaluations.

Area of Science:

  • Artificial Intelligence
  • Scientific Publishing
  • Academic Evaluation

Background:

  • Large language models (LLMs) offer potential for assisting in scientific peer review.
  • Investigating the agreement between LLMs and human experts in quantitative assessment is crucial.

Purpose of the Study:

  • To evaluate the consistency and reliability of ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5 in assessing conference abstracts.
  • To compare LLM evaluations against human reviewers using a standardized rubric.

Main Methods:

  • Three LLMs independently graded 160 conference abstracts using an 8-criterion rubric (1-5 scale).
  • 14 human reviewers assessed subsets of abstracts with the same rubric.
  • Inter-rater reliability was assessed using intraclass correlation coefficients (ICCs) and Bland-Altman plots.
Keywords:
abstract evaluationartificial intelligenceinter-rater reliabilitylarge language modelspeer-review

Related Experiment Videos

Last Updated: May 21, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Main Results:

  • LLMs demonstrated high internal consistency; ChatGPT and Claude showed moderate agreement with human reviewers on objective criteria (ICCs 0.45-0.60).
  • Agreement was fair for subjective criteria (ICCs 0.23-0.38), with Gemini performing worse.
  • Bland-Altman analysis indicated acceptable systematic bias for ChatGPT and Claude.

Conclusions:

  • LLMs can moderately agree with human experts on objective abstract criteria, aiding pre-screening or reviewer support.
  • LLMs offer efficiency and standardization advantages for large-scale abstract review.
  • Human judgment remains essential for comprehensive assessment due to LLMs' limitations in evaluating subjective dimensions.