Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Complementation Tests

Complementation Tests

A complementation test is a simple cross to identify whether the two mutations are located on the same gene or different genes. It was first performed by Edward Lewis in the 1940s while working on fruit flies. He developed the test to identify the location and arrangement of different mutations on chromosomes.
Organisms heterozygous for different mutations are crossed pairwise in all combinations. If present on different genes, the mutations can complement each other by providing the missing...

Multiple Comparison Tests

Multiple Comparison Tests

Multiple comparison test, abbreviated as MCT, is a post hoc analysis generally performed after comparing multiple samples with one or more tests. An MCT will help identify a significantly different sample among multiple samples or a factor among multiple factors.
It would be easy to compare two samples using a significance alpha level of 0.05. In other words, there is only one sample pair to be compared. However, it would be difficult to identify a significantly different sample if the number...

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Sign Test for Nominal Data

Sign Test for Nominal Data

The sign test is a nonparametric method used to evaluate hypotheses about the median of a single sample or to compare the medians of two related samples. The sign test is particularly useful when dealing with nominal data, which includes distinct categories without an inherent order, such as names, labels, and preferences. Nominal data restricts statistical analysis to evaluating population proportions rather than mean or median values that require continuous data.
For example, consider a...

Sign Test for Matched Pairs

Sign Test for Matched Pairs

The sign test for matched pairs offers a robust method for comparing two paired samples, often for the effects of an intervention in one of them. This method is very useful in situations where the underlying distribution of the data is unknown. The test compares two related samples—often pre- and post-treatment measurements on the same subjects—to determine if there are significant differences in their median values.
To conduct the sign test, we first calculate the differences in...

Stereotype Content Model

Stereotype Content Model

The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Physical function impacts hearing without mediation from systolic blood pressure.

Scientific reports·2026

Same author

Guiding approaches to studying Alzheimer's disease: a scoping review of community engagement, health communication, and implementation science research.

The Gerontologist·2026

Same author

Hypertension-Induced Retinal Microvascular Remodeling in Women.

Journal of vascular research·2026

Same author

Digital Twin Model of Treatment Outcomes in Post-Stroke Aphasia.

medRxiv : the preprint server for health sciences·2026

Same author

Multifaceted neural representation of words in naturalistic language.

ArXiv·2026

Same author

Improving Lexicosemantic Impairments in Post-Stroke Aphasia Using rTMS Targeting the Right Anterior Temporal Lobe.

Brain sciences·2026

Same journal

Application of ephrin-B2 loaded glycol chitosan-silk fibroin hydrogel in the treatment of diabetic refractory wounds.

Scientific reports·2026

Same journal

International expert Delphi consensus on thromboprophylaxis in metabolic and bariatric surgery.

Scientific reports·2026

Same journal

Assessing the cross-region knowledge transfer capability of selected deep learning building vectorization methods in the context of available training datasets.

Scientific reports·2026

Same journal

Feasibility and preliminary effects of outdoor versus indoor cognitive-motor therapy in women with Alzheimer's disease: A randomized single-blind pilot study.

Scientific reports·2026

Same journal

Hallmarks of social action in the vocal turn-taking of wild common marmosets (Callithrix jacchus).

Scientific reports·2026

Same journal

Role and mechanism of AOPPs-induced NOX4-mediated ferroptosis in intervertebral disc degeneration.

Scientific reports·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jun 13, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

The Two Word Test as a semantic benchmark for large language models.

Nicholas Riccardi¹, Xuan Yang², Rutvik H Desai³

¹Department of Communication Sciences and Disorders, University of South Carolina, Columbia, 29208, USA.

Scientific Reports

|September 16, 2024

Summary

This summary is machine-generated.

The Two Word Test (TWT) reveals large language models (LLMs) struggle with basic semantic understanding, performing poorly on meaningfulness judgments compared to humans. This benchmark highlights LLM limitations in true language comprehension.

More Related Videos

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition

Published on: July 13, 2019

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Published on: June 25, 2019

Related Experiment Videos

Last Updated: Jun 13, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition

Published on: July 13, 2019

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Published on: June 25, 2019

Area of Science:

Natural Language Processing
Artificial Intelligence
Cognitive Science

Background:

Large language models (LLMs) demonstrate advanced capabilities, prompting discussions about their potential for human-like understanding and artificial general intelligence (AGI).
Current benchmarks often focus on reasoning or domain expertise, potentially overlooking fundamental semantic processing abilities.
Human language relies heavily on combining words to form meaningful concepts, a core linguistic operation.

Purpose of the Study:

Introduce the open-source Two Word Test (TWT) as a novel benchmark to assess the semantic abilities of LLMs.
Evaluate LLMs' capacity for meaningfulness judgments on two-word phrases, a task easily performed by humans.
Provide a tool to identify and address limitations in LLM language understanding.

Main Methods:

Developed the Two Word Test (TWT), comprising 1768 noun-noun combinations rated for meaningfulness by human participants.
Administered TWT in two versions: a 0-4 scale for nuanced ratings and a binary judgment task.
Tested leading LLMs including GPT-4, GPT-3.5, Claude-3-Optus, and Gemini-1.0-Pro-001 on the TWT.

Main Results:

All tested LLMs performed significantly worse than humans in judging the meaningfulness of two-word phrases.
GPT-3.5-turbo, Gemini-1.0-Pro-001, and GPT-4-turbo failed to reliably distinguish between meaningful and nonsensical phrases.
Claude-3-Opus showed improvement in binary discrimination but still lagged behind human performance.

Conclusions:

The TWT effectively highlights the limitations of current LLMs in fundamental semantic understanding.
Results suggest caution is needed when attributing human-level or "true" understanding to LLMs based on existing benchmarks.
The TWT offers a valuable tool for assessing and potentially enhancing the semantic capabilities of future LLMs.