Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Complementation Tests00:49

Complementation Tests

4.9K
A complementation test is a simple cross to identify whether the two mutations are located on the same gene or different genes. It was first performed by Edward Lewis in the 1940s while working on fruit flies. He developed the test to identify the location and arrangement of different mutations on chromosomes.
Organisms heterozygous for different mutations are crossed pairwise in all combinations. If present on different genes, the mutations can complement each other by providing the missing...
4.9K
Multiple Comparison Tests01:13

Multiple Comparison Tests

3.9K
Multiple comparison test, abbreviated as MCT, is a post hoc analysis generally performed after comparing multiple samples with one or more tests. An MCT will help identify a significantly different sample among multiple samples or a factor among multiple factors.
It would be easy to compare two samples using a significance alpha level of 0.05. In other words, there is only one sample pair to be compared. However, it would be difficult to identify a significantly different sample if the number...
3.9K
Improving Translational Accuracy02:07

Improving Translational Accuracy

9.4K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
9.4K
Sign Test for Nominal Data01:12

Sign Test for Nominal Data

78
The sign test is a nonparametric method used to evaluate hypotheses about the median of a single sample or to compare the medians of two related samples. The sign test is particularly useful when dealing with nominal data, which includes distinct categories without an inherent order, such as names, labels, and preferences. Nominal data restricts statistical analysis to evaluating population proportions rather than mean or median values that require continuous data.
For example, consider a...
78
Sign Test for Matched Pairs01:17

Sign Test for Matched Pairs

117
The sign test for matched pairs offers a robust method for comparing two paired samples, often for the effects of an intervention in one of them. This method is very useful in situations where the underlying distribution of the data is unknown. The test compares two related samples—often pre- and post-treatment measurements on the same subjects—to determine if there are significant differences in their median values.
To conduct the sign test, we first calculate the differences in...
117
Stereotype Content Model02:16

Stereotype Content Model

14.0K
The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...
14.0K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Physical function impacts hearing without mediation from systolic blood pressure.

Scientific reportsĀ·2026
Same author

Guiding approaches to studying Alzheimer's disease: a scoping review of community engagement, health communication, and implementation science research.

The GerontologistĀ·2026
Same author

Hypertension-Induced Retinal Microvascular Remodeling in Women.

Journal of vascular researchĀ·2026
Same author

Digital Twin Model of Treatment Outcomes in Post-Stroke Aphasia.

medRxiv : the preprint server for health sciencesĀ·2026
Same author

Multifaceted neural representation of words in naturalistic language.

ArXivĀ·2026
Same author

Improving Lexicosemantic Impairments in Post-Stroke Aphasia Using rTMS Targeting the Right Anterior Temporal Lobe.

Brain sciencesĀ·2026
Same journal

Application of ephrin-B2 loaded glycol chitosan-silk fibroin hydrogel in the treatment of diabetic refractory wounds.

Scientific reportsĀ·2026
Same journal

International expert Delphi consensus on thromboprophylaxis in metabolic and bariatric surgery.

Scientific reportsĀ·2026
Same journal

Assessing the cross-region knowledge transfer capability of selected deep learning building vectorization methods in the context of available training datasets.

Scientific reportsĀ·2026
Same journal

Feasibility and preliminary effects of outdoor versus indoor cognitive-motor therapy in women with Alzheimer's disease: A randomized single-blind pilot study.

Scientific reportsĀ·2026
Same journal

Hallmarks of social action in the vocal turn-taking of wild common marmosets (Callithrix jacchus).

Scientific reportsĀ·2026
Same journal

Role and mechanism of AOPPs-induced NOX4-mediated ferroptosis in intervertebral disc degeneration.

Scientific reportsĀ·2026
See all related articles

Related Experiment Video

Updated: Jun 13, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

519

The Two Word Test as a semantic benchmark for large language models.

Nicholas Riccardi1, Xuan Yang2, Rutvik H Desai3

  • 1Department of Communication Sciences and Disorders, University of South Carolina, Columbia, 29208, USA.

Scientific Reports
|September 16, 2024
PubMed
Summary
This summary is machine-generated.

The Two Word Test (TWT) reveals large language models (LLMs) struggle with basic semantic understanding, performing poorly on meaningfulness judgments compared to humans. This benchmark highlights LLM limitations in true language comprehension.

More Related Videos

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition
12:49

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition

Published on: July 13, 2019

16.8K
Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment
06:48

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Published on: June 25, 2019

9.1K

Related Experiment Videos

Last Updated: Jun 13, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

519
Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition
12:49

Transcranial Direct Current Stimulation tDCS of Wernicke's and Broca's Areas in Studies of Language Learning and Word Acquisition

Published on: July 13, 2019

16.8K
Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment
06:48

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Published on: June 25, 2019

9.1K

Area of Science:

  • Natural Language Processing
  • Artificial Intelligence
  • Cognitive Science

Background:

  • Large language models (LLMs) demonstrate advanced capabilities, prompting discussions about their potential for human-like understanding and artificial general intelligence (AGI).
  • Current benchmarks often focus on reasoning or domain expertise, potentially overlooking fundamental semantic processing abilities.
  • Human language relies heavily on combining words to form meaningful concepts, a core linguistic operation.

Purpose of the Study:

  • Introduce the open-source Two Word Test (TWT) as a novel benchmark to assess the semantic abilities of LLMs.
  • Evaluate LLMs' capacity for meaningfulness judgments on two-word phrases, a task easily performed by humans.
  • Provide a tool to identify and address limitations in LLM language understanding.

Main Methods:

  • Developed the Two Word Test (TWT), comprising 1768 noun-noun combinations rated for meaningfulness by human participants.
  • Administered TWT in two versions: a 0-4 scale for nuanced ratings and a binary judgment task.
  • Tested leading LLMs including GPT-4, GPT-3.5, Claude-3-Optus, and Gemini-1.0-Pro-001 on the TWT.

Main Results:

  • All tested LLMs performed significantly worse than humans in judging the meaningfulness of two-word phrases.
  • GPT-3.5-turbo, Gemini-1.0-Pro-001, and GPT-4-turbo failed to reliably distinguish between meaningful and nonsensical phrases.
  • Claude-3-Opus showed improvement in binary discrimination but still lagged behind human performance.

Conclusions:

  • The TWT effectively highlights the limitations of current LLMs in fundamental semantic understanding.
  • Results suggest caution is needed when attributing human-level or "true" understanding to LLMs based on existing benchmarks.
  • The TWT offers a valuable tool for assessing and potentially enhancing the semantic capabilities of future LLMs.