Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Reliability and Validity

Reliability and Validity

Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.

Language and Cognition

Language and Cognition

Language serves as a bridge between ideas and communication, influencing how individuals perceive and interact with the world. Psychologists have long debated whether language shapes thought or vice versa. This discussion gained grip with Edward Sapir and Benjamin Lee Whorf in the 1940s, who proposed that language determines thought, a concept known as linguistic determinism. They suggested that the vocabulary and structure of a language influence how its speakers think and perceive reality.

Stereotype Content Model

Stereotype Content Model

The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...

Language Development

Language Development

Children master language quickly and with relative ease, supported by both biological predisposition and reinforcement. B. F. Skinner (1957) proposed that language is learned through reinforcement, while Noam Chomsky (1965) argued that language acquisition mechanisms are biologically determined.
The critical period for language acquisition suggests that the ability to acquire language is at its peak early in life. As people age, this proficiency decreases. Language development begins very...

Machines: Problem Solving II

Machines: Problem Solving II

Machines are complex structures consisting of movable, pin-connected multi-force members that work together to transmit forces. Consider a lifting tong carrying a 100 kg load. It comprises movable sections DAF and CBG linked together with member AB.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

General scales unlock AI evaluation with explanatory and predictive power.

Nature·2026

Same author

Subphenotyping of Mexican Patients With COVID-19 at Preadmission To Anticipate Severity Stratification: Age-Sex Unbiased Meta-Clustering Technique.

JMIR public health and surveillance·2022

Same journal

Retraction Note: NSD2 targeting reverses plasticity and drug resistance in prostate cancer.

Nature·2026

Same journal

Enhanced B cell priming induces broadly neutralizing HIV-1 apex antibodies.

Nature·2026

Same journal

Vaccination elicits HIV broadly neutralizing antibodies in primates.

Nature·2026

Same journal

Child online safety needs more than social-media bans.

Nature·2026

Same journal

Ebola preparedness must start with ecosystems and before humans show symptoms.

Nature·2026

Same journal

AI tools can speed up thinking, but evidence still comes from the lab bench.

Nature·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jun 12, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Larger and more instructable language models become less reliable.

Lexin Zhou^1,2, Wout Schellaert^1,3, Fernando Martínez-Plumed^1,4

¹Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.

|September 25, 2024

Summary

This summary is machine-generated.

Scaling up large language models (LLMs) may decrease reliability. While larger models answer more questions, they often provide incorrect answers that are hard for humans to detect, necessitating new AI development approaches.

More Related Videos

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation

P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation

Published on: September 8, 2023

Related Experiment Videos

Last Updated: Jun 12, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation

P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation

Published on: September 8, 2023

Area of Science:

Artificial Intelligence
Natural Language Processing
Machine Learning

Background:

Current large language model (LLM) development focuses on scaling (increasing size, data, computation) and shaping (fine-tuning, human feedback).
Despite advancements, larger and more "instructable" LLMs may exhibit reduced reliability and unpredictable error patterns.

Purpose of the Study:

To investigate the relationship between task difficulty, model avoidance, and prompting stability in various LLM families.
To assess how scaling and shaping impact LLM reliability and error predictability, particularly in high-stakes applications.

Main Methods:

Analysis of difficulty concordance between human participants and LLMs.
Evaluation of task avoidance and prompting stability across different LLM families.
Comparison of error types and detectability between early and scaled-up/shaped-up LLMs.

Main Results:

LLMs find easy tasks easy, but scaled models do not guarantee error-free or easily supervised low-difficulty zones.
Scaled LLMs, unlike earlier models, frequently provide plausible but incorrect answers, often on difficult questions missed by human supervisors.
While scaling and shaping improve response stability to varied phrasing, unpredictable errors persist across difficulty levels.

Conclusions:

Scaling and shaping LLMs do not inherently improve reliability or predictability of errors.
A paradigm shift in AI design is needed, focusing on predictable error distributions for critical applications.
Further research is required to ensure AI safety and trustworthiness, especially in high-stakes domains.