Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Improving Translational Accuracy02:07

Improving Translational Accuracy

9.4K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
9.4K
Reliability and Validity01:29

Reliability and Validity

12.7K
Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.
12.7K
Language and Cognition01:27

Language and Cognition

336
Language serves as a bridge between ideas and communication, influencing how individuals perceive and interact with the world. Psychologists have long debated whether language shapes thought or vice versa. This discussion gained grip with Edward Sapir and Benjamin Lee Whorf in the 1940s, who proposed that language determines thought, a concept known as linguistic determinism. They suggested that the vocabulary and structure of a language influence how its speakers think and perceive reality.
336
Stereotype Content Model02:16

Stereotype Content Model

14.0K
The Stereotype Content Model (SCM) was first proposed by Susan Fiske and her colleagues (Fiske, Cuddy, Glick & Xu, 2002; see also Fiske, 2012 and Fiske, 2017). The SCM specifies that when someone encounters a new group, they will stereotype them based on two metrics: warmth—or that group’s perceived intent, and how likely they are to provide help or inflict harm—and competence—or their ability to carry out that objective. Depending on the warmth-competence...
14.0K
Language Development01:22

Language Development

327
Children master language quickly and with relative ease, supported by both biological predisposition and reinforcement. B. F. Skinner (1957) proposed that language is learned through reinforcement, while Noam Chomsky (1965) argued that language acquisition mechanisms are biologically determined.
The critical period for language acquisition suggests that the ability to acquire language is at its peak early in life. As people age, this proficiency decreases. Language development begins very...
327
Machines: Problem Solving II01:30

Machines: Problem Solving II

300
Machines are complex structures consisting of movable, pin-connected multi-force members that work together to transmit forces. Consider a lifting tong carrying a 100 kg load. It comprises movable sections DAF and CBG linked together with member AB.
300

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

General scales unlock AI evaluation with explanatory and predictive power.

Nature·2026
Same author

Subphenotyping of Mexican Patients With COVID-19 at Preadmission To Anticipate Severity Stratification: Age-Sex Unbiased Meta-Clustering Technique.

JMIR public health and surveillance·2022
Same journal

Retraction Note: NSD2 targeting reverses plasticity and drug resistance in prostate cancer.

Nature·2026
Same journal

Enhanced B cell priming induces broadly neutralizing HIV-1 apex antibodies.

Nature·2026
Same journal

Vaccination elicits HIV broadly neutralizing antibodies in primates.

Nature·2026
Same journal

Child online safety needs more than social-media bans.

Nature·2026
Same journal

Ebola preparedness must start with ecosystems and before humans show symptoms.

Nature·2026
Same journal

AI tools can speed up thinking, but evidence still comes from the lab bench.

Nature·2026
See all related articles

Related Experiment Video

Updated: Jun 12, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

516

Larger and more instructable language models become less reliable.

Lexin Zhou1,2, Wout Schellaert1,3, Fernando Martínez-Plumed1,4

  • 1Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.

Nature
|September 25, 2024
PubMed
Summary
This summary is machine-generated.

Scaling up large language models (LLMs) may decrease reliability. While larger models answer more questions, they often provide incorrect answers that are hard for humans to detect, necessitating new AI development approaches.

More Related Videos

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques
08:05

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

7.5K
P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation
06:09

P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation

Published on: September 8, 2023

518

Related Experiment Videos

Last Updated: Jun 12, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

516
Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques
08:05

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

7.5K
P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation
06:09

P300-Based Brain-Computer Interface Speller Performance Estimation with Classifier-Based Latency Estimation

Published on: September 8, 2023

518

Area of Science:

  • Artificial Intelligence
  • Natural Language Processing
  • Machine Learning

Background:

  • Current large language model (LLM) development focuses on scaling (increasing size, data, computation) and shaping (fine-tuning, human feedback).
  • Despite advancements, larger and more "instructable" LLMs may exhibit reduced reliability and unpredictable error patterns.

Purpose of the Study:

  • To investigate the relationship between task difficulty, model avoidance, and prompting stability in various LLM families.
  • To assess how scaling and shaping impact LLM reliability and error predictability, particularly in high-stakes applications.

Main Methods:

  • Analysis of difficulty concordance between human participants and LLMs.
  • Evaluation of task avoidance and prompting stability across different LLM families.
  • Comparison of error types and detectability between early and scaled-up/shaped-up LLMs.

Main Results:

  • LLMs find easy tasks easy, but scaled models do not guarantee error-free or easily supervised low-difficulty zones.
  • Scaled LLMs, unlike earlier models, frequently provide plausible but incorrect answers, often on difficult questions missed by human supervisors.
  • While scaling and shaping improve response stability to varied phrasing, unpredictable errors persist across difficulty levels.

Conclusions:

  • Scaling and shaping LLMs do not inherently improve reliability or predictability of errors.
  • A paradigm shift in AI design is needed, focusing on predictable error distributions for critical applications.
  • Further research is required to ensure AI safety and trustworthiness, especially in high-stakes domains.