Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving01:29

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving

48
Mechanistic models play a crucial role in algorithms for numerical problem-solving, particularly in nonlinear mixed effects modeling (NMEM). These models aim to minimize specific objective functions by evaluating various parameter estimates, leading to the development of systematic algorithms. In some cases, linearization techniques approximate the model using linear equations.
In individual population analyses, different algorithms are employed, such as Cauchy's method, which uses a...
48
Language and Cognition01:27

Language and Cognition

342
Language serves as a bridge between ideas and communication, influencing how individuals perceive and interact with the world. Psychologists have long debated whether language shapes thought or vice versa. This discussion gained grip with Edward Sapir and Benjamin Lee Whorf in the 1940s, who proposed that language determines thought, a concept known as linguistic determinism. They suggested that the vocabulary and structure of a language influence how its speakers think and perceive reality.
342
Typical Model Studies01:30

Typical Model Studies

354
Fluid mechanics model studies often utilize scaled-down systems to predict fluid behavior in full-scale environments, such as river flows, dam spillways, and structures interacting with open surfaces. Maintaining Froude number similarity in river models is crucial, as it replicates surface flow features like wave patterns and velocities.
354
Multi-input and Multi-variable systems01:22

Multi-input and Multi-variable systems

106
Cruise control systems in cars are designed as multi-input systems to maintain a driver's desired speed while compensating for external disturbances such as changes in terrain. The block diagram for a cruise control system typically includes two main inputs: the desired speed set by the driver and any external disturbances, such as the incline of the road. By adjusting the engine throttle, the system maintains the vehicle's speed as close to the desired value as possible.
In the absence...
106
Modeling and Similitude01:12

Modeling and Similitude

262
Scaled modeling is a fundamental technique in engineering, enabling the study of large and complex systems by creating smaller, manageable replicas that recreate critical characteristics of the original. In hydrology and civil infrastructure, for example, scaled models of dams help analyze water flow, turbulence, and pressure. This method allows for accurate predictions of real-world behavior within a controlled environment, significantly reducing the cost and time involved in full-scale...
262
Modeling in Therapy01:26

Modeling in Therapy

66
Modeling, a key technique in therapy, uses observational learning to help clients acquire and practice new skills by watching therapists demonstrate desired behaviors. This approach, rooted in Albert Bandura's concept of vicarious learning, plays a significant role in therapeutic interventions for various psychological conditions, including social anxiety, ADHD, and depression.
Participant Modeling
Participant modeling involves therapists demonstrating calm and effective behaviors in...
66

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Soft-Templated Synthesis of Large-Extrinsic-Mesopore Covalent Organic Frameworks with Tunable Pore Architecture and Size.

ACS nano·2026
Same author

A reporting checklist for large language models in behavioural science.

Nature human behaviour·2026
Same author

Perceived authenticity drives gaze behavior when watching AI-generated videos of physical scenes.

Scientific reports·2026
Same author

Conniving With Continuations: Representing Goals in a Domain-Specific Language of Thought.

Topics in cognitive science·2026
Same author

Neural representation of action symbols in primate frontal cortex.

Nature·2026
Same author

Human-level learning of complex novel tasks as theory-based modelling, exploration and planning.

Philosophical transactions. Series A, Mathematical, physical, and engineering sciences·2026
Same journal

In This Issue.

Proceedings of the National Academy of Sciences of the United States of America·2026
Same journal

Correction for Otsuki et al., Extracellular sulfatases support cartilage homeostasis by regulating BMP and FGF signaling pathways.

Proceedings of the National Academy of Sciences of the United States of America·2026
Same journal

Hive mind: Microbial communities and the making of memory.

Proceedings of the National Academy of Sciences of the United States of America·2026
Same journal

Targets for disease modification in schizophrenia: New findings add to evidence for the involvement of the immune complement system.

Proceedings of the National Academy of Sciences of the United States of America·2026
Same journal

Correction for Wang et al., The role of reduced aerosol masking from air pollutant emission reductions in recent global warming acceleration (2013-2023).

Proceedings of the National Academy of Sciences of the United States of America·2026
Same journal

Correction for Mishra, Ecology is not yet ready for AI-and why that matters.

Proceedings of the National Academy of Sciences of the United States of America·2026
See all related articles

Related Experiment Video

Updated: Jun 24, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

544

Evaluating language models for mathematics through interactions.

Katherine M Collins1, Albert Q Jiang1, Simon Frieder2

  • 1University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Proceedings of the National Academy of Sciences of the United States of America
|June 3, 2024
PubMed
Summary
This summary is machine-generated.

Evaluating large language models (LLMs) for interactive problem-solving requires more than static tests. Our study shows that while models like GPT-4 perform well in math, human interaction reveals nuances in helpfulness and correctness.

Keywords:
AIhuman–computer interactionlanguage modelstheorem proving

More Related Videos

Multimedia Battery for Assessment of Cognitive and Basic Skills in Mathematics BM-PROMA
10:58

Multimedia Battery for Assessment of Cognitive and Basic Skills in Mathematics BM-PROMA

Published on: August 28, 2021

4.5K
The Spatial Memory Game: Testing the Relationship Between Spatial Language, Object Knowledge, and Spatial Cognition
05:15

The Spatial Memory Game: Testing the Relationship Between Spatial Language, Object Knowledge, and Spatial Cognition

Published on: February 19, 2018

10.8K

Related Experiment Videos

Last Updated: Jun 24, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

544
Multimedia Battery for Assessment of Cognitive and Basic Skills in Mathematics BM-PROMA
10:58

Multimedia Battery for Assessment of Cognitive and Basic Skills in Mathematics BM-PROMA

Published on: August 28, 2021

4.5K
The Spatial Memory Game: Testing the Relationship Between Spatial Language, Object Knowledge, and Spatial Cognition
05:15

The Spatial Memory Game: Testing the Relationship Between Spatial Language, Object Knowledge, and Spatial Cognition

Published on: February 19, 2018

10.8K

Area of Science:

  • Artificial Intelligence
  • Human-Computer Interaction
  • Mathematics Education

Background:

  • Large language models (LLMs) show promise as problem-solving assistants.
  • Current LLM evaluation methods using static input-output pairs are inadequate for interactive settings.
  • Understanding LLM capabilities in dynamic, real-world applications is crucial.

Purpose of the Study:

  • To introduce CheckMate, a platform for interactive LLM evaluation.
  • To assess InstructGPT, ChatGPT, and GPT-4 as mathematical problem-solving assistants.
  • To analyze human interaction patterns and LLM performance in a mathematical context.

Main Methods:

  • Developed and utilized the CheckMate platform for human-LLM interaction.
  • Conducted a study involving undergraduate mathematics students and professors.
  • Collected interaction data and ratings to form the MathConverse dataset.
  • Performed case studies on GPT-4's mathematical problem-solving capabilities.

Main Results:

  • Derived a taxonomy of human query behaviors during LLM interaction.
  • Observed divergence between LLM output correctness and perceived helpfulness.
  • Identified specific strengths and weaknesses of GPT-4 in mathematical proofs.
  • Released the MathConverse dataset for further research.

Conclusions:

  • Interactive evaluation is essential for understanding LLM utility.
  • LLMs that communicate uncertainty and accept corrections are better assistants.
  • Mathematicians and ML practitioners should be aware of LLM limitations and potential fallibility.