Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Impression Management Techniques IV: Altercasting01:14

Impression Management Techniques IV: Altercasting

Altercasting is a strategic communication technique in which an individual imposes a specific identity or social role onto another person to influence their behavior and shape the interaction. By presuming a role—such as “responsible leader” or “patient person”—altercasting encourages the target to conform to that identity, often aligning their behavior with the expectations associated with the role. The power of this tactic lies in its subtlety; once a role is assigned, it becomes socially...
Kendall's Coefficient of Concordance01:20

Kendall's Coefficient of Concordance

Kendall's Coefficient of Concordance (W), also known as Kendall's W, is a non-parametric statistical measure used to assess the agreement or concordance between multiple raters or judges when they rank a set of items. It is often used when you have ordinal data (ranks) and you want to see if there is consistency or consensus among the raters. It is widely applied in research areas such as psychology, medicine, and social sciences, where multiple judges are asked to rank or rate subjects or...
The Anchoring-and-Adjustment Heuristic01:25

The Anchoring-and-Adjustment Heuristic

In order to make good decisions, we use our knowledge and our reasoning. Often, this knowledge and reasoning is sound and solid. However, sometimes, we are swayed by biases or by others manipulating a situation. For example, let’s say you and three friends wanted to rent a house and had a combined target budget of $1,600. The realtor shows you only very run-down houses for $1,600 and then shows you a very nice house for $2,000. Might you ask each person to pay more in rent to get the $2,000...
Impression Management Techniques III: Aligning Actions01:29

Impression Management Techniques III: Aligning Actions

Aligning actions are communicative strategies individuals employ to maintain social harmony and preserve personal identity in the face of potential disruptions to social norms. These actions are particularly important in managing social impressions when one's behavior might be seen as inappropriate, incompetent, or morally questionable.Types of Aligning ActionsThe three principal types of aligning actions are disclaimers, accounts, and apologies.DisclaimersDisclaimers are preventive; they are...
Wilcoxon Signed-Ranks Test for Matched Pairs01:09

Wilcoxon Signed-Ranks Test for Matched Pairs

The Wilcoxon signed-rank test for matched pairs evaluates the null hypothesis by combining the ranks of differences with their signs. It essentially tests whether the median of the differences in a population of matched pairs is zero. Since the test incorporates more information than the sign test, it generally yields more trustable conclusions. This test also does not require the data to follow a normal distribution, but two conditions must be met for it to be applicable: (1) the data must...
Friedman Two-way Analysis of Variance by Ranks01:21

Friedman Two-way Analysis of Variance by Ranks

Friedman's Two-Way Analysis of Variance by Ranks is a nonparametric test designed to identify differences across multiple test attempts when traditional assumptions of normality and equal variances do not apply. Unlike conventional ANOVA, which requires normally distributed data with equal variances, Friedman's test is ideal for ordinal or non-normally distributed data, making it particularly useful for analyzing dependent samples, such as matched subjects over time or repeated measures from...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

ChartQA-X: Generating Explanations for Visual Chart Reasoning.

IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision·2026
Same author

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos.

Extended abstracts on Human factors in computing systems. CHI Conference·2026
Same author

DescribePro: Collaborative Audio Description with Human-AI Interaction.

ASSETS. Annual ACM Conference on Assistive Technologies·2026
Same author

OSCaR: Object State Captioning and State Change Representation.

Findings of ACL. NAACL·2025
Same author

Surgical Outcomes of Open, Laparoscopic, and Robotic-Assisted Approaches for Stage I Endometrial Cancer: Insights From a Real-World Study by the Indian Gynecologic-Onco Study Group.

Cureus·2025
Same author

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition·2025
Same journal

Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

Assay2Mol: Large Language Model-based Drug Design Using BioAssay Context.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
See all related articles

Related Experiment Video

Updated: May 26, 2026

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody
09:09

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment.

Yogesh Kulkarni1, Pooyan Fazli1

  • 1Arizona State University.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
|May 25, 2026
PubMed
Summary
This summary is machine-generated.

VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries) improves video-language models by training them to identify flawed video representations. This approach enhances understanding of spatial and temporal details without human annotation.

Related Experiment Videos

Last Updated: May 26, 2026

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody
09:09

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

Area of Science:

  • Artificial Intelligence
  • Computer Vision
  • Natural Language Processing

Background:

  • Video-language models (Video-LLMs) demonstrate proficiency in video comprehension but exhibit weaknesses in spatial reasoning, temporal sequencing, and cross-frame consistency.
  • Existing models often require extensive pretraining or architectural changes to improve performance.

Purpose of the Study:

  • To introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a novel framework designed to enhance Video-LLMs.
  • To improve the ability of Video-LLMs to understand complex spatial and temporal dynamics within videos.

Main Methods:

  • VideoPASTA employs targeted preference optimization, training Video-LLMs to differentiate correct video representations from adversarial examples that violate spatial, temporal, or cross-frame relationships.
  • The framework utilizes Direct Preference Optimization with a limited dataset of 7,020 preference pairs and 32-frame sampling.

Main Results:

  • VideoPASTA significantly enhances the performance of various state-of-the-art Video-LLMs across multiple benchmarks, including LongVideoBench (+3.8%), VideoMME (+4.1%), and MVBench (+4.0%).
  • The approach is model-agnostic and achieves substantial improvements without requiring human annotation or captioning.
  • Models trained with VideoPASTA demonstrate improved capture of fine-grained spatial details and long-range temporal dynamics.

Conclusions:

  • Targeted preference alignment, as implemented in VideoPASTA, is an effective strategy for addressing core challenges in video-language understanding.
  • VideoPASTA offers a scalable, plug-and-play solution that seamlessly integrates with existing Video-LLMs, enhancing their capabilities without altering their fundamental architecture.