Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Impression Management Techniques IV: Altercasting

Impression Management Techniques IV: Altercasting

Altercasting is a strategic communication technique in which an individual imposes a specific identity or social role onto another person to influence their behavior and shape the interaction. By presuming a role—such as “responsible leader” or “patient person”—altercasting encourages the target to conform to that identity, often aligning their behavior with the expectations associated with the role. The power of this tactic lies in its subtlety; once a role is assigned, it becomes socially...

Kendall's Coefficient of Concordance

Kendall's Coefficient of Concordance

Kendall's Coefficient of Concordance (W), also known as Kendall's W, is a non-parametric statistical measure used to assess the agreement or concordance between multiple raters or judges when they rank a set of items. It is often used when you have ordinal data (ranks) and you want to see if there is consistency or consensus among the raters. It is widely applied in research areas such as psychology, medicine, and social sciences, where multiple judges are asked to rank or rate subjects or...

The Anchoring-and-Adjustment Heuristic

The Anchoring-and-Adjustment Heuristic

In order to make good decisions, we use our knowledge and our reasoning. Often, this knowledge and reasoning is sound and solid. However, sometimes, we are swayed by biases or by others manipulating a situation. For example, let’s say you and three friends wanted to rent a house and had a combined target budget of $1,600. The realtor shows you only very run-down houses for $1,600 and then shows you a very nice house for $2,000. Might you ask each person to pay more in rent to get the $2,000...

Impression Management Techniques III: Aligning Actions

Impression Management Techniques III: Aligning Actions

Aligning actions are communicative strategies individuals employ to maintain social harmony and preserve personal identity in the face of potential disruptions to social norms. These actions are particularly important in managing social impressions when one's behavior might be seen as inappropriate, incompetent, or morally questionable.Types of Aligning ActionsThe three principal types of aligning actions are disclaimers, accounts, and apologies.DisclaimersDisclaimers are preventive; they are...

Wilcoxon Signed-Ranks Test for Matched Pairs

Wilcoxon Signed-Ranks Test for Matched Pairs

The Wilcoxon signed-rank test for matched pairs evaluates the null hypothesis by combining the ranks of differences with their signs. It essentially tests whether the median of the differences in a population of matched pairs is zero. Since the test incorporates more information than the sign test, it generally yields more trustable conclusions. This test also does not require the data to follow a normal distribution, but two conditions must be met for it to be applicable: (1) the data must...

Friedman Two-way Analysis of Variance by Ranks

Friedman Two-way Analysis of Variance by Ranks

Friedman's Two-Way Analysis of Variance by Ranks is a nonparametric test designed to identify differences across multiple test attempts when traditional assumptions of normality and equal variances do not apply. Unlike conventional ANOVA, which requires normally distributed data with equal variances, Friedman's test is ideal for ordinal or non-normally distributed data, making it particularly useful for analyzing dependent samples, such as matched subjects over time or repeated measures from...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

ChartQA-X: Generating Explanations for Visual Chart Reasoning.

IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision·2026

Same author

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos.

Extended abstracts on Human factors in computing systems. CHI Conference·2026

Same author

DescribePro: Collaborative Audio Description with Human-AI Interaction.

ASSETS. Annual ACM Conference on Assistive Technologies·2026

Same author

OSCaR: Object State Captioning and State Change Representation.

Findings of ACL. NAACL·2025

Same author

Surgical Outcomes of Open, Laparoscopic, and Robotic-Assisted Approaches for Stage I Endometrial Cancer: Insights From a Real-World Study by the Indian Gynecologic-Onco Study Group.

Cureus·2025

Same author

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition·2025

Same journal

Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

Assay2Mol: Large Language Model-based Drug Design Using BioAssay Context.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

See all related articles

Search research articles

Related Experiment Video

Updated: May 26, 2026

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment.

Yogesh Kulkarni¹, Pooyan Fazli¹

¹Arizona State University.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

|May 25, 2026

Summary

This summary is machine-generated.

VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries) improves video-language models by training them to identify flawed video representations. This approach enhances understanding of spatial and temporal details without human annotation.

Related Experiment Videos

Last Updated: May 26, 2026

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

Area of Science:

Artificial Intelligence
Computer Vision
Natural Language Processing

Background:

Video-language models (Video-LLMs) demonstrate proficiency in video comprehension but exhibit weaknesses in spatial reasoning, temporal sequencing, and cross-frame consistency.
Existing models often require extensive pretraining or architectural changes to improve performance.

Purpose of the Study:

To introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a novel framework designed to enhance Video-LLMs.
To improve the ability of Video-LLMs to understand complex spatial and temporal dynamics within videos.

Main Methods:

VideoPASTA employs targeted preference optimization, training Video-LLMs to differentiate correct video representations from adversarial examples that violate spatial, temporal, or cross-frame relationships.
The framework utilizes Direct Preference Optimization with a limited dataset of 7,020 preference pairs and 32-frame sampling.

Main Results:

VideoPASTA significantly enhances the performance of various state-of-the-art Video-LLMs across multiple benchmarks, including LongVideoBench (+3.8%), VideoMME (+4.1%), and MVBench (+4.0%).
The approach is model-agnostic and achieves substantial improvements without requiring human annotation or captioning.
Models trained with VideoPASTA demonstrate improved capture of fine-grained spatial details and long-range temporal dynamics.

Conclusions:

Targeted preference alignment, as implemented in VideoPASTA, is an effective strategy for addressing core challenges in video-language understanding.
VideoPASTA offers a scalable, plug-and-play solution that seamlessly integrates with existing Video-LLMs, enhancing their capabilities without altering their fundamental architecture.