Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Videos

TransXNet: Learning Both Global and Local Dynamics With a Dual Dynamic Token Mixer for Visual Recognition.

Meng Lou, Shu Zhang, Hong-Yu Zhou

IEEE Transactions on Neural Networks and Learning Systems

|April 3, 2025

Summary

This summary is machine-generated.

Related Concept Videos

Vision

Vision

Vision is the result of light being detected and transduced into neural signals by the retina of the eye. This information is then further analyzed and interpreted by the brain. First, light enters the front of the eye and is focused by the cornea and lens onto the retina—a thin sheet of neural tissue lining the back of the eye. Because of refraction through the convex lens of the eye, images are projected onto the retina upside-down and reversed.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Toward Practical Solid-State Lithium Batteries With High-Nickel Cathodes: An Interface-Centered Perspective.

Advanced materials (Deerfield Beach, Fla.)·2026

Same author

Generative Artificial Intelligence and Large Language Models in Clinical Oncology.

MedComm·2026

Same author

Large reasoning models as thinking machines for medicine.

Nature biomedical engineering·2026

Same author

Investigation on the unsteady aerodynamic coefficients of iced conductors and the applicability of quasi-static assumptions.

Scientific reports·2026

Same author

GPR15-guided CD8<sup>+</sup> T regulatory cells control intestinal inflammation.

Nature·2026

Same author

Self-Limiting Covalent Ligation Mechanism Enabling Anomalously High Interfacial Compatibility in Organic-in-Sulfide All-Solid-State Lithium Batteries.

Angewandte Chemie (International ed. in English)·2026

Same journal

Hidden Data Recovery and Forecasting via Next-Generation Reservoir Computing With Multiscale Delay Selection.

IEEE transactions on neural networks and learning systems·2026

Same journal

CAFF-CIL: Causality-Aware Freedom Forgetting Approach for Class-Incremental Learning.

IEEE transactions on neural networks and learning systems·2026

Same journal

Harmonic Autoencoding Framework for Multiple Tasks in Magnetic Particle Imaging Reconstruction.

IEEE transactions on neural networks and learning systems·2026

Same journal

A Survey on Human-Centric Voice-Face Multimodal Learning.

IEEE transactions on neural networks and learning systems·2026

Same journal

Vision-Assisted Foundation Model for Solving Multitask Vehicle Routing Problems.

IEEE transactions on neural networks and learning systems·2026

Same journal

FP3O: Enabling Proximal Policy Optimization in Multiagent Cooperation With Parameter-Sharing Versatility.

IEEE transactions on neural networks and learning systems·2026

See all related articles

This study introduces a novel dual dynamic token mixer (D-Mixer) for vision networks, enhancing performance by enabling dynamic adaptation to input data. The proposed TransXNet model achieves superior accuracy and efficiency in image classification and dense prediction tasks.

Area of Science:

Computer Vision
Deep Learning
Artificial Intelligence

Background:

Integrating convolutions with transformers aims to improve generalization via inductive bias.
Static convolutions in hybrid networks limit dynamic adaptation and feature fusion with self-attention.
This leads to suboptimal representation capacity in current CNN-transformer architectures.

Purpose of the Study:

To address the limitations of static convolutions in hybrid vision networks.
To propose a novel, lightweight dual dynamic token mixer (D-Mixer) for enhanced feature representation.
To develop a new hybrid CNN-transformer backbone, TransXNet, for improved performance and efficiency.

Main Methods:

Introduced a dual dynamic token mixer (D-Mixer) that learns global and local dynamics in an input-dependent manner.

Related Experiment Videos

D-Mixer utilizes an efficient global attention module and an input-dependent depthwise convolution on split feature segments.

Constructed TransXNet, a hybrid CNN-transformer vision backbone, using D-Mixer as the fundamental building block.

Main Results:

TransXNet-T achieved 0.3% higher top-1 accuracy than Swin-T on ImageNet-1K with less than half the computational cost.
TransXNet-S and TransXNet-B demonstrated strong scalability, reaching 83.8% and 84.6% top-1 accuracy, respectively.
The architecture showed superior generalization on dense prediction tasks compared to state-of-the-art methods at lower computational costs.

Conclusions:

The proposed D-Mixer effectively overcomes the limitations of static convolutions in hybrid networks.
TransXNet offers a compelling balance of high accuracy, efficiency, and strong generalization capabilities.
The D-Mixer approach presents a promising direction for designing efficient and effective vision backbone networks.