Multi-modal Language models in bioacoustics with zero-shot transfer: a case study | JoVE Visualize

Area of Science:

Bioacoustics
Ecoacoustics
Soundscape Ecology
Artificial Intelligence (AI)
Machine Learning

Background:

Traditional AI methods for wildlife monitoring rely on supervised learning, requiring extensive manual annotation of bioacoustic data.
Manual annotation is labor-intensive, costly, and demands significant domain expertise, limiting AI deployment in real-world conservation.
Supervised learning is restricted to predefined categories, hindering adaptability to novel or diverse acoustic environments.

Purpose of the Study:

To explore the potential and limitations of Multi-Modal Language Models (MMLMs) in bioacoustic applications.
To showcase how MMLMs can overcome challenges associated with traditional supervised learning in wildlife sound detection.
To evaluate the zero-shot transfer capabilities of an Audio-Language Model for bioacoustic monitoring.

Main Methods:

Applied the Contrastive Language-Audio Pretraining (CLAP) model, an Audio-Language Model, to eight diverse bioacoustic benchmarks.
Utilized simple prompt engineering to guide the CLAP model's recognition capabilities.
Evaluated CLAP's performance on recognizing group-level sound categories without model fine-tuning or additional training.

Main Results:

CLAP effectively recognized broad categories like birds, frogs, and whales across benchmarks with zero-shot transfer, achieving performance comparable to supervised baselines.
Demonstrated CLAP's potential for novel tasks, including estimating relative sound distances and discovering unknown species.
Identified limitations, such as the inability to discern fine-grained species-level categories and the dependency on manually crafted text prompts.

Conclusions:

Multi-Modal Language Models, specifically Audio-Language Models like CLAP, offer a versatile and efficient alternative for bioacoustic monitoring.
CLAP demonstrates significant potential for zero-shot sound event detection in diverse ecological contexts, reducing reliance on manual annotation.
Further research is needed to address limitations in fine-grained recognition and prompt engineering for practical, real-world bioacoustic applications.