Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

RNA-seq

RNA-seq

RNA sequencing, or RNA-Seq, is a high-throughput sequencing technology used to study the transcriptome of a cell. Transcriptomics helps to interpret the functional elements of a genome and identify the molecular constituents of an organism. Additionally, it also helps in understanding the development of an organism and the occurrence of diseases.
Before the discovery of RNA-seq, microarray-based methods and Sanger sequencing were used for transcriptome analysis. However, while...

Vesicular Tubular Clusters

Vesicular Tubular Clusters

After budding out from the ER membrane, some COPII vesicles lose their coat and fuse with one another to form larger vesicles and interconnected tubules called vesicular tubular clusters or VTCs. These clusters constitute a compartment at the ER-Golgi interface known as ERGIC (Endoplasmic Reticulum Golgi Intermediate Compartment). The ERGIC is a mobile membrane-bound cargo transport system that sorts proteins secreted from ER and delivers them to the Golgi.
With the help of motor proteins such...

lncRNA - Long Non-coding RNAs

lncRNA - Long Non-coding RNAs

In humans, more than 80% of the genome gets transcribed. However, only around 2% of the genome codes for proteins. The remaining part produces non-coding RNAs which includes ribosomal RNAs, transfer RNAs, telomerase RNAs, and regulatory RNAs, among other types. A large number of regulatory non-coding RNAs have been classified into two groups depending upon their length – small non-coding RNAs, such as microRNA, which are less than 200 nucleotides in length, and long non-coding RNA...

Classification of Signals

Classification of Signals

In signal processing, signals are classified based on various characteristics: continuous-time versus discrete-time, periodic versus aperiodic, analog versus digital, and causal versus noncausal. Each category highlights distinct properties crucial for understanding and manipulating signals.
A continuous-time signal holds a value at every instant in time, representing information seamlessly. In contrast, a discrete-time signal holds values only at specific moments, often denoted as x(n), where...

Genetic Lingo

Genetic Lingo

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Nyquist-Hilbert-nonlinear Schrödinger solitons: A continuous family of fractional nonlinear waves.

Science advances·2026

Same author

Nonlinear wave propagation governed by a fractional derivative.

Nature communications·2025

Same author

Nonassortative relationships between groups of nodes are typical in complex networks.

PNAS nexus·2023

Same author

Dark solitons under higher-order dispersion.

Optics letters·2022

Same author

High-heat-flux rectification due to a localized thermal diode.

Physical review. E·2020

Same author

Self-similar propagation of optical pulses in fibers with positive quartic dispersion.

Optics letters·2020

Same journal

Desert lizards modulate nutritional responses to match seasonal biological needs.

Royal Society open science·2026

Same journal

Multi-generational fidelity, ecological and social determinants of roosting in a cooperatively breeding bird (<i>Argya squamiceps</i>).

Royal Society open science·2025

Same journal

Multifaceted polarization and information reliability in climate change discussions on social media platforms.

Royal Society open science·2025

Same journal

Comparing the kinematics related to inflicted head injury between violent shaking of a 6-week-old and a 1-year-old infant surrogate.

Royal Society open science·2025

Same journal

Partner choice increases observed reciprocity-based cooperation but decreases unobserved stake-based cooperation.

Royal Society open science·2025

Same journal

Importation models for travel-related SARS-CoV-2 cases reported in Newfoundland and Labrador during the COVID-19 pandemic.

Royal Society open science·2025

See all related articles

Search research articles

Related Experiment Video

Updated: May 31, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Human-interpretable clustering of short text using large language models.

Justin K Miller¹, Tristram J Alexander¹

¹School of Physics, The University of Sydney, Sydney, Australia.

Royal Society Open Science

|January 23, 2025

Summary

This summary is machine-generated.

Large language models (LLMs) effectively cluster short texts by generating nuanced embeddings. This approach surpasses traditional methods, offering more interpretable and distinctive clusters validated by both humans and generative LLMs.

Keywords:

large language models clustering validation text clustering

More Related Videos

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Published on: February 15, 2017

Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Published on: March 1, 2024

Related Experiment Videos

Last Updated: May 31, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Published on: February 15, 2017

Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Published on: March 1, 2024

Area of Science:

Natural Language Processing
Machine Learning
Data Mining

Background:

Clustering short text is challenging due to low word co-occurrence.
Traditional methods like doc2vec and latent Dirichlet allocation have limitations in capturing semantic meaning.

Purpose of the Study:

To demonstrate the efficacy of large language models (LLMs) in clustering short texts.
To compare LLM-based clustering with traditional methods.
To explore LLMs for cluster validation.

Main Methods:

Generating text embeddings using large language models (LLMs).
Applying Gaussian mixture modeling to cluster embeddings.
Comparing LLM-generated clusters with doc2vec and latent Dirichlet allocation outputs.
Quantifying cluster quality using human reviewers and a generative LLM.

Main Results:

LLM-based embeddings capture semantic nuances, overcoming limitations of traditional methods.
Clusters generated using LLMs and Gaussian mixture modeling are more distinctive and human-interpretable.
A generative LLM shows strong agreement with human reviewers in cluster validation.
Comparison reveals biases in both LLM and human coding, questioning human coding as the sole validation standard.

Conclusions:

Large language models offer a powerful solution for short text clustering.
LLMs can bridge the validation gap in clustering by providing reliable interpretation.
The study challenges the conventional reliance on human coding for cluster validation, highlighting LLM potential.