Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Cluster Sampling Method01:20

Cluster Sampling Method

11.6K
Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...
11.6K
RNA-seq03:21

RNA-seq

9.8K
RNA sequencing, or RNA-Seq, is a high-throughput sequencing technology used to study the transcriptome of a cell. Transcriptomics helps to interpret the functional elements of a genome and identify the molecular constituents of an organism. Additionally, it also helps in understanding the development of an organism and the occurrence of diseases. 
Before the discovery of RNA-seq, microarray-based methods and Sanger sequencing were used for transcriptome analysis. However, while...
9.8K
Vesicular Tubular Clusters01:45

Vesicular Tubular Clusters

2.4K
After budding out from the ER membrane, some COPII vesicles lose their coat and fuse with one another to form larger vesicles and interconnected tubules called vesicular tubular clusters or VTCs. These clusters constitute a compartment at the ER-Golgi interface known as ERGIC (Endoplasmic Reticulum Golgi Intermediate Compartment). The ERGIC is a mobile membrane-bound cargo transport system that sorts proteins secreted from ER and delivers them to the Golgi.
With the help of motor proteins such...
2.4K
lncRNA - Long Non-coding RNAs02:39

lncRNA - Long Non-coding RNAs

8.5K
In humans, more than 80% of the genome gets transcribed. However, only around 2% of the genome codes for proteins. The remaining part produces non-coding RNAs which includes ribosomal RNAs, transfer RNAs, telomerase RNAs, and regulatory RNAs, among other types. A large number of regulatory non-coding RNAs have been classified into two groups depending upon their length – small non-coding RNAs, such as microRNA, which are less than 200 nucleotides in length, and long non-coding RNA...
8.5K
Classification of Signals01:30

Classification of Signals

383
In signal processing, signals are classified based on various characteristics: continuous-time versus discrete-time, periodic versus aperiodic, analog versus digital, and causal versus noncausal. Each category highlights distinct properties crucial for understanding and manipulating signals.
A continuous-time signal holds a value at every instant in time, representing information seamlessly. In contrast, a discrete-time signal holds values only at specific moments, often denoted as x(n), where...
383
Genetic Lingo01:11

Genetic Lingo

100.5K
Overview
100.5K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Nyquist-Hilbert-nonlinear Schrödinger solitons: A continuous family of fractional nonlinear waves.

Science advances·2026
Same author

Nonlinear wave propagation governed by a fractional derivative.

Nature communications·2025
Same author

Nonassortative relationships between groups of nodes are typical in complex networks.

PNAS nexus·2023
Same author

Dark solitons under higher-order dispersion.

Optics letters·2022
Same author

High-heat-flux rectification due to a localized thermal diode.

Physical review. E·2020
Same author

Self-similar propagation of optical pulses in fibers with positive quartic dispersion.

Optics letters·2020
Same journal

Desert lizards modulate nutritional responses to match seasonal biological needs.

Royal Society open science·2026
Same journal

Multi-generational fidelity, ecological and social determinants of roosting in a cooperatively breeding bird (<i>Argya squamiceps</i>).

Royal Society open science·2025
Same journal

Multifaceted polarization and information reliability in climate change discussions on social media platforms.

Royal Society open science·2025
Same journal

Comparing the kinematics related to inflicted head injury between violent shaking of a 6-week-old and a 1-year-old infant surrogate.

Royal Society open science·2025
Same journal

Partner choice increases observed reciprocity-based cooperation but decreases unobserved stake-based cooperation.

Royal Society open science·2025
Same journal

Importation models for travel-related SARS-CoV-2 cases reported in Newfoundland and Labrador during the COVID-19 pandemic.

Royal Society open science·2025
See all related articles

Related Experiment Video

Updated: May 31, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

484

Human-interpretable clustering of short text using large language models.

Justin K Miller1, Tristram J Alexander1

  • 1School of Physics, The University of Sydney, Sydney, Australia.

Royal Society Open Science
|January 23, 2025
PubMed
Summary
This summary is machine-generated.

Large language models (LLMs) effectively cluster short texts by generating nuanced embeddings. This approach surpasses traditional methods, offering more interpretable and distinctive clusters validated by both humans and generative LLMs.

Keywords:
large language modelsclustering validationtext clustering

More Related Videos

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations
12:27

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Published on: February 15, 2017

6.9K
Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers
03:37

Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Published on: March 1, 2024

630

Related Experiment Videos

Last Updated: May 31, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

484
Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations
12:27

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Published on: February 15, 2017

6.9K
Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers
03:37

Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Published on: March 1, 2024

630

Area of Science:

  • Natural Language Processing
  • Machine Learning
  • Data Mining

Background:

  • Clustering short text is challenging due to low word co-occurrence.
  • Traditional methods like doc2vec and latent Dirichlet allocation have limitations in capturing semantic meaning.

Purpose of the Study:

  • To demonstrate the efficacy of large language models (LLMs) in clustering short texts.
  • To compare LLM-based clustering with traditional methods.
  • To explore LLMs for cluster validation.

Main Methods:

  • Generating text embeddings using large language models (LLMs).
  • Applying Gaussian mixture modeling to cluster embeddings.
  • Comparing LLM-generated clusters with doc2vec and latent Dirichlet allocation outputs.
  • Quantifying cluster quality using human reviewers and a generative LLM.

Main Results:

  • LLM-based embeddings capture semantic nuances, overcoming limitations of traditional methods.
  • Clusters generated using LLMs and Gaussian mixture modeling are more distinctive and human-interpretable.
  • A generative LLM shows strong agreement with human reviewers in cluster validation.
  • Comparison reveals biases in both LLM and human coding, questioning human coding as the sole validation standard.

Conclusions:

  • Large language models offer a powerful solution for short text clustering.
  • LLMs can bridge the validation gap in clustering by providing reliable interpretation.
  • The study challenges the conventional reliance on human coding for cluster validation, highlighting LLM potential.