Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

One-Compartment Open Model: Wagner-Nelson and Loo Riegelman Method for ka Estimation

One-Compartment Open Model: Wagner-Nelson and Loo Riegelman Method for k_a Estimation

This lesson introduces two critical methods in pharmacokinetics, the Wagner-Nelson and Loo-Riegelman methods, used for estimating the absorption rate constant (ka) for drugs administered via non-intravenous routes. The Wagner-Nelson method relates ka to the plasma concentration derived from the slope of a semilog percent unabsorbed time plot. However, it is limited to drugs with one-compartment kinetics and can be impacted by factors like gastrointestinal motility or enzymatic degradation.
On...

Random Sampling Method

Random Sampling Method

Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest. Among the various sampling methods used by...

Maxwell-Boltzmann Distribution: Problem Solving

Maxwell-Boltzmann Distribution: Problem Solving

Individual molecules in a gas move in random directions, but a gas containing numerous molecules has a predictable distribution of molecular speeds, which is known as the Maxwell-Boltzmann distribution, f(v).
This distribution function f(v) is defined by saying that the expected number N (v1,v2) of particles with speeds between v1 and v2 is given by

Distributed Loads: Problem Solving

Distributed Loads: Problem Solving

Beams are structural elements commonly employed in engineering applications requiring different load-carrying capacities. The first step in analyzing a beam under a distributed load is to simplify the problem by dividing the load into smaller regions, which allows one to consider each region separately and calculate the magnitude of the equivalent resultant load acting on each portion of the beam. The magnitude of the equivalent resultant load for each region can be determined by calculating...

One-Way ANOVA: Equal Sample Sizes

One-Way ANOVA: Equal Sample Sizes

One-Way ANOVA can be performed on three or more samples with equal or unequal sample sizes. When one-way ANOVA is performed on two datasets with samples of equal sizes, it can be easily observed that the computed F statistic is highly sensitive to the sample mean.
Different sample means can result in different values for the variance estimate: variance between samples. This is because the variance between samples is calculated as the product of the sample size and the variance between the...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Echocardiographic prediction of functional coronary stenosis: global longitudinal strain as a key determinant of quantitative flow ratio.

Internal and emergency medicine·2026

Same author

SLC25A21 promotes ferroptosis by inducing mitochondrial GPX4 deficiency in colorectal cancer.

Cellular and molecular life sciences : CMLS·2026

Same author

A rapidly personalized in-hospital bloodstream infection prediction model: a multicenter retrospective study.

BMC infectious diseases·2026

Same author

Microstructured electrode coupled with electrochemical deposition enrichment laser-induced breakdown spectroscopy for ppb-level sensitive detection of Pb<sup>2+</sup> and Cr<sup>3+</sup> in water.

Talanta·2026

Same author

A novel serum phosphorus to chloride and bicarbonate ratio predicts severe acute kidney injury in critically ill patients: a multicenter cohort study.

Respiratory medicine·2026

Same author

Epigenetic and O-glycosylation regulation of p66Shc mitigates mitochondrial oxidative stress in aortic dissection.

Theranostics·2026

Same journal

Thymidylate synthase inhibitory drugs induce p53-dependent pathways differently.

PloS one·2026

Same journal

Top-down and bottom-up attention for joint pattern classification and reconstruction.

PloS one·2026

Same journal

Short- and long-term scaling behavior of blood pressure and pulse arrival time during sleep in healthy controls and patients with obstructive sleep apnea.

PloS one·2026

Same journal

Double DQN-based secrecy energy efficiency and fairness performance in IRS-assisted NOMA systems with friendly jamming.

PloS one·2026

Same journal

10 recommendations for strengthening citizen science for improved societal and ecological outcomes: A co-produced analysis of challenges and opportunities in the 21st century.

PloS one·2026

Same journal

Paying in public: Peer effects, impression management, and willingness to pay on digital payment platforms.

PloS one·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jun 4, 2025

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Published on: February 15, 2017

Distributed K-Means algorithm based on a Spark optimization sample.

Yongan Feng¹, Jiapeng Zou¹, Wanjun Liu¹

¹Liaoning Technical University, Huludao, China.

|December 23, 2024

Summary

This summary is machine-generated.

We developed SOSK-Means, an optimized K-Means algorithm for big data. It significantly boosts computational speed and accuracy for large-scale clustering tasks.

More Related Videos

ExCYT: A Graphical User Interface for Streamlining Analysis of High-Dimensional Cytometry Data

ExCYT: A Graphical User Interface for Streamlining Analysis of High-Dimensional Cytometry Data

Published on: January 16, 2019

Determination of Aggregate Surface Morphology at the Interfacial Transition Zone ITZ

Determination of Aggregate Surface Morphology at the Interfacial Transition Zone ITZ

Published on: December 16, 2019

Related Experiment Videos

Last Updated: Jun 4, 2025

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Large-scale Reconstructions and Independent, Unbiased Clustering Based on Morphological Metrics to Classify Neurons in Selective Populations

Published on: February 15, 2017

ExCYT: A Graphical User Interface for Streamlining Analysis of High-Dimensional Cytometry Data

ExCYT: A Graphical User Interface for Streamlining Analysis of High-Dimensional Cytometry Data

Published on: January 16, 2019

Determination of Aggregate Surface Morphology at the Interfacial Transition Zone ITZ

Determination of Aggregate Surface Morphology at the Interfacial Transition Zone ITZ

Published on: December 16, 2019

Area of Science:

Data Science
Machine Learning
Big Data Analytics

Background:

Classical K-Means algorithm suffers from instability and performance issues with massive datasets.
Efficient clustering of large-scale data is crucial for various data mining applications.

Purpose of the Study:

To introduce SOSK-Means, an enhanced K-Means algorithm optimized for Spark to address the limitations of classical K-Means on massive datasets.
To improve the computational speed and accuracy of K-Means clustering for large-scale data.

Main Methods:

Implemented a weighted jump-bank approach for efficient random sampling and pre-clustering, improving initial center selection.
Utilized a weighted max-min distance with variance for enhanced distance calculation, considering data weight and variance.
Employed a novel distance comparison method and a Directed Acyclic Graph (DAG) for optimized computation and distributed processing on Spark.

Main Results:

SOSK-Means demonstrates significant improvements in computational speed compared to classical K-Means.
The algorithm maintains high computational accuracy, effectively handling massive datasets.
Enhanced initial center selection and distance calculation contribute to improved clustering performance.

Conclusions:

SOSK-Means offers a robust and efficient solution for large-scale data clustering using Spark optimization.
The proposed modifications effectively address the instability and performance bottlenecks of traditional K-Means.
This optimized algorithm is well-suited for big data analytics requiring fast and accurate clustering.