Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Regression Analysis

Regression Analysis

Regression analysis is a statistical tool that describes a mathematical relationship between a dependent variable and one or more independent variables.
In regression analysis, a regression equation is determined based on the line of best fit– a line that best fits the data points plotted in a graph. This line is also called the regression line. The algebraic equation for the regression line is called the regression equation. It is represented as:

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Introduction to R

Introduction to R

R is a powerful software environment for statistical computing and graphics. Originating as an implementation of the S language, developed at Bell Laboratories, R has evolved into a robust, open-source statistical software favored by statisticians and data scientists worldwide. Its comprehensive suite includes data manipulation, calculation, and graphical display capabilities, making it versatile for data analysis and visualization. Its programming language is at the core of R's...

Statistical Analysis: Overview

Statistical Analysis: Overview

When we take repeated measurements on the same or replicated samples, we will observe inconsistencies in the magnitude. These inconsistencies are called errors. To categorize and characterize these results and their errors, the researcher can use statistical analysis to determine the quality of the measurements and/or suitability of the methods.
One of the most commonly used statistical quantifiers is the mean, which is the ratio between the sum of the numerical values of all results and the...

One-Way ANOVA: Unequal Sample Sizes

One-Way ANOVA: Unequal Sample Sizes

One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used:

One-Way ANOVA

One-Way ANOVA

One-way ANOVA analyzes more than three samples categorized by one factor. For example, it can compare the average mileage of sports bikes. Here, the data is categorized by one factor - the company. However, one-way ANOVA cannot be used to simultaneously compare the sample mean of three or more samples categorized by two factors. An example of two factors would be sports bikes from different companies driven in different terrains, such as a desert or snowy landscape. Here, two-way ANOVA is used...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Making new connections: An fNIRS machine learning classification study of inter-brain synchrony in the default mode network.

Social cognitive and affective neuroscience·2026

Same author

Enhancing generalizability of model discovery across parameter space with multi-experiment equation learning for biological systems.

PLoS computational biology·2026

Same author

Interactive Alignment After Communication Partner Training in Adults With Traumatic Brain Injury.

American journal of speech-language pathology·2026

Same author

Coding and Validation for Breadth and Desirability of 1,214 English Adjectives.

Scientific data·2026

Same author

Cross-recurrence quantification analysis captures inter-brain coupling during naturalistic negotiation: a new dynamic approach for hyperscanning.

Frontiers in neuroscience·2026

Same author

Reconsidering intrapersonal communication through an interdisciplinary lens.

Frontiers in psychology·2025

Same journal

Exploring psychological tradeoffs: Developing and demonstrating an R Shiny app for Pareto optimization.

Behavior research methods·2026

Same journal

The performance of Bayesian fit measures in detecting misspecified multilevel structural equation modeling.

Behavior research methods·2026

Same journal

Psychometric functions from multiple responses : Dedicated to the memory of Colin L. Mallows.

Behavior research methods·2026

Same journal

Low-cost, open-source, full-stack software and Arduino-based hardware for control of commercially available animal behavior systems.

Behavior research methods·2026

Same journal

PyNeon: A Python package for the analysis of Neon multimodal mobile eye-tracking data.

Behavior research methods·2026

Same journal

Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction.

Behavior research methods·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Mar 16, 2026

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Published on: September 18, 2021

Efficient n-gram analysis in R with cmscu.

David W Vinson¹, Jason K Davis², Suzanne S Sindi²

¹University of California, Merced, 5200 N. Lake Rd., Merced, CA, USA. dvinson@ucmerced.edu.

Behavior Research Methods

|August 7, 2016

Summary

This summary is machine-generated.

A new R package, cmscu, enables efficient analysis of large text datasets. Higher information density in reviews correlates with better reader ratings, revealing insights into behavioral phenomena.

Keywords:

Big data Information theory Interdisciplinary collaboration Sketch algorithms n-grams

More Related Videos

A User-friendly and Powerful R Analysis of Large-scale Datasets

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study

Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study

Published on: April 18, 2025

Related Experiment Videos

Last Updated: Mar 16, 2026

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Published on: September 18, 2021

A User-friendly and Powerful R Analysis of Large-scale Datasets

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study

Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study

Published on: April 18, 2025

Area of Science:

Computational linguistics
Data science
Psycholinguistics

Background:

Analyzing large text corpora presents computational challenges.
Existing R libraries struggle with high-throughput processing of massive datasets.
Count-Min-Sketch with conservative updating offers a memory-efficient solution.

Purpose of the Study:

Introduce the cmscu R package for efficient n-gram analysis.
Apply the package to investigate information density in online reviews.
Explore the relationship between information density and review ratings.

Main Methods:

Developed the cmscu R package using C++ and Rcpp for performance.
Implemented the modified Kneser-Ney n-gram smoothing algorithm.
Analyzed n-gram frequencies from a 2.2 million review Yelp dataset.

Main Results:

The cmscu package handles large-scale text data beyond the capacity of standard libraries.
A positive correlation was found between review information density and reader ratings.
Demonstrated the utility of efficient tools for behavioral research in large datasets.

Conclusions:

The cmscu package provides a powerful and efficient tool for text analysis.
Information density is a significant factor influencing reader perception of reviews.
Efficient computational tools are crucial for advancing the study of human behavior in big data.