Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Regression Analysis01:11

Regression Analysis

8.8K
Regression analysis is a statistical tool that describes a mathematical relationship between a dependent variable and one or more independent variables.
In regression analysis, a regression equation is determined based on the line of best fit– a line that best fits the data points plotted in a graph. This line is also called the regression line. The algebraic equation for the regression line is called the regression equation. It is represented as:
8.8K
Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

4.4K
Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...
4.4K
Introduction to R01:11

Introduction to R

5.2K
R is a powerful software environment for statistical computing and graphics. Originating as an implementation of the S language, developed at Bell Laboratories, R has evolved into a robust, open-source statistical software favored by statisticians and data scientists worldwide. Its comprehensive suite includes data manipulation, calculation, and graphical display capabilities, making it versatile for data analysis and visualization. Its programming language is at the core of R's...
5.2K
Statistical Analysis: Overview01:11

Statistical Analysis: Overview

16.8K
When we take repeated measurements on the same or replicated samples, we will observe inconsistencies in the magnitude. These inconsistencies are called errors. To categorize and characterize these results and their errors, the researcher can use statistical analysis to determine the quality of the measurements and/or suitability of the methods.
One of the most commonly used statistical quantifiers is the mean, which is the ratio between the sum of the numerical values of all results and the...
16.8K
One-Way ANOVA: Unequal Sample Sizes01:15

One-Way ANOVA: Unequal Sample Sizes

6.9K
One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used:
6.9K
One-Way ANOVA01:18

One-Way ANOVA

14.1K
One-way ANOVA analyzes more than three samples categorized by one factor. For example, it can compare the average mileage of sports bikes. Here, the data is categorized by one factor - the company. However, one-way ANOVA cannot be used to simultaneously compare the sample mean of three or more samples categorized by two factors. An example of two factors would be sports bikes from different companies driven in different terrains, such as a desert or snowy landscape. Here, two-way ANOVA is used...
14.1K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Making new connections: An fNIRS machine learning classification study of inter-brain synchrony in the default mode network.

Social cognitive and affective neuroscience·2026
Same author

Enhancing generalizability of model discovery across parameter space with multi-experiment equation learning for biological systems.

PLoS computational biology·2026
Same author

Interactive Alignment After Communication Partner Training in Adults With Traumatic Brain Injury.

American journal of speech-language pathology·2026
Same author

Coding and Validation for Breadth and Desirability of 1,214 English Adjectives.

Scientific data·2026
Same author

Cross-recurrence quantification analysis captures inter-brain coupling during naturalistic negotiation: a new dynamic approach for hyperscanning.

Frontiers in neuroscience·2026
Same author

Reconsidering intrapersonal communication through an interdisciplinary lens.

Frontiers in psychology·2025
Same journal

Exploring psychological tradeoffs: Developing and demonstrating an R Shiny app for Pareto optimization.

Behavior research methods·2026
Same journal

The performance of Bayesian fit measures in detecting misspecified multilevel structural equation modeling.

Behavior research methods·2026
Same journal

Psychometric functions from multiple responses : Dedicated to the memory of Colin L. Mallows.

Behavior research methods·2026
Same journal

Low-cost, open-source, full-stack software and Arduino-based hardware for control of commercially available animal behavior systems.

Behavior research methods·2026
Same journal

PyNeon: A Python package for the analysis of Neon multimodal mobile eye-tracking data.

Behavior research methods·2026
Same journal

Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction.

Behavior research methods·2026
See all related articles

Related Experiment Video

Updated: Mar 16, 2026

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2
10:10

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Published on: September 18, 2021

42.4K

Efficient n-gram analysis in R with cmscu.

David W Vinson1, Jason K Davis2, Suzanne S Sindi2

  • 1University of California, Merced, 5200 N. Lake Rd., Merced, CA, USA. dvinson@ucmerced.edu.

Behavior Research Methods
|August 7, 2016
PubMed
Summary
This summary is machine-generated.

A new R package, cmscu, enables efficient analysis of large text datasets. Higher information density in reviews correlates with better reader ratings, revealing insights into behavioral phenomena.

Keywords:
Big dataInformation theoryInterdisciplinary collaborationSketch algorithmsn-grams

More Related Videos

A User-friendly and Powerful R Analysis of Large-scale Datasets
10:56

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

440
Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study
07:50

Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study

Published on: April 18, 2025

1.1K

Related Experiment Videos

Last Updated: Mar 16, 2026

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2
10:10

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Published on: September 18, 2021

42.4K
A User-friendly and Powerful R Analysis of Large-scale Datasets
10:56

A User-friendly and Powerful R Analysis of Large-scale Datasets

Published on: November 4, 2025

440
Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study
07:50

Global and Current Research Trends of Single-Cell Sequencing in Cancer: A Bibliometric and Visualization Study

Published on: April 18, 2025

1.1K

Area of Science:

  • Computational linguistics
  • Data science
  • Psycholinguistics

Background:

  • Analyzing large text corpora presents computational challenges.
  • Existing R libraries struggle with high-throughput processing of massive datasets.
  • Count-Min-Sketch with conservative updating offers a memory-efficient solution.

Purpose of the Study:

  • Introduce the cmscu R package for efficient n-gram analysis.
  • Apply the package to investigate information density in online reviews.
  • Explore the relationship between information density and review ratings.

Main Methods:

  • Developed the cmscu R package using C++ and Rcpp for performance.
  • Implemented the modified Kneser-Ney n-gram smoothing algorithm.
  • Analyzed n-gram frequencies from a 2.2 million review Yelp dataset.

Main Results:

  • The cmscu package handles large-scale text data beyond the capacity of standard libraries.
  • A positive correlation was found between review information density and reader ratings.
  • Demonstrated the utility of efficient tools for behavioral research in large datasets.

Conclusions:

  • The cmscu package provides a powerful and efficient tool for text analysis.
  • Information density is a significant factor influencing reader perception of reviews.
  • Efficient computational tools are crucial for advancing the study of human behavior in big data.