Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Quartile

Quartile

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first, find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is seven. The lower half of the...

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

Bond Polarity, Dipole Moment, and Percent Ionic Character

Bond Polarity, Dipole Moment, and Percent Ionic Character

Data Collection I

Data Collection I

Data collection gathers information needed to make accurate judgments about a patient's present condition. During a health history interview, subjective data is collected from the patient, their caregivers, or family members, and objective data is collected through observations and physical assessment. Patients are the primary source of subjective data. Thus information gathered from patients through interviews, observations, and physical examination is primary data. Secondary sources of...

z Scores and Unusual Values

z Scores and Unusual Values

The z score is one of the three measures of relative standing. It describes the location of a value in a dataset relative to the mean. z scores are obtained after the standardization of the values in a dataset. The z score for the mean is 0.
This score indicates how far a value is from the mean in terms of standard deviation. For example, if a data value has a z score of +1, the researcher can infer that the particular data value is one standard deviation above the mean. If another data...

Data: Types and Distribution

Data: Types and Distribution

In biostatistics, data are the observations collected for analysis. There are two main types: parametric and non-parametric. Parametric data, which include continuous (e.g., weight) and discrete numerical data (e.g., number of tablets), assume a particular distribution pattern, often the normal distribution. Non-parametric data do not adhere to a specific distribution and typically comprise nominal (e.g., gender) and ordinal categorical data (e.g., pain scale ratings).
Distributions in...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Ethmoid sinus CBCT imaging as a biometric instrument: dataset creation for deep learning identification.

European journal of radiology·2026

Same author

MADOran: A morphologically annotated dataset of Oran.

Data in brief·2025

Same author

Morphologically-analyzed and syntactically-annotated Quran dataset.

Data in brief·2025

Same author

Perception and knowledge of learners about the use of 3D technologies in manual therapy education - a qualitative study.

BMC medical education·2023

Same author

Deep learning for Covid-19 forecasting: State-of-the-art review.

Neurocomputing·2022

Same author

Recent advances of bat-inspired algorithm, its versions and applications.

Neural computing & applications·2022

Same journal

A harmonized fast-fashion garment-variant dataset for textile circularity and sustainability assessment.

Data in brief·2026

Same journal

Terahertz reflectivity dataset: Reading text on both sides of the page.

Data in brief·2026

Same journal

High-quality draft genome sequence data of <i>Levilactobacillus brevis</i> 3LB isolated from fermented milk koumiss.

Data in brief·2026

Same journal

Interview dataset: Encouraging the development of industrial symbiosis networks in Slovenia - transition to the circular economy.

Data in brief·2026

Same journal

Timeseries of multispectral and radar data and vegetation indices from Sentinel-1, Sentinel-2 and Landsat-8 at field scale.

Data in brief·2026

Same journal

BACI-VI-Bench: A dataset of variational inequality benchmark instances for multi-agent trade-network equilibrium.

Data in brief·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jul 3, 2025

Comparing Bibliometric Analysis Using PubMed, Scopus, and Web of Science Databases

Comparing Bibliometric Analysis Using PubMed, Scopus, and Web of Science Databases

Published on: October 24, 2019

Arabic punctuation dataset.

Sane Yagi¹, Ashraf Elnagar², Esra Yaghi³

¹Department of Foreign Languages, University of Sharjah, the United Arab Emirates.

|February 13, 2024

Summary

This summary is machine-generated.

Arabic punctuation inconsistency hinders NLP. The Arabic Punctuation Dataset (APD) offers annotated Modern Standard Arabic texts to train models for sentence boundary identification and punctuation prediction, improving Arabic NLP tasks.

Keywords:

Automatic punctuation Punctuation corpus Sentence boundary identification Theme-rheme Topic and comment

More Related Videos

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

Collection and Analysis of Arabidopsis Phloem Exudates Using the EDTA-facilitated Method

Collection and Analysis of Arabidopsis Phloem Exudates Using the EDTA-facilitated Method

Published on: October 23, 2013

Related Experiment Videos

Last Updated: Jul 3, 2025

Comparing Bibliometric Analysis Using PubMed, Scopus, and Web of Science Databases

Comparing Bibliometric Analysis Using PubMed, Scopus, and Web of Science Databases

Published on: October 24, 2019

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Foreign Accent and Forensic Speaker Identification in Voice Lineups: The Influence of Acoustic Features Based on Prosody

Published on: September 27, 2024

Collection and Analysis of Arabidopsis Phloem Exudates Using the EDTA-facilitated Method

Collection and Analysis of Arabidopsis Phloem Exudates Using the EDTA-facilitated Method

Published on: October 23, 2013

Area of Science:

Computational Linguistics
Natural Language Processing

Background:

Arabic exhibits significant punctuation inconsistency, creating challenges for Natural Language Processing (NLP) applications.
Developing robust NLP tools for Arabic requires addressing this punctuation variability.

Purpose of the Study:

To introduce the Arabic Punctuation Dataset (APD), a novel resource for improving Arabic NLP.
To facilitate machine learning model training for sentence boundary identification and punctuation prediction in Modern Standard Arabic.

Main Methods:

The Arabic Punctuation Dataset (APD) was created using the "theme-rheme completion" principle, linking grammar to punctuation.
APD comprises 312 million words across 12 million sentences, including manually annotated book chapters (ABC), parallel translations (CBT), and scrambled sentences (SSAC-UNPC).

Main Results:

APD provides a large-scale, annotated corpus for training NLP models specific to Arabic punctuation.
The dataset's diverse components cater to various NLP tasks, from basic boundary identification to complex punctuation restoration.

Conclusions:

The Arabic Punctuation Dataset (APD) is a foundational resource for advancing Arabic NLP.
APD's grammar-based approach enhances machine-generated text clarity, benefiting applications like machine translation and speech recognition.