Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Decision Making: P-value Method

Decision Making: P-value Method

The process of hypothesis testing based on the P-value method includes calculating the P- value using the sample data and interpreting it.
First, a specific claim about the population parameter is proposed. The claim is based on the research question and is stated in a simple form. Further, an opposing statement to the claim is also stated. These statements can act as null and alternative hypotheses: a null hypothesis would be a neutral statement while the alternative hypothesis can...

Behavior Modification

Behavior Modification

Behavioral approaches have often been criticized for ignoring mental processes and focusing solely on observable behavior. However, these approaches provide an optimistic perspective for individuals seeking to change their behaviors. Rather than concentrating on intrinsic personality traits, behavioral approaches suggest that even longstanding habits can be modified by changing the reward contingencies that maintain them.
A real-world application of operant conditioning principles is applied...

Reinforcement Schedules

Reinforcement Schedules

Positive reinforcement is a powerful method for teaching new behaviors to both animals and humans. B.F. Skinner demonstrated this with his experiments using rats in a Skinner box. When a rat pressed a lever, it received a food pellet. This immediate reward encouraged the rat to repeat the behavior. This method, where a reward follows every instance of the behavior, is known as continuous reinforcement. It is highly effective for establishing new behaviors quickly.
Once a behavior is learned,...

Expected Value

Expected Value

The expected value is known as the "long-term" average or mean. This means that over the long term of experimenting over and over, you would expect this average. The expected average is represented by the symbol μ. It is calculated as follows:

Purposive Learning

Purposive Learning

E. C. Tolman emphasized the purposiveness of behavior — the idea that much of our behavior is goal-directed. For instance, employees who aim for a promotion work diligently to meet their targets. Tolman argued that when classical conditioning and operant conditioning occur, the organism acquires certain expectations. In classical conditioning, a child might fear a dog because they expect it to bite. In operant conditioning, a person might consistently work overtime because they expect a...

Model Approaches for Pharmacokinetic Data: Distributed Parameter Models

Model Approaches for Pharmacokinetic Data: Distributed Parameter Models

Pharmacokinetic models are mathematical constructs that represent and predict the time course of drug concentrations in the body, providing meaningful pharmacokinetic parameters. These models are categorized into compartment, physiological, and distributed parameter models.
The distributed parameter models are specifically designed to account for variations and differences in some drug classes. This model is particularly useful for assessing regional concentrations of anticancer or...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Auricular Acupressure for Preventing Postoperative Catheter-Related Bladder Discomfort in Male Patients Undergoing Spinal Surgery: A Randomized Controlled Trial.

Nursing research and practice·2026

Same author

Qualitative Analysis of User Experiences of a mHealth Self-Care Intervention for Care Partners of Individuals with Traumatic Brain Injury.

Archives of rehabilitation research and clinical translation·2026

Same author

FeS Colloids Trigger Antimony Redox Cycling, Colloid Formation, and Ultimate Fate during the Anoxic-Oxic Transition.

Environmental science & technology·2026

Same author

FEDERATED LEARNING OF ROBUST INDIVIDUALIZED DECISION RULES WITH APPLICATION TO HETEROGENEOUS MULTIHOSPITAL SEPSIS POPULATION.

The annals of applied statistics·2026

Same author

Crotonylation impedes c-Myc oncogenic activity.

Proceedings of the National Academy of Sciences of the United States of America·2026

Same author

Mechanistic insights into the molecular selectivity and cotransport dynamics of biodegradable microplastic-derived DOM with cadmium in saturated porous media.

Journal of hazardous materials·2026

Same journal

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data.

Annals of statistics·2026

Same journal

One-Step Estimation of Differentiable Hilbert-Valued Parameters.

Annals of statistics·2026

Same journal

GENERALIZATION ERROR BOUNDS OF DYNAMIC TREATMENT REGIMES IN PENALIZED REGRESSION-BASED LEARNING.

Annals of statistics·2026

Same journal

EFFICIENT AND MULTIPLY ROBUST RISK ESTIMATION UNDER GENERAL FORMS OF DATASET SHIFT.

Annals of statistics·2026

Same journal

TESTING HIGH-DIMENSIONAL REGRESSION COEFFICIENTS IN LINEAR MODELS.

Annals of statistics·2026

Same journal

COUNTERFACTUAL INFERENCE IN SEQUENTIAL EXPERIMENTS.

Annals of statistics·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Aug 4, 2025

Pavlovian Conditioned Approach Training in Rats

Pavlovian Conditioned Approach Training in Rats

Published on: February 4, 2016

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

Peng Liao¹, Zhengling Qi², Runzhe Wan³

¹Harvard University.

Annals of Statistics

|April 6, 2023

Summary

This summary is machine-generated.

We developed a new method for learning optimal health policies from batch data using doubly robust estimation. This approach maximizes long-term rewards and guarantees performance in mobile health applications.

Keywords:

Average Reward Doubly Robust Estimator Markov Decision Process Policy Optimization

More Related Videos

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents

Published on: September 10, 2018

Behavioral Training Procedures for Head-fixed Virtual Reality in Mice

Behavioral Training Procedures for Head-fixed Virtual Reality in Mice

Published on: September 6, 2024

Related Experiment Videos

Last Updated: Aug 4, 2025

Pavlovian Conditioned Approach Training in Rats

Pavlovian Conditioned Approach Training in Rats

Published on: February 4, 2016

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents

Published on: September 10, 2018

Behavioral Training Procedures for Head-fixed Virtual Reality in Mice

Behavioral Training Procedures for Head-fixed Virtual Reality in Mice

Published on: September 6, 2024

Area of Science:

Machine Learning
Reinforcement Learning
Mobile Health

Background:

Batch policy learning is crucial for optimizing sequential decision-making.
Infinite horizon Markov Decision Processes (MDPs) model long-term reward maximization.
Mobile health applications require efficient policy learning from observational data.

Purpose of the Study:

To develop a robust method for batch policy learning in infinite horizon MDPs.
To maximize the long-term average reward for mobile health interventions.
To provide theoretical guarantees on the performance of the learned policy.

Main Methods:

Proposed a doubly robust estimator for average reward, achieving semiparametric efficiency.
Developed an optimization algorithm for computing optimal policies within a parameterized stochastic policy class.
Established a finite-sample regret guarantee to measure policy performance.

Main Results:

The doubly robust estimator demonstrates semiparametric efficiency.
The optimization algorithm effectively computes the optimal policy.
Finite-sample regret bounds were established, validating the method's performance.

Conclusions:

The proposed doubly robust method is effective for batch policy learning in MDPs.
The approach is suitable for mobile health applications, maximizing long-term rewards.
The theoretical guarantees and simulations support the method's practical utility.