Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Decision Making: P-value Method01:09

Decision Making: P-value Method

5.6K
The process of hypothesis testing based on the P-value method includes calculating the P- value using the sample data and interpreting it.
First, a specific claim about the population parameter is proposed. The claim is based on the research question and is stated in a simple form. Further, an opposing statement to the claim  is also stated. These statements can act as null and alternative hypotheses:  a null hypothesis would be a neutral statement while the alternative hypothesis can...
5.6K
Behavior Modification01:21

Behavior Modification

211
Behavioral approaches have often been criticized for ignoring mental processes and focusing solely on observable behavior. However, these approaches provide an optimistic perspective for individuals seeking to change their behaviors. Rather than concentrating on intrinsic personality traits, behavioral approaches suggest that even longstanding habits can be modified by changing the reward contingencies that maintain them.
A real-world application of operant conditioning principles is applied...
211
Reinforcement Schedules01:24

Reinforcement Schedules

223
Positive reinforcement is a powerful method for teaching new behaviors to both animals and humans. B.F. Skinner demonstrated this with his experiments using rats in a Skinner box. When a rat pressed a lever, it received a food pellet. This immediate reward encouraged the rat to repeat the behavior. This method, where a reward follows every instance of the behavior, is known as continuous reinforcement. It is highly effective for establishing new behaviors quickly.
Once a behavior is learned,...
223
Expected Value01:15

Expected Value

4.1K
The expected value is known as the "long-term" average or mean. This means that over the long term of experimenting over and over, you would expect this average. The expected average is represented by the symbol μ. It is calculated as follows:
4.1K
Purposive Learning01:22

Purposive Learning

174
E. C. Tolman emphasized the purposiveness of behavior — the idea that much of our behavior is goal-directed. For instance, employees who aim for a promotion work diligently to meet their targets. Tolman argued that when classical conditioning and operant conditioning occur, the organism acquires certain expectations. In classical conditioning, a child might fear a dog because they expect it to bite. In operant conditioning, a person might consistently work overtime because they expect a...
174
Model Approaches for Pharmacokinetic Data: Distributed Parameter Models01:06

Model Approaches for Pharmacokinetic Data: Distributed Parameter Models

102
Pharmacokinetic models are mathematical constructs that represent and predict the time course of drug concentrations in the body, providing meaningful pharmacokinetic parameters. These models are categorized into compartment, physiological, and distributed parameter models.
The distributed parameter models are specifically designed to account for variations and differences in some drug classes. This model is particularly useful for assessing regional concentrations of anticancer or...
102

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Auricular Acupressure for Preventing Postoperative Catheter-Related Bladder Discomfort in Male Patients Undergoing Spinal Surgery: A Randomized Controlled Trial.

Nursing research and practice·2026
Same author

Qualitative Analysis of User Experiences of a mHealth Self-Care Intervention for Care Partners of Individuals with Traumatic Brain Injury.

Archives of rehabilitation research and clinical translation·2026
Same author

FeS Colloids Trigger Antimony Redox Cycling, Colloid Formation, and Ultimate Fate during the Anoxic-Oxic Transition.

Environmental science & technology·2026
Same author

FEDERATED LEARNING OF ROBUST INDIVIDUALIZED DECISION RULES WITH APPLICATION TO HETEROGENEOUS MULTIHOSPITAL SEPSIS POPULATION.

The annals of applied statistics·2026
Same author

Crotonylation impedes c-Myc oncogenic activity.

Proceedings of the National Academy of Sciences of the United States of America·2026
Same author

Mechanistic insights into the molecular selectivity and cotransport dynamics of biodegradable microplastic-derived DOM with cadmium in saturated porous media.

Journal of hazardous materials·2026
Same journal

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data.

Annals of statistics·2026
Same journal

One-Step Estimation of Differentiable Hilbert-Valued Parameters.

Annals of statistics·2026
Same journal

GENERALIZATION ERROR BOUNDS OF DYNAMIC TREATMENT REGIMES IN PENALIZED REGRESSION-BASED LEARNING.

Annals of statistics·2026
Same journal

EFFICIENT AND MULTIPLY ROBUST RISK ESTIMATION UNDER GENERAL FORMS OF DATASET SHIFT.

Annals of statistics·2026
Same journal

TESTING HIGH-DIMENSIONAL REGRESSION COEFFICIENTS IN LINEAR MODELS.

Annals of statistics·2026
Same journal

COUNTERFACTUAL INFERENCE IN SEQUENTIAL EXPERIMENTS.

Annals of statistics·2026
See all related articles

Related Experiment Video

Updated: Aug 4, 2025

Pavlovian Conditioned Approach Training in Rats
06:57

Pavlovian Conditioned Approach Training in Rats

Published on: February 4, 2016

11.0K

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

Peng Liao1, Zhengling Qi2, Runzhe Wan3

  • 1Harvard University.

Annals of Statistics
|April 6, 2023
PubMed
Summary
This summary is machine-generated.

We developed a new method for learning optimal health policies from batch data using doubly robust estimation. This approach maximizes long-term rewards and guarantees performance in mobile health applications.

Keywords:
Average RewardDoubly Robust EstimatorMarkov Decision ProcessPolicy Optimization

More Related Videos

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents
07:05

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents

Published on: September 10, 2018

6.1K
Behavioral Training Procedures for Head-fixed Virtual Reality in Mice
06:27

Behavioral Training Procedures for Head-fixed Virtual Reality in Mice

Published on: September 6, 2024

1.1K

Related Experiment Videos

Last Updated: Aug 4, 2025

Pavlovian Conditioned Approach Training in Rats
06:57

Pavlovian Conditioned Approach Training in Rats

Published on: February 4, 2016

11.0K
Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents
07:05

Operant Protocols for Assessing the Cost-benefit Analysis During Reinforced Decision Making by Rodents

Published on: September 10, 2018

6.1K
Behavioral Training Procedures for Head-fixed Virtual Reality in Mice
06:27

Behavioral Training Procedures for Head-fixed Virtual Reality in Mice

Published on: September 6, 2024

1.1K

Area of Science:

  • Machine Learning
  • Reinforcement Learning
  • Mobile Health

Background:

  • Batch policy learning is crucial for optimizing sequential decision-making.
  • Infinite horizon Markov Decision Processes (MDPs) model long-term reward maximization.
  • Mobile health applications require efficient policy learning from observational data.

Purpose of the Study:

  • To develop a robust method for batch policy learning in infinite horizon MDPs.
  • To maximize the long-term average reward for mobile health interventions.
  • To provide theoretical guarantees on the performance of the learned policy.

Main Methods:

  • Proposed a doubly robust estimator for average reward, achieving semiparametric efficiency.
  • Developed an optimization algorithm for computing optimal policies within a parameterized stochastic policy class.
  • Established a finite-sample regret guarantee to measure policy performance.

Main Results:

  • The doubly robust estimator demonstrates semiparametric efficiency.
  • The optimization algorithm effectively computes the optimal policy.
  • Finite-sample regret bounds were established, validating the method's performance.

Conclusions:

  • The proposed doubly robust method is effective for batch policy learning in MDPs.
  • The approach is suitable for mobile health applications, maximizing long-term rewards.
  • The theoretical guarantees and simulations support the method's practical utility.