Reinforcement learning and human behavior

doi:10.1016/j.conb.2013.12.004

Current Opinion in Neurobiology

Volume 25, April 2014, Pages 93-98

https://doi.org/10.1016/j.conb.2013.12.004 Get rights and content

Highlights

•
Standard RL explains some aspects of operant learning and its underlying neural activity.
•
Nevertheless, some operant learning behaviors seem inconsistent with standard RL.
•
Inferring a world model is an important part of state-based learning.
•
Direct parametric policy learning bypasses the need to learn the model of the world in terms of what are the relevant states-action pairs.

The dominant computational approach to model operant learning and its underlying neural activity is model-free reinforcement learning (RL). However, there is accumulating behavioral and neuronal-related evidence that human (and animal) operant learning is far more multifaceted. Theoretical advances in RL, such as hierarchical and model-based RL extend the explanatory power of RL to account for some of these findings. Nevertheless, some other aspects of human behavior remain inexplicable even in the simplest tasks. Here we review developments and remaining challenges in relating RL models to human operant learning. In particular, we emphasize that learning a model of the world is an essential step before or in parallel to learning the policy in RL and discuss alternative models that directly learn a policy without an explicit world model in terms of state-action pairs.

Section snippets

Model-free RL

The computational problem in many operant learning tasks can be formulated in a framework known as Markov Decision Processes (MDP) [1]. In MDPs, the world can be in one of several states, which determine the consequences of the agent's actions with respect to the future rewards and world states. A policy defines the agent behavior at a given situation. In MDPs, a policy is a mapping from the states of the environment to actions to be taken when in those states [1]. Finding the optimal policy is

Model-based RL

When training is intense, task-independent reward devaluation, for example, through satiety, has only a little immediate effect on behavior. This habitual learning is consistent with model-free RL because in this framework, the value of an action is updated only when it is executed. By contrast, when training is moderate, the response to reward devaluation is immediate and substantial [21]. This and other behaviors (e.g., planning) are consistent with an alternative RL approach, known as

The curse of dimensionality and the blessing of hierarchical RL

There are theoretical reasons why the RL models described above cannot fully account for operant learning in natural environments. First, the computational problem of finding the values is bedeviled by the ‘curse of dimensionality’: the number of states is exponential with the number of variable, which define a state [1]. Second, when the state of the world is only partially known (i.e., the environment is a partially observable MDP (POMDP), applying model-free algorithms such as Q-learning may

Challenges in relating human behavior to RL algorithms

Despite the many successes of the different RL algorithms in explaining some of the observed human operant learning behaviors, others are still difficult to account for. For example, humans tend to alternate rather than repeat an action after receiving a positively surprising payoff. This behavior is observed both in simple repeated two-alternative force choice tasks with probabilistic rewards (also known as the 2-armed bandit task, Figure 1a) and in the stock market [28]. Moreover, a recent

Heterogeneity in world model

The lack of uniformity regarding behavior even in simple tasks could be due to heterogeneity in the prior expectations of the participants. From the experimentalist point of view, the two-armed bandit task, for example, is simple: the world is characterized by a single state and two actions (Figure 1a). However, from the participant point of view there is, theoretically, an infinite repertoire of possible world models characterized by different sets of states and actions. This could be true

Learning the world model

Many models of operant learning often take as given that the learner has already recognized the available sets of states and actions (Figure 2a). Hence, when attempting to account for human behavior they fail to consider the necessary preliminary step of identifying them (correctly or incorrectly). In machine learning, classification is often preceded by an unsupervised dimension-reduction for feature extraction [38, 39]. Similarly, it has been suggested that operant learning is a two-step

Learning without states

Operant learning can also be accomplished without an explicit representation of states and actions, by directly tuning a parametric policy (Figure 2d). A plausible implementation of such direct policy learning algorithms is using stochastic policy-gradient methods [42, 43, 44]. The idea behind these methods is that the gradient of the average reward (with respect to policy parameter) can be estimated on-line by perturbing a neural network model with noise and considering the effect of these

Concluding remarks

RL is the dominant theoretical framework to operant learning in humans and animals. RL models were partially successful in quantitative modeling of learning behavior and provided important insights into the putative role of different brain structures in operant learning. Yet, substantial theoretical as well as experimental challenges remain, indicating that these models may be substantially oversimplified. In particular, how state-space representations are learned in operant learning remain

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

• of special interest
•• of outstanding interest

Acknowledgements

We would like to thank Ido Erev for many fruitful discussions and David Hansel, Gianluigi Mongillo, Tal Neiman and Ran Darshan for carefully reading the manuscript.

This work was supported by the Israel Science Foundation (Grant No. 868/08), Grant from the Ministry of Science and Technology, Israel and the Ministry of Foreign and European Affairs and the Ministry of Higher Education and Research France and the Gatsby Charitable Foundation.

References (53)

W. Schultz
Updating dopamine reward signals
Curr Opin Neurobiol
(2013)
P.R. Montague et al.
Computational psychiatry
Trends Cogn Sci
(2012)
T.J. Vickery et al.
Ubiquity and specificity of reinforcement signals throughout the human brain
Neuron
(2011)
J. Gläscher et al.
States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning
Neuron
(2010)
N.D. Daw et al.
Model-based influences on humans’ choices and striatal prediction errors
Neuron
(2011)
R.S. Sutton et al.
Introduction to Reinforcement Learning
(1998)
H. Shteingart et al.
The role of first impression in operant learning
J Exp Psychol Gen
(2013)
P.R. Montague et al.
Computational roles for dopamine in behavioural control
Nature
(2004)
K.M. Kim et al.
Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement
PLoS ONE
(2012)
K. Samejima et al.
Representation of action-specific reward values in the striatum
Science
(2005)

J. O’Doherty et al.

Dissociable roles of ventral and dorsal striatum in instrumental conditioning

Science

(2004)

S.M. Nicola

The nucleus accumbens as part of a basal ganglia action selection circuit

Psychopharmacology (Berl)

(2007)

J. Denrell

Adaptive learning and risk taking

Psychol Rev

(2007)

R. Hertwig et al.

Decisions from experience and the effect of rare events in risky choice

Psychol Sci

(2004)

E. Yechiam et al.

Using cognitive models to map relations between neuropsychological disorders and human decision-making deficits

Psychol Sci

(2005)

T.V. Maia et al.

From reinforcement learning models to psychiatric and neurological disorders

Nat Neurosci

(2011)

Q.J.M. Huys et al.

Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees

PLoS Comput Biol

(2012)

S. Lammel et al.

Reward and aversion in a heterogeneous midbrain dopamine system

Neuropharmacology

(2013)

M.D. Iordanova

Dopamine transmission in the amygdala modulates surprise in an aversive blocking paradigm

Behav Neurosci

(2010)

M. Joshua et al.

Midbrain dopaminergic neurons and striatal cholinergic interneurons encode the difference between reward and aversive events at different epochs of probabilistic classical conditioning trials

J Neurosci

(2008)

C.D. Fiorillo et al.

Discrete coding of reward probability and uncertainty by dopamine neurons

Science

(2003)

B. Seymour et al.

Serotonin selectively modulates reward value in human decision-making

J Neurosci

(2012)

E. Tricomi et al.

A specific role for posterior dorsolateral striatum in human habit learning

Eur J Neurosci

(2009)

K. Wunderlich et al.

Mapping value based planning and extensively trained choice in the human brain

Nat Neurosci

(2012)

T. Jaakkola et al.

Reinforcement learning algorithm for partially observable Markov decision problems

NIPS 1994: 1143

Adv. Neural Inf. Process. Syst

(1995)

A.G. Barto et al.

Recent advances in hierarchical reinforcement learning

Discret Event Dyn Syst

(2003)

Cited by (77)

Hierarchical control over foraging behavior by anterior cingulate cortex
2024, Neuroscience and Biobehavioral Reviews
Foraging is a natural behavior that involves making sequential decisions to maximize rewards while minimizing the costs incurred when doing so. The prevalence of foraging across species suggests that a common brain computation underlies its implementation. Although anterior cingulate cortex is believed to contribute to foraging behavior, its specific role has been contentious, with predominant theories arguing either that it encodes environmental value or choice difficulty. Additionally, recent attempts to characterize foraging have taken place within the reinforcement learning framework, with increasingly complex models scaling with task complexity. Here we review reinforcement learning foraging models, highlighting the hierarchical structure of many foraging problems. We extend this literature by proposing that ACC guides foraging according to principles of model-based hierarchical reinforcement learning. This idea holds that ACC function is organized hierarchically along a rostral-caudal gradient, with rostral structures monitoring the status and completion of high-level task goals (like finding food), and midcingulate structures overseeing the execution of task options (subgoals, like harvesting fruit) and lower-level actions (such as grabbing an apple).
Transfer learning for occupancy-based HVAC control: A data-driven approach using unsupervised learning of occupancy profiles and deep reinforcement learning
2023, Energy and Buildings
Model-free heating, ventilation, and air conditioning (HVAC) control systems have demonstrated promising potential for adjusting indoor setpoint temperature based on dynamic occupancy patterns in smart buildings. Although these control systems offer the advantage of not needing building or occupancy models, the involved trial-and-error learning process can cause considerable thermal discomfort for occupants, particularly during the initial learning period. Given the critical importance of thermal comfort, this limitation is a major barrier to the practical implementation of such systems. To address this challenge, the present study proposes a framework to enhance the learning process of the model-free HVAC controllers. Specifically, a transfer learning (TL) technique is adopted based on a similarity analysis of occupancy patterns using an unsupervised learning of occupancy profiles. This control framework leverages a k-means clustering algorithm with dynamic time warping to match the most similar households in terms of occupancy patterns within 26 residential units. The results demonstrate that the proposed method significantly improves the performance of the HVAC control system. It enhances the jumpstart performance and total rewards by nearly 25% and 5%, respectively, compared to a conventional model-free controller. Furthermore, it reduces the deviation period and mean temperature deviation by approximately 4% and 68%, respectively. Overall, this framework presents a promising approach to enhancing the performance and practicality of model-free HVAC control systems by reducing the thermal discomfort during the learning process.
Human-like decision making for lane change based on the cognitive map and hierarchical reinforcement learning
2023, Transportation Research Part C: Emerging Technologies
Human-like decision making is crucial to developing an autonomous driving system (ADS) with high acceptance. Inspired by the cognitive map, this paper proposes a hierarchical reinforcement learning (HRL)-based framework with sound biological plausibility named Cog-MP, which combines the cognitive map and motion primitive (MP) in human-like decision making. In the proposed Cog-MP, three general levels involved in ADS are integrated in a top–bottom way, including operational, decision-making, and cognitive levels. The proposed Cog-MP is used to make human-like decisions in lane-changing scenarios, focusing on three aspects: human-like lane decision, human-like path decision, and decision optimization. The proposed framework is validated on two groups of realistic lane-change data, of which one group is used to train cognitions towards different styles of driving behaviors, and the other group is to provide validation scenarios. Experimental results show that the proposed framework can generate human-like decisions and perform soundly regarding the three considered aspects, demonstrating a promising prospect in developing a brain-inspired human-like ADS.
How Continuous Glucose Monitoring Can Motivate Self-management: Can Motivation Follow Behaviour?
2023, Canadian Journal of Diabetes
Motivation to adhere to clinical recommendations requires engagement, and the urgency to act is one of many factors that contribute to achieving glycemic benefits in people with type 2 diabetes (PwT2D). Continuous glucose monitoring (CGM) devices are associated with improved glycemic benefits. We conducted a qualitative assessment of PwT2D who found using CGM extremely beneficial and examined the potential for CGM to elicit motivation to engage in self-management behaviours.
Participants using CGM were recruited through social media and interviewed, and transcripts were analyzed (template analysis using thematic analysis) to generate coded responses and inductive themes by 2 raters.
Thirteen participants (84.6% women, with a duration of T2D >5 years and CGM use for >6 months) were interviewed. Codes were organized around 3 themes: improved self-management, experience of glucose-sensing technology vis-à-vis general positive or negative experience, and positive impact of CGM on living with diabetes. Improved self-management was reflected in how the CGM technology provided personalized knowledge and ability to self-manage, particularly in contrast to finger pricking. Positive experience included motivation for behaviour changes as well as improved relationships with health-care providers and in social situations. This translated into a sense of improved health and an avoidance of complications. Negative experience included costs, concern over location of the sensor, and discomfort with the device.
CGM technology profoundly impacts multiple aspects of self-management and care for PwT2D. Developing a validated instrument to assess identified constructs could contribute to developing interventions and leveraging benefits of this technology, particularly the motivational constructs of engagement and urgency.
La motivation à adhérer aux recommandations cliniques qui exige l’engagement et l’urgence d’agir est l’un des nombreux facteurs qui contribuent à l’obtention d’effets bénéfiques sur la glycémie des personnes atteintes du diabète de type 2 (PDT2). Les appareils de surveillance du glucose en continu (SGC) sont associés à l’amélioration des effets bénéfiques sur la glycémie. Nous avons réalisé une évaluation qualitative auprès des PDT2 qui trouvaient l’utilisation de la SGC extrêmement bénéfique, et examiné le potentiel de la SGC à susciter la motivation à adopter des comportements de prise en charge autonome.
Les participants qui utilisaient la SGC étaient recrutés sur les médias sociaux et interrogés, et deux évaluateurs analysaient les transcriptions (analyses de modèles au moyen de l’analyse thématique) pour générer des réponses codées et des thèmes inductifs.
Treize participants (84,6 % de femmes, DT2 > 5 ans et utilisation de la SGC > 6 mois) étaient interrogés. Les codes s’articulaient autour de 3 thèmes : l’amélioration de la prise en charge autonome, l’expérience de la technologie de détection du glucose par rapport aux expériences générales positives ou négatives. L’amélioration de la prise en charge autonome se reflétait dans la façon dont la technologie de SGC donnait des informations personnalisées et la capacité à se prendre en charge, surtout en comparaison à la piqûre au doigt. Les expériences positives étaient les suivantes : la motivation à adopter des changements de comportement, l’amélioration des relations avec les prestataires de soins et des relations sociales. Cela se traduisait par la perception d’une meilleure santé et l’évitement des complications. Les expériences négatives étaient les suivantes : les coûts, les préoccupations concernant l’emplacement et le confort de l’appareil.
La technologie de SGC a des répercussions importantes sur plusieurs aspects de la prise en charge autonome et des soins chez les PDT2. L’élaboration d’un outil validé pour évaluer les construits relevés pourrait contribuer à l’élaboration d’interventions qui exploitent les avantages de cette technologie, particulièrement en ce qui concerne les construits motivationnels de l’engagement et de l’urgence.
Emotions as computations
2023, Neuroscience and Biobehavioral Reviews
Citation Excerpt :
Moreover, early foundational work in primates (Schultz et al., 1997) showed that brainstem dopaminergic neurons signal the reward prediction errors defined in Eq. (3.1.2). Though in some instances people may not learn values as described above, but rather relative preferences among sets of states (Bennett et al., 2021; Shteingart and Loewenstein, 2014), value learning in one form or another remains a key building block of any comprehensive account of human learning and decision making. Empirical work has shown that positive reward prediction errors are associated with positive emotions, and negative reward prediction errors (i.e., disappointments) with negative emotions (Mellers et al., 1997; Rutledge et al., 2014).
Emotions ubiquitously impact action, learning, and perception, yet their essence and role remain widely debated. Computational accounts of emotion aspire to answer these questions with greater conceptual precision informed by normative principles and neurobiological data. We examine recent progress in this regard and find that emotions may implement three classes of computations, which serve to evaluate states, actions, and uncertain prospects. For each of these, we use the formalism of reinforcement learning to offer a new formulation that better accounts for existing evidence. We then consider how these distinct computations may map onto distinct emotions and moods. Integrating extensive research on the causes and consequences of different emotions suggests a parsimonious one-to-one mapping, according to which emotions are integral to how we evaluate outcomes (pleasure & pain), learn to predict them (happiness & sadness), use them to inform our (frustration & content) and others’ (anger & gratitude) actions, and plan in order to realize (desire & hope) or avoid (fear & anxiety) uncertain outcomes.
'I don't want to play with you anymore': Dynamic partner judgements in moody reinforcement learners playing the prisoner's dilemma
2024, Knowledge Engineering Review

View all citing articles on Scopus

View full text

Reinforcement learning and human behavior

Highlights

Section snippets

Model-free RL

Model-based RL

The curse of dimensionality and the blessing of hierarchical RL

Challenges in relating human behavior to RL algorithms

Heterogeneity in world model

Learning the world model

Learning without states

Concluding remarks

References and recommended reading

Acknowledgements

Curr Opin Neurobiol

Trends Cogn Sci

Neuron

Neuron

Neuron

Introduction to Reinforcement Learning

The role of first impression in operant learning

J Exp Psychol Gen

Computational roles for dopamine in behavioural control

Nature

Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement

PLoS ONE

Representation of action-specific reward values in the striatum

Science

Dissociable roles of ventral and dorsal striatum in instrumental conditioning

Science

The nucleus accumbens as part of a basal ganglia action selection circuit

Psychopharmacology (Berl)

Adaptive learning and risk taking

Psychol Rev

Decisions from experience and the effect of rare events in risky choice

Psychol Sci

Using cognitive models to map relations between neuropsychological disorders and human decision-making deficits

Psychol Sci

From reinforcement learning models to psychiatric and neurological disorders

Nat Neurosci

Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees

PLoS Comput Biol

Reward and aversion in a heterogeneous midbrain dopamine system

Neuropharmacology

Dopamine transmission in the amygdala modulates surprise in an aversive blocking paradigm

Behav Neurosci

Midbrain dopaminergic neurons and striatal cholinergic interneurons encode the difference between reward and aversive events at different epochs of probabilistic classical conditioning trials

J Neurosci

Discrete coding of reward probability and uncertainty by dopamine neurons

Science

Serotonin selectively modulates reward value in human decision-making

J Neurosci

A specific role for posterior dorsolateral striatum in human habit learning

Eur J Neurosci

Mapping value based planning and extensively trained choice in the human brain

Nat Neurosci

Reinforcement learning algorithm for partially observable Markov decision problems

NIPS 1994: 1143

Adv. Neural Inf. Process. Syst

Recent advances in hierarchical reinforcement learning

Discret Event Dyn Syst