Reinforcement learning and human behavior

https://doi.org/10.1016/j.conb.2013.12.004Get rights and content

Highlights

  • Standard RL explains some aspects of operant learning and its underlying neural activity.

  • Nevertheless, some operant learning behaviors seem inconsistent with standard RL.

  • Inferring a world model is an important part of state-based learning.

  • Direct parametric policy learning bypasses the need to learn the model of the world in terms of what are the relevant states-action pairs.

The dominant computational approach to model operant learning and its underlying neural activity is model-free reinforcement learning (RL). However, there is accumulating behavioral and neuronal-related evidence that human (and animal) operant learning is far more multifaceted. Theoretical advances in RL, such as hierarchical and model-based RL extend the explanatory power of RL to account for some of these findings. Nevertheless, some other aspects of human behavior remain inexplicable even in the simplest tasks. Here we review developments and remaining challenges in relating RL models to human operant learning. In particular, we emphasize that learning a model of the world is an essential step before or in parallel to learning the policy in RL and discuss alternative models that directly learn a policy without an explicit world model in terms of state-action pairs.

Section snippets

Model-free RL

The computational problem in many operant learning tasks can be formulated in a framework known as Markov Decision Processes (MDP) [1]. In MDPs, the world can be in one of several states, which determine the consequences of the agent's actions with respect to the future rewards and world states. A policy defines the agent behavior at a given situation. In MDPs, a policy is a mapping from the states of the environment to actions to be taken when in those states [1]. Finding the optimal policy is

Model-based RL

When training is intense, task-independent reward devaluation, for example, through satiety, has only a little immediate effect on behavior. This habitual learning is consistent with model-free RL because in this framework, the value of an action is updated only when it is executed. By contrast, when training is moderate, the response to reward devaluation is immediate and substantial [21]. This and other behaviors (e.g., planning) are consistent with an alternative RL approach, known as

The curse of dimensionality and the blessing of hierarchical RL

There are theoretical reasons why the RL models described above cannot fully account for operant learning in natural environments. First, the computational problem of finding the values is bedeviled by the ‘curse of dimensionality’: the number of states is exponential with the number of variable, which define a state [1]. Second, when the state of the world is only partially known (i.e., the environment is a partially observable MDP (POMDP), applying model-free algorithms such as Q-learning may

Challenges in relating human behavior to RL algorithms

Despite the many successes of the different RL algorithms in explaining some of the observed human operant learning behaviors, others are still difficult to account for. For example, humans tend to alternate rather than repeat an action after receiving a positively surprising payoff. This behavior is observed both in simple repeated two-alternative force choice tasks with probabilistic rewards (also known as the 2-armed bandit task, Figure 1a) and in the stock market [28]. Moreover, a recent

Heterogeneity in world model

The lack of uniformity regarding behavior even in simple tasks could be due to heterogeneity in the prior expectations of the participants. From the experimentalist point of view, the two-armed bandit task, for example, is simple: the world is characterized by a single state and two actions (Figure 1a). However, from the participant point of view there is, theoretically, an infinite repertoire of possible world models characterized by different sets of states and actions. This could be true

Learning the world model

Many models of operant learning often take as given that the learner has already recognized the available sets of states and actions (Figure 2a). Hence, when attempting to account for human behavior they fail to consider the necessary preliminary step of identifying them (correctly or incorrectly). In machine learning, classification is often preceded by an unsupervised dimension-reduction for feature extraction [38, 39]. Similarly, it has been suggested that operant learning is a two-step

Learning without states

Operant learning can also be accomplished without an explicit representation of states and actions, by directly tuning a parametric policy (Figure 2d). A plausible implementation of such direct policy learning algorithms is using stochastic policy-gradient methods [42, 43, 44]. The idea behind these methods is that the gradient of the average reward (with respect to policy parameter) can be estimated on-line by perturbing a neural network model with noise and considering the effect of these

Concluding remarks

RL is the dominant theoretical framework to operant learning in humans and animals. RL models were partially successful in quantitative modeling of learning behavior and provided important insights into the putative role of different brain structures in operant learning. Yet, substantial theoretical as well as experimental challenges remain, indicating that these models may be substantially oversimplified. In particular, how state-space representations are learned in operant learning remain

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgements

We would like to thank Ido Erev for many fruitful discussions and David Hansel, Gianluigi Mongillo, Tal Neiman and Ran Darshan for carefully reading the manuscript.

This work was supported by the Israel Science Foundation (Grant No. 868/08), Grant from the Ministry of Science and Technology, Israel and the Ministry of Foreign and European Affairs and the Ministry of Higher Education and Research France and the Gatsby Charitable Foundation.

References (53)

  • J. O’Doherty et al.

    Dissociable roles of ventral and dorsal striatum in instrumental conditioning

    Science

    (2004)
  • S.M. Nicola

    The nucleus accumbens as part of a basal ganglia action selection circuit

    Psychopharmacology (Berl)

    (2007)
  • J. Denrell

    Adaptive learning and risk taking

    Psychol Rev

    (2007)
  • R. Hertwig et al.

    Decisions from experience and the effect of rare events in risky choice

    Psychol Sci

    (2004)
  • E. Yechiam et al.

    Using cognitive models to map relations between neuropsychological disorders and human decision-making deficits

    Psychol Sci

    (2005)
  • T.V. Maia et al.

    From reinforcement learning models to psychiatric and neurological disorders

    Nat Neurosci

    (2011)
  • Q.J.M. Huys et al.

    Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees

    PLoS Comput Biol

    (2012)
  • S. Lammel et al.

    Reward and aversion in a heterogeneous midbrain dopamine system

    Neuropharmacology

    (2013)
  • M.D. Iordanova

    Dopamine transmission in the amygdala modulates surprise in an aversive blocking paradigm

    Behav Neurosci

    (2010)
  • M. Joshua et al.

    Midbrain dopaminergic neurons and striatal cholinergic interneurons encode the difference between reward and aversive events at different epochs of probabilistic classical conditioning trials

    J Neurosci

    (2008)
  • C.D. Fiorillo et al.

    Discrete coding of reward probability and uncertainty by dopamine neurons

    Science

    (2003)
  • B. Seymour et al.

    Serotonin selectively modulates reward value in human decision-making

    J Neurosci

    (2012)
  • E. Tricomi et al.

    A specific role for posterior dorsolateral striatum in human habit learning

    Eur J Neurosci

    (2009)
  • K. Wunderlich et al.

    Mapping value based planning and extensively trained choice in the human brain

    Nat Neurosci

    (2012)
  • T. Jaakkola et al.

    Reinforcement learning algorithm for partially observable Markov decision problems

    NIPS 1994: 1143

    Adv. Neural Inf. Process. Syst

    (1995)
  • A.G. Barto et al.

    Recent advances in hierarchical reinforcement learning

    Discret Event Dyn Syst

    (2003)
  • Cited by (77)

    • Emotions as computations

      2023, Neuroscience and Biobehavioral Reviews
      Citation Excerpt :

      Moreover, early foundational work in primates (Schultz et al., 1997) showed that brainstem dopaminergic neurons signal the reward prediction errors defined in Eq. (3.1.2). Though in some instances people may not learn values as described above, but rather relative preferences among sets of states (Bennett et al., 2021; Shteingart and Loewenstein, 2014), value learning in one form or another remains a key building block of any comprehensive account of human learning and decision making. Empirical work has shown that positive reward prediction errors are associated with positive emotions, and negative reward prediction errors (i.e., disappointments) with negative emotions (Mellers et al., 1997; Rutledge et al., 2014).

    View all citing articles on Scopus
    View full text