Elsevier

Neural Networks

Volume 15, Issues 4–6, June–July 2002, Pages 549-559
Neural Networks

2002 Special issue
Dopamine: generalization and bonuses

https://doi.org/10.1016/S0893-6080(02)00048-5Get rights and content

Abstract

In the temporal difference model of primate dopamine neurons, their phasic activity reports a prediction error for future reward. This model is supported by a wealth of experimental data. However, in certain circumstances, the activity of the dopamine cells seems anomalous under the model, as they respond in particular ways to stimuli that are not obviously related to predictions of reward. In this paper, we address two important sets of anomalies, those having to do with generalization and novelty. Generalization responses are treated as the natural consequence of partial information; novelty responses are treated by the suggestion that dopamine cells multiplex information about reward bonuses, including exploration bonuses and shaping bonuses. We interpret this additional role for dopamine in terms of the mechanistic attentional and psychomotor effects of dopamine, having the computational role of guiding exploration.

Introduction

Much evidence, reviewed by Schultz (1998), suggests that dopamine (DA) cells in the primate midbrain play an important role in reward and action learning. Electrophysiological studies in both instrumental (Schultz, 1992, Schultz, 1998) and classical (Waelti, Dickinson, & Schultz, 2001) conditioning tasks support a theory that DA cells signal a global prediction error for summed future reward in appetitive conditioning tasks (Montague et al., 1996, Schultz et al., 1997), in the form of a temporal difference (TD) prediction error term. One use of this term is training the predictions themselves, a standard interpretation for the preparatory aspects of classical conditioning; another is finding the actions that maximize reward, as in a two-factor learning theory for the interaction of classical and instrumental conditioning. Storage of the predictions involves at least the basolateral nuclei of the amygdala (Hatfield et al., 1996, Holland and Gallagher, 1999, Whitelaw et al., 1996) and the orbitofrontal cortex (Gallagher et al., 1999, O'Doherty et al., 2001, Rolls, 2000, Schoenbaum et al., 1998, Schoenbaum et al., 1999, Schultz et al., 2000, Tremblay and Schultz, 2000a, Tremblay and Schultz, 2000b). The neural substrate for the dopaminergic control over action is rather less clear (Dayan, 2000, Dickinson and Balleine, 2001, Houk et al., 1995, Montague et al., 1996).

The computational role of dopamine in reward learning is controversial for various reasons (Gray et al., 1997, Ikemoto and Panksepp, 1999, Redgrave et al., 1999). First, stimuli that are not associated with reward prediction are known to activate the dopamine system in a non-trivial manner, including stimuli that are novel and salient, or that physically resemble other stimuli that do predict reward (Schultz, 1998). In both cases, an important aspect of the dopamine response is that it sometimes consists of a short-term increase above baseline followed by a short-term decrease below baseline. Second, dopamine release is associated with a set of motor effects, such as species- and stimulus-specific approach behaviors, that seem either irrelevant or detrimental to the delivery of reward. We call these motor effects mechanistic because of their apparent independence from prediction or action.

In this paper (see also Suri and Schultz, 1999, Suri, 2002), we study various of these apparently anomalous activations of dopamine cells. We interpret the short term increase and decrease in the light of generalization as an example of partial information—the response is exactly what would be expected where the animal to be initially incompletely certain as to whether or not the presented stimulus was the one associated with food. We interpret the short-term effects after new stimuli as suggesting that the DA system multiplexes information about bonuses on top of information about rewards. Bonuses are fictitious quantities added to rewards (Dayan and Sejnowski, 1996, Sutton, 1990) or values (Ng, Harada, & Russell, 1999) to ensure appropriate exploration in new or changing environments.

In Section 2, we describe the TD model of dopamine activity. In Section 3 we discuss generalization; in Section 4 we discuss novelty responses and bonuses.

Section snippets

Temporal difference and dopamine activity

Fig. 1 shows three aspects of the activity of dopamine cells, together with the associated TD model. The electrophysiological data in Fig. 1(A) and (B) are based on a set of reaction-time operant conditioning trials, in which monkeys are learning the relationship between an auditory conditioned stimulus (the CS) and the delivery of a juice reward (the unconditioned stimulus or US). The monkeys had to keep their hands on a resting key until the sound was played, and then they had to depress a

Generalization and uncertainty

Fig. 3 shows two aspects of the behavior of dopamine cells that are not obviously in accord with the temporal difference model. These come from two related tasks (Schultz & Romo, 1990) in which there are two boxes in front of a monkey, one of which always contains food (door+) and one of which never contains food (door−). On a trial, the monkey keeps its hand on a resting key until one of the doors opens (usually accompanied by both visual and auditory cues). If door+ opens, the monkey has to

Novelty responses

Another main difference between the temporal difference model of the activity of dopamine cells and their actual behavior has to do with novelty. Salient, novel, stimuli are reported to activate dopamine cells for between a few and many trials. One example of this may be the small response at the time of the stimulus in the top line of Fig. 1(A). Here, there is a slight increase in the response locked to the stimulus, with no subsequent decrement below baseline. In this case, the activity could

Discussion

We have suggested a set of interpretations for the activity of the DA system to complement that of reporting prediction error for reward. First, we considered activating and depressing generalization responses, arguing that they come from short-term ambiguity about the predictive stimuli presented. Second, we considered novelty responses, showing that they are exactly what would be expected where the dopamine cells to be reporting a prediction error for reward in a sophisticated reinforcement

Acknowledgements

Funding is from the NSF and the Gatsby Charitable Foundation. We are very grateful to Nathaniel Daw, Jon Horvitz, Peter Redgrave, Roland Suri, Rich Sutton, and an anonymous reviewer for helpful comments. This paper is based on Kakade and Dayan (2000).

References (67)

  • P. Redgrave et al.

    Is the short-latency dopamine response too short to signal reward error?

    Trends in Neurosciences

    (1999)
  • J.D. Salamone

    The involvement of nucleus accumbens dopamine in appetitive and aversive motivation

    Behavioural Brain Research

    (1994)
  • W. Schultz

    Activity of dopamine neurons in the behaving primate

    Seminars in the Neurosciences

    (1992)
  • R.E. Suri

    TD models of reward predictive responses in dopamine neurons

    Neural Networks

    (2002)
  • R.E. Suri et al.

    A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task

    Neuroscience

    (1999)
  • R.S. Sutton

    Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

    Machine Learning, Proceedings of the Seventh International Conference

    (1990)
  • A.G. Barto et al.

    Neuronlike adaptive elements that can solve difficult learning control problems

    IEEE Transaction on Systems, Man and Cybernetics

    (1983)
  • D.P. Bertsekas et al.

    Neuro-dynamic programming

    (1996)
  • R.I. Brafman et al.

    R-MAX—A general polynomial time algorithm for near-optimal reinforcement learning

    (2001)
  • K. Breland et al.

    The misbehavior of organisms

    American Psychologist

    (1961)
  • R.M. Church

    Properties of the internal clock

    Annals of the New York Academy of Sciences

    (1984)
  • J.D. Cohen et al.
  • N.D. Daw et al.

    Opponent interactions between serotonin and dopamine

    Neural Networks

    (2002)
  • P. Dayan

    Motivated reinforcement learning

  • P. Dayan et al.

    Theoretical neuroscience

    (2001)
  • P. Dayan et al.

    Exploration bonuses and dual control

    Machine Learning

    (1996)
  • A. Dickinson et al.

    The role of learning in motivation

  • K. Doya

    Reinforcement learning in continuous time and space

    Neural Computation

    (1999)
  • J. Ekelund et al.

    Association between novelty seeking and type 4 dopamine receptor gene in a large Finnish cohort sample

    American Journal of Psychiatry

    (1999)
  • M. Gallagher et al.

    Orbitofrontal cortex and representation of incentive value in associative learning

    Journal of Neuroscience

    (1999)
  • J.A. Gray et al.

    Dopamine's role

    Science

    (1997)
  • S. Grossberg et al.

    Neural dynamics of attentionally modulated pavlovian conditioning: Conditioned reinforcement, inhibition, and opponent processing

    Psychobiology

    (1987)
  • J.-S. Han et al.

    The role of an amygdalo-nigrostriatal pathway in associative learning

    Journal of Neuroscience

    (1997)
  • Cited by (0)

    View full text