Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning

Introduction

It is becoming clear that there are multiple modes of learning and decision-making. For instance, when learning which sequence of actions to choose, some decision-makers behave as if they are ‘model-free’, simply repeating actions that previously yielded rewards, while others behave as if they are ‘model-based’, additionally taking into account whether those outcomes were likely or unlikely given their actions1,2,3,4,5,6. Model-free behaviour is thought to reflect reinforcement learning7, guided by reward prediction errors8,9,10,11,12. On the other hand, model-based behaviour is thought to reflect consideration of the task structure, that is, decision-makers form a ‘model’ of the environment. Model-based choice has been linked to goal-based behaviour13,14, cognitive control15, slower habit formation16, declarative memory17 and extraversion18. What remains unclear is whether model-free behaviour is a distinct approach to decision-making or whether it simply reflects suboptimal inference.
Behavioural and neural evidence seem to support a hybrid model where individuals exhibit mixtures of both model-based and model-free behaviour4,5,19. These models assume that the brain uses reward outcomes to update the expected values (‘Q-values’) of the alternatives and then compares those Q-values at the time of choice. There is an implicit assumption in these models that the evaluation and choice stages occur at the same time for the model-based and model-free components, using a similar reinforcement process, but with the model-based component incorporating action-state transition probabilities.
In support of this view, recent neural evidence suggests that the brain employs an arbitration system that compares the reliability of the two systems and adjusts the relative contribution of the model-based component at the time of choice20. Other evidence argues for a cooperative architecture between the model-based and model-free systems21. Finally, model-based choices have been linked to prospective neural signals reflecting the desired state, supporting the hypothesis of goal-based evaluation22.
At the same time, a parallel literature has used eye tracking and sequential-sampling models to better understand value-based decision-making23,24,25,26,27,28,29. This work has shown that evidence accumulation and comparison drives choices and that the process depends on overt attention. Krajbich et al.23 developed the attentional drift-diffusion model (aDDM) to capture this phenomenon. The idea behind this work is that gaze is an indicator of attentional allocation30and that attention to an option magnifies its value relative to the other alternative(s). Subsequent studies have identified similar relationships between option values, eye movements, response times (RT) and choices24,31. Despite this work and the vast literature on oculomotor control and visual search, the connection between selective attention and reinforcement learning, particularly model-based learning, remains unclear26,32,33,34,35. Some recent evidence does suggest a link between attention and choice in a simple reinforcement-learning task26, but in general the use of eye-tracking in this literature is just beginning35,36,37.
Here we seek to investigate whether model-based and model-free behaviour reflect a common choice mechanism that utilizes the same information uptake and value-comparison process (but with varying degrees of accuracy), or whether these choice modes rely on distinct processes. To do so, we use eye-tracking to study human subjects in a two-stage learning task designed by Daw et al.3 to distinguish between model-free and model-based learning. Gaze data allow us to test whether model-free and model-based subjects engage in the same choice process, and whether model-free subjects ignore task-relevant information or simply misinterpret that information. Interestingly, we find that the choices of model-free subjects show clear signs of an aDDM-like comparison process, whereas model-based subjects appear to already know which option they will choose, showing signs of directed visual search. Furthermore, model-free subjects often ignore task-relevant information, suggesting that they approach the task in a different way than the model-based subjects.

Results

Behavioural results

We carried out an eye-tracking experiment using a two-stage decision-making task that discriminates between model-based and model-free learning3. Forty-three subjects completed the experiment, which consisted of two conditions with 150 trials each.
In the first condition, we replicated the standard design. Each trial had two stages. In the first stage subjects had to make a choice between two Tibetan symbols (arbitrarily labelled ‘A’ and ‘B’ for further analyses) that could lead to one of two second-stage states, ‘purple’ and ‘blue’ (Fig. 1a). The transition was stochastic: one symbol was more likely to lead to the blue state, and the other one was more likely to lead to the purple state. Thus each first-stage symbol had a ‘common’ state (probability 0.7) and a ‘rare’ state (probability 0.3) associated with it. Once one of the states was reached, subjects had another choice between two symbols of the respective colour (Fig. 1b). Each of these four symbols was rewarded with a different probability that slowly drifted over the course of the experiment, independently for each symbol and irrespective of the subjects’ choices.
Figure 1: Experimental design.
Figure 1
(a) Choice-trial timeline. Subjects are forced to fixate at the center of the screen for 2 s before every choice. A first stage choice between two beige symbols yields one of two second-stage states with either blue or purple symbols. Once one of the symbols is selected, it is shown in the center of the screen for 2 s, and the stochastic outcome is displayed. (b) Transition structure. One of the first stage symbols is more likely to lead to the blue state; the other is more likely to lead to the purple state. (c) First stage choice screen in the second part of the experiment. The blue/purple coloured bars indicated the change in transition probability on a particular trial (‘colour deviation’).
Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning Gaze data reveal distinct choice processes underlying model-based and model-free reinforcement learning Reviewed by Unknown on 09:00 Rating: 5

Không có nhận xét nào