Abstract
Covert and overt spatial selection behaviors are guided by both visual saliency maps derived from early visual features as well as priority maps reflecting high-level cognitive factors. However, whether mid-level perceptual processes associated with visual form recognition contribute to covert and overt spatial selection behaviors remains unclear. We hypothesized that if peripheral visual forms contribute to spatial selection behaviors, then they should do so even when the visual forms are task-irrelevant. We tested this hypothesis in male and female human subjects as well as in male macaque monkeys performing a visual detection task. In this task, subjects reported the detection of a suprathreshold target spot presented on top of one of two peripheral images, and they did so with either a speeded manual button press (humans) or a speeded saccadic eye movement response (humans and monkeys). Crucially, the two images, one with a visual form and the other with a partially phase-scrambled visual form, were completely irrelevant to the task. In both manual (covert) and oculomotor (overt) response modalities, and in both humans and monkeys, response times were faster when the target was congruent with a visual form than when it was incongruent. Importantly, incongruent targets were associated with almost all errors, suggesting that forms automatically captured selection behaviors. These findings demonstrate that mid-level perceptual processes associated with visual form recognition contribute to covert and overt spatial selection. This indicates that neural circuits associated with target selection, such as the superior colliculus, may have privileged access to visual form information.
SIGNIFICANCE STATEMENT Spatial selection of visual information either with (overt) or without (covert) foveating eye movements is critical to primate behavior. However, it is still not clear whether spatial maps in sensorimotor regions known to guide overt and covert spatial selection are influenced by peripheral visual forms. We probed the ability of humans and monkeys to perform overt and covert target selection in the presence of spatially congruent or incongruent visual forms. Even when completely task-irrelevant, images of visual objects had a dramatic effect on target selection, acting much like spatial cues used in spatial attention tasks. Our results demonstrate that traditional brain circuits for orienting behaviors, such as the superior colliculus, likely have privileged access to visual object representations.
Introduction
Spatial selection of stimuli in a cluttered visual scene is central to visual behaviors in primates, and it could occur either overtly (with orienting eye movements) or covertly (without such eye movements). The mechanisms underlying both overt and covert spatial selection behaviors rely on spatial maps in sensorimotor regions that are functionally organized into visual saliency maps, primarily derived from low-level visual processes, and priority maps, representing high-level cognitive processes (Fecteau and Munoz, 2006; Veale et al., 2017; Bisley and Mirpour, 2019). Indeed, classic visual saliency map models are computed from early visual features, such as orientation, motion, and color (Itti and Koch, 2000), whereas priority maps are based on cognitive factors, such as behavioral relevance, expectation, and reward (Awh et al., 2012; Chelazzi et al., 2014; Sprague et al., 2018). Accordingly, visual saliency maps and priority maps are believed to be represented in the neuronal activity of visual, sensorimotor, and associative brain regions, such as primary visual cortex (V1), superior colliculus (SC), and regions of the parietal, and prefrontal cortices (Gottlieb et al., 1998; Bisley and Goldberg, 2003; Ignashchenkova et al., 2004; White et al., 2017; Sapountzis et al., 2018; Yan et al., 2018; Chen et al., 2020).
The organization of spatial maps based on a dichotomy of early visual features, on the one hand, and high-level cognitive factors, on the other, ignores whether mid-level perceptual processes related to visual form recognition are represented in these maps. This is inconsistent with multiple lines of evidence suggesting a potential link between visual form recognition and spatial selection behaviors. First, recent studies in monkeys identified a novel attention-related region in the temporal cortex (Bogadhi et al., 2019a; Stemmann and Freiwald, 2019). Importantly, neurons in this attention-related region were selective for peripheral visual forms (Bogadhi et al., 2019b), suggesting a functional link between covert spatial selection and peripheral visual form recognition. Second, studies modeling fixation patterns in free-viewing of natural images show that visual objects predict fixation patterns better than saliency maps based on early visual features (Einhäuser et al., 2008; Yanulevskaya et al., 2013; Kümmerer et al., 2014), indicating an influence of visual forms on overt behaviors. Third, behavioral studies in humans show a rapid detection of faces and animals in peripheral images for saccadic eye movements and attentional capture, suggesting a rapid processing of animate visual forms for overt selection (Kirchner and Thorpe, 2006; Bindemann et al., 2007; Crouzet et al., 2010; Drewes et al., 2011; Devue et al., 2012). However, such rapid detection could be explained by low-level image features or unnatural statistics of the image databases (Honey et al., 2008; Wichmann et al., 2010; Crouzet and Thorpe, 2011; Zhu et al., 2013). Importantly, these studies used visual form images as saccade targets, which were always relevant to the task performance. Hence, it remains unclear, from studies using goal-directed and free-viewing paradigms, whether peripheral visual forms contribute to spatial selection when they are rendered irrelevant to the task and equated with nonform images for low-level visual features. We hypothesized that, if peripheral visual forms contribute to spatial selection, then they should do so in both covert and overt behaviors, even when the visual forms are task-irrelevant and equated for low-level visual features.
We investigated the contribution of peripheral visual forms to covert and overt spatial selection using a visual detection task pitting visual form images against 50% phase-scrambled images. Most importantly, all images were irrelevant to the task and matched for low-level image properties. In the covert (humans) and overt (humans and monkeys) tasks, subjects reported the detection of a suprathreshold target with a manual or saccadic eye movement response, respectively. We found that response times were significantly faster when the target was congruent with a visual form image in both covert and overt selection tasks, and in both humans and monkeys. Crucially, almost all response errors were captured by visual forms incongruent with targets. Interestingly, during covert selection, microsaccades following image onsets were biased toward visual forms. These findings demonstrate that peripheral visual forms, even when task-irrelevant, contribute to overt and covert spatial selection and perhaps act as spatial cues for orienting movements (Posner, 1980; Tian et al., 2016).
Materials and Methods
Subjects and ethics approvals
Eleven human subjects (3 males and 8 females; mean age ± SD = 27.3 ± 3.9 years) naive to the purpose of the study and 3 male rhesus monkeys (Macaca mulatta; Monkeys A, F, and M) aged 10, 11, and 10 years, respectively, participated in this study. All human subjects provided written informed consent in accordance with the Declaration of Helsinki. Ethics committees at the Medical Faculty of Tuebingen University reviewed and approved protocols for the human experiments. Monkey experiments were approved by regional governmental offices in Tuebingen.
Experimental setups
Human subjects were seated in a dark room at a viewing distance of 57 cm from a CRT monitor with a resolution of 1400 × 1050 pixels (34.13° × 25.93°). Stimulus display on the monitor was controlled by a 2010 Mac Pro (Apple) running MATLAB (The Mathworks) with the Psychophysics Toolbox extensions (Brainard, 1997). Eye position signals and manual responses were acquired using an EyeLink 1000 infrared eye-tracking system (SR Research) and Viewpixx button box (VPixx Technologies), respectively.
Monkeys were seated and head-fixed in a primate chair (Crist Instrument) inside a darkened booth at a viewing distance of 72.2 cm from a CRT monitor with a 1024 × 768 resolution (30.96° × 23.47°). For experiments in Monkeys A and M, stimulus display on the monitor was controlled using a modified version of PLDAPS with Datapixx and Psychophysics Toolbox extensions on MATLAB (The MathWorks) running on an Ubuntu operating system (Eastman and Huk, 2012). For experiments in Monkey F, stimulus display was controlled using a LabVIEW system (National Instruments) handshaking with a 2010 Mac Pro (Apple) running MATLAB (The MathWorks) with the Psychophysics Toolbox extensions (for details, see Chen and Hafed, 2013; Tian et al., 2016). Eye position signals in Monkeys A and M were measured using a surgically implanted scleral search coil; eye position signals in Monkey F were measured using an EyeLink 1000 infrared eye-tracking system (SR Research). Surgical procedures for implanting head-holders and scleral coils were described in a previous study (Skinner et al., 2019).
Experimental design
Covert selection task (human subjects)
Subjects started each trial by fixating a central spot of 0.1° radius (97.6 cd/m2) displayed on a gray background (43.84 cd/m2). Eye position signals were monitored to enforce fixation within a fixation window of 2° radius. Following a 500–1000 ms randomized fixation duration, an intact visual form image (4.88° × 4.88°) and its corresponding 50% scrambled form image (see Image normalization) were displayed symmetrically on either side of fixation along the horizontal meridian and centered at 8° eccentricity. After a fixed delay of 100, 200, or 300 ms following image onset, a suprathreshold target (black disk; radius = 0.2°) was displayed at the center of one of the two images. Subjects were instructed to report the spatial location of the target with a left or right button press; and most importantly, they were informed that both images were irrelevant to the detection task. We refer to trials in which the target was presented on top of the visual form image as target congruent trials and trials in which the target was presented on top of the 50% phase-scrambled image as target incongruent trials. All three delay conditions (100, 200, or 300 ms) were randomized across trials. In addition, catch trials with no target were also included in the 100 and 300 ms delay conditions on 25% of trials, and subjects were instructed to withhold their responses in these trials. Data in each subject (n = 11) were collected in a single session.
Overt selection task (humans and monkeys)
The trial structure in the overt task was the same as in the covert task with one difference: in the overt task, subjects reported the detection of the target with a saccade to the target location rather than a button press. In addition, at the same time as target onset, the fixation point was extinguished and presented on top of the target to aid the subjects with a speeded saccade response to the target location. We suggest that this was particularly helpful in instructing monkeys that would otherwise require additional training to generate target-directed saccades. However, it is important to note that fixation point disappearance was at the center of the display and was completely uninformative of the newly appearing target location. Crucially, the identical visual event happened at the same location in both the target congruent and target incongruent trials that we compared throughout this study.
We also included a single-image condition on 40% of trials. In this case, only one image, either a visual form image or a 50% scrambled form image, was presented simultaneously with a target on top of it in one of the four diagonal locations at 8° eccentricity. The single-image condition with diagonal locations was used to control for any spatial biases in eye movements from repetitive target presentations at the same spatial locations. We collected data in 10 of the 11 human subjects in a single session each and in 3 monkeys across 17 sessions. In monkeys, we used the same visual images (now sized 5.14° × 5.14°) as in the human experiments, with a target (black disk) of radius 0.3°–0.45° across monkeys. The background gray luminance for the monkey experiments was 27.21–37.1 cd/m2. The humans performed the overt selection task before performing the covert selection task.
The randomized fixation duration before image presentation for the overt task in humans was the same as in the covert task (500–1000 ms). However, the duration was slightly different across monkeys. In Monkeys A and M, the duration was 100–700 ms compared with 300–900 ms in Monkey F. This was because of the slightly different experimental setup in which data from Monkey F were collected. Nonetheless, to keep the timing of image presentations comparable across monkeys in all analyses, we excluded trials in Monkeys A and M having fixation durations <300 ms. This trial exclusion was blind to whether the trials were target congruent or target incongruent trials. Most importantly, we have repeated the analyses on all trials (i.e., with no exclusion), and the results were unaltered (see Results).
Image normalization
Forty images with visual forms and their corresponding 50% scrambled form images were used in both human and monkey experiments. Visual form images were obtained from previous electrophysiology studies of the inferotemporal cortex (Tsao et al., 2006; Bogadhi et al., 2019b) and were sampled from four different categories, including human faces, fruits, hands, and inanimate objects with 10 examples in each category (see Fig. 1c).
All images were equated iteratively for luminance distributions (mean = background luminance) and Fourier spectra using the SHINE tool box (Willenbockel et al., 2010). Briefly, all 40 visual form images were resized to the appropriate dimensions, and the mean gray level of each image was equated to the background level (lumMatch in SHINE). The resultant 40 images were iteratively (n = 20) matched for the histogram of gray levels (histMatch in SHINE) and the Fourier spectra (specMatch in SHINE), before generating their corresponding 50% phase-scrambled images by randomizing half of the phase matrix and keeping the amplitude matrix constant. Finally, all of the visual form images and their corresponding phase-scrambled images were iteratively (n = 20) matched, once again, for the histogram of gray levels and the Fourier spectra to yield the final visual form images (see Fig. 1c, top) and phase-scrambled images (see Fig. 1c, bottom) used in this study.
Suprathreshold target detection
We hypothesized that visual form contribution to spatial selection behaviors, in covert and overt tasks, should be evident as a facilitation of response times between target congruent and target incongruent conditions. Hence, it was important that the differences in response times between target congruent and target incongruent conditions could not be attributed to difficulty in the visual detection of the target across conditions. For this reason, we chose a high-contrast (“black” color) and sufficiently large target (0.2° radius disk). We also confirmed that detection performance during the most difficult condition of our covert task (the 100 ms delay condition; see Results) was suprathreshold in both target congruent (% correct performance = 99.48 ± 0.97% SD) and target incongruent (% correct performance = 99.06 ± 0.94% SD) trials, with no significant difference between the two conditions (Wilcoxon signed-rank test, p = 0.43).
Statistical analyses
Response times and proportions of errors
We measured visual form contribution to spatial selection behaviors on correct and error trials separately. Correct trials were defined as the trials in which the first response of the subject correctly matched the target location. Error trials were those in which the subjects erroneously selected the image that had no target dot superimposed on it (i.e., they selected the image opposite to the target). On correct trials, we quantified the effect of visual forms on response time differences between target congruent and target incongruent conditions. On error trials, we quantified the effect of visual forms on the proportion of errors between target congruent and target incongruent conditions. Response time on a correct trial was calculated as the time of response onset relative to target onset. The proportion of errors in each subject was calculated as the ratio of the number of error trials to the sum of correct and error trials.
Saccadic responses to targets in the overt task and microsaccades during fixation in both the covert and overt tasks were detected using a velocity and acceleration threshold followed by manual inspection (Krauzlis and Miles, 1996). Trials with microsaccades occurring between 100 ms before image onset and response onset were excluded in the analyses of response times and proportion of errors to control for the effects of microsaccades on stimulus onset activity in different brain regions (Chen et al., 2015).
For analysis of response times in the covert task in humans (e.g., see Fig. 2a), we included an average of 80.6 (SD = 26.4) and 77.9 (SD = 24.9) trials from target congruent and target incongruent conditions, respectively, for a given delay and subject. Similarly, in the overt task (see Fig. 2b), we included an average of 68.2 (SD = 9.8) and 68.6 (SD = 9.8) trials in target congruent and target incongruent conditions, respectively. For the response time analysis in monkeys, trial counts in target congruent and target incongruent conditions are shown in Figure 6. It should be noted here that there was a progressively lower trial count with increasing delay in the monkey data (see Fig. 6). This is primarily because of the exclusion of trials with microsaccades occurring between 100 ms before image onset and 100 ms after target onset. That is, the longer the delay period duration, the more likely it was for microsaccades to have occurred during our exclusion period. This progressive reduction in trial count was also more apparent in Monkeys A and M compared with Monkey F. We measured the microsaccade rate in all 3 monkeys after the image onset, from 150 to 300 ms, and confirmed that Monkeys A and M generated more microsaccades compared with Monkey F (microsaccades/s: mean ± SD = 0.79 ± 0.36 and 1.15 ± 0.47 for Monkeys A and M respectively, vs 0.44 ± 0.13 for Monkey F), explaining the greater loss of trials in Monkeys A and M. Nonetheless, we verified that trial count before exclusion was comparable across the three delay conditions in all monkeys (mean ± SD = 527.66 ± 15.13 in Monkey A, 637.5 ± 8.31 in Monkey F, 183.66 ± 13.53 in Monkey M). More importantly, we repeated our analyses without any microsaccade exclusion (i.e., with matched trial numbers across delays), and the results remained consistent with our main findings (see Results). For all paired comparisons in humans and monkeys, we used Wilcoxon signed-rank tests.
Microsaccades
We pooled microsaccades flagged during the covert task across subjects, and we separated them into congruent and incongruent microsaccades based on their directions relative to the visual form image or the scrambled image, respectively. Specifically, we found that the great majority of microsaccades were predominantly horizontal because of our image configuration in the task. We therefore grouped all microsaccades within ±32 degrees from horizontal into two groups depending on the direction of their horizontal component: either toward the visual form image (congruent microsaccades) or toward the scrambled image (incongruent microsaccades; see Fig. 4a, inset). The rate of congruent and incongruent microsaccades was constructed by counting the corresponding microsaccades in a 50 ms time window sliding in steps of 5 ms. The proportion of microsaccades congruent with visual forms was also calculated as the ratio of the number of congruent microsaccades to the total sum of congruent and incongruent microsaccades occurring within a given time bin. To test the statistical significance of congruent and incongruent microsaccade proportions, we used the binomial test in each of the 50 ms time windows. These analyses of microsaccade rate and direction congruence are standard analyses in the field of microsaccade research (Engbert and Kliegl, 2003; Pastukhov and Braun, 2010; Hafed, 2013; Pastukhov et al., 2013; Tian et al., 2016; Baumeler et al., 2020). Also, in the overt task, saccades to the images replaced poststimulus microsaccades; it was therefore not meaningful to analyze microsaccadic modulations in the overt task.
ANOVAs
To test the influence of visual form category on the manual response times in humans and saccade response times in humans and monkeys, we performed ANOVAs on the response time data with three factors: visual forms (intact/scrambled), delay (100, 200, 300 ms), and categories (human faces, fruits, hands, and inanimate objects). In the ANOVA of response times in humans, we included the subjects as a random effect factor.
Results
Both covert and overt selection behaviors are facilitated by task-irrelevant visual form images
We hypothesized that, if peripheral visual forms contribute to spatial selection behaviors in an automatic and bottom-up manner, then response times associated with target detection should be influenced by the spatial congruency between target location and task-irrelevant visual form images. To test this, we first ran human subjects on two target detection tasks: one being covert and involving a manual response (Fig. 1a) and the other being overt and using a foveating eye movement response (Fig. 1b). In both tasks, the subjects had to report, as quickly as possible, the onset of a suprathreshold target stimulus that appeared at one of two possible locations centered on top of either a visual form image or its scrambled version (see Materials and Methods). The subjects were informed a priori that the images behind the possible two target locations were completely irrelevant to the task, and we identified correct trials as those in which the required response (one of two buttons corresponding to each target location or accurate saccade landing at the target location) was spatially accurate; this was the majority of trials (see Materials and Methods). In all cases, the target could appear after one of three possible delays after image onset (Fig. 1). We compared response times on correct trials when the target was congruent with the visual form image to response times when the target was incongruent with the visual form image in each delay condition.
A comparison of cumulative distributions of response times in an example subject, during the 100 ms delay condition, clearly shows that response times for congruent targets were faster compared with incongruent targets in both the covert (Fig. 2a) and overt (Fig. 2d) tasks. A paired comparison of median response times across all subjects further demonstrates that response times for congruent targets were significantly faster compared with incongruent targets, in all three delay conditions (100, 200, and 300 ms) tested, and in both covert (Fig. 2b; Wilcoxon signed-rank test, p = 0.0009 in 100 ms, p = 0.0009 in 200 ms, p = 0.0009 in 300 ms) and overt (Fig. 2e; Wilcoxon signed-rank test, p = 0.0019 in 100 ms, p = 0.027 in 200 ms, p = 0.011 in 300 ms) tasks. This facilitation of response times by the visual form was uniformly present across all delays in the covert task (Fig. 2c), and it was the strongest for the 100 ms delay condition in the overt task (Fig. 2f). These findings demonstrate that peripheral visual forms, even when task-irrelevant, bias spatial selection and facilitate target detection as early as 100 ms from image onset, in both covert and overt spatial selection behaviors.
Visual forms capture selection even when incongruent with task requirements
In its complementary form, a spatial selection bias by visual forms could also degrade performance and result in more error trials when visual forms are spatially incongruent with target locations. We tested for this by comparing the proportion of errors in target congruent trials with the proportion of errors in target incongruent trials, in both the covert (Fig. 3a) and overt (Fig. 3b) tasks. On average, our subjects made more errors when the targets were incongruent with visual forms compared with when they were congruent with visual forms, and this occurred in both the covert (Fig. 3a) and overt (Fig. 3b) versions of the task. A paired comparison across subjects revealed a consistent pattern of more errors for incongruent targets in the 200 ms (Wilcoxon signed-rank test; covert task, p = 0.03; overt task, p = 0.01) and 300 ms (Wilcoxon signed-rank test; covert task, p = 0.09; overt task, p = 0.06) delay conditions, in both the covert (Fig. 3a, middle, right) and overt (Fig. 3b, middle, right) tasks. Interestingly, this effect of visual forms on errors for incongruent targets was weaker and less consistent across subjects in the 100 ms delay condition (Wilcoxon signed-rank test; covert task, p = 0.43; overt task, p = 0.22), and in both covert (Fig. 3a, left) and overt (Fig. 3b, left) tasks. These findings provide a complementary demonstration that peripheral visual forms bias spatial selection and produce more errors for incongruent targets in a time-specific manner (Fig. 2).
Microsaccades during fixation reflect capture by peripheral, task-irrelevant visual forms
The behavioral effects of peripheral visual forms, particularly on response times (Fig. 2), in both covert and overt tasks resemble the well-known effects of spatial cues on behavioral performance in attention tasks (Posner, 1980). Since microsaccades provide a sensitive assay of effects related to attention (Hafed and Clark, 2002; Engbert and Kliegl, 2003), we therefore tested whether peripheral visual forms also bias microsaccades before target presentation. We analyzed the incidence of microsaccades before target presentation either toward (congruent) or opposite (incongruent) the suddenly appearing visual form image in the covert version of the task (see Materials and Methods). We first computed a microsaccade rate independently for movements that were either congruent with the visual form image or incongruent with it (Fig. 4a). Immediately after image onset, microsaccade rate for both congruent and incongruent movements decreased reflexively, consistent with previous reports of microsaccadic inhibition (Engbert and Kliegl, 2003; Rolfs et al., 2008; Tian et al., 2016; Buonocore et al., 2017). However, subsequent microsaccades, which likely benefit from frontal cortical drive (Peel et al., 2016), occurred earlier if they were congruent with a visual form than if they were incongruent (Fig. 4a). Importantly, this meant that the proportion of microsaccades in the congruent direction was higher than in the incongruent direction during the interval following inhibition, suggesting a spatial direction bias toward visual form images. This spatial bias was statistically different from chance in each of the 50 ms time bins from 122.5 to 142.5 ms (Fig. 4b; binomial test, p < 0.05). These findings demonstrate that peripheral visual forms, even when task irrelevant, bias microsaccades and effectively act as spatial cues for selection behaviors.
Nonface visual forms still influence response times and bias microsaccades
The known influence of face stimuli on saccadic eye movements (Bindemann et al., 2007; Xu-Wilson et al., 2009; Morand et al., 2010; Devue et al., 2012; Boucart et al., 2016; Kauffmann et al., 2019; Buonocore et al., 2020) raises a potential question on our results so far: namely, whether our findings of visual form effects on response times, microsaccade biases, and target selection errors are largely restricted to trials with face images. To test this, we excluded trials with face images and reanalyzed all of our data with only nonface images in both covert and overt tasks. We found that response times were faster for targets congruent with nonface visual forms compared with incongruent targets in both covert (Fig. 5a; Wilcoxon signed-rank test, p = 0.002 in 100 ms, p = 0.002 in 200 ms, p = 0.002 in 300 ms) and overt tasks (Fig. 5b; Wilcoxon signed-rank test, p = 0.01 in 100 ms, p = 0.09 in 200 ms, p = 0.05 in 300 ms). Importantly, we also observed a spatial direction bias in microsaccades before the target presentation toward nonface visual forms in each of the 50 ms bins from 112.5 to 147.5 ms (Fig. 5c; binomial test, p < 0.05). These results show that nonface visual forms strongly influence spatial selection to facilitate target detection in both covert and overt behaviors.
Additionally, we evaluated the complementary effect of nonface visual forms on errors when they were incongruent with the targets. Surprisingly, we found the effect of incongruent visual forms on errors to be weak and inconsistent across three delay conditions in both covert (Wilcoxon signed-rank test, p = 0.74 in 100 ms, p = 0.21 in 200 ms, p = 0.12 in 300 ms) and overt tasks (Wilcoxon signed-rank test, p = 0.68 in 100 ms, p = 0.007 in 200 ms, p = 0.12 in 300 ms). These findings suggest that nonface visual forms bias spatial selection only to an extent where it can facilitate spatially congruent target detection but not necessarily degrade spatially incongruent target detection.
We also tested whether there might be an influence of object category on the visual form effects on manual and saccade response times in humans (Fig. 2). That is, it could be possible that certain visual form categories (e.g., inanimate objects) are less ecologically relevant than other visual form categories (e.g., faces or fruits), and therefore have smaller effects on response times in our tasks. To investigate this, we ran ANOVAs with visual form, delay, and object categories as factors (see Materials and Methods). The results from our ANOVAs showed a main effect of visual form factor on response times in both manual (F(1,4968) = 67.42, p < 0.0001, ANOVA) and saccade (F(1,3866) = 33.09, p = 0.0003, ANOVA) response tasks, consistent with our main findings (Fig. 2). In addition, we also observed a significant interaction effect of visual form and category factors on saccade response times (F(3,3866) = 3.54, p = 0.027, ANOVA), but not on manual response times (F(3,4968) = 1.01, p = 0.4, ANOVA). Subsequent inspection of data revealed that faces were slightly more relevant for the performance of our human subjects, but only in the saccade response task and not in the manual response task. These results indicate that the influence of object category on visual form effects for detection may be limited to overt gaze shifts.
Overt selection behavior is also facilitated by peripheral visual forms in monkeys
Since monkeys are an important animal model for investigating the neural mechanisms of spatial selection behaviors (Schall and Thompson, 1999; Reynolds and Chelazzi, 2004; Krauzlis et al., 2014; Basso and May, 2017), we next asked whether peripheral visual forms can have similar effects in these animals as in our human subjects. We used the same overt task design as in humans (see Materials and Methods), and we analyzed the monkeys' saccades. We confirmed that all 3 monkeys (Monkeys A, F, and M) performed the task correctly (% correct performance: 90.2 ± 2.1% SD, 92.8 ± 4% SD, and 84 ± 2.9% SD for Monkeys A, F, and M, respectively), and we also confirmed that individual monkeys' performance was significantly greater than chance in each of the 17 sessions collected across 3 monkeys (bootstrap test, p < 0.001). Following the same reasoning as in the human experiments, we compared saccadic response times on target congruent and target incongruent trials in each delay condition and monkey (Fig. 6). The results revealed two features that were consistent across all monkeys, and that were also consistent with our observations in humans when considering that monkey response times are generally faster than human response times. First, faster response times to congruent targets were limited to the early saccade responses as evident in the comparison between target congruent and target incongruent trials in the 100 ms delay condition (Fig. 6, first column of panels). Second, this facilitation of early saccade responses by visual forms was very weak in the 300 ms delay condition (Fig. 6, third column of panels) compared with the 100 ms delay condition.
We quantified this differential effect of visual forms on early and late saccade response times by splitting the response time distributions into 7 quantiles, such that the first quantile occupied the express-saccade part of the cumulative distributions in all conditions and monkeys. Express saccades represent a population of early saccades with very short latency, which appear to be distinct from the overall response time distribution (Fischer and Boch, 1983). Thus, in the cumulative distributions of response times (e.g., Fig. 6), the express-saccade part of response time distributions appears as an early distribution of trials before a plateau is reached in cumulative response time (i.e., an early tail in the global cumulative distribution). Paired comparisons of median response times for target congruent and target incongruent trials in the first quantile showed that saccadic responses were significantly faster for congruent targets in the 100 and 200 ms delay conditions (Fig. 7a; Wilcoxon signed-rank test, p = 0.001 in 100 ms, p = 0.003 in 200 ms) but not in the 300 ms delay condition (Fig. 7a; Wilcoxon signed-rank test, p = 0.129), consistent with the observations from Figure 6. The effect on all ranges (quantiles) of saccade response times in the three delay conditions is also shown in Figure 7b. As can be seen, there was a facilitatory effect of peripheral visual forms on saccade response times, but this was limited to the early saccadic responses and fell off abruptly for the 300 ms delay condition after the first quantile. The fall off was milder for the 100 and 200 ms delay conditions (Fig. 7b). These findings demonstrate that task-irrelevant visual forms facilitate early saccade responses in monkeys, and that this facilitation is the strongest in the first 200 ms of visual form processing, consistent with our earlier results in humans (Fig. 2f).
We next asked whether the weaker effect in the 300 ms delay condition was because of progressively lower trial counts in the longer delay conditions (Fig. 5; see Materials and Methods). We repeated the same analyses on all trials (with no exclusion of trials based on whether microsaccades occurred or not; see Materials and Methods). This resulted in comparable trial counts across the three delays (see Materials and Methods), and our results showed that median response times in the first quantile were significantly faster for congruent targets in the 100 and 200 ms delay conditions (Wilcoxon signed-rank test, p = 0.0003 in 100 ms, p = 0.0003 in 200 ms) but not in the 300 ms delay condition (Wilcoxon signed-rank test, p = 0.58). This further demonstrates that task-irrelevant visual forms influence early saccades within the first 200 ms of visual form processing.
Additionally, we performed ANOVAs on the first quantile response times with visual form, delay, and object categories as factors, to test the influence of object categories on our findings (see Materials and Methods). The ANOVA results revealed a main effect of visual form factor (F(1,489) = 14.51, p = 0.0002, ANOVA), consistent with our main findings (Fig. 7). We also found a significant interaction effect of visual form and category factors (F(3,489) = 3.47, p = 0.016, ANOVA). Like in our human subjects above, this suggests that different categories of visual forms had different facilitatory effects on monkey spatial selection performance with saccades. Interestingly, the biggest facilitatory effects were with faces and fruits, and the least facilitatory effects were with inanimate objects. Nonetheless, we confirmed that inanimate objects, decidedly the least biologically relevant visual form category to monkeys in our experiments, still significantly influenced saccade response times within the first 200 ms of visual form processing (Wilcoxon signed-rank test, p = 0.02 in 100 ms, p = 0.001 in 200 ms, p = 0.22 in 300 ms). These findings suggest that object category plays a role in modulating the effect of visual forms on response times in monkeys, but that this influence is not necessarily limited by the ecological relevance of visual forms.
Visual forms capture more saccade errors in monkeys when incongruent with task requirements
Finally, in humans, the visual form facilitation of response times for target congruent trials was accompanied by the complimentary effect of more errors for targets that were incongruent with visual forms images (Fig. 3). We tested whether peripheral visual forms had this complementary effect on errors in monkeys as well. We found that monkeys indeed made significantly more errors when the saccade targets were incongruent with the visual forms compared with when the targets were congruent with the visual forms (Fig. 8). Interestingly, this effect on errors was strong and significant in the 100 ms (Wilcoxon signed-rank test, p = 0.0006) and 200 ms (Wilcoxon signed-rank test, p = 0.02) delay conditions but relatively weak and insignificant in the 300 ms delay condition (Wilcoxon signed-rank test, p = 0.45). This stronger effect on errors in the early delay conditions is consistent with the similar results from the saccade response times (Fig. 7). These findings demonstrate that peripheral visual forms capture more error saccades in well-trained monkeys, and that this effect on errors is the strongest in the first 200 ms of visual form processing.
Importantly, we also confirmed in monkeys that nonface visual forms still strongly influenced response times in the first quantile (Wilcoxon signed-rank test, p = 0.008 in 100 ms, p = 0.006 in 200 ms, p = 0.18 in 300 ms) and errors (Wilcoxon signed-rank test, p = 0.0004 in 100 ms, p = 0.03 in 200 ms, p = 0.39 in 300 ms) in the first 200 ms of visual form processing, as demonstrated in our main findings (Figs. 7a, 8).
Discussion
We investigated whether peripheral visual forms contribute to covert and overt spatial selection behaviors using a visual detection task in which visual forms were completely irrelevant. In humans, we found that visual forms facilitate the detection of spatially congruent targets with faster response times in both covert (Fig. 2b) and overt (Fig. 2e) tasks, and that this facilitation is evident in the first 100 ms of visual form processing. Importantly, visual forms incongruent with targets resulted in more errors in both covert (Fig. 3a) and overt (Fig. 3b) tasks, and this effect on errors was most pronounced after 200 ms of visual form processing. In addition, microsaccades before target presentation (but after visual form image onset) were biased toward visual forms in the covert task (Fig. 4b). Our results in monkeys revealed a similar pattern of visual form effects seen in humans with two notable differences. First, visual form facilitation of response times was specific to early saccadic responses (Fig. 7b), likely because monkey saccadic response times are faster than those of humans. Second, the visual form effects on response times and errors were limited to the first 200 ms of visual form processing (Figs. 7, 8). Overall, these findings demonstrate that peripheral visual forms contribute to covert and overt spatial selection in ways that resemble the effects of spatial cues on orienting behaviors (Posner, 1980).
Low-level visual factors cannot explain visual form influences on response times
Low-level visual factors related to luminance, spatial frequency content, and target contrast modulate neuronal activity in visual and sensorimotor regions of the brain (Ohayon et al., 2012; Chen and Hafed, 2018; Chen et al., 2018; Vinke and Ling, 2020), and therefore may influence behavioral responses. For this reason, we took several measures in the design of the image and target stimuli to minimize the contribution of low-level visual factors to our findings, particularly on response times. First, we equalized all visual form images and their corresponding 50% phase-scrambled images iteratively for luminance distributions and the Fourier spectra (see Materials and Methods). Second, we chose the target to be of the highest contrast and adjusted the size so that target detection, and hence the perceived contrast of the target, was suprathreshold for both visual form and phase-scrambled image backgrounds (see Materials and Methods). Thus, we suggest that low-level visual factors were unlikely to have influenced our results showing visual form facilitation of response times.
High-level cognitive factors cannot explain visual form influence on response times
Cognitive factors related to behavioral relevance, novelty, and reward also modulate neuronal activity in visuomotor brain regions, such as the SC, and hence can shape orienting behaviors (Basso and Wurtz, 1997; Ikeda and Hikosaka, 2003; Boehnke et al., 2011; Herman and Krauzlis, 2017). These factors are again unlikely to have influenced our findings for the following reasons. First, we made the visual form and the phase-scrambled images completely irrelevant to behavior in both covert and overt tasks, and in both humans and monkeys. Second, the same subjects participated in both covert and overt tasks that used the same images (see Materials and Methods). In addition, all monkeys were trained with the same images in at least 7 training sessions before the experimental sessions. Third, none of the images was associated with reward, in humans and monkeys, as they were irrelevant to the performance in the task. Thus, cognitive factors related to behavioral relevance, novelty, and reward were unlikely to have influenced our findings showing visual form facilitation of response times.
Face stimuli alone cannot account for visual form influence on response times
Faces are of ecological value, and the influence of faces on goal-directed and free-viewing saccade behaviors is well documented (Bindemann et al., 2007; Xu-Wilson et al., 2009; Morand et al., 2010; Devue et al., 2012; Boucart et al., 2016; Kauffmann et al., 2019; Buonocore et al., 2020). Importantly, there is growing evidence that faces are rapidly processed through a network of subcortical structures, including the SC (Johnson, 2005; Nguyen et al., 2016; Le et al., 2020), which also plays a crucial role in spatial selection (McPeek and Keller, 2004; Lovejoy and Krauzlis, 2010). To confirm that face images alone did not disproportionately contribute to our results, we repeated all of our analyses of covert and overt tasks in humans by excluding trials with face images. Results showed that response times were equally strongly affected by nonface visual forms alone in both covert (Fig. 5a) and overt tasks (Fig. 5b). Importantly, we also observed significant biases in microsaccades to nonface visual forms (Fig. 5c). Additionally, we also confirmed in monkeys that nonface visual forms strongly influenced the response times. These control analyses show that face stimuli alone cannot account for visual form influence on response times in both covert and overt tasks, and most importantly, demonstrate that all visual forms can influence spatial selection. Nonetheless, it would be interesting in the future to identify potential graded influences of different visual form categories on spatial selection performance. For example, our ANOVAs did show stronger effects of fruits and faces on monkey saccade performance than inanimate objects. This suggests that ecological relevance, even in the oculomotor system, needs to be considered when interpreting neural and behavioral effects. Indeed, increasing evidence supports the idea of an oculomotor system organization that is in line with the image statistics of the environment in which we make eye movements (Hafed and Chen, 2016; Chen et al., 2018).
Comparison of visual form effects in humans and monkeys
A comparison of visual form effects in humans and monkeys during the overt task revealed interesting species differences. For example, visual form effects on response times in monkeys were confined to the earliest saccades (Fig. 7b), unlike in humans where a similar analysis revealed visual form effects across all quantiles in all delay conditions (Fig. 2b). Similarly, visual form effects on errors in monkeys were more pronounced in the early delay conditions (100 and 200 ms; see Fig. 8) with a weaker effect in the late 300 ms delay condition, unlike in humans where this pattern was almost reversed; effects on errors were the weakest in the 100 ms condition (Fig. 3b). We suggest that the predominance of visual form effects on earliest responses and delay conditions in monkeys may be related to their behavioral training. Specifically, these were highly trained animals with short saccadic response times in general. With the longer delay periods (e.g., 200 and 300 ms), these delay periods were often much longer than the actual saccadic reaction times that would have been elicited to the visual form images themselves (e.g., see Fig. 7a). The long delay periods therefore required actively suppressing saccades to properly receive rewards in the task, which eliminated the visual form effects that still occurred automatically with shorter latencies. Indeed, the monkeys' final reaction times on successful trials were much shorter than those of the human subjects in the same task (compare Fig. 2 with Fig. 7).
Visual-form based selection differs from object-based attention
Space-based or spatial attention refers to behavioral benefits conferred by spatial cues exclusively at the cued location (Carrasco, 2011). In object-based attention, the cueing benefits extend to all spatial locations occupied by the object at the cued location (Duncan, 1984; Egly et al., 1994; Abrams and Law, 2000). Our demonstration of spatial selection based on visual forms is different from object-based attention because there were no explicit spatial cues in our task, and, most importantly, the visual forms were irrelevant in our task. However, it is very likely that both object-based and visual form-based selection mechanisms involve common visual processes related to segmentation and perceptual grouping (Driver et al., 2001; Baldauf and Desimone, 2014), and may operate outside of the modulation of sensory processing mechanisms associated with spatial attention (Shomstein and Yantis, 2002; Reynolds and Chelazzi, 2004; Chou et al., 2014; but see Roelfsema et al., 1998).
Neural circuits representing visual-form based spatial maps for orienting
The influence of peripheral visual forms on target detection as early as 100 ms suggests a neural circuit that rapidly links visual form processing with spatial maps in sensorimotor structures, such as the SC (Robinson, 1972; Chen et al., 2019). Recent evidence in a new region of the primate temporal cortex shows rapid object selectivity and detection-related signals that were causally dependent on midbrain SC activity (Bogadhi et al., 2019b). Based on this evidence, we suspect that SC neurons might signal peripheral visual forms and bias spatial selection. Recent findings in monkeys and mice further demonstrate the visual capabilities of SC neurons in representing visual statistics and properties of the natural environment that are innately relevant to our behaviors (Hafed and Chen, 2016; Chen et al., 2018; Lee et al., 2020).
Of course, visual form recognition is also accomplished in the primate inferotemporal cortex through feedforward visual cortical circuits. This can possibly influence sensorimotor structures, such as the SC, for spatial selection through direct projections (Cerkevich et al., 2014). However, the time course of visual form recognition in the traditional inferotemporal regions is not entirely consistent with our results showing rapid visual form facilitation (Kreiman et al., 2006; Tsao et al., 2006). Therefore, we suggest that a circuit linking SC with the temporal cortex, possibly through pulvinar or amygdala, may be at play in linking rapid visual form recognition with spatial selection (Harting et al., 1991; Boussaoud et al., 1992; Hadj-Bouziane et al., 2012; Rafal et al., 2015; Soares et al., 2017). Future studies investigating the subcortical and cortical contributions to visual form recognition, particularly in the periphery, will identify the candidate circuit mediating the visual form influence on spatial selection.
Footnotes
The authors declare no competing financial interests.
This work was supported by Werner Reichardt Center for Integrative Neuroscience, Deutsche Forschungsgemeinschaft EXC 307 excellence cluster, Hertie Institute for Clinical Brain Research, and Deutsche Forschungsgemeinschaft Project BO5681/1-1.
- Correspondence should be addressed to Amarender R. Bogadhi at bogadhi.amar{at}gmail.com