ROC metric and KL-divergence for saliency maps of static natural scenes (SE: standard error).
As a vital driving force of systems neuroscience, visual neuroscience had its conceptual framework established more than 40 years ago based on Hubel and Wiesel’s groundbreaking work on the receptive-field properties of visual neurons (Hubel & Wiesel, 1977). This framework was subsequently strengthened by David Marr's influential book (Marr, 2010). In this paradigm, visual neurons are conceived to perform bottom-up, image-based processing to build a series of symbolic representations of visual stimuli. This paradigm, however, is deeply misleading since the generative sources in the three-dimensional (3D) physical world of any stimulus, to which visual animals must respond successfully, cannot be determined by image-based processing (due to the inverse optics problem). This is perhaps the reason why "Now, thirty years later, the main problems that occupied Marr remain fundamental open problems in the study of perception" (Marr, 2010), as assessed by two prominent vision scientists and Marr's close associates.
During the last 30 years, dramatic progress in computing hardware, digital imaging, statistical modeling, and visual neuroscience has promoted researchers to re-examine the computations and representations (see above) for natural vision examined in Marr's book. A range of new ideas have been proposed, many of which are summarized in books (Knill & Richards, 1996; Rao et al., 2002; Purves & Lotto, 2003; Doya et al., 2007; Trommershauser et al., 2011) and reviews (Simoncelli & Olshausen, 2001; Yuille & Kersten, 2006; Geisler, 2008; Friston, 2010). The unified theme is that vision and visual system structure and function must be understood in statistical terms. How this feat can be achieved, however, is not clear at all.
Since humans and other visual animals must respond successfully to visual stimuli whose generative sources cannot be determined in any direct way, the visual system can only generate percepts according to the probability distributions (PDs) of visual variables underlying the stimuli. The information pertinent to the generation of these PDs, namely, the statistics of natural visual environments, must have been incorporated into the visual circuitry by successful behavior in the world over evolutionary and developmental time. During the last two decades, this statistical concept of vision has been successful in explaining aspects of vision that would be difficult to understand otherwise (see references cited above). In this chapter, I will describe several recent studies that relate the statistics of 2D and 3D natural visual scenes to visual percepts of brightness, saliency, and 3D space.
In the second section of this chapter, I will discuss how the PDs of luminance in specific contexts in natural scenes, referred to as the context-mediated PDs in natural scenes, predict brightness, the perception elicited by the luminance of a visual target. Our results show that brightness generated on this statistical basis accounts for a range of observations, whose causes have been debated for a long time without consensus. In the third section, I will present a simple, elegant model of the context- mediated PDs in natural scenes and a measure of visual saliency derived from these PDs. Our results show that this measure of visual saliency is a good predictor of human gaze in free-viewing both static and dynamic natural scenes. In the fourth section, I will present the statistics of 3D natural scenes and their relationship to human visual space. Our results show that human visual space is not a direct mapping of the 3D physical space but rather generated probabilistically. Finally, I will discuss the implications of these and other results for our understanding of the response properties of visual neurons, the intricate visual circuitries, the large-scale cortical organizations, the operational dynamics of the visual system, and natural vision.
2. The statistical structure of natural light patterns determines perceived light intensity
In this section, I present evidence that the context-mediated PDs of luminance in natural scenes predict brightness, the perception elicited by the luminance of a visual target. A central puzzle in understanding how such percepts are generated by the visual system is that brightness does not correspond in any simple way to luminance. Thus, the same amount of light arising from a given region in a scene can elicit dramatically different brightness percepts when presented in different contexts (Fig. 1) (Kingdom, 2011). For example, in Fig.1 (a), the central square (T) in the left panel appears brighter than the same target in the right panel. This is the standard simultaneous brightness contrast effect.
A variety of explanations have been suggested since the basis for such phenomena was first debated by Helmholtz, Hering, Mach, and others (Gichrist et al., 1999; Purves et al., 2004; Kingdom, 2011). Although lateral inhibition in the early visual processing has often been proposed to account for these “illusions”, this mechanism cannot explain instances in which similar overall contexts produce different brightness effects (compare Fig. 1 (a) with Figs. 1 (b) and (e); see also Fig. 1 (c)). This failure has led to several more recent suggestions, including complex filtering (Blakeslee & McCourt, 2004), the idea that brightness depends on detecting edges and junctions that promote the grouping of various luminances into interpretable spatial arrangements (Adelson, 2000; Anderson & Winawer, 2005), and the proposal that brightness is “re-synthesized” from 3D scene properties “inferred” from the stimulus (Wishart et al., 1997).
2.2. Context-mediated PDs of luminance in natural scenes
To examine whether the statistics of natural light patterns predict the perceptual phenomena shown in Fig. 1, we obtained the relevant PDs of luminance in natural scenes by sampling a database of natural scenes (van Hateren & van der Schaaf, 1998) with target-surround configurations that had the same local geometry as the stimuli in Fig. 1. As a first step, these configurations were superimposed on the images to find light patterns in which the luminance values of both the surround and target regions were approximately homogeneous; for those configurations in which the surround comprised more than one region of the same luminance (see Fig. 1), we also required that the relevant sampled regions meet this criterion. The sampling configurations were moved in steps of one pixel to screen the full image. The mean luminance values of the target and the surrounding regions in the samples were then calculated, and their occurrences tallied.
2.3. Brightness signifies context-mediated PDs of luminance in natural scenes
Natural environments comprise objects of different sizes at various distances that are related to each other and the observer in a variety of ways (Yang & Purves, 2003 a,b). When the light arising from objects is projected onto an image plane, these complex relationships are transformed into 2D patterns of light intensity with highly structured statistics. Thus, the PD of the luminance of, say, the central target in a standard simultaneous brightness contrast stimulus (Fig. 2 (a) ) depends on the surrounding luminance values (Fig. 2 (b) ).
Fig. 2 (c) illustrates the supposition that, for any context, the visual system generates the brightness of a target according to the value of its luminance in the probability distribution function (PDF, the integral of PD) of the possible target luminance experienced in that context (Yang & Purves, 2004). This value is referred to subsequently as the percentile of the target luminance among all possible luminance values that co-occur with the contextual luminance pattern in the natural environment. In formal terms, this supposition means that the visual system generates brightness percepts according to the relationship Brightness=A(P)+A, where A and A are constants, and (P) is a monotonically increasing function of the PDF, P.
By definition, then, the percentile of target luminance for the lowest luminance value within any contextual light pattern is 0% and corresponds to the perception of maximum darkness; the percentile for the highest luminance within any contextual pattern is 100% and corresponds to the maximum perceivable brightness. In any given context, a higher luminance will always have a higher percentile, and will always elicit a perception of greater brightness compared to any luminance that has a lower percentile. Since the relation Brightness=A(P)+A is not based on a particular luminance within the context in question, but rather on the entire PD of possible luminance values experienced in that context, the context-dependent relationship between brightness and luminance is highly nonlinear (see Fig. 2 (c) ). In consequence, the same physical difference between two luminance values will often signify different percentile differences, and thus perceived differences in brightness. Furthermore, because the percentiles change more rapidly as the target luminance approaches the luminance of the surround one would expect greater changes of brightness, an expectation that corresponds to the well known “crispening” effect in perception.
Finally, because the same value of target luminance will often correspond to different percentiles in the PDFs of target luminance in different contexts, two targets having the same luminance can elicit different brightness percepts, the higher percentile always corresponding to a brighter percept. Thus, in the standard simultaneous brightness contrast stimulus in Fig. 1 (a), the target (T) in the left panel in Fig. 2 (a) appears brighter than the equiluminant target in the right panel.
2.4. White’s illusion
White’s illusion (Fig. 1 (b) ) presents a particular challenge for any explanation of brightness (White, 1979). The equiluminant rectangular areas surrounded by predominantly more luminant regions in the stimulus (on the left in the left panel of Fig. 1 (b); see also the area in the red frame in Fig. 3 (a) appear brighter than areas of identical luminance surrounded by less luminant regions (on the right in the left panel of Fig. 1 (b); see also the area in the blue frame in Fig. 3 (a). The especially perplexing characteristic of this percept is that the effect is opposite that elicited by standard simultaneous brightness contrast stimuli (Fig. 1 (a) ). Even more puzzling, the effect reverses when the luminance of the rectangular targets is either the lowest or highest value in the stimulus (middle and right panels in Fig. 1 (b).
The explanation for White’s illusion provided by the statistical framework outlined above is shown in Fig. 3. When presented separately, as in Fig. 3 (a), the components of White’s stimulus elicit much the same effect as in the usual presentation. By sampling the images of natural visual environments using configurations based on these components (Fig. 3 (b),) we obtained the PDFs of the luminance of a rectangular target (T) embedded in the two different configurations of surrounding luminance in White’s stimulus. As shown in Fig. 3 (c), when the target in the intermediate range of luminance values (i.e., in between the luminance values at the two crossover points) abuts two dark rectangles laterally (left panel in Fig. 3 (b),) the percentile of the target luminance (red line) is higher than the percentile when the target abuts the two light rectangles (right panel in Fig. 3(b); blue line in Fig. 3 (c) ). If, as we suppose, the percentile in the PDF of target luminance within any specific context determines the brightness perceived, the target with an intermediate luminance on the left in Fig. 3 (b) should appear brighter than the equiluminant target on the right. Finally, when all the luminance values in the stimulus are limited to a very narrow range (e.g., from 0 to 100 cd/m2 or from 1000 to 1100 cd/m2), when the sampling configurations are orientated vertically, or when the aspect ratio of the sampling configurations is changed (e.g., from 1:2 to 1:5), the PDFs derived from the database are not much different. These further results are consistent with the observations that White’s stimulus elicits the similar effect when presented at a wide range of overall luminance levels, in a vertical orientation, or with different aspect ratios.
An aspect of White’s illusion that has been particularly difficult to explain is the so-called “inverted White’s effect”: when the target luminance is either the lowest or the highest value in the stimulus, the effect is actually opposite the usual percept (see the middle and right panels in Fig. 1 (b) ). The explanation for this further anomaly is also evident in Fig. 3 (c). When the target luminance is the lowest value in the presentation (see insets), the blue curve is above the red curve. As a result, a relatively dark target surrounded by more light area should now appear darker, as it does (see also the middle panel of Fig. 1 (b) ). By the same token, when the target luminance is the highest value in the stimulus (see insets), the blue curve is also above the red curve. Accordingly, the relatively light target surrounded by more dark area should appear brighter, as it does (see also the right panel of Fig. 1 (b) ). Thus the statistical structure of natural light patterns predicts not only White’s illusion, but the inverted White’s effect as well. Notice further that the two crossover points of the blue and red curves shift to the right when the contextual luminances increase, and to the left when they decrease; thus the inverted effect will be apparent, although altered in magnitude, for any luminance values of the surrounding areas.
2.5. The Wertheimer-Benary illusion
In the Wertheimer-Benary illusion (Fig. 1(c),) the equiluminant gray triangles appear differently bright, the triangle embedded in the arm of the cross looking slightly brighter than the triangle in the corner of the cross.
The explanation of the Wertheimer-Benary illusion provided by the statistical framework outlined above is shown in Fig. 4. By sampling the images of natural environments using configurations based on the components of the stimulus (Fig. 4 (b),) we obtained the PDFs of target luminance in these contexts. As shown in Fig. 4 (c), when the triangular patch is embedded in a dark bar with its base facing a lighter area, the percentile of the luminance of the triangular patch (red line) is higher than the percentile when the triangular patch abuts a dark corner with its base facing a similar light background (blue line). Accordingly, the same gray patch should appear brighter in the former context than in the latter, as is the case. The PDFs obtained after changing the triangles to rectangles, rotating the configurations in Fig. 4 (b) by 180, or reflecting the configurations along the diagonal of the cross (cf. middle and right panels in Fig. 1 (c) ) were much the same as those shown in Fig. 4 (c). These several observations accord with the fact that the Wertheimer-Benary effect is little changed by such manipulations.
2.6.1. The statistical nature of perception
I showed that brightness percepts do not encode luminance as such, but rather the statistical relationship between the luminance in an area within a particular contextual light pattern and all possible occurrences of luminance in the context that have been experienced by humans in natural environments during evolution. The statistical basis for this aspect of visual perception is quite different from traditional approaches to rationalizing brightness. In the “relational approach” (Gichrist et al., 1999), an idea that evolved from the late 19th C. debate between Helmholtz, Hering, and others, brightness percepts are “recovered” by the visual system from explicitly coded luminance contrasts and gradients. Another idea is that brightness depends on intermediate-level visual processes that detect edges, gradients and junctions, which are then grouped into specific spatial layouts (Adelson, 2000; Anderson & Winawer, 2005). Finally, the brightness elicited by a given luminance has been also considered as being “re-synthesized” by processing at several levels of the visual system that is based on inferences about the possible arrangements of surfaces in 3D, their material properties and their illumination (Wishart et al., 1997).
The common deficiency of these several ways of thinking about brightness is their failure to relate the statistics of light patterns experienced in the course of evolution to what the corresponding brightness percepts need to signify (namely, the relationship of a particular occurrence of luminance to all possible occurrences of luminance in a given context). Since light patterns on the retina are the only information the visual system receives, basing brightness percepts on the statistics of natural light patterns allows visual animals to deal optimally with all possible natural occurrences of luminance, employing the full range of perceivable brightness to represent the physical world.
2.6.2. Neural instantiation of context-mediated PDs of luminance in natural scenes
What sort of neural mechanisms, then, could incorporate these statistics of natural light patterns and relate them to brightness percepts? Although the answer is not known, the present results suggest that the circuitry at all levels of the visual system instantiates the statistical structures of light patterns in natural environments. In this conception, the center-surround organization of the receptive fields of retinal ganglion cells provides the initial basis for representing the necessary statistics. A further speculation would be that neural circuitry at the level of visual cortex is organized to instantiate the statistics of luminance patterns with arbitrary target and context shapes and sizes. As a result, the neuronal response at each location would signify the percentile of the target luminance in the PDF pertinent to a given context.
3. Visual saliency emerging from context-mediated PDs in natural scenes
In this section, I present a simple model of the context-mediated PDs in natural scenes and derive a measure of visual saliency from these PDs. Visual saliency is the perceptual quality that makes some items in visual scenes stand out from their immediate contexts (Itti & Koch, 2001). Visual saliency plays important roles in natural vision in that saliency can direct eye movements and facilitate object detection and scene understanding. We developed a model of the context-mediated PDs in natural scenes using a modified algorithm for independent component analysis (ICA) (Hyvarinen, 1999) and demonstrated that visual saliency based on the context-mediated PDs in natural scenes is a good predictor of human gaze in free-viewing both static and dynamic natural scenes (Xu et al., 2010).
3.2. Context-mediated PDs in natural scenes and visual saliency
A visual feature is a random variable and co-occurs at certain probabilities with other visual features in natural scenes. We call these the context-mediated PDs in natural scenes. Here, a context refers to the natural scene patch that co-occurs with a visual target in question in space and/or time domains. We proposed to represent the context-mediated PDs in natural scenes using independent components (ICs) of natural scenes. There are two reasons for this. First, it has been argued extensively that the early visual cortex represents incoming stimuli in an efficient manner (Simoncelli & Olshausen, 2001). Second, the filters of the ICs of natural scenes are very much like the receptive fields of simple cells in V1 (van Hateren & van der Schaaf, 1998).
To model the context-mediated PDs in static natural scenes, we used a center-surround configuration in which the scene patch within the circular center serves as the target and the scene patch in the annular surround as the context (Xu et al., 2010). We sampled a large number of scene patches from the McGill calibrated color image database of natural scenes (Olmos & Kingdom, 2004). Thus, each sample is a pair of a patch in center () and a patch in the surrounding area () (Fig. 5 (a) ). We developed a model of natural scenes in this configuration (Eq. (1)). In Eq. (1), ,, and are ICs. This model allows us to calculate the ICs for the context first and then the other ICs of natural scenes.
ICA filters (i.e.,) can be obtained as follows
Therefore, we obtained three sets of ICs. First, the columns of are the ICs for. Second, the columns of are the ICs for that are paired with the ICs for. Finally, the columns of are the ICs for that are not paired with any ICs for.
Fig. 5 (b) shows paired chromatic ICs for and. Fig. 5 (c) shows paired achromatic ICs for and. The chromatic ICs for the surround have red-green (L-M) or blue-yellow [S-(LM)] opponency. The chromatic paired ICs for the center are extensions of the ICs for the surround. Fig. 5 (d) shows the ICs for, including chromatic and achromatic ICs, that are not paired with any ICs for. Fig. 5 (e) shows examples of the ICs for the center computed alone.
To obtain the context-mediated PDs in dynamic natural scenes, we used sequences of image patches in which the current frame severed as the target and the three preceding frames as the context. We sampled a large number of sequences of image patches (~ 490,000) from a video database (Itti & Baldi, 2009) and performed the ICA according to Eq. (1). Fig. 6 (a) shows the paired chromatic spatiotemporal ICs. Fig. 6 (b) shows the paired achromatic spatiotemporal ICs. Fig. 6 (c) shows the unpaired ICs for the current frame, which are oriented bars and have red-green or blue-yellow opponency.
The context-mediated PDs of natural scenes, i.e., the conditional PDs, , can be derived using the Bayesian formula as follows
where is the amplitude of the ith unpaired IC for. Therefore, the context-mediated PDs depend only on the unpaired ICs for. We modeled as generalized Gaussian PDs.
We proposed a measure of visual saliency as
where is the maximum probability of a target,, that co-occurs with a context, , in natural scenes. Thus, if the probability of the occurrence of a target is low relative to that of the most likely occurrence in the context in natural scenes, the target is salient within the context (Fig. 7).
3.3. Visual saliency and human gaze in free-viewing static natural scenes
Human gaze in free-viewing natural scenes is probably driven by visual saliency in natural scenes. To test this hypothesis, we used a dataset of human gaze collected from 20 human subjects in free-viewing 120 images (Bruce & Tsotsos, 2009). Fig. 8 shows the saliency maps based on the context-mediated PDs in natural scenes and the density maps of human gaze for six scenes. The saliency maps based on the information maximization (AIM) model are also shown (Bruce & Tsotsos, 2009). Evidently, the salient features and objects in these scenes predicted by the saliency maps accord with human observations and the saliency maps predicted by our model qualitatively matched the density maps of human gaze.
To quantitatively examine how well this model of visual saliency predicts human fixation, we used the receiver operating characteristic (ROC) metric and the Kullback–Leibler (KL) divergence. The ROC metric measures the area under the ROC curve. To calculate this metric, we used visual saliency as a feature to classify the locations where the saliency measures are greater than a threshold as fixations and the rest as nonfixated locations. By varying the threshold, we obtained an ROC curve and calculated the area under the curve, which indicates how well the saliency maps predict human gaze.
To avoid a central tendency in human gaze, we used the ROC measure described in (Tatler et al., 2005). We compared the saliency measures at the attended locations to the saliency measures in that scene at the locations that are attended in different scenes in the dataset, called shuffled fixations. The average area under the ROC curve is 0.6803, which means the saliency measures at fixations are significantly higher than the saliency measures at shuffled fixations. Similarly, we measured the KL divergence between two histograms of saliency measures: the histogram of saliency measures at the fixated locations in a test scene and the histogram of saliency measures at the same locations in a different scene randomly selected from the dataset (Zhang et al., 2008).
Our model of visual saliency is a good predictor of human gaze in free-viewing static natural scenes, outperforming all other models that we tested. As shown in Table 1 (Xu et al., 2010), our model has an average KL divergence of 0.3016 and the average ROC measure is 0.6803. The average KL divergence and ROC measure for the AIM model are 0.2879 and 0.6799 respectively, which were calculated using the code provided by the authors. The results for other models in Table 1 were given in (Zhang et al., 2008).
|Model||KL (SE)||ROC (SE)|
|Bruce et al. (2009)||0.2879(0.0048)||0.6799(0.0024)|
|Itti et al. (1998)||0.1130(0.0011)||0.6146(0.0008)|
|Gao et al. (2009)||0.1535(0.0016)||0.6395(0.0007)|
|Zhang et al.: DOG (2008)||0.1723(0.0012)||0.6570(0.0007)|
|Zhang et al.: ICA (2008)||0.2097(0.0016)||0.6682(0.0008)|
3.4. Visual saliency and human gaze in free-viewing natural movies
We used a database of human gaze collected from 8 subjects in free-viewing 50 videos, including indoor scenes, outdoor scenes, television clips, and video games (Itti & Baldi, 2009). Fig. 9 shows the saliency maps for selected frames in 6 videos. The 3 contextual video frames and the target frame are shown to the left and the saliency maps to the right. As predicted by the saliency maps, the moving objects in these videos appear to be salient (e.g., the character in the game video, the falling water drop, the soccer player and the ball, the moving car and the walking policeman, and the jogger and the football player). These predictions accord well with human observations.
We calculated the KL-divergence for this dataset as described above. Humans tend to gaze at visual features that have high saliency, as shown by the KL divergence measures in Table 2 (Xu et al., 2010). The KL-divergence measure for our model is 0.3153, which is higher than the saliency metric (0.205) (Itti et al., 1998) and the surprise metric 0.241 (Itti & Baldi, 2009), but slightly lower than the AIM model (0.328) (Bruce & Tsotsos, 2009).
3.5.1. Distinctions from other models of visual saliency
Our model of visual saliency is different from all other models. There are four classes of models of visual saliency. The first class of models do not use PDs in natural scenes but involve complex image-based computation that includes feature extraction, feature pooling, and normalization (Itti et al., 1998). The second class of models make use of PDs computed from the current scene the subject is seeing (Bruce & Tsotsos, 2009). The third class of models are based on PDs in natural scenes that are not dependent on specific contexts (Zhang et al., 2008). Finally, there is a biologically inspired neural network model (Zhaoping & May, 2007). Our model is unique in that: 1) the PDs are computed from an ensemble of natural scenes that presumably approximate the statistics human experienced during evolution and development; and 2) the PDs are dependent on specific contexts in natural scenes.
|Bruce et al. (2009)||0.328(0.009)|
|Itti et al. (2009)||0.241(0.006)|
|Zhang et al. (2009)||0.181|
|Itti et al. (1998)||0.205(0.006)|
3.5.2. Neurons as estimators of context-mediated PDs in natural scenes
These results support the notion that neurons in the early visual cortex act as estimators of the context-mediated PDs in natural scenes. This way, any single neuron relates an occurrence of any visual variable to the underlying PD in natural scenes. These PDs are related to all possible stimuli in natural scenes experienced by the visual animals over evolutionary and developmental time.
This hypothesis is distinct from the conventional view of neurons as feature detectors, the efficient coding hypothesis (Simoncelli & Olshausen, 2001), predictive coding (Rao & Ballard, 1999), the proposal that neurons encode logarithmic likelihood functions (Rao, 2004), and several recent V1 neuronal models that involve complex spatial-tempo structures but don't function as estimators of PDs in natural scenes (Rust et al., 2005; Chen at al., 2007). Since the response of any single neuron encodes and decodes the PD of the visual variable in natural scenes, this concept is also different from probabilistic population codes where populations of neurons automatically encode PDs due to varying tuning among neurons and noise (Ma et al., 2005).
4. Statistics of 3D natural scenes and visual space
In the last two sections, I presented evidence that aspects of human natural vision are generated on the basis of the PDs of visual variables in 2D natural scenes. However, the most fundamental task of vision is to generate visual percepts and visually guided behaviors in the 3D physical world. In this section, I present PDs in 3D natural scenes and relate them to the characteristics of human visual space.
Visual space is characterized by perceived geometrical properties such as distance, linearity, and parallelism. An appealing intuition is that these properties are the result of a direct transformation of the Euclidean characteristics of physical space (Hershenson, 1998; Loomis et al., 1996; Gillam, 1996). This assumption is, however, inconsistent with a variety of puzzling and often subtle discrepancies between the predicted consequences of any direct mapping of physical space and what people actually see. A number of examples in perceived distance, the simplest aspect of visual space, show that the apparent distance of objects bears no simple relation to their physical distance from the observer (Loomis et al., 1996; Gillam, 1996) (Fig. 10). Although a variety of explanations have been proposed, there has been little or no agreement about the basis of this phenomenology.
We tested the hypothesis that these anomalies of perceived distance are all manifestations of a probabilistic strategy for generating visual percepts in response to inevitably ambiguous visual stimuli (Knill & Richards, 1996; Purves & Lotto, 2003; Trommershauser et al., 2011). A straightforward way of examining this idea in the case of visual space is to analyze the statistical relationship between geometrical features (e.g., points, lines and surfaces) in the image plane and the corresponding physical geometry in representative visual scenes. Accordingly, we used a database of natural scene geometry acquired with a laser range scanner to test whether the otherwise puzzling phenomenology of perceived distance can be explained in statistical terms (Fig. 11).
4.2. A probabilistic concept of visual space
The challenge in generating perceptions of distance (and spatial relationships more generally) is the inevitable ambiguity of visual stimuli. When any point in space is projected onto the retina, the corresponding point in projection could have been generated by an infinite number of different locations in the physical world. In consequence, the relationship between any projected image and its source is inherently ambiguous. Nevertheless, the PD of the distances of un-occluded object surfaces from the observer must have a potentially informative statistical structure. Given this inevitable ambiguity, it seems likely that highly evolved visual systems would have taken advantage of this probabilistic information in generating perceptions of physical space.
This probabilistic strategy can be formalized in terms of Bayesian inference (Knill & Richards, 1996; Trommershauser et al., 2011). In this framework, the PD of the physical sources underlying a visual stimulus, P(S|I) can be expressed as
where S represents the parameters of physical scene geometry and I the visual image. P(S) is the PD of scene geometry in typical visual environments (the prior), P(I|S) the PD of stimulus I generated by the scene geometry S (the likelihood function), and P(I) a normalization constant.
If visual space is indeed determined by the PD of 3D scene geometry underlying visual stimuli, then, under reduced-cue conditions, the prior PD of distances to the observer in typical viewing environments should bias perceived distances. By the same token, the PD of the distances between locations in a scene should bias the apparent relative distances among them. Finally, when additional information pertinent to distance is present, these biases will be reduced.
4.3. PDs of distances in natural scenes
The information at each pixel in the range image database is the distance, elevation, and azimuth of the corresponding location in the physical scene relative to the laser scanner (Fig. 11). These data were used to compute the PD of distances from the center of the scanner to locations in the physical scenes in the database.
The first of several statistical features apparent in the analysis is that the PD of the radial distances from the scanner to physical locations in the scenes has a maximum at about 3 m, declining approximately exponentially over greater distances (Fig. 12 (a)). This PD is scale- invariant, meaning that any scaled version of the geometry of a set of natural scenes will, in statistical terms, be much the same (Lee et al., 2001). A simple model of natural 3D scenes generates a scaling-invariant PD of object distances nearly identical to that obtained from natural scenes (see legend of Fig. 12).
A second statistical feature of the analysis concerns how different physical locations in natural scenes are typically related to each other with respect to distance from the observer. The PD of the differences in the distance from the observer to any two physical locations is highly skewed, having a maximum near zero and a long tail (Fig. 12 (b)). Even for physical separations as large as 30, the most probable difference between the distances from the image plane of two locations is minimal.
A third statistical feature is that the PD of horizontal distances from the scanner to physical locations changes relatively little with height in the scene (the height of the center of scanner was always 1.65m above the ground, thus approximating eye-level of an average adult) (Fig. 12 (c)). The PD of physical distances at eye-level has a maximum at about 4.7 m and decays gradually as the distances increase. The PDs of the horizontal distances of physical locations at different heights above and below eye-level also tend to have a maximum at about 3m, and are similar in shape.
4.4. Perceived distances in impoverished settings
How, then, do these scale-invariant PDs of distances from the image plane in natural scenes account for the anomalies of visual space summarized in Fig. 10?
When little or no other information is available in a scene, observers tend to perceive objects at a distance of 2-4m (Owens et al., 1976). In the absence of any distance cues, the likelihood function in Eq. (6) is flat; the apparent distance of a point in physical space should therefore accord with the PD of the distances of all points in typical visual scenes (see Eq. (6)). As indicated in Fig. 12 (a), this distribution has a maximum probability at about 3 m. The agreement between this PD of distances in natural scenes and the relevant psychophysical evidence is thus consistent with a probabilistic explanation of the ‘specific distance tendency’.
The similar apparent distance of an object to the apparent distances of its near neighbors in the retinal image (the ‘equidistance tendency’ (Owens et al., 1976)) also accords with the PD of the distances of locations in the natural scenes. In the absence of additional information about differences in the distances of two nearby locations, the likelihood function is again more or less flat. As a result, the PD of the differences of the physical distances from the image plane to any two locations in natural scenes should strongly bias the perceived difference in their distances. Since this distribution between two locations with relatively small angular separations (the black line in Fig. 12 (b)) has a maximum near zero, any two neighboring objects should be perceived to be at about the same distance from the observer. However, at larger angular separations (the green line in Fig. 12 (b)) the probability associated with small absolute differences in the distance to the two points is lower than the corresponding probabilities for smaller separations, and the distribution relatively flatter. Accordingly, the tendency to see neighboring points at the same distance from the observer would be expected to decrease somewhat as a function of increasing angular separation. Finally, when more specific information about the distance difference is present, this tendency should decrease. Each of these several tendencies has been observed in psychophysical studies of the ‘equidistance tendency’.
4.5. Perceived distances in more complex circumstances
The following explanations for the phenomena illustrated in Figs. 10 (c) and (d) are somewhat more complex since, in contrast to the ‘specific distance’ and ‘equidistance’ tendencies, the relevant psychophysical observations were made under conditions that entailed some degree of contextual visual information. Thus, the relevant likelihood functions are no longer flat. Since their form is not known, we used a Gaussian to approximate the likelihood function in the following analyses.
The PD of physical distances at eye-level (the black line in Fig. 12 (c)) accounts for the perceptual anomalies in response to stimuli generated by near and far objects presented at this height (see Fig. 10 (c)). As shown in Fig. 13 (a), the distance that should be perceived on this basis is approximately a linear function of physical distance, with near distances being overestimated and far distances underestimated; the physical distance at which overestimation changes to underestimation is about 5-6 m. The effect of these statistics accords both qualitatively and quantitatively with the distances reported under these experimental conditions (Philbeck & Loomis, 1997).
To examine whether the perceptual observations summarized in Fig. 10 (d) can also be explained in these terms, we computed the PD of physical distances of points at different elevation angles of the laser beam relative to the horizontal plane at eye-level (Fig. 14). As shown in Fig. 14 (a), the PD of distances is more dispersed when the line of sight is directed above rather than below eye-level. The distribution shifts toward nearer distances with increasing absolute elevation angle, a tendency that is more pronounced below than above eye-level. A more detailed examination of the distribution within 30 m shows a single salient ridge below eye-level (indicated in red), extending from ~3 m near the ground to ~10 m at an elevation of -10(Fig. 14 (b)). The distances of the average physical locations at different elevation angles of the scanning beam form a gentle curve. Below eye-level, the height of this curve is relatively near the ground for closer distances, but increases slowly as the horizontal distance from the observer increases. If the portion of the curve at heights below eye-level in Fig. 14 (c) is taken as an index of the average ground, it is apparent that the average ground is neither a horizontal plane nor a plane with constant slant, but a curved surface that is increasingly inclined toward the observer as a function of horizontal distance.
These characteristics of distance as a function of the elevation of the line of sight can thus account for the otherwise puzzling perceptual effects shown in Fig. 10 (d). The perceived location of an object on the ground without much additional information varies according to the declination of the line of sight, the object appearing closer and higher than it really is as a function of this angle. The apparent location of an object predicted by the PDs in Fig. 14 is increasingly higher and closer to the observer as the declination of the line of sight decreases, in agreement with the relevant psychophysical data (Ooi et al., 2001) (Fig. 13 (b)).
When projected onto the retina, 3D spatial relationships in the physical world are necessarily transformed into 2D relationships in the image plane. As a result, the physical sources underlying any geometrical configuration in the retinal image are uncertain: a multitude of different scene geometries could underlie any particular configuration in the image. This uncertain link between retinal stimuli and physical sources presents a biological dilemma, since an observer’s fate depends on visually guided behavior that accords with real-world physical sources.
Given this quandary, we set out to explore the hypothesis that the uncertain relationship of images and sources is addressed by a probabilistic strategy, using the phenomenology of visual space to test this idea. If physical and perceptual space are indeed related in this way, then the characteristics of human visual space should accord with the PDs of 3D natural scene geometry. Observers would be expected to perceive objects in positions substantially and systematically different from their physical locations when countervailing empirical information is not available, or at locations predicted by the altered PDs of the possible sources of the stimulus in question when other contextual information is available. Using a database of range images, we showed that the phenomena illustrated in Fig. 10 can all be rationalized in this framework.
If visual space is indeed generated by a probabilistic strategy, then explaining the relevant perceptual phenomenology will inevitably require knowledge of the statistical properties of natural visual environments with respect to observers. Visual space generated probabilistically will necessarily be a space in which perceived distances are not a simple mapping of physical distances; on the contrary, apparent distance will always be determined by the way all the available information at that moment affects the PD of the gamut of the possible sources of any physical point in the scene.
These and many other studies present a strong case supporting the concept that vision works as a fundamentally statistical machine. In this concept, even the simplest visual percept has a statistical basis, i.e., it is related to a certain statistics in the natural environments that supports routinely successful visually guided behavior. The statistics of natural visual environments must have been incorporated into the visual circuitry by successful behavior in the world over evolutionary and developmental time.
There are a range of statistics in the natural environments. These include the statistics of 2D and 3D natural scenes in both space and time domains. As discussed here and elsewhere (Geisler, 2008), these statistics are related to a range of aspects of human natural vision. Since natural environments consist of objects of various physical properties that are arranged in 3D space and move in a variety of ways, the statistics of natural objects, activities, and events, though not discussed here, are critical for our understanding of human object recognition and activity and event understanding (Yuille & Kersten, 2006; Doya et al., 2007; Friston, 2010).
What could be the neural mechanisms underlying this fundamentally statistical machine? A broad hypothesis is that the response properties of visual neurons and their connections, the organization of visual cortex, the patterns of activity elicited by visual stimuli, and visual perception are all determined by the PDs of visual stimuli. In this conception, neurons do not detect or encode features, but by virtue of their activity levels, act as estimators of the PDs of the variables underlying any given stimulus. From this perspective, the function of visual cortical circuitry is to propagate, combine, and transform these PDs. The iterated structure of the primary visual cortex in primates may thus be organized in the way it is in order to generate PDs pertinent to simpler aspects of visual stimuli. By the same token, the extrastriate visual cortical areas may serve to generate PDs pertinent to more complex aspects of visual stimuli by propagating, combining, and transforming the PDs elaborated in the V1 area. The activity patterns elicited by any visual stimulus would, in this conception, be determined by the joint PDs of the variables underlying visual stimuli, which, in turn, determine what people actually see.
This statistical concept of vision and visual system structure and function is radically different from the conventional view, where visual neurons are conceived to perform bottom-up, image-based processing (e.g., computing zero-crossing, luminance and texture gradients, stereoscopic and motion correspondence, and grouping) to build a series of symbolic representations of visual stimuli (e.g., primal sketch, 2½) sketch, and 3D representation) (Marr, 2010). Since the statistics of natural scenes, which are, as argued above, fundamental to the generation of natural vision and visually guided behaviors, are not contained in any current stimulus the visual animal is seeing, any image-based feature extraction/representation construction in the current stimulus per se will not generate percepts that allow routinely successful behaviors. The results presented here and many others support this statistical concept of vision and visual system structure and function and several recent reviews also point to this new concept (Knill & Richards, 1996; Rao et al., 2002; Purves & Lotto, 2003; Doya et al., 2007; Trommershauser et al., 2011; Simoncelli & Olshausen, 2001; Yuille & Kersten, 2006; Geisler, 2008; Friston, 2010), but much is left to the next generation of neuroscientists.