Bio-Inspired Active Vision Paradigms in Surveillance Applications

Visual perception was described by Marr (1982) as the processing of visual stimuli through three hierarchical levels of computation. In the first level or low-level vision it is performed the extraction of fundamental components of the observed scene such as edges, corners, flow vectors and binocular disparity. In the second level or medium-level vision it is performed the recognition of objects (e.g. model matching and tracking). Finally, in the third level or high-level vision it is performed the interpretation of the scene. A complementary view is presented in (Ratha & Jain, 1999; Weems, 1991); by contrast, the processing of visual stimuli is analysed under the perspective developed by Marr (1982) but emphasising how much data is being processed and what is the complexity of the operators used at each level. Hence, the low-level vision is characterised by large amount of data, small neighbourhood data access, and simple operators; the medium-level vision is characterised by small neighbourhood data access, reduced amount of data, and complex operators; and the high-level vision is defined by non-local data access, small amount of data, and complex relational algorithms. Bearing in mind the different processing levels and their specific characteristics, it is plausible to describe a computer vision system as amodular framework in which the low-level vision processes can be implemented by using parallel processing engines like GPUs and FPGAs to exploit the data locality and the simple algorithmic operations of the models; and the medium and high-level vision processes can be implemented by using CPUs in order to take full advantage of the straightforward fashion of programming these kind of devices.


Introduction
Visual perception was described by Marr (1982) as the processing of visual stimuli through three hierarchical levels of computation.In the first level or low-level vision it is performed the extraction of fundamental components of the observed scene such as edges, corners, flow vectors and binocular disparity.In the second level or medium-level vision it is performed the recognition of objects (e.g.model matching and tracking).Finally, in the third level or high-level vision it is performed the interpretation of the scene.A complementary view is presented in (Ratha & Jain, 1999;Weems, 1991); by contrast, the processing of visual stimuli is analysed under the perspective developed by Marr (1982) but emphasising how much data is being processed and what is the complexity of the operators used at each level.Hence, the low-level vision is characterised by large amount of data, small neighbourhood data access, and simple operators; the medium-level vision is characterised by small neighbourhood data access, reduced amount of data, and complex operators; and the high-level vision is defined by non-local data access, small amount of data, and complex relational algorithms.Bearing in mind the different processing levels and their specific characteristics, it is plausible to describe a computer vision system as a modular framework in which the low-level vision processes can be implemented by using parallel processing engines like GPUs and FPGAs to exploit the data locality and the simple algorithmic operations of the models; and the medium and high-level vision processes can be implemented by using CPUs in order to take full advantage of the straightforward fashion of programming these kind of devices.
The low-level vision tasks are probably the most studied in computer vision and they are still an open research area for a great variety of well defined problems.In particular, the estimation of optic flow and of binocular disparity have earned special attention because of their applicability in segmentation and tracking.On the one hand, the stereo information has been proposed as a useful cue to overcome some of the issues inherent to robust pedestrian detection (Zhao & Thorpe, 2000), to segment the foreground from background layers (Kolmogorov et al., 2005), and to perform tracking (Harville, 2004).On the other hand, the optic flow is commonly used as a robust feature in motion-based segmentation and tracking (Andrade et al., 2006;Yilmaz et al., 2006).and the human visual system allows us to take full advantage of both optic flow and disparity estimations not only for tracking and fixation in depth but also for scene segmentation.The most relevant aspect in the proposed framework is its hardware and software modularity.The proposed system integrates three cameras (see Fig. 1); two active cameras with variable-focal-length lenses (binocular system) and a third fixed camera with a wide-angle lens.This system has been designed to be compatible with the well-known iCub robot interface1 .The cameras movement control, as well as the zoom and iris control run on an embedded computer PC/104.The optic flow and the disparity algorithms run on a desktop computer equipped with a processor Intel Core 2 Quad @ 2.40GHz and a memory RAM of about 8 GB.All system components, namely the desktop computer, the embedded computer PC/104, and the cameras, are connected in a gigabit Ethernet network through which they can interact as a distributed system.Fig. 1.Trinocular robotic head with 5 degrees of freedom, namely a common tilt movement, and independent zoom-pan movements for left and right cameras, respectively.
The general features of the moving platform are compiled in Table 1.Likewise, the optic features of the cameras are collected in Table 2. Lastly, it is important to mention that the binocular system has a baseline of 30 cm.  at not predetermined positions to strategically cover a wide area; the term active specifies the camera's ability of changing both the angular position and the field of view.The type of cameras used in the network has inspired different calibration processes to find automatically both the intrinsic and extrinsic camera parameters.In this regard, Lee et al. (2000) proposed a method to estimate the 3D positions and orientations of fixed cameras, and the ground plane in a global reference frame which lets the multiple cameras views to be aligned into a single planar coordinate frame; this method assume approximate values for intrinsic cameras parameters and it is based on overlapped cameras views; however, others calibration methods have been proposed for non-overlapped cameras views (i.e.Kumar et al., 2008).In the case of active cameras, Tsai (1987) has developed a method for estimating both the matrices of rotation and translation in the Cartesian reference frame, and the intrinsic parameters of the cameras.In addition to the calibration methods, the current surveillance systems must deal with the segmentation and identification of complex scenes in order to characterise them and thus to obtain a classification which let the system to recognise unusual behaviours into the scene.In this regard, a large variety of algorithms have been developed to detect changes in scene; for example the application of a threshold to the absolute difference between pixel intensities of two consecutive frames can lead to the identification of moving objects, some methods for the threshold selection are described in (Kapur et al., 1985;Otsu, 1979;Ridler & Calvar, 1978).Other examples are the adaptive background subtraction to detect moving foreground objects (Stauffer & Grimson, 1999;2000) and the estimation of optic flow (Barron et al., 1994).Our proposal differs the most of the current surveillance systems in at least three aspects: (1) the use of a single camera with a wide-angle lens to cover vast areas and a binocular system for tracking areas of interest at different fields of view (the wide-angle camera is used as the reference frame), (2) the estimation of both optic flow and binocular disparity for segmenting the images; this system feature can provide useful information for disambiguating occlusions in dynamic scenarios, and (3) the use of a bio-inspired fixation strategy which lets the system to fixate areas of interest, accurately.
In order to explain the system behaviour, two different perspectives were described.On the one hand, we present the system as a bio-inspired mathematical model of the primary visual cortex (see section 2); from this viewpoint, we developed a low-level vision architecture for estimating optic flow and binocular disparity.On the other hand, we describe the geometry of the cameras position in order to derive the equations that govern the movement of the cameras (see section 3).Once the system is completely described, we define an angular-position control capable of changing the viewpoint of the binocular system by using disparity measures in section 4.An interesting case study is described in section 5 where both disparity and optic flow are used to segment images.Finally, in section 6, we present and discuss the system's performance results.

The system: a low-level vision approach
The visual cortex is the largest, and probably the most studied part of the human brain.The visual cortex is responsible for the processing of visual stimuli impinging on the retinas.As a matter of fact, the first stage of processing takes place in the lateral geniculate nucleus (LGN) and then the neurons of the LGN relay the visual information to the primary visual cortex (V1).Then, the visual information flow hierarchically to areas V2, V3, V4 and V5/MT where visual perception gradually takes place.
The experiments carried out by Hubel & Wiesel (1968) proved that the primary visual cortex (V1) consists of cells responsive to different kinds of spatiotemporal features of the visual information.The apparent complexity with which the brain extracts the spatiotemporal features has been clearly explained by Adelson & Bergen (1991).The light filling a region of space contains information about the objects in that space; in this regard, they proposed the plenoptic function to describe mathematically the pattern of light rays collected by a vision system.By definition, the plenoptic function describes the state of luminous environment, thus the task of the visual system is to extract structural elements from it.
Structural elements of the plenoptic function can be described as oriented patterns in the plenoptic space, and the primary cortex can be interpreted as a set of local, Fourier or Gabor operators used to characterise the plenoptic function in the spatiotemporal and frequency domains.

Neuromorphic paradigms for visual processing
Mathematically speaking, the extraction of the most important aspects of the plenoptic function can emulate perfectly the neuronal processing of the primary visual cortex (V1).More precisely, qualities or elements of the visual input can be estimated by applying a set of low order directional derivatives at the sample points; the so obtained measures represent the amount of a particular type of local structure.To effectively characterise a function within a neighbourhood, it is necessary to work with the local average derivative or, in an equivalent form, with the oriented linear filters in the function hyperplanes.Consequently, the neurons in V1 can be interpreted as a set of oriented linear filters whose outputs can be combined to obtain more complex feature detectors or, what is the same, more complex receptive fields.The combination of linear filters allow us to measure the magnitude of local changes within a specific region, without specifying the exact location or spatial structure.The receptive fields of complex neurons have been modelled as the sum of the squared responses of two linear receptive fields that differ just in phase for 90 • (Adelson & Bergen, 1985); as a result, the receptive fields of complex cells provide local energy measures.

Neural Architecture to estimate optic flow and binocular disparity
The combination of receptive fields oriented in space-time can be used to compute local energy measures for optic flow (Adelson & Bergen, 1985).Analogously, by combining the outputs of spatial receptive fields it is possible to compute local energy measures for binocular disparity (Fleet et al., 1996;Ohzawa et al., 1990).On this ground, it has been recently proposed a neural architecture for the computation of horizontal and vertical disparities and optic flow (Chessa, Sabatini & Solari, 2009).Structurally, the architecture comprises four processing stages (see Fig. 2): the distributed coding of the features by means of oriented filters that resemble the filtering process in area V1; the decoding process of the filter responses; the estimation of the local energy for both optic flow and binocular disparity; and the coarse-to-fine refinement.The neuronal population is composed of a set of 3D Gabor filters which are capable of uniformly covering the different spatial orientations, and of optimally sampling the spatiotemporal domain (Daugman, 1985).The linear derivative-like computation concept of the Gabor filters let the filters to have the form h(x, t)=g(x) f (t).Both spatial and temporal terms in the right term are comprised of one harmonic function and one Gaussian function.This can be easily deduced from the impulse response of the Gabor filter.
The mathematical expression of the spatial term of a 3D Gabor filter rotated by an angle θ with respect to the horizontal axis is: where θ ∈ [0, 2π) represents the spatial orientation; ω 0 and ψ are the frequency and phase of the sinusoidal modulation, respectively; the values σ x and σ y determine the spatial area of the filter; and (x θ , y θ ) are the rotated spatial coordinates.
The algorithm to estimate the binocular disparity is based on a phase-shift model; one of the variations of this model suggests that disparity is coded by phase shifts between receptive fields of the left and right eyes whose centres are in the same retinal position (Ohzawa et al., 1990).Let the left and right receptive fields be g L (x) and g R (x), respectively; the binocular phase shift is defined by ∆ψ = ψ L − ψ R .Each spatial orientation has a set of k receptive fields with different binocular phase shifts in order to be sensitive to different disparities (δ θ = ∆ψ/ω o ); the phase shifts are uniformly distributed between −π and π.Therefore, the left and right receptive fields are applied to a binocular image pair I L (x) and I R (x) according to the following equation: so, the spatial array of binocular energy measures can be expressed as: Likewise, the temporal term of a 3D Gabor filter is defined by: where σ t determines the integration window of the filter in time domain; ω t is the frequency of the sinusoidal modulation; and 1(t) denotes the unit step function.Each receptive field is tuned to a specific velocity v θ along the direction orthogonal to the spatial orientation θ.T h e temporal frequency is varied according to ω t = v θ ω 0 .Each spatial orientation has a set of receptive fields sensitive to M tuning velocities; M depends on the size of the area covered by each filter according to the Nyquist criterion.
The set of spatiotemporal receptive fields h(x, t) is applied to an images sequence I(x, t) according to the following equation: so, the motion energy E(x 0 , t; v θ ) equals: where So far, we have described the process of encoding both binocular disparity and optic flow by means of a N × M × K array of filters uniformly distributed in space domain.Now, it is necessary to extract the component velocity (v θ c ) and the component disparity (δ θ c )f r o m the local energy measures at each spatial orientation.The accuracy in the extraction of these components is strictly correlated with the number of filters used per orientation, such that precise estimations require a large number of filters; as a consequence, it is of primary importance to establish a compromise between the desired accuracy and the number of filters used or, what is the same, a compromise between accuracy and computational cost.
An affordable computational cost can be achieved by using weighted sum methods as the maximum likelihood proposed by Pouget et al. (2003).However, the proposed architecture uses the centre of gravity of the population activity since it has shown the best compromise between simplicity, computational cost and reliability of the estimates.Therefore, the component velocity v θ c is obtained by pooling cell responses over all orientations: where v θ i represent all the M tuning velocities; and E(x 0 , t; v θ i ) represent the motion energies at each spatial orientation.The component disparity δ θ c can be estimated in a similar way.
Because of the aperture problem a filter can just estimate the features which are orthogonal to the orientation of the filter.So we adopt k different binocular and M different motion receptive fields for each spatial orientation; consequently, a robust estimate for the full velocity v and for 6 Machine Vision -Applications and Systems www.intechopen.comthe full disparity δ is achieved by combining all the estimates v θ c and δ θ c , respectively (Pauwels & Van Hulle, 2006;Theimer & Mallot, 1994).
Finally, the neural architecture uses a coarse to fine control strategy in order to increase the range of detection in both motion and disparity.The displacement features obtained at coarser levels are expanded and used to warp the images in finer levels in order to achieve a higher displacement resolution.

The system: a geometrical description
In the previous section we presented the system from a biological point of view.We have summarised a mathematical model of the behaviour of the primary visual cortex and we have proposed a computational architecture based on linear filters for estimating optic flow and binocular disparity.Now it is necessary to analyse the system from a geometrical point of view in order to link the visual perception to the camera movements, thus letting the system to interact with the environment.
To facilitate the reference to the cameras within this text, we are going to refer the fixed camera as wide-angle camera, and the cameras of the binocular system as active cameras.The wide-angle camera is used for a wide view of the scene, and it becomes the reference of the system.In vision research, the cyclopean point is considered the most natural centre of a binocular system (Helmholtz, 1925) and it is used to characterise stereopsis in human vision (Hansard & Horaud, 2008;Koenderink & van Doorn, 1976).By doing a similar approximation, the three-camera model uses the wide-angle-camera image as the cyclopean image of the system.In this regard, the problem statement is not trying to construct the cyclopean image from the binocular system, but using the third camera image as a reference coordinate to properly move the active cameras according to potential targets or regions of interest in the wide range scenario.
Each variable-focal-length camera can be seen as a 3DOFs pan-tilt-zoom (PTZ) camera.However, the three-camera system constraints the active cameras to share the tilt movement due to the mechanical design of the binocular framework.One of the purposes of our work is to describe the geometry of the three-camera system in order to properly move the pan-tilt-zoom cameras to fixate any object in the field of view of the wide-angle camera and thus to get both a magnified view of the target object and the depth of the scene.
We used three coordinates systems to describe the relative motion of the active cameras with respect to the wide-angle camera (see Fig. 3).The origin of each coordinate system is supposed in the focal point of each camera and the Z-axes are aligned with the optical axes of the cameras.The pan angles are measured with respect to the planes X L = 0andX R = 0 respectively; note that pan angles are positive for points to the left of these planes (X L > 0or X R > 0).The rotation axes for the pan movement are supposed to be parallel.The common tilt angle is measured with respect to the horizontal plane; note that the tilt angle is positive for points above the horizontal plane (Y L = Y R > 0).
The point P(X, Y, Z) can be written in terms of the coordinate systems shown in Fig. 3 as follows: where O L =( dx L , dy L , dz L ) and O R =( −dx R , dy R , dz R ) are the origin of the coordinate system of the left and right cameras with respect to the wide-angle camera coordinate system.

Right Camera
Left Camera Wide-angle Camera Fig. 3.The coordinate systems of the three cameras in the binocular robotic head.
It is considered f w as the focal length of the wide-angle camera and f as the focal length of the active cameras.The Equations 8 and 9 can be written in terms of the image coordinate system of the wide-angle camera if these equations are multiplied by factor Now, it is possible to link the image coordinate system of the wide-angle camera to the image coordinate system of the active cameras by multiplying the Equations 10 and 11 by the factors f Z L and f Z R , respectively: Assuming that the position of the origin with respect to the Z-axis is small enough compared to the distance of the real object in the scene, it can be done the next approximation Z ≈ Z L and Z ≈ Z R .Accordingly, the Equations 12 and 13 can be rewritten to obtain the wide-to-active 8 Machine Vision -Applications and Systems www.intechopen.comcamera mapping equations as follows: These equations describe the position of any point in the field of view of the wide-angle camera into the image coordinate of the active cameras.
So far, we have described the geometry of the cameras system, now the problem is to transform the wide-to-active camera mapping equations to motor stimuli in order to fixate any point in the wide-angle image.The fixation problem can be defined as the computation of the correct angular position of the motors in charge of the pan and tilt movements of the active cameras, to direct the gaze to any point in the wide-angle image.In this sense, the fixation problem is solved when the point p(x, y) in the wide-angle image can be seen in the centres of the left and right camera images.
From the geometry of the trinocular head we can consider dx L = dx R ,anddy L = dy R .I nt h i s way, both pan (θ L , θ R ) and tilt (θ y ) angles of the active cameras, according to the wide-to-active camera mapping equations, can be written as: where c is the camera conversion factor from pixel to meters; dx, dy are the terms dx L = dx R and dy L = dy R in pixel units.
Bearing in mind the wide-to-active camera mapping equation, in the following section we will describe the algorithm to move the active cameras to gaze and fixate in depth any object in thefieldofviewofthewide-anglecamera.

Fixation in depth
Two different eyes movements can be distinguished: version movements rotate the two eyes by an equal magnitude in the same direction, whereas vergence movements rotate the two eyes in opposite direction.The vergence angle, together with version and tilt angles, uniquely describe the fixation point in the 3D space according to the Donders' law (Donders, 1969).
Fixation in depth is the coordinated eye movement to align the two retinal images in the respective foveas.Binocular depth perception has its highest resolution in the well-known Panum area, i.e. a rather small area centred on the point of fixation (Kuon & Rose, 2006).The fixation of a single point in the scene can be achieved, mainly, by vergence eye-movements which are driven by binocular disparity (Rashbass & Westheimer, 1961).It follows that the amount of disparity around the Panum area must be reduced in order to properly align the two retinal images in the respective foveas.

9
Bio-Inspired Active Vision Paradigms in Surveillance Applications www.intechopen.com

Defining the Panum area
The Panum area is normally set around the centre of uncalibrated images.This particular assumption becomes a problem in systems where the images are captured by using variable-focal-length lenses; consequently, if the centre of the image is not lying on the optical axis, then any change in the field of view will produce a misalignment of the Panum area after a fixation in depth.Lenz & Tsai (1988) were the first in proposing a calibration method to determine the image centre by changing the focal length even though no zoom lenses were available at that time.In a subsequent work (Lavest et al., 1993) have used variable-focal-length lenses for three-dimensional reconstruction and they tested the calibration method proposed by (Lenz & Tsai, 1988).
In a perspective projection geometry the parallel lines, not parallel to the image plane, appear to converge to a unique point as in the case of the two verges of a road which appear to converge in the distance; this point is known as the vanishing point.Lavest et al. (1993) used the properties of the vanishing point to demonstrate that, with a zoom lens, it is possible to estimate the intersection of the optical axis and the image plane, i.e. the image centre.
The Equation 18is the parametric representation of a set of parallel lines defined by the direction vector The vanishing point of these parallel lines can be estimated by using the perspective projection as shown in Equation 19: The result shown in Equation 19 demonstrates that the line passing through the optical centre of the camera and the projection of the vanishing point of the parallel lines is collinear to the director vector ( D) of these lines as shown below: According to the aforementioned equations and taking into account that, by convention, the centre of the image is the intersection of the optical axis and the image plane; it is possible to conclude that the vanishing point of a set of lines parallel to the optical axis lies in the image centre.The optical zoom can be considered as a virtual movement of the scene throughout the optical axis; in this regard, any point in the scene follows a virtual line parallel to the 10 Machine Vision -Applications and Systems www.intechopen.comoptical axis.This suggests that, from the tracing of two points across a set of zoomed images, it is possible to define the lines L1andL2 (see Fig. 4) which represent the projection of these virtual lines in the image plane.It follows that the intersection of L1andL2 corresponds with the image centre.

Zoom out
Zoom in 1 Zoom in 2 L1 L2 Fig. 4. Geometric determination of the image centre by using zoomed images.The intersection of the lines L1andL2, defined by the tracing of two points across the zoomed images, corresponds with the image centre.
Once the equations of lines L1a n dL2 have been estimated, it is possible to compute their intersection.Now, the Panum area is defined as a small neighbourhood around the intersection of these lines and thus it is possible to guarantee the fixation of any object even under changes in the field of view of the active cameras.

Developing the fixation-in-depth algorithm
Once the Panum area is properly defined, it is possible to develop an iterative angular-position control based on disparity estimations to fixate in depth any point in the field of view of the wide-angle camera.Fig. 5 shows a scheme of the angular-position control of the three-camera system.Any salient feature in the cyclopean image (wide-angle image) provides the point (x, y), in image coordinate, in order to set the version movement.Once the version movement is completed, the disparity estimation module can provide information about the depth of the object in the scene; this information is used to iteratively improve the alignment of the images in the active cameras.
Considering that the angular position of the cameras is known at every moment, it is possible to use the disparity information around the Panum area to approximate the scene depth; this is, a new Z in the wide-to-active camera mapping equations (see Equation 16).If we take the left image as reference, then the disparity information tells us how displaced the right image is; hence, the mean value of these disparities around the Panum area can be used to estimate the angular displacement needed to align the left and right images.As the focal length of the Fig. 5. Angular-position control scheme of the trinocular system.
active cameras can be approximated from the current zoom value, the angular displacement θ can be estimated as follow: Once the angular displacement is estimated, the new Z parameter is obtained according to Equation 22: The angle θ verg is half of the angular displacement θ according to (Rashbass & Westheimer, 1961).In order to iteratively improve the alignment of the images in the active cameras, the angle θ verg is multiplied by a constant (q < 1) in the angular-position control algorithm; this constant defines the velocity of convergence of the iterative algorithm.

Benefits of using binocular disparity and optic flow in image segmentation
The image segmentation is an open research area in computer vision.The problem of properly segment an image has been widely studied and several algorithms have been proposed for different practical applications in the last three decades.The perception of what is happening in an image can be thought of as the ability for detecting many classes of patterns and statistically significant arrangements of image elements.Lowe (1984) suggests that human perception is mainly a hierarchical process in which prior knowledge of the world is used to provide higher-level structures, and these ones, in their turn, can be further combined to yield new hierarchical structures; this line of thoughts was followed in (Shi & Malik, 2000).It is worth noting that the low-level visual features like motion and disparity (see Fig. 6) can offer a first description of the world in certain practical application (cf.Harville, 2004;Kolmogorov et al., 2005;Yilmaz et al., 2006;Zhao & Thorpe, 2000).The purpose of this section is to show the benefits of using binocular disparity and optic flow estimates in segmenting surveillance video sequences rather than to make a contribution to the solution of the general problem of image segmentation.The following is a case of study in which the proposed system is capable of segmenting all individuals in a scene by using binocular disparity and optic flow.In a first stage of processing, the system fixates in depth the individuals according to the aforementioned algorithm (see section 4); that is, an initial fast movement of the cameras (version) triggered by a saliency in the wide-angle camera, and a subsequent slower movement of the cameras (vergence) guided by the binocular disparity.In a second stage of processing, the system changes the field of view of the active cameras in order to magnified the region of interest.Finally, in the last stage of processing, the system segments the individuals in the scene by using a threshold in the disparity information (around disparity zero or point of fixation) and a threshold in the orientation of the optic flow vectors.The results of applying the above mentioned processing stages are shown in Fig. 7. Good segmentation results can be achieved from the disparity measures by defining a set of thresholds (see Fig. 7b), however, a better data segmentation is obtained by combining the partial segments of binocular disparity and optic flow, respectively; an example is shown in Fig. 7c.The results in segmentation are constrained by the estimates of disparity and optic flow.For this reason, it is necessary to follow segmentation strategies, like the one proposed by Shi & Malik (2000), in order to achieve the appropriate robustness in the data segmentation.In fact, they argue the necessity of combining different features like colour, edge or in general any kind of texture information to create a hierarchical partition of the image, based on graph theory, in which prior knowledge is used to confirm current grouping or to guide further classifications.

The system performance
So far, we have presented an active vision system capable of estimating both optic flow and binocular disparity through a biologically inspired strategy, and capable of using these information to change the viewpoint of the cameras in an open, uncontrolled environment.This capability lets the system interact with the environment to perform video surveillance tasks.The purpose of this work was to introduce a novel system architecture for an active vision system rather than to present a framework for performing specific surveillance tasks.Under this perspective, it was first described the low-level vision approach for optic flow and binocular disparity, and then it was presented a robotic head which uses this approach to effectively solve the problem of fixation in depth.
In order to evaluate the performance of the system, it is necessary to differentiate the framework instances according to their role in the system.On the one hand, both optic flow and binocular disparity are to be used as prominent features for segmentation; hence, it is important to evaluate the accuracy of the proposed algorithms by using test sequences for which ground truth is available (see http://vision.middlebury.edu/).On the other hand, we must evaluate the system performance in relation to the accuracy of the binocular system to correctly change the viewpoint of the cameras.

Accuracy of the distributed population code
The accuracy of the estimates has been evaluated for a system with N = 16 oriented filters, each tuned to M = 3 different velocities and to K = 9 binocular phase differences.The used Gabor filters have a spatiotemporal support of (11 × 11) × 7p i x e l s× frames and are characterised by a bandwidth of 0.833 octave and spatial frequency ω 0 = 0.5π.T h eT a b l e3 shows the results for distributed population code that has been applied to the most frequently used test sequences.The optic flow was evaluated by using the database described in (Baker et al., 2007) and the disparity was evaluated by using the one described in (Scharstein & Szeliski, 2002); however, in the case of disparity test sequences, the ground truth contains horizontal disparities, only; for this reason, it was also used the data set described in (Chessa, Solari & Sabatini, 2009) to benchmark the 2D-disparity measures (horizontal and vertical).Table 3. Performance of the proposed distributed population code.On the one hand, the reliability of disparity measures has been computed in terms of percentage of bad pixels (%BP) for non-occluded regions.On the other hand, the reliability of optic flow measures has been computed by using the average angular error (AAE) proposed by Barron (Barron et al., 1994).

Distributed population code
A quantitative comparison between the proposed distributed population code and some of the well-established algorithms in literature has been performed in (Chessa, Sabatini & Solari, 2009).The performances of the stereo and motion modules are shown in Table 3, which substantiates the feasibility of binocular disparity and optic flow estimates for image segmentation; the visual results are shown in Fig. 7.

Behaviour of the trinocular system
A good perception of the scene's depth is required to properly change the viewpoint of a binocular system.The previous results for disparity estimation have shown to be a valuable cue for 3D perception.The purpose now is to demonstrate the capability of the trinocular head to fixate any object in the field of view of the wide-angle camera.In order to evaluate the fixation in depth algorithm, two different scenarios have been considered: the long-range scenario in which the depth is larger than 50 meters in the line of sight (see Fig. 8), and the short-range scenario in which the depth is in the range between 10 and 50 meters (see Fig. 11).The angular-position control uses the disparity information to align the binocular images in the Panum area.In order to save computational resources and considering that just a small area around the centre of the image has the disparity information of the target object, the size of the Panum area has been empirically chosen as a square region of 40x40 pixels.Accordingly, the mean value of the disparity in the Panum area is used to iteratively estimate the new Z parameter.
In order to evaluate the performance of the trinocular head, we first tested the fixation strategy in the long-range scenario.In the performed tests, three points were chosen in the cyclopean image (see Fig. 8(a)).For each point, the active cameras performed a version movement according to the coordinate system of the cyclopean image and, inmediately after, the angular-position control started the alignment of the images by changing the pan angles iteratively.Once the images were aligned, a new point in the cyclopean image was provided.
Fig. 9 shows the angular changes of the active cameras during the test in the long-range scenario.In Figs.9(a) and 9(b) the pan angle of the left and right cameras, respectively, is depicted as a function of time.Fig. 9(c) shows the same variation for the common tilt angle.Each test point of the cyclopean image was manually selected after the fixation in depth of the previous one; consequently, the plots show the angular-position control behaviour during changes in the viewpoint of the binocular system.It is worth noting that the version movements correspond, roughly speaking, with the pronounced slopes in the graphs, while the vergence movements are smoother and therefore with a less pronounced slope.In a similar way, the fixation in depth algorithm was also evaluated in short-range scenarios by using three test points (see Fig. 11).We followed the same procedure used for long-range scenarios and the results are shown in Fig. 10.
From the plots in Figs. 9 and 10 we can observe that small angular shifts were performed just after a version movement; this behaviour is due to two factors: (1) the inverse relationship between the vergence angle and the depth by which for large distances the optical axes of the binocular system can be well approximated as parallel; and (2) the appropriate geometrical description of the system which allows us to properly map the angular position of the active cameras with respect to the cyclopean image.Actually, there are not enough differences between long and short-range scenarios in the angular-position control, because the vergence angles begin to be considerable for depths minor than 10 meters, approximately; it is worth noting that, this value is highly dependent on the baseline of the binocular system.

Conclusion
We have described a trinocular active visual framework for video surveillance applications.The framework is able to change the viewpoint of the active cameras toward areas of interest, to fixate a target object at different fields of view, and to follow its motion.This behaviour is possible thanks to a rapid angular-position control of the cameras for object fixation and pursuit based on disparity information.The framework is capable of recording image frames at different scales by zooming individual areas of interest, in this sense, it is possible to exhibit the target's identity or actions in detail.The proposed visual system is a cognitive model of visual processing replicating computational strategies supported by the neurophysiological studies of the mammalian visual cortex which provide the system with a powerful framework to characterise and to recognise the environment, in this sense, the optic flow and binocular disparity information are an effective, low-level, visual representation of the scenes which provide a workable base for segmenting the dynamic scenarios; it is worth noting that, these measures can easily disambiguate occlusions in the different scenarios.

3
Bio-Inspired Active Vision Paradigms in Surveillance Applications www.intechopen.com

Fig. 2 .
Fig. 2. The neural architecture for the computation of disparity and optic flow.

Fig. 6 .
Fig. 6.Example of how different scenes can be described by using our framework.The low-level visual features refer to both disparity and optic flow estimates.

Fig. 7 .
Fig. 7. Case of study: the segmentation of an image by using disparity and optic flow estimates.
Left Image, point A. (c) Right Image, point A. (d) Left Image, point B. (e) Right Image, point B. (f) Left Image, point C. (g) Right Image, point C.

Fig. 8 .
Fig. 8. Long-range scenario: Fixation of points A, B and C. A zoom factor of 16x was used in the active cameras.Along the line of sight the measured depths were approximately 80 m, 920 m, and 92 m, respectively.

Fig. 9 .
Fig. 9. Temporal changes in the angular position of the active cameras to fixate in depth the points A, B and C in a long-range scenario.
Left Image, point A. (c) Right Image, point A. (d) Left Image, point B. (e) Right Image, point B. (f) Left Image, point C. (g) Right Image, point C.

Fig. 11 .
Fig. 11.Short-range scenario: Fixation of points A, B, and C. The different zoom factors used in the active cameras were 4x, 16x, and 4x, respectively.Along the line of sight the measured depths were approximately 25 m, 27 m, and 28 m, respectively. of sight the measured depths were approximately 25 m, 27 m, and 28 m, for points A, B, and C, respectively.

Table 1 .
General features of the moving platform.Most of the video surveillance systems are networks of cameras for a proper coverage of wide areas.These networks use both fixed or active cameras, or even a combination of both, placed

Table 2 .
Optic features of the cameras.