EM-Based Mixture Models Applied to Video Event Detection

Surveillance system (SS) development requires hi-tech support to prevail over the shortcomings related to the massive quantity of visual information from SSs. Anything but reduced human monitoring became impossible by means of its physical and economic implications, and an advance towards an automated surveillance becomes the only way out. When it comes to a computer vision system, automatic video event comprehension is a challenging task due to motion clutter, event understanding under complex scenes, multilevel semantic event inference, contextualization of events and views obtained from multiple cameras, unevenness of motion scales, shape changes, occlusions and object interactions among lots of other impairments. In recent years, state-of-the-art models for video event classification and recognition include modeling events to discern context, detecting incidents with only one camera, low-level feature extraction and description, high-level semantic event classification, and recognition. Even so, it is still very burdensome to recuperate or label a specific video part relying solely on its content. Principal component analysis (PCA) has been widely known and used, but when combined with other techniques such as the expectation-maximization (EM) algorithm its computation becomes more efficient. This chapter introduces advances associated with the concept of Probabilistic PCA (PPCA) analysis of video event and it also aims at looking closely to ways and metrics to evaluate these less intensive EM implementations of PCA and KPCA.


Introduction
Surveillance system (SS) development requires hi-tech support to prevail over the shortcomings related to the massive quantity of visual information from SSs [Fuentes & Velastin (2001)]. Anything but reduced human monitoring became impossible by means of its physical and economic implications, and an advance towards an automated surveillance becomes the only way out.
When it comes to a computer vision system, automatic video event comprehension is a challenging task due to motion clutter, event understanding under complex scenes, multilevel semantic event inference, contextualization of events and views obtained from multiple cameras, unevenness of motion scales, shape changes, occlusions and object interactions among lots of other impairments. In recent years, state-of-the-art models for video event classification and recognition [Zhang et al. (2011), Yacoob et al. (1999)] include modeling events to discern context, detecting incidents with only one camera (Ma et al. (2009), Zhao et al. (2002, Zelnik-Manor et al. (2006)], low-level feature extraction and description, high-level semantic event classification and recognition. Even so, it is still very burdensome to recuperate or label a specific video part relying solely on its content. prospective bases, and handles the implicit postulation of continuity in a data set. Let X be a matrix representing the original data set , be another matrix related by a linear transformation P that represents a change of basis. X is the original recorded data set and is its new representation PX=Y . (1) Geometrically, is a rotation and a stretch which again transforms X into Y. The covariance matrix of X is PCA also takes for granted that mean and variance are sufficient statistics are enough to depict a probability distribution. This happens to be the case with exponential distributions (Gaussian, Exponential, etc). Deviations from an exponential distribution could nullify this assumption. Diagonalizing a covariance matrix might not give acceptable results. This hypothesis guarantees that the and the covariance matrix totally portray noise and redundancies. The following factors can corrupt data: noise, rotation and redundancy. A common noise metric is the signal-to-noise ratio , or a ratio of variances as follows: ( A high ≫ indicates clean data, while a low points to noisy data. Large variances have important dynamics which means that the data is supposed to have a high . Thus, principal components (PCs) with larger associated variances correspond to interesting dynamics, while those with lower variances may characterize noise.
Returning to (1), X is an m×n matrix, where m is the number of measurement types and n is the number of samples. The goal is to find an orthonormal such that is diagonal and rows of P are the PCs of X. Because lots of real world data are normally distributed, PCA usually provides a robust solution to small deviations from this assumption. Rewriting in terms of P yields 11 () () where ≡XX T is symmetric with .
The eigenvectors of A are arranged as columns of E and is a diagonal matrix. A has orthonormal eigenvectors where is the rank of the matrix. The rank of A is less than when A is degenerate or all data reside in a subspace of dimension . Maintaining the constraint of orthogonality, this situation can be remediated by selecting further orthonormal vectors to complete E. These additional vectors do not influence the final solution because the variances associated with these directions are zero.
Each row p i is an eigenvector of XX T and form ≡ . Combining the previous equations results in A DP.
Since P -1 =P T , then C Y becomes In practice, computing of a data set X requires subtracting off the mean of each measurement type and the calculation of the eigenvectors of .

A more general solution: SVD
PCA relates closely to singular value decomposition , but is a more general method to deal with change of basis. Let X be an arbitrary matrix and be a symmetric square matrix with rank r. V={v 1 , v 2 , …,v r , 0,…,0} is the set of orthonormal eigenvectors associated with eigenvalues , ,…, , ,…, for the symmetric matrix X T X such that , where ≡ are positive real singular values and U={u 1 , u 2 ,…, u r , 0,…,0} is the set of orthonormal vectors defined by u i =(1/σ i )Xv i . V and U contain, respectively, (m-r) and (n-r) appended zeros and ⋯ are the rank-ordered set of singular values. The matrix version of is given by .
Because V is orthogonal, multiplying both sides of the expression above by leads to the final form of the SVD: , which states that any arbitrary matrix X can be converted to an orthogonal matrix, a diagonal matrix and another orthogonal matrix as follows: www.intechopen.com where ≡ . Note that is a change of basis from X to Z. The fact that the orthonormal basis transforms column vectors means that is a basis that spans the columns of X. Bases that span the columns are termed the column space of X. If ≡ , then the rows of (or the columns of V) are an orthonormal basis for transforming into Z. Because of the transpose of X, it follows that V is an orthonormal basis spanning the row space of X.
Matrices V and U are and respectively. Σ is a matrix with a small amount of non-zero values along its diagonal. The SVD allows for creating a new m ×n matrix as follows: where each column of has zero mean. The definition of becomes obvious by looking at : hence, is an and by construction equals the covariance matrix of X. The PCs of X are the eigenvectors of . Applying to , the columns of matrix V contain the eigenvectors of . Therefore, the columns of V are the PCs of X.
V spans the row space of ≡ √ . Therefore, V must also span the column space of √ .
We can conclude that finding the PCs amounts to finding an orthonormal basis that spans the column space of X. If the final goal is to find an orthonormal basis for the column space of X then we can calculate it directly without constructing Y. By symmetry the columns of produced by the S of √ must also be the PCs.
One benefit of is that we can examine the variances associated with the principal components. Often one finds that large variances associated with the first PCs, and then a precipitous drop-off. One can conclude that most interesting dynamics occur only in the first dimensions.
Both the strength and weakness of is that it is a non-parametric analysis. When data are not normally distributed fails. In exponentially distributed data, the axes with the largest variance do not correspond to the underlying basis. There are no parameters to tweak and no coefficients to adjust based on user experience: the answer is unique and independent of the user.
This also poses a problem, if some system characteristics are not known a-priori, then it makes sense to incorporate these assumptions into a parametric algorithm or an algorithm with selected parameters.
This prior non-linear transformation is sometimes termed a kernel transformation and the entire parametric algorithm is called . This procedure is parametric because the user must incorporate prior knowledge of the structure in the selection of the kernel but it is also more optimal in the sense that the structure is more concisely described.
One might envision situations where the PCs need not be orthogonal. Only the subspace is unique because the PCs are not uniquely defined. In addition, eigenvectors beyond the rank of a matrix (i.e. for rank) can be selected almost capriciously. Nevertheless, these degrees of freedom do not influence the qualitative features of the solution nor a dimensional reduction.
For instance, if an image contains a 2-D exponentially distributed data set, then the largest variances will not correspond to the meaningful axes and fails.

EM algorithm for PCA
There is a close relationship between the expectation-maximization (EM) algorithm and PCA, which leads to a faster implementation of the PPCA. The algorithm extracts a small number of eigenvectors and eigenvalues from large sets of high dimensional data. It is computationally efficient in space and time and does not require computing the sample covariance of the data.
PCA is largely used in data analysis due to its optimality in terms of mean squared error, and its linear scheme to reduce the dimensions of vectors, so that compression and decompression become simple operations to carry out given the model parameters.
Notwithstanding these interesting features, PCA has some deficiencies. The other two methods for finding the PCs are impractical for high dimensional data. Difficulties can arise in both computational complexity and data scarcity when diagonalizing a covariance matrix of vectors in a p-dimensional space when and amount to hundreds or several thousands of elements. It is often the case that there is not enough data in high dimensions for the sample covariance to be of full rank (data scarcity). Moreover, care needs to be taken in order to use techniques such as the SVD, which do not need full rank matrices. Complexity makes the direct diagonalization of a symmetric matrix with thousands of rows tremendously expensive (it is O(p 3 ) for inputs). There are procedures such as the one proposed by Wilkinson (1965) which is O(p 2 ) that decrease this cost when only the first most important eigenvectors and eigenvalues are necessary. The sample covariance calculation calls for O(np 2 ) operations.
In most cases, the explicit computation of the sample covariance matrix should be avoided. Methods such as the snap-shot algorithm from Sirovich (1987), which has complexity of O(n 3 ), take for granted that the eigenvectors sought out are linear combinations of the data points. In this section, a version of the EM algorithm from Dempster (1977) is presented for learning the PCs of a dataset. The algorithm does not require computing the sample covariance and has a complexity limited by O(knp) operations, where is the number of leading eigenvectors to be learned.
Usual PCA approaches cannot handle missing values: incomplete data must either be discarded or completed via ad-hoc interpolation techniques. A possible and uncomplicated solution is to replace missing coordinates with the mean of the known values in the corresponding coordinate or with estimation values relying on the known values. The EM algorithm for PCA benefits from the estimation of the maximum likelihood (ML) values for missing information directly at each iteration as stated by Ghahramani & Jordan (1994).
As a final point, independently of the technique used to perform PCA, there is no accurate probability model in the input space, because the probability density is not normalized in the PS . This means that once applying PCA to some data, the only criterion on hand to verify if the new data fit well the model is the squared distance of the new data from their projections into the PS. A data point distant from the training data but close to the PS will have a high pseudo-likelihood or low error. This chapter also brings in a model called sensible PCA SPCA) which delineates a proper covariance structure in the data space as proposed by Roweis (1998) whose main contribution was to alleviate the computational load of the other two techniques with the help of the EM algorithm.
PCA can be interpreted as a limiting case of a particular class of linear-Gaussian models (LGMs), because these models capture the covariance structure of an observed pdimensional variable y using less than p(p+1)/2 free parameters when compared to the full covariance matrix calculation.
LGMs do this by assuming that y is the result from a linear transformation of some k-dimensional x plus additive Gaussian noise. Denoting the transformation by the matrix and the (p-dimensional) noise by v (with covariance matrix R) the generative model can be written as , where ∼ , and ∼ , .
is considered independent and identically distributed (iid) according to a unit variance spherical Gaussian. Since is also iid and independent of x, the model reduces to a single Gaussian model for as follows: With the purpose of saving parameters over the direct covariance representation in p-space, it is indispensable to select and to curb the covariance structure of by constraining R. The second constraint allows the model to capture any interesting or informative projections in . If was not limited, then the algorithm could choose and would be the sample covariance of the data considering any deviation in the data as noise.
There are two central problems of interest when working with LGMs. Firstly, the compression problem asks if given fixed model parameters C and R, it is possible to gather information about the unknown x given a few observations y must be gathered. Since the data points are independent, one needs the posterior probability P(x|y) given the corresponding single observation, resulting in where 1 () TT   β CC C R gives the expected value βy of the unknown and an estimate of the uncertainty in this value in the form of the covariance (I-βC) . y from x can be obtained from P(x|y). Finally, the likelihood of any data point y comes from (16).
The second problem is called learning or parameter fitting. It seeks the matrices and that assign the highest likelihood to the observed data. There is a family of EM algorithms employing the inference formula above in the E-step to estimate the unknown and then choose as well as in the M-step, in order to maximize the expected joint likelihood of the estimated and the observed .
PCA is a limiting case of the LGM as the covariance of the noise becomes infinitesimally small and equal in all directions, that is 0 lim     RI . This makes the likelihood of subject exclusively to the squared distance between it and its reconstruction C X . The directions of the columns of which minimize this error are the PCs. Inference now becomes a simple least squares projection: or alternatively, Given that the noise became insignificant, the posterior over x collapses to a single point and the covariance turns out to be zero. Albeit the PCs can be computed explicitly, there is still an EM algorithm for the limiting case of zero noise. It can be easily derived from the standard algorithms (Sangers (1989), Oja (1989), Everitt (1984), Ghahramani et al. (1997)) by replacing the common E-step by the above projection as follows: where is a matrix containing all the observed data and X is a matrix with the unknowns. The columns of span the space of the first PCs. To explicitly compute the corresponding eigenvectors and eigenvalues, the data can be projected onto this kdimensional subspace to construct an ordered orthogonal covariance basis. This means that once an orientation for the PS was guessed, the presumed subspace is corrected and the data is projected onto it to give the values of . Next, the values of x are corrected and a subspace orientation is chosen to minimize the squared reconstruction errors of the data points.
Bear in mind that if is with and is rank then left multiplication by , which appears not to be well defined because (CC T ) is not invertible, is exactly equivalent to left multiplication by . This is the same as the SVD idea of defining the inverse of the diagonal singular value matrix as the inverse of an element except if it is zero when it stays zero. The perception is that even if CC T in fact is not invertible, the directions along which it is not invertible are just those that is about to project out.
The EM algorithm for PCA amounts to an iterative procedure for finding the leading eigenvectors without explicit computation of the sample covariance. Its complexity is limited by per iteration and so depends only linearly on both the dimensionality of the data and the number of points. Explicitly computing the sample covariance matrix result in complexities of , while other methods that form linear combinations of the data must calculate and diagonalize a matrix with all possible inner products between points and as a result have complexity.
According to Roweis (1998), the standard convergence proofs for EM given by Dempster (1977) are appropriate to this algorithm as well, so a solution will always attain a local maximum of likelihood. Additionally, it is assumed that PCA learning do not have a stable maxima other than the global optimum which results in convergence to the true PS. The rate of convergence depends on the ratio of the largest eigenvalue to the second largest eigenvalue; the closer the two are in magnitude the slower the convergence will be.
In the complete data setting, the values of the projections are viewed as missing information for EM. During the E-step these values are computed by means of projecting the observed data into the current subspace. This minimizes the model error given the observed data and the model parameters. However, if some of the input points lack certain coordinate values, then those values can be easily estimated in the same fashion. The E-step can be generalized as follows: E-step: For each (possibly incomplete) point find the unique pair of points * and * such that * lies in the current PS and * lies in the subspace defined by the known coordinates of y) which minimize the norm ‖ Cx * * ‖. Set the corresponding column of X to * and the corresponding column of to * .
If is complete then * and * is found exactly as before. If not, then * and * are the solution to a least squares problem and can be found by, for instance, factorization. Observe that this method is not restricted to missing coordinates in the data; the unknown degrees of freedom may lie in any directions in the space. This outperforms replacing each missing coordinate with the mean of known coordinates.

EM algorithm for Sensible PCA (SPCA)
If must have the form , but do not take the limit as → , then this model is called SPCA according to Roweis (1998). The columns of are still known as the PCs. From now on, the scalar value on the diagonal of is called global noise level. It is worth noting that SPCA uses / free parameters to model the covariance. Once again, inference is done with (17) and learning by an EM algorithm. Because it has a finite noise level, SPCA defines the following model and probability distribution in the data space: which makes possible to evaluate the actual likelihood of new test data under an SPCA model. Furthermore, this likelihood will be much lower for data far from the training set even if they are near the PS, unlike the reconstruction error from PCA. The EM algorithm for SPCA is: Since is diagonal, the inversion in the E-step can be performed efficiently using the matrix inversion lemma: Because only the trace of the matrix in the M-step is taken, there is no need to compute the full sample covariance XX T . Instead only the variance along each coordinate need to be computed. These two observations suggest that for small , learning for SPCA also have complexities limited by and not worse. Tipping & Bishop (1999) analyzed PCA from a probabilistic point of view and realized that probabilistic PCA (PPCA) is a special case of factor analysis. In addition, Rubin & Thayer (1984) developed the expectation-maximization (EM) learning algorithm for factor analysis. So, considering PPCA within the factor analysis framework, the principal components can be straightforwardly extracted by using the EM algorithm rather than performing eigenvalue decomposition. So, the computational burden on high dimensional data can be alleviated. Rosipal and Girolami (2001), transformed the EM procedure from data space to a nonlinearly related feature space. Thus, an EM approach to kernel PCA (KPCA) has arisen, which is very useful to find the nonlinear PCs. Scholkopf et al. (1998) introduced KPCA and demonstrated its value in machine learning as well as pattern recognition.

EM algorithm for KPCA
 KPCA needs to diagonalize the kernel matrix K, whose dimensionality N is equal to the number of data points and as the data set increases, KPCA becomes less viable due to the augmenting computational complexity O(N 3 ), which prohibits it from being used in many applications. Moreover, there is still the problem of numerical precision when diagonalizing large matrices directly according to Rosipal and Girolami, (2001 (2003) have introduced a constrained EM algorithm by using a coupled latent variables model. Their proposed EM approach can directly compute the eigensystem of sample covariance matrix in data space as well as that of the kernel matrix. For the most part, when it is applied to the kernel matrix K, it is a dual form of the constrained EM algorithm for performing KPCA.

EM algorithm for any positive semi-definite matrix
Let ,…, be the matrix consisting of N p-dimensional vectors known as observations, and ,…, be the q-dimensional latent variables associated with the data points of . The linear model relating an observed data vector to a corresponding latent variable is given by with n = 1, …, N; the parametrical matrix ∈ determines the connection between the data space, and the latent space. The p-dimensional noise vector is normally distributed with zero mean and covariance matrix . Vector x n is also zero mean and normally distributed with identity covariance. By marginalizing with respect to and optimizing W using the ML principle, Tipping & Bishop (1999) proved that the ML solution correspond to the situation when spans the PS of the observed data (PPCA model).
where the element-wise lower operator  was defined such that  (a ij ) = a ij , for and zero otherwise, and the upper operator  corresponds to  (a ij ) = a ij , for and zero if not. Ahn & Oh (2003) verified that as the noise level became infinitesimal, would converge to be the ML estimator / , where the columns of were the eigenvectors of the sample covariance matrix / and contained the related eigenvalues arranged diagonally in descending order of magnitudes. This removes the rotational ambiguity of algorithm as stated in (27) and (28).
The EM algorithm for PCA considers the latent variables as missing data. The -step of the LM algorithm evaluates the expectation of the corresponding complete-data loglikelihood with respect to the posterior distribution of given the observed . The expectations | and | form the basis of the E-step. In the M-step, the parameter W is updated to maximize the expected complete-data log-likelihood function, which is guaranteed to increase the likelihood of the observed samples as follows: Combining the E-step and M-stes yields The modified constrained EM algorithm comes from (33) and (34), can be further generalized to any positive semi-definite matrix S. In fact, by the incomplete Cholesky decomposition any positive semi-definite matrix can be factorized as , where ∈ and S have rank . The columns of L are samples of y n and apply the model (26). The modified constrained EM algorithm comes from (33) and (34). After convergence, the normalized columns of are the leading eigenvectors of (please, refer to Roweis (1998) and Dempster (1977)). The related eigenvalues are the diagonal elements of . The computational complexity of the proposed EM algorithm is per iteration.
The modified constrained EM algorithm (MCEM) even though simple in the derivation, is very useful in computing not only the eigen-system of sample covariance matrix in data space, but also that of kernel-based nonlinear algorithms. For instancee, the MCEM can be applied to KPCA directly. Given the set of observations Y, the basic idea is to first map these input data into some new feature space via a nonlinear function , followed by standard linear PCA using the mapped samples . Thus, KPCA turns out to compute the most important eigenvectors of the kernel matrix which is defined such that the elements , . , where is the kernel function which calculates the dot product between two mapped samples and . Hence, the mapping of does not need to be computed explicitly. If is a positive definite kernel, then there exists a mapping into the dot product space such that (36) holds.
To compute the leading eigenvectors of , it is enough to replace with using the MCEM algorithm which can be viewed as a dual form of the constrained EM algorithm for performing KPCA. The computational complexity is per iteration. The projection of a test point , whose image is onto the nonlinear principal axes is given by , where the columns of are normalized, i.e., the i-th column is divided by T ii wK w , such that the eigenvectors of the sample covariance matrix have unitary norms, and is the vector . ,…, .

PCA in video event detection
Visual surveillance demands video sequence understanding, the detection of predefined events prone to activate an alarm, the tasks to performed, environment/scenario acquaintance and, consequently, a superior computational performance [Siebel and Maybank (2002), Fuentes and Velastin (2001)]. Nevertheless, smart SSs face the intricate task of analyzing people and their activities. A number of clues may be acquired from the investigation of people trajectories and their relations (tracking). The investigation of a single blob location or path can decide whether a person is located in an illegal area, running, jumping or hiding as in Figures (1) and (2). These data from two or more people may reveal facts regarding their interaction.
The amount of people present in a scene is called density. Events may also be classified into as position-based and dynamic-based events. SSs need to handle more than one image channel or cameras simultaneously in real time to be effectively applied to security, and this calls for a simplification of the image processing stages. A background recognition technique relying on motion detection along with a simple tracking algorithm to extract real-time blob and scene features such as blob position, blob speed and people density can help building semantic descriptions of predefined occurrences. Contrasting sequence parameters with the semantic description of the associated events related to the current scenario, the system is able to spot them and to alert the ultimate decision-maker. The case study presented here is concerned with people-oriented SSs applied to public environments. Background regions can be segmented out by detecting the presence or absence of motion between at least two consecutive frames as can be seen in Figure (3). An SS may hint that an event has indeed occurred because of frame-by-frame blob investigation and tracking, which offer sufficient data to analyze some prior events and the occurrence odds of others in a video sequence. A matching procedure involving blobs from the current and preceding frames, and the overlapping of bounding boxes [Rivera et al. (2004)] can be used as evaluation criteria for ruling the interaction among people and things. This condition has proved helpful in other techniques [MacKenna et al. (2000)] and it does not necessitate the calculation of the blob position since blob motions are always considered smaller than their dimensions. Each new blob can be updated by means of the data accumulated from previous frames. If a new blob emerges, then the centroid position can be used to construct a new path. If two blobs join to form a new one, then this blob is labelled as a group and the information about combined blobs is stored for future use. This new blob group can be tracked independently. If the group splits once more, the system uses motion direction and blob characteristics to properly classify splitting blobs. Trajectories of single persons or cars can be effortlessly obtained following centroids in adjacent frames throughout a video sequence. Whenever needed, the tracked blob position can be interpolated before and after it became part of a group.
Camera networks, control rooms along with human resources and operators for surveillance have prompted lots of interest in automation of inspection tasks. These systems are also concerned with citizens' safety, people flow patterns (for counting purposes or as an aid to facilities planning), overcrowding of restricted or semi-open areas, atypical crowd movements, obstruction of exits, brawls, vandalism, falls/accidents, unattended objects, invasion of forbidden regions, and so on [Fuentes (2002)].
Blob recognition offers information about people coordinates in a 3 setting. A more accurate position calls for either geometric camera calibration or stereo processing. Some examples of event detection that can be done by means of centroid position analysis [Zhang et al. (2011), Fuentes et al. (2002] are 1. Unattended objects laying on the floor (blobs are normally smaller than people) presenting little or no motion, people falling on the ground and objects that move away from a person or reference point can be detected by means of a time analysis. 2. Suspicious activities related to hidden people may correspond to blobs disappearance along several consecutive frames. 3. Vandalism may involve isolation of one person/group present in the scene with irregular centroids motion and, possibly, changes in the background. 4. Temporal thresholding may be used to detect invasions and activate the alarms in case necessary actions must be taken. 5. Prevention of Attacks and Fights: Using people sensation of distance, and knowledge on social patterns of conduct, it is possible to establish a range of distances and profiles corresponding to different types of social interaction [MacKenna et al. (2000)]. By means of the analysis of fast centroids changes, important clues about the people present in a scene can be gathered, such as blob coincidence, merging of blobs and blob splitting to name a few.
Another way of posing the blob detection algorithm is to cluster displacement vectors and, then learn the blob centroids motions.

Case study: Motion estimation
Motion provides important information. Significant events, such as collision paths, object docking, sensor obstruction, object properties and occlusion can be characterized and understood with the help of the optic flow (OF). Segmenting an OF field (OFF) into coherent motion groups and estimating each underlying motion are very challenging tasks when a scene has several independently moving objects. The problem is further complicated by noise and/or data scarcity.
The main problem with motion analysis is the difficulty to get accurate motion estimates without prior motion segmentation and vice-versa. Pel-recursive (PR) schemes [Franz & Krapp (2000), Franz & Chahl (2003), Kim et al. (2005), Tekalp (1995)] can theoretically overcome some of the limitations associated with blocks by assigning a unique motion vector to each pixel.
Segmenting OF via EM algorithm for mixtures of PCs can be done successfully [Estrela & Galatsanos (2000), Tipping & Bishop (1999)] because both techniques share a close relationship. Most methods assume that there is little or no interference between the individual sample constituents or that all the constituents in the samples are known ahead of time. In real world samples, it is very unusual, if not entirely impossible, to know the entire composition of a mixture sample. Sometimes, only the quantities of a few constituents in very complex mixtures of multiple constituents are of interest [Blekas et al. (2005), Kim et al. (2005), Tipping & Bishop (1999)]. This section intends to solve OF problems by means of two different takes on PCA regression (PCR): 1) a combination of regularized least squares (RLS) and PCA (PCR 1 ); and 2) RLS followed by regularized PCA regression (PCR 2 ). Both involve simpler computational procedures than previous attempts at addressing mixtures [Blekas et al. (2005), Kim et al. (2005), Tipping & Bishop (1999), Jolliffe (2002), Wold et. al. (1983].

Problem formulation
The displacement of every pixel in each frame forms the displacement vector field (DVF) and its estimation can be done using at least two successive frames. The goal is to find the corresponding intensity value I k (r) of the k-th frame at location r = [x, y] T , and d(r) = [d x , d y ] T the corresponding displacement vector (DV) at the working point r in the current frame by means of algorithms that minimize the DFD function in a small area containing the working point assuming constant image intensity along the motion trajectory. The perfect registration of frames will result in I k (r)=I k-1 (r-d(r)) as seen in Figure (4). Figure (5) shows some examples of pixel neighborhoods. The DFD represents the error due to the nonlinear temporal prediction of the intensity field through the DV and is given by ∆(r;d(r))=I k (r)-I k-1 (r-d(r)) . An estimate of d(r), is obtained by directly minimizing (r,d(r)) or by determining a linear relationship between these two variables through some model. This is accomplished by using a Taylor series expansion of I k-1 (r-d(r)) about the location (r-d i (r)), where d i (r) represents a prediction of d(r) in the i-th step. This results in iT i K-1 ( , -( )) I ( -( )) e( , ( ))     rrd r u rd r rdr , where the displacement update vector is u=[u x , e(r, d(r)) stands for the truncation error resulting from higher order terms (linearization error) and =[/ x , / y ] T represents the spatial gradient operator. Considering all points in a neighborhood of pixels around r gives where the temporal gradients  (r, r-d i (r)) have been stacked to form the N×1 observation vector z containing DFD information on all the pixels in a neighborhood, the N×2 matrix G is obtained by stacking the spatial gradient operators at each observation, and the error terms have formed the N×1 noise vector n. The PR estimator for each pixel located at position r of a frame k can be written as where u i (r) is the current motion update vector obtained through a motion estimation procedure that attempts to find u, d i (r) is the DV at iteration i and d i+1 (r) is the corrected DV. The ordinary least squares (OLS) estimate of the update vector is which is given by the minimizer of the functional J(u)=║z-Gu║ 2 . The assumptions made about n for least squares estimation are E(n) = 0, and Var(n) = E(nn T ) = σ 2 I N , where E(n) is the expected value (mean) of n, and I N , is the identity matrix of order N. From now on, G will be analyzed as being an N×p matrix in order to make the whole theoretical discussion easier. Since G may be very often ill conditioned, the solution given by the previous expression will be usually unacceptable due to the noise amplification resulting from the calculation of the inverse matrix G T G. In other words, the data are erroneous or noisy.
The regularized minimum norm solution also known as regularized least square (RLS) solution is given by The RLS estimate of the motion update vector can be improved by a strategy that uses local properties of the image. Each row of G has entries [g xi , g yi ] T , with i = 1, …, N. The spatial gradients of I k-1 are calculated through a bilinear interpolation scheme similar to what is done in [Estrela & Galatsanos (2000), Estrela & Galatsanos (1998)]. The entries 1 () k f  r corresponding to a given pixel location inside a causal mask is needed to compute the spatial gradients by means of bilinear interpolation [Estrela & Galatsanos (2000), Estrela & Galatsanos (1998) where x   is the largest integer that is smaller than or equal to x, the bilinear interpolated intensity 1 () k f  r is specified by 00 10 1 01 11 (1 )(

On the use of PCA in regression
The main idea behind the two proposed PCR procedures is the PCA of the G matrix [Jackson (1991), Jolliffe (2002)].
Each successive component explains portions of the variance in the total sample. PCA relates to the second statistical moment of G, which is proportional to G T G and it partitions G into matrices T and P (sometimes called scores and loadings, respectively), such that: T contains the eigenvectors of G T G ordered by their eigenvalues with the largest first and in descending order. When dimensionality reduction is needed, the number of components can be chosen via examination of the eigenvalues or, for instance, considering the residual error from cross-validation [Estrela & Galatsanos (2000), Jolliffe (2002]. The PCR motion estimation algorithms will keep the PCs and use them to group DVs inside a neighborhood. The resulting clusters will give an idea about the mixture of MVs inside a mask. The formal solution PCR 1 may be written as 1 () 

T-1 T PCR uP T T T z , and 
where a regularization matrix Λ tries to compensate for deviations from the smoothness constraint. In PCR 1 , the scores vectors (columns in T) of different components are orthogonal. PCR 1 uses a truncated inverse where only the scores corresponding to large eigenvalues are included. The criteria for deciding when the PCR 1 estimator is superior to OLS estimators depend on the values of the true regression coefficients in the model. The previous solution can also be regularized: 2 ()  T-1 T PCR uP T T Ξ Tz.
www.intechopen.com with Ξ standing for a regularization matrix in the PC domain. Grouping objects can be posed as a mathematical problem consisting of finding region boundaries. Sometimes the problem is such that a sample may belong to more than one class at the same time, or not belong to any class. In this method, each class is modeled by a multivariate normal in the score space from PCA. Two measures are used to determine whether a sample belongs to a specific class or not: the leverage-the Mahalanobis distance to the center of the class, the class boundary being computable as an ellipse and the norm of the residual, which must be lower than a critical value. Figure (6) shows a set of observations plotted with respect to the first two principal components (PCs). It is likely that the four clusters correspond to four different types of DVs (see ellipses). For a big neighborhood, it could happen that these vectors would not be readily distinguished using only one variable at a time, but the plot with respect to the two PCs clearly distinguishes the populations. PCR estimates are biased, but may be more accurate than OLS estimates in terms of mean square error. Nevertheless, when severe multicollinearity is suspected, it is recommended that at least one set of estimates in addition to the OLS estimates be computed since these estimates may help interpreting the data in a different way. When PCA reveals the instability of a particular data set, one should first consider using least squares regression on a reduced set of variables. If least squares regression is still unsatisfactory, only then should principal components be used. Besides exploring the most obvious approach, it reduces the computer load. Outliers and other observations should not be automatically removed, because they are not necessarily bad observations. As a matter of fact, they can signal some change in the scene context and if they make sense according to the above-mentioned criteria, they may be the most informative points in the data. For example, they may indicate that the data did not come from a normal population or that the model is not linear.
When cluster analysis is used for video scene dissection, the aim of a two-dimensional plot with respect to the first two PCs will almost always be to verify that a given dissection 'looks' reasonable. Hence, the diagnosis of areas containing motion discontinuities can be significantly improved. If additional knowledge on the existence of borders is used, then one's ability to predict the correct motion will increase. PCs can be used for clustering, given the links between regression and discrimination. The fact that separation among populations may be in the directions of the last few PCs does not mean that PCs should not be used at all. In regression, their uncorrelatedness implies that each PC can be assessed independently. To classify a new observation, the least distance cluster is picked up. If a datum is not close to any of the existing groups, it may be an outlier or come from a new group about which there is currently no information. Conversely, if the classes are not well separated, some future observations may have small distances from more than one class. In such cases, it may again be undesirable to decide on a single possible class; instead, two or more groups may be listed as possible loci for the observation.
The average improvement in motion compensation for a sequence of K frames it turns out to be [Estrela & Galatsanos (2000)]:   When it comes to motion estimation, one seeks algorithms that have high values of () IMC dB . A perfect registration of motion leads to () IMC dB =. Figure (7) illustrates the evolution of () k IMC dB as a function of the frame number for two noiseless sequences: "Foreman" and "Mother and Daughter". PCR 2 works outperforms the other estimators due to the use of regularization in the PC domain. Figure (8) shows the DVFs for the "Rubik Cube" sequence with SNR=20 dB.  Corresponding displacement vector field for a 31×31 mask obtained by means of PCR 1 with SNR=20 dB; and (c) PCR 2 , 31×31 mask with SNR=20 dB.

Final comments
The methods developed in this chapter allow simple and efficient computation of a few eigenvectors and eigenvalues when working with many data points in high dimensions. They rely on PPCA and the MCEM algorithm, which permit this calculation even in the presence of missing data.
The EM algorithms for PCA and KPCA derived above using probabilistic arguments are closely related to two well know sets of algorithms. The first are power iteration methods for solving matrix eigenvalue problems. Roughly speaking these methods iteratively update their eigenvector estimates through repeated multiplication by the matrix to be diagonalized. In the case of PCA explicitly forming the sample covariance and multiplying by it to perform such power iterations would be disastrous. However, since the sample covariance is in fact a sum of outer products of individual vectors we can multiply by it efficiently without ever computing it. In fact, the EM algorithm is exactly equivalent to performing power iterations for finding using this trick. Iterative methods for partial least squares (e.g. the NIPALS algorithm) are doing the same trick for regression. Taking the singular value decomposition (SVD) of the data matrix directly is a related way to find the PS. If the Lanczos' and Arnoldi's methods are used to compute this SVD, then the resulting iterations are similar to those of the EM algorithm. The second class of methods comprises competitive learning methods for finding the PS, such as Sanger (1989) and Oja (1989) suggest. These methods enjoy the same storage and time complexities as the EM algorithm however, their update steps reduce but do not minimize the cost and so they typically need more iterations and require a learning rate parameter to be set by hand.
In this chapter, two PCR frameworks for the detection of motion fields are discussed. Both algorithms combine regression and PCA. The resulting transformed variables are uncorrelated. Unlike other works ([8, 11, 12]), the interest here is not in reducing the dimensionality of the feature space describing different types of motion inside a neighborhood surrounding a pixel. Instead, we use them in order to validate motion estimates. They can be seen as simple alternative ways of dealing with mixtures of motion displacement vectors. PCR 1 and PCR2 performed better than RLS estimators for noiseless and noisy images. More experiments are still needed in order to test the proposed algorithms with different types and levels of noise, so that the classification can be improved. It is also necessary to incorporate more statistical information in our models and to analyze if this knowledge will improve the outcome.