Robust Principal Component Analysis for Background Subtraction: Systematic Evaluation and Comparative Analysis

The analysis and understanding of video sequences is currently quite an active research field. Many applications such as video surveillance, optical motion capture or those of multimedia need to first be able to detect the objects moving in a scene filmed by a static camera. This requires the basic operation that consists of separating the moving objects called "foreground" from the static information called "background". Many background subtraction methods have been developed (Bouwmans et al. (2010); Bouwmans et al. (2008)). A recent survey (Bouwmans (2009)) shows that subspace learning models are well suited for background subtraction. Principal Component Analysis (PCA) has been used to model the background by significantly reducing the data's dimension. To perform PCA, different Robust Principal Components Analysis (RPCA) models have been recently developed in the literature. The background sequence is then modeled by a low rank subspace that can gradually change over time, while the moving foreground objects constitute the correlated sparse outliers. However, authors compare their algorithm only with the PCA (Oliver et al. (1999)) or another RPCA model. Furthermore, the evaluation is not made with the datasets and the measures currently used in the field of background subtraction. Considering all of this, we propose to evaluate RPCA models in the field of video-surveillance. Contributions of this chapter can be summarized as follows: 1) A survey regarding robust principal component analysis and 2) An evaluation and comparison on different video surveillance datasets


Introduction
The analysis and understanding of video sequences is currently quite an active research field.Many applications such as video surveillance, optical motion capture or those of multimedia need to first be able to detect the objects moving in a scene filmed by a static camera.This requires the basic operation that consists of separating the moving objects called "foreground" from the static information called "background".Many background subtraction methods have been developed (Bouwmans et al. (2010); Bouwmans et al. (2008)).A recent survey (Bouwmans (2009)) shows that subspace learning models are well suited for background subtraction.Principal Component Analysis (PCA) has been used to model the background by significantly reducing the data's dimension.To perform PCA, different Robust Principal Components Analysis (RPCA) models have been recently developped in the literature.The background sequence is then modeled by a low rank subspace that can gradually change over time, while the moving foreground objects constitute the correlated sparse outliers.However, authors compare their algorithm only with the PCA (Oliver et al. (1999)) or another RPCA model.Furthermore, the evaluation is not made with the datasets and the measures currently used in the field of background subtraction.Considering all of this, we propose to evaluate RPCA models in the field of video-surveillance.Contributions of this chapter can be summarized as follows: • A survey regarding robust principal component analysis

• An evaluation and comparison on different video surveillance datasets
The rest of this paper is organized as follows: In Section 2, we firstly provide the survey on robust principal component analysis.In Section 3, we evaluate and compare robust principal component analysis in order to achieve background subtraction.Finally, the conclusion is established in Section 4.

Principal component analysis
Assuming that the video is composed of n frames of size width × height.We arrange this training video in a rectangular matrix A ∈ R m×n (m is the total amount of pixels), each video frame is then vectorized into column of the matrix A, and rows correspond to a specific pixel and its evolution over time.The PCA firstly consists of decomposing the matrix A in the product USV ′ .where S ∈ R n×n (diag) is a diagonal matrix (singular values), U ∈ R m×n and V ∈ R n×n (singular vectors) .Then only the principals components are retained.To solve this decomposition, the following function is minimized (in tensor notation): (1) This imply singular values are straightly sorted and singular vectors are mutually orthogonal The solutions S 0 , U 0 and V 0 of (1) are not unique.
We can define U 1 and V 1 , the set of cardinality 2 min(n,m) of all solution; We choose k (small) principal components: The background is computed as follows: where v is the current frame.The foreground dectection is made by thresholding the difference between the current frame v and the reconstructed background image (in Iverson notation): where T is a constant threshold.
Results obtained by Oliver et al. (1999) show that the PCA provides a robust model of the probability distribution function of the background, but not of the moving objects while they do not have a significant contribution to the model.As developped in Bouwmans (2009), this model presents several limitations.The first limitation of this model is that the size of the foreground object must be small and don't appear in the same location during a long period in the training sequence.The second limitation appears for the background maintenance.Indeed, it is computationally intensive to perform model updating using the batch mode PCA.Moreover without a mechanism of robust analysis, the outliers or foreground objects may be absorbed into the background model.The third limitation is that the application of this model is mostly limited to the gray-scale images since the integration of multi-channel data is not straightforward.It involves much higher dimensional space and causes additional difficulty to manage data in general.Another limitation is that the representation is not multimodal so various illumination changes cannot be handled correctly.In this context, several robust PCA can be used to alleviate these limitations.

RPCA via Robust Subspace Learning
Torre & Black (2003) proposed a Robust Subspace Learning (RSL) which is a batch robust PCA method that aims at recovering a good low-rank approximation that best fits the majority of the data.RSL solves a nonconvex optimization via alternative minimization based on the idea of soft-detecting andown-weighting the outliers.These reconstruction coefficients can be arbitrarily biased by an outlier.Finally, a binary outlier process is used which either completely rejects or includes a sample.Below we introduce a more general analogue outlier process that has computational advantages and provides a connection to robust M-estimation.
The energy function to minimize is then: where µ is the mean vector and the ρ − f unction is the particular class of robust ρ-function (Black & Rangarajan (1996)).They use the Geman-McClure error function ρ(x, σ p )= x 2 x 2 +σ 2 p where σ p is a scale parameter that controls the convexity of the robust function.Similar, the penalty term associate is ( L pi − 1) 2 .The robustness of De La Torre's algorithm is due to this ρ − f unction.This is confirmed by the results presented whitch show that the RSL outperforms the standard PCA on scenes with illumination change and people in various locations.

RPCA via Principal Component Pursuit
Candes et al. ( 2009) achieved Robust PCA by the following decomposition: where L is a low-rank matrix and S must be sparse matrix.The straightforward formulation is to use L 0 norm to minimize the energy function: where λ is arbitrary balanced parameter.But this problem is NP-hard, typical solution might involve a search with combinatorial complexity.For solve this more easily, the natural way is to fix the minimization with L 1 norm that provided an approximate convex problem: argmin where ||.|| * is the nuclear norm (which is the L 1 norm of singular value).Under these minimal assumptions, the PCP solution perfectly recovers the low-rank and the sparse matrices, provided that the rank of the low-rank matrix and the sparsity matrix are bounded by the follow inequality: where, ρ r and ρ s are positive numerical constants, m and n are the size of the matrix A.
For further consideration, lamda is choose as follow: Results presented show that PCP outperform the RSL in case of varying illuminations and bootstraping issues.

RPCA via templates for first-order conic solvers
Becker et al. ( 2011) used the same idea as Candes et al. (2009) that consists of some matrix A which can be broken into two components A = L + S, where L is low-rank and S is sparse.
The inequality constrained version of RPCA uses the same objective function, but instead of the constraints L + S = A, the constraints are: argmin Practically, the A matrix is composed from datas generated by camera, consequently values are quantified (rounded) on 8 bits and bounded between 0 and 255.Suppose A 0 ∈R m×n is the ideal data composed with real values, it is more exact to perform exact decomposition onto A 0 .Thus, we can assert ||A 0 − A|| ∞ < 1 2 with A 0 = L + S. The result show improvements for dynamic backgrounds3 .

RPCA via Bayesian framework
Ding et al. ( 2011) proposed a hierarchical Bayesian framework that considered for decomposing a matrix (A) into low-rank (L), sparse (S) and noise matrices (E).In addition, the Bayesian framework allows exploitation of additional structure in the matrix .Markov dependency is introduced between consecutive rows in the matrix implicating an appropriate temporal dependency, because moving object are strongly correlated across consecutive frames.A spatial dependency assumption is also added and introduce the same Markov contrain as temporal utilizing the local neightborood.Indeed, it force the sparce outliers component to be spatialy and temporaly connected.Thus the decomposition is made as follows: Where L is the low-rank matrix, S is the sparse matrix and E is the noise matrix.Then some assumption about components distribution are done: • Singular vector (U and V ′ ) are drawn from normal distribution.
• Singular value and sparse matrix (S and X) value are drawn from normal-gamma distribution • Singular sparness mask (B L and B S ) from bernouilli-beta process.
Note that L 1 minimization is done by l 0 minimization (number of non-zero values fixed for the sparness mask), afterwards a l 2 minimization is performed on non-zero values.
The matrix A is assumed noisy, with unknown and possibly non-stationary noise statistics.
The Bayesian framework infers an approximate representation for the noise statistics while simultaneously inferring the low-rank and sparse-outlier contributions: the model is robust to a broad range of noise levels, without having to change model hyperparameter settings.The properties of this Markov process are also inferred based on the observed matrix, while simultaneously denoising and recovering the low-rank and sparse components.Ding et al. (2011) applied it to background modelling and the result obtain show more robustness to noisy background, slow changing foreground and bootstrapping issue than the RPCA via convex optimization (Wright et al. (2009)).

Comparison
In this section, we present the evaluation of the five RPCA models (RSL, PCP, TFOCS, IALM, Bayesian) and the basic average algorithm (SUB) on three different datasets used in video-surveillance: the Wallflower dataset provided by Toyama et al. (1999), the dataset of Li et al. (2004) and dataset of Sheikh & Shah (2005).Qualitative and quantitative results are provided for each dataset.

Wallflower dataset 4
We have chosen this particular dataset provided by Toyama et al.Toyama et al. (1999) because of how frequent its use is in this field.This frequency is due to its faithful representation of real-life situations typical of scenes susceptible to video surveillance.Moreover, it consists of seven video sequences, with each sequence presenting one of the difficulties a practical task is likely to encounter (i.e illumination changes, dynamic backgrounds).The size of the images is 160 × 120 pixels.A brief description of the Wallflower image sequences can be made as follows: • Moved Object (MO): A person enters into a room, makes a phone call, and leaves.The phone and the chair are left in a different position.This video contains 1747 images.
• Time of Day (TOD): The light in a room gradually changes from dark to bright.Then, a person enters the room and sits down.This video contains 5890 images.
• Light Switch (LS): A room scene begins with the lights on.Then a person enters the room and turns off the lights for a long period.Later, a person walks in the room and switches on the light.This video contains 2715 images.
• Waving Trees (WT): A tree is swaying and a person walks in front of the tree.This video contains 287 images.
• Camouflage (C): A person walks in front of a monitor, which has rolling interference bars on the screen.The bars include similar color to the person's clothing.This video contains 353 images.
• Bootstrapping (B): The image sequence shows a busy cafeteria and each frame contains people.This video contains 3055 images.
• Foreground Aperture (FA): A person with uniformly colored shirt wakes up and begins to move slowly.This video contains 2113 images.
For each sequence, the ground truth is provided for one image when the algorithm has to show its robustness to a specific change in the scene.Thus, the performance is evaluated against hand-segmented ground truth.Four terms are used in the evaluation: • True Positive (TP) is the number of foreground pixels that are correctly marked as foreground.
• False Positive (FP) is the number of background pixels that are wrongly marked as foreground.
• True Negative (TN) is the number of background pixels that are correctly marked as background.
• False Negative (FN) is the number of foreground pixels that are wrongly marked as background.
. @ 2 6 . A B2 Precision gives the percentage of corrected pixels classified as background as compared at the total pixels classified as background by the method: A good performance is obtained when the detection rate is high without altering the precision.
We also computed the F-measure used in (Maddalena & Petrosino (2010)) as follows: Table 2 shows the results obtained by the different algorithms on each sequence.For each sequence, the first column shows the original image and the corresponding ground truth.
The second part presents the sparse matrix in the first row and the optimal foreground mask in the second row.The detection rate (DR), Precision (Prec) and F-measure (F) are indicated below each foreground mask.Fig. 1 shows two cumulative histograms of F-measure: The left (resp.right) figure concern the cumulative score by method (resp.sequence).PCP gives the best result followed by RSL, IALM, TFOCS, and Bayesian RPCA.This ranking has to be taken with prrecaution because a poor performance on one sequence influences the overall F-measure and then modifies the rank for just one sequence.For example, the Bayesian obtained a bad score because of the following assumption: the background has necessarily a bigger area than the foreground.It happen in the sequences Camouflage and Light Switch.
In the first case, the person hides more than half of the screen space.In the second case, when the person switch on the light all the pixels are affected and the algorithm exchanges the foreground and background.PCP seems to be robust for all critical situations.

Shah's dataset 5
This sequence involved a camera mounted on a tall tripod and comes from Sheikh & Shah (2005).It contains 500 images and the corresponding GT.The wind caused the tripod to sway back and forth causing nominal motion in the scene.Table 3 shows the results obtained by the different algorithms on three images of the sequence: Frame 309 that contains a walking person, frame 395 when a car arrived the scene and frame 462 when the same car left the scene.For each frame, the first column shows the original image and the corresponding truth.The second part presents the sparse matrix in the first row and the optimal foreground mask in the second row.The detection rate (DR), Precision (Prec) and F-measure (F) are indicated below each foreground mask.Fig. 2 shows two cumulative histograms of F-measure: as in previous performance evaluation.PCP gives the best result followed by Bayesian RPCA, RSL, TFOCS and IALM.We can notice that the Bayesian give better performance on this dataset because none of moving object are bigger than the background area.

Li's dataset 6
This dataset provided by Li et al. (2004) consists of nine video sequences, which each sequence presenting dynamic backgrounds or illumination changes.The size of the images is 176*144 pixels.Among this dataset, we have chosen seven sequences that are the following ones: • Campus: Persons walk and vehicles pass on a road in front of waving trees.This sequence contains 1439 images.
• Water Surface: A person arrives in front of the sea.There are many waves.This sequence contains 633 images.
• Curtain: A person presents a course in a meeting room with a moving curtain.This sequence contains 23893 images.
• Escalator: This image sequence shows a busy hall where an escalator is used by people.This sequence contains 4787 images.
• Airport: This image sequence shows a busy hall of an airport and each frame contains people.This sequence contains 3584 images.
• Shopping Mall: This image sequence shows a busy shopping center and each frame contains people.This sequence contains 1286 images.
• Restaurant: This sequence comes from the wallflower dataset and shows a busy cafeteria.This video contains 3055 images.
The sequences Campus, Water Surface and Curtain present dynamic backgrounds whereas the sequences Restaurant, Airport, Shopping Mall show bootstrapping issues.For each sequence, the ground truth is provided for twenty images when algorithms have to show their robustness.Table 4 shows the results obtained by the different algorithms on the sequence campus.Table 5 presents the results on the dynamic background on the sequences Water Surface, Curtain, Escalator, whereas table 6 presents the result on bootstrapping issues on the sequences Airport, Shopping mall, Restaurant.For each table, the first column shows the original image and the corresponding ground truth.The second part presents the sparse matrix in the first row and the optimal foreground mask in the second row.The detection rate (DR), Precision (Prec) and F-measure (F) are indicated below each foreground mask.
Fig. 3 and 4 shows the two cumulative histograms of F-measure respectively for dynamic background and bootstrapping issues.In each case, Bayesian RPCA gives best results followed by PCP, TFOCS, IALM and RSL.

Implementation and time issues
Regarding the code, we have used the following implementation in MATLAB: RSL provided by F. De La Lin et al. (2009) proposed to substitute the constraint equality term by penalty function subject to a minimization under L 2 norm : argminL,S Rank(L)+λ||S|| 0 + µ 1 2 ||L + S − A|| 2 F (13)This algorithm solves a slightly relaxed version of the original equation.The µ constant lets balance between exact and inexact recovery.Lin et al. (2009) didn't present result on background subtraction.

Fig. 1 .
Fig. 1.Performance on the Wallflower dataset.The left (resp.right) figure concern the cumulative score by method (resp.sequence).

Fig. 2 .
Fig. 2. Performance on the Shah dataset.The left (resp.right) figure concern the cumulative score by method (resp.sequence).