Estimation error of shift vectors of deformation field.
The effectiveness of the use of stochastic gradient estimation to detect motion in a sequence of images is investigated. Pixel-by-pixel reverse estimation of the deformation field is used. The representation of the shift vector is considered both by the projections on the basic axes and by the polar parameters. Two approaches to estimate the parameters of the deformation field are proposed and analyzed. In the first approach, the stochastic gradient procedure sequentially processes all rows of an image to find estimates of shifts for all points of the reference image. The joint processing of the results allows compensating the inertia of the stochastic estimation. In the second approach, to improve the accuracy of estimation, the correlation of image rows is taken into account. As a criterion for the formation of the resulting estimate, the minimum of gradient estimation and correlation maximum were investigated. The computational complexity of the proposed algorithms is investigated. The algorithms are compared with the MVFAST algorithm. Examples of experimental results on the formation of the deformation field, the selection of a moving object area, and the finding of the movement and trajectory parameters of a moving object in a video sequence are given.
- image processing
- video sequence
- shift vector
- stochastic estimation
- stochastic gradient
- reverse estimation
One of the challenges in video processing is moving object detection and tracking. Some tasks require only detection of the motion, while others require extraction of the moving object or the motion area boundary. The biggest challenge is to estimate trajectory parameters of a moving object in a video sequence. A solution quality to the problem largely depends on the accuracy of moving object area detection, since all the information needed to determine motion parameters and trajectory of the object is extracted from the image.
There are various approaches to identify the area of moving object based on the inter-frame difference [1, 2], background subtraction [2, 3], the use of statistics [2, 4], block estimation , and optical flow analysis . The processing can be presented as estimation of inter-frame geometric deformations of two images, one of which can be considered as the reference image and the second as deformed image , where is the value in node of the image. The field of inter-frame shift vectors of all points of the reference image corresponding to the nodes of the sample grid will be called deformation field or shift vectors’ field.
An important task of estimating the trajectory of a moving object is to increase the spatial accuracy of detecting the area of motion in an image sequence. The solution of this problem depends on the quality of estimation of the deformation field. When finding estimates of shifts for each node of the deformed image, the approach of using stochastic gradient estimation [7, 8, 9] is promising. It finds coordinates of points of the reference image on the deformed image. That is, it evaluates vectors of inter-frame shifts of points (Figure 1) of the reference image.
In view of the insignificant change in brightness on adjacent frames of a video sequence, it is advisable to use mean square inter-frame difference as the objective function of estimation. Then, for the projections, and of shift vector, we get 
where is the learning rate, which determines the rate of change of the estimated parameters, is the gradient estimation of an objective function, and is the image size.
The projections of stochastic gradient can be represented as 
where are the coordinates of the reference image point on the deformed image, is the brightness of the continuous image obtained from by means of interpolation, and are the steps of finding estimates of derivatives and .
Note that inter-frame shift vector can be also represented in polar form: (Figure 1), where is the length of the vector and is the angle with respect to the axis (Figure 1). Functionally, these representations are equivalent. However, due to the inertia of estimates of shift vectors’ field , when using stochastic gradient descent, the use of parameters and does not give equivalent estimates compared with the use of and . This is due to the fact that the sets of parameters have different physical meanings . The answer to the question of which set is preferable for solving the problem of moving object area detection is not obvious and requires research.
2. Variants of estimation algorithms
Let us compare the efficiency of moving object area identification using the sets of parameters and . The following method for estimating the shift vectors’ field is used. At the first stage, stochastic gradient descent algorithm sequentially processes all nodes of the reference image row by row with the set of parameters . It processes each row bidirectionally , first, from the left to the right, getting, according to Eq. (1), the estimates
and then from the right to the left getting the estimates
where parameter is determined by the maximum speed of moving objects.
When using the set of parameters (), considering that , , , and , for gradient estimation , we get 
Then according to Eq. (1), we get the estimates
where the parameters and are also determined by the maximum speed of moving objects.
At the second stage, for each node , the estimates and are processed jointly . The optimal value of is found between the estimates with some step . Step is determined by the required accuracy. If the absolute difference between the estimates and is less than , then the estimate of the deformation parameter is assumed to be equal . Otherwise, the set of possible values of the deformation parameter estimate is given by
The optimal value of from the resulting set is found using some criterion.
2.1 Criteria for the formation of the resulting estimation
Two criteria for the formation of the resulting estimation are studied :
Gradient estimation minimumE2
Correlation coefficient maximum (CC)
where () are the coordinates of the point of the image in the image for the estimate and is the window size for calculation of CC. The optimal value is determined similarly.
Joint processing of estimates allows compensating the inertia of the recursive estimation. The results given below were obtained with .
The joint processing of estimates , and , and as well as finding the optimal values of and are performed as described above using the same criteria of gradient estimation minimum and CC maximum.
Thus for joint processing of the estimates, depending on the criterion, we obtain two versions of the estimation algorithm:
Algorithm A: reverse processing using the criterion Eq. (2);
Algorithm B: reverse processing using the criterion Eq. (3).
2.2 Comparison of the estimation algorithm efficiency
Let us first consider the efficiency of the criterion Eq. (2). Figure 2 shows an example of images used for the study. These are two adjacent frames of a video sequence where the vehicle located in the center is moving and the vehicle on the right is motionless. The parameters of inter-frame spatial shift of the moving vehicle are , or , and .
For example, Figure 3 shows typical results of shift magnitude estimation for a single row of the reference image (Figure 2) while using Algorithm A. For a correct comparison of the estimates , and , and , the latter is recalculated to polar parameters:
Figure 3a shows dependencies and on point in a row, Figure 3b shows the result of their joint processing, Figure 3c shows dependencies and on , and Figure 3d shows the result of their joint processing. The solid gray line represents the true value of the deformation parameter. The results for parameters are visually better. Estimates given in Table 1 also confirm that. Table 1 shows mean value and variance of estimation error for the area with and without motion for results shown in Figure 3b and d.
|Algorithm||Motion area||Area without motion|
|For a single row|
|For the entire image|
Note that with the use of parameters in the motion area, the mean value of the estimation error is lower, and the variance is more than two times less than with the use of the set of parameters . In the area without motion, the use of parameters is also preferable since the bias of the estimates is less despite the slightly larger variance. For comparison, Table 1 also shows the results for the mean value and estimation of the variance of estimation errors averaged over the entire image.
Figure 4 shows the results of the joint processing of the sets of estimates and using the criterion Eq. (3) of the formation of the resulting estimate (Algorithm B). The results are for the same row as the results in Figure 3. Figure 4a corresponds to the set of parameters , and Figure 4b corresponds to . Table 1 summarizes the numerical data of the estimation error for the row and averaging over the entire image.
The results show that the use of CC maximum as the criterion can significantly improve the results of the joint processing of the estimates. For both sets of parameters, the mean value of the error for the motion area does not exceed 0.2. The variance of the error is approximately halved for the set of parameters and 15 times less for parameters . For the area without motion, the mean error is less by about 2.6 and 3.3 times. The variance of the estimates also decreases. However, when using CC, the computational complexity of the estimation procedure increases significantly.
Figure 5 shows the visualization of estimates of the shift vectors’ field for the four estimation algorithms described above. It shows the magnitudes of the estimated vectors as a function of coordinates of nodes of the reference image. Figure 5a and b shows the results for criteria Eqs. (2) and (3) for the set of parameters and Figure 5c and d for the same criteria when parameters are used. The figures illustrate and confirm the conclusions made above.
2.3 Use of inter-row correlation of adjacent rows of the image
The approach to estimate the field considered above processes images row by row as one-dimensional signals. Let us consider the possibility to increase processing efficiency by taking into account correlation of the adjacent rows of image [13, 17]. For each node of the reference image, we will also form two estimates and of the parameters of the vector . To form the estimate , the stochastic gradient procedure processes rows one after the other with the change in direction after each row, for example, the odd rows from the left to the right, forming estimates , and the even rows, from the right to the left, forming estimates . The current shift estimate in a node is compared with the estimate , obtained from the previous row. If the estimate is considered better using some criterion, then it is taken as the current estimate.
As the criterion we use CC maximum Eq. (3) with the difference that here the set of possible values is limited by the estimates and . The estimate is also formed in a similar way with the difference that, when processing, the odd rows are processed from the right to the left and the even rows from the left to the right.
The joint processing of estimates and is carried out as described above. On the one hand, when taking into account inter-row correlation, the formation of and requires a greater amount of computation. On the other hand, the set of possible estimates for their joint processing turns out to be significantly smaller, which reduces the computational complexity.
Thus we have two more algorithms to estimate the shift vectors’ field:
2.4 Comparison of algorithms C and D
Figure 6 shows the typical results of joint processing of the sets of estimates , (Figure 6a) and , and (Figure 6b) using Algorithm C. Figure 6c and d shows the results for the same sets and Algorithm D. It can be seen from the figures that the use of inter-row correlation can significantly improve the estimation results. For the parameters and Algorithm C, mean value of the error for motion area decreases by 1.2 times, error variance by 1.1 times, and for Algorithm D by 3 times and 29 times, respectively. For the set of parameters and Algorithm C, the mean value of the error for motion area decreases by 5 times, variance by 2.1 times, and for Algorithm D by 10 times and 2.5 times, respectively. The numerical values are listed in Table 1.
Figure 7 shows visualization of the estimates of shift vectors’ field for the estimation algorithms. Figure 7a and b corresponds to the algorithms C and D and set of parameters , and Figure 7c and d corresponds to the same algorithms and the set of parameters . It can be seen that criterion Eq. (3) provides a greater accuracy of estimation (but also a larger computational costs). When maximum spatial accuracy is required, it is advisable to use one of algorithms B and D. The error for the Algorithm D is smaller, but the algorithm has a greater computational complexity.
Figure 8 shows moving object area and its contour obtained from the results of algorithms C (Figure 8a) and D (Figure 8b) by thresholding with the threshold equal to 0.1 of the maximum value of shift magnitude. Note that there are practically no errors of the second kind for Algorithm D.
3. Efficiency of the algorithms for complex motion
Let us consider the possibilities of the developed algorithms when estimating shift vectors’ field for more complex types of motion, in particular, for the case when inter-frame geometric deformations of images of a moving object can be described by a similarity model with a vector of parameters . To compare the results of estimation of the shift vectors’ field with the results obtained for the images in Figure 2, the right image was modeled with the corresponding parameters of inter-frame deformations: , , and .
The results presented below are for algorithms B and D which were shown to have the highest efficiency with the use of Eq. (3) as a criterion. Figure 9 shows typical results for a single row of the image; here graphs (a) and (b) correspond to the Algorithm B and the set of parameters and , respectively, and graphs (c) and (d) correspond to Algorithm D. In this case the estimates for the set of parameters are closer to the true values (gray line). Figure 10 shows visualization of estimates of the shift vectors’ field for the entire image. Figure 10a and b corresponds to Algorithm B, and Figure 10c and d corresponds to Algorithm D with the set of parameters and , respectively.
In the presented figures, in order to remove outliers significantly exceeding the true value of the shift, the estimation module was limited to . When using Algorithm B, we note a significant number of false estimates both for and outside the area of motion. In particular, when using the parameters , the estimates for a significant number of bottom rows of the image exceed the threshold. The use of the inter-row correlation significantly reduces the number of outliers for the set of parameters and virtually eliminates them for the set .
4. Comparison of the developed algorithms with the MVFAST algorithm
For block approach [18, 19], one of the most effective algorithms for the formation of shift vectors’ field is the MVFAST (Motion Vector Field Adaptive Search Technique) algorithm . Therefore, we compare the effectiveness of the proposed algorithms and the MVFAST algorithm.
In the MVFAST algorithm, the deformed image is divided into many nonoverlapping square blocks of a predetermined size. In most cases the block sizes are 2 × 2, 4 × 4, etc., but the algorithm allows setting blocks of any size, starting from one pixel. The estimation of the shift vector for the current block is determined by the estimates of the motion vectors of the neighboring blocks. The maximum value of the motion vector length of neighboring blocks is compared with two threshold values and , the values of which depend on the type of video sequence. As a result motion activity is classified as low, medium, and high. The criterion is the minimum of the sum of absolute differences (SAD). This characteristic for a block is the sum of the modules of the differences between the nodes of the current block and the corresponding nodes of the image on which the block location is searched. To improve performance and eliminate the influence of noise in the MVFAST algorithm, an early stop of the search is used. The search is terminated if the SAD is less than the specified threshold value . Early stopping of the search can be disabled by setting the threshold to zero.
Let us compare the results of estimation of the deformation field by algorithms D and MVFAST. For correctness of the comparison, below are the results obtained for the images in Figure 2 with the parameters of Algorithm D, , , , , and , and parameters of the MVFAST algorithm, , , , and block size 1x1. Figure 11a shows the estimates of the shift magnitude of the image points corresponding to the nodes of a single row of the reference image and formed by the MVFAST algorithm. The results of Algorithm D with the sets of parameters and are shown in Figure 6c and d, respectively.
It can be seen from the figure that the MVFAST algorithm forms fairly accurate estimates of node shifts, but, unlike Algorithm D, it has errors at the boundaries of the image of an object. Errors also occur in low-contrast areas inside the object. The latter is due to the fact that SAD in the center of the search for them often does not exceed the threshold . Setting the threshold when a block size is 1 pixel also does not solve this problem. Algorithm D, due to the inertia of changes of estimates, is deprived of this disadvantage.
The values of the mathematical expectation and variance of estimation errors when using the MVFAST algorithm for a row and for the entire image are shown in Table 1. The table shows that the mathematical expectation of the MVFAST algorithm for the motion area is almost the same as for Algorithm D with the set of parameters but several times higher (about five times for a row, eight times for the entire image) than for Algorithm D with the set of parameters . The variance of the estimation error for the motion area for the MVFAST algorithm significantly exceeds the variance for Algorithm D (13 times with the set of parameters and 27 times with the set of parameters . In the absence of noise for area without motion, the MVFAST algorithm shows slightly better results compared to Algorithm D with the set of parameters , but the results for Algorithm D with the set of parameters is better. This is well illustrated by Figure 11b, which shows the visualization of the deformation field estimates for the entire image. The results of using Algorithm D with the sets of parameters and are shown in Figure 7b and d, respectively.
Let us also compare the efficiency of the algorithms with the complex motion for the same conditions and parameters of inter-frame deformations (, , ) which were used in the analysis of algorithms B and D.
Figure 12a shows the estimates of shift magnitudes for a single row of the reference image (the same row that was selected in the analysis of Algorithm D), when the MVFAST algorithm is used. The results of Algorithm D with the sets of parameters and are shown in Figure 9c and d, respectively. The visualization of deformation field for the MVFAST algorithm is shown in Figure 12b. The deformation fields obtained by Algorithm D with the sets of parameters and are shown in Figure 10c and d, respectively.
It can be seen from the figure that when estimating a deformation field for complex motion, the MVFAST algorithm gives significantly worse results in terms of spatial accuracy compared to the developed algorithms, and it is practically inapplicable for estimating the parameters of the motion trajectory.
Note that the increase of block size in the MVFAST algorithm allows to get rid of detection gaps inside the object image but still does not solve the problem of errors at its boundaries. At the same time, the increase of block size reduces the accuracy of moving object area detection. The proposed algorithms, in contrast to the MVFAST algorithm, provide subpixel accuracy while estimating the parameters of the shifts that is fundamentally impossible for the MVFAST algorithm.
5. Analysis of computational costs
Along with the accuracy of estimation, computational complexity of the algorithms is also important. In this case, the required amount of computations per pixel of the reference image with coordinates consists of two components:
Computational cost of finding estimates and or , and , ;
Computational cost of obtaining the resulting estimate in the joint processing of estimates and or a set of estimates , and , .
The second component depends on many factors: complexity of the background, noise, number and speed of moving objects, etc. It can be evaluated with the given limitations of the particular problem being solved. Therefore, in the framework of this work, we restrict ourselves from analyzing the first component of computational costs with two sets of parameters for the estimated shift vectors and for the three algorithms discussed above:
Algorithm A: single row reverse processing
Algorithm C: joint processing of adjacent rows using gradient estimation minimum criterion
Algorithm D: joint processing of adjacent rows using CC maximum criterion
Detailed analysis of the computational cost requires taking into account not only algorithm itself but a large number of other factors (type of the computing device, time of memory access and for other operations, etc.). Many of these factors depend on the particular imaging device and computing resources. Therefore, we analyze only the computational complexity of the calculated ratios of the algorithms. We will consider the number of operations of addition (+), subtraction (−), multiplication (), division (), tacking the root (), trigonometric functions (), and interpolations for a node ().Table 2 shows the number of such operations needed to process one node of an image.
|Operation||Number of operations|
|Algorithm A||Algorithm C||Algorithm D|
The total computational cost is defined as
where are the coefficients characterizing the time needed to perform corresponding operations and are the number of operations.
Equation (4) allows estimating the computational cost of the algorithms for particular computing facilities. For example, for a PC with an AMD Turion II X2 M500 processor: = 1.4, = 1.4, = 13.6, = 63, = 135, and = 212 ns. Then for the set of parameters , average time needed to process one node of an image is 0.53 μs for Algorithm A, 1.06 μs for Algorithm C, and 3.74 μs for Algorithm D. The average time for parameters is 1.4 μs for Algorithm A, 2.8 μs for Algorithm C, and 4.62 μs for Algorithm D.
Let us note the features of the computational complexity of algorithms D and MVFAST. Studies have shown that the amount of computation of the latter is largely determined by the nature of the processed images. So if the adjacent frames of a video sequence do not contain motion, the computational complexity of the algorithm will be determined only by computing one SAD for each block. In the presence of motion, depending on its intensity, the computational complexity may increase in tens or even hundreds of times. In particular, when processing two frames of a video sequence without movement by the MVFAST algorithm with a block size of 1 pixel on the same PC, the average time spent on one node was 0.64 μs. This is the lowest possible computational cost. The average processing time for a single node for the images shown in Figure 2 was 8.91 μs. For similar images with a complex type of motion, for example, with , processing time is 21.84 μs. Note that the time expenses when using Algorithm D with the set of parameters were 7.44, 8.8, and 10.75 μs under similar conditions, while with the parameters , the time expenses were 8.13, 10.27, and 11.8 μs.
From the above results, it can be seen that the computational complexity of Algorithm D weakly depends on the presence or absence of moving objects, their size, and type of motion. The computational complexity of the MVFAST algorithm is largely determined by these factors. So for the case when the inter-frame deformations of images of a moving object correspond only to a parallel shift, the computation time for algorithms D and MVFAST is approximately the same. When motion is complex, it increases in several times for the MVFAST algorithm.
6. An example of estimating the trajectory of a moving object using deformation field
Let us give an example of estimating the trajectory of a moving object using the developed algorithms. Figure 13a and b shows two frames from the video sequence being processed in which the aircraft lands on an aircraft carrier.
A complicating factor is the uneven movement of the camera toward the landing plane. Therefore, to estimate the position of the moving object relative to the scene (not to the camera), it was necessary not only to detect and identify the moving object area but also to stabilize the image. First, the distortion caused by the movement of the camera is compensated. To do this the inter-frame deformation of adjacent frames presented as similarity model is estimated using deformation fields . Then the area of a moving object is detected. Using the estimates of the deformation field in the area of moving object, the parameters of the inter-frame motion are found, which for 34 frames of the video sequence are isometrically presented in Figure 13c. These parameters are used to estimate a three-dimensional trajectory of a moving object. Figure 13d shows the result in relative coordinates XYZ. The camera position at the initial moment of the shooting is taken as the origin.
The effectiveness of the motion area detection in sequence of images using pixel-by-pixel stochastic gradient estimation of the inter-frame shift vectors for all points of the reference image corresponding to its nodes (deformation field) is investigated. It is shown that stochastic gradient estimation of shift vectors does not give equivalent results for their representation as projections on the basic axes of the image and polar parameters. This is due to the fact that these sets of parameters have a different physical meaning. The analysis of variants of estimation algorithms for the deformation field has shown that the use of polar parameters provides higher accuracy of estimation.
Two approaches to estimate the parameters of the deformation field are considered. In the first approach, the stochastic gradient procedure sequentially processes all rows of an image to find estimates of shifts for all points of the reference image. It processes each row bidirectionally, i.e., from the left to the right and from the right to the left. The joint processing of the results allows compensating the inertia of the stochastic estimation. However, this approach does not take into account inter-row correlation, and images are processed as one-dimensional signals. In the second approach, to improve the accuracy of estimation, the correlation of image rows is taken into account. Algorithm processes rows one after the other with change in direction after each row and uses obtained values to form resulting estimate for each node. From the results obtained for each node of the reference image, the resulting estimate of the shift vector is formed. Studies have shown that the second approach, with approximately the same computational complexity, showed significantly higher accuracy in the estimation of the shift vectors’ field.
As a criterion for the formation of the resulting estimate, the minimum of gradient estimation of the objective function and the correlation maximum of local neighborhoods of the deformed and reference images were investigated. Correlation maximum criterion showed the best results. When using this criterion, not only the mean value and the variance of the estimation errors turned out to be less but also the effect of “oscillations” outside the motion area. But the use of this criterion requires a large computational cost.
The computational complexity of the proposed algorithms is investigated. Under equal conditions, the estimation of the projections of the shift vectors requires a slightly smaller amount of computations than the estimation of polar parameters. Using reverse processing of a row is less time-consuming compared with processing of the adjacent rows. The use of the correlation maximum criterion implies a larger amount of calculations than the use of the gradient estimation minimum criterion. The algorithms with large computational costs provide less error; therefore the choice of algorithm depends on the specific task.
The experiments have confirmed the effectiveness of using the developed algorithms to find the motion parameters and the trajectories of objects in a video sequence. In particular, an example of estimating three-dimensional trajectory of an aircraft in relative coordinates using 34 frames of a video sequence is considered. The parameters of the similarity model (parallel shift, rotation angle, and scale factor) were used to describe the inter-frame motion.
Comparison of the developed algorithms with the block algorithm MVFAST (Motion Vector Field Adaptive Search Technique) showed that under identical conditions, the latter shows worse results in accuracy of detecting the boundaries of motion. In addition, in contrast to the MVFAST algorithm, the proposed algorithms allow obtaining subpixel estimation accuracy.
Thus, the conducted study allows us to conclude that the pixel-by-pixel inter-frame estimation of the shift vectors of the reference image points corresponding to its nodes is a promising approach in detecting the area of a moving object in a series of images.
The reported study was funded by the RFBR and Government of Ulyanovsk Region according to the research projects 18-41-730006 and 18-41-730011.