Camera Motion Estimation Based on Edge Structure Analysis

The estimation of camera motion is important for several video analysis tasks such as indexing and retrieval purposes, motion compensation and for scientific film analysis. From an aesthetical point of view, camera motion is often used as an expressive element in film production. Motion content can be used as a powerful cue for structuring video data, similarity-based video retrieval, and video abstraction


Introduction
The estimation of camera motion is important for several video analysis tasks such as indexing and retrieval purposes, motion compensation and for scientific film analysis.From an aesthetical point of view, camera motion is often used as an expressive element in film production.Motion content can be used as a powerful cue for structuring video data, similarity-based video retrieval, and video abstraction Motion estimation and motion pattern classification problem have been extensively investigated by the scientific community for semantic characterization and discrimination of video streams.Moving object trajectories have been used for video retrieval [1][2][3].Camera motion pattern characterization has been efficiently applied to video indexing and retrieval [4][5][6][7].However, the main limitation of the latter methods is that they deal only with the characterization of the detected camera motion patterns, without explicit measurement of the camera motion parameters.As a result, the acquired information is of limited interest, since it can be used primarily for video indexing and retrieval.
There are different types of camera motion: rotation around one of the three axes and translation along the x and y-axis.Furthermore, zoom in and out can be considered as equivalent to translation along the z-axis.Existing methods can be classified as optical flow methods and feature correspondences based approaches.Let us also mention recursive techniques based on extended Kalman filters [8] which track camera motion and estimate the structure of the scene.In the case of an uncalibrated camera, interesting approaches are described in [9,10].The use of optical flow avoids the choice of "good features".In [11] differential approaches of the epipolar constraint are described.In [12], the optical flow computed between two adjacent images in a video sequence is linearly decomposed on a database of optical flow models.The authors of [13] propose a comparison of algorithms which only use optical flow for estimating camera.
In this work we present an approach for recovering graph-based structures from images.This structure is then used for estimating 3D camera motion.Our approach is based on detecting straight line segments.After several prefiltering operations such as bilateral filtering and Hough transform, preceding by the edge detection is applied.Then a result of transformation analyses in order to detect local maximums.Reverse transform of these maximums gives us a several straight lines presented in the image.We use an intersection of these analytical lines with detected edges in order to find straight line segments.On this step we also detect intersections between line segments.For each intersection point we compute rank based on number of connected points.After transforming image into graph we search for similar structural elements in the graph of previous frame.This process is based on searching subgraphs consists of vertices with similar ranks.After finding correspondence points camera motion is estimated as a combination of translation and rotation.

Camera motion model
In this work we considered eight-parameter perspective model defined as follows: aa  , 34 aa   and 67 0 aa   will give us a translation-zoom-rotation model.
In [14] any vector field is approximated by a linear combination of a divergent field, a rotation field and two hyperbolic fields.The relationship between motion model parameters and symbol-level interpretation is established as: Error in esti mation parameters is defined as: where N is the number of corresponding points,   Where M defines estimated camera motion parameters.
It is well known that, by taking some particular point p as the origin of the coordinate system with coordinates z, any infinitely differentiable function ƒ(x) could be approximated using Taylor series: where The matrix A which consists of a second partial derivatives of the function is also called Hessian matrix of the function at p [15] In approximation of (4) the gradient of f is easily calculated as In Newton's method gradient is set to 0 to determine the next iteration point.
The gradient of error function  with respect to parameters a has components , 0,1,...,7 Taking second order partial derivatives gives It is conventional to remove the factors of 2 by defining 22 1 2 Making [a]=1/2A in equation ( 6), in terms of which that equation can be rewritten as the set of linear equations SVD is used to compute transformation parameters form overdetermined set of linear equations (11).
In the proposed work initial translation was estimated prior to pan-tilt-zoom estimation based on center of gravity of corresponding feature points in consistent frames.To remove outliers voting idea was used.After finding correspondences between frames each pair of matching points "votes" for its offset.Then points with small number of offset votes are discarded.

Image to graph conversion
However, the most challenging part in camera motion estimation is finding correspondences between two frames.In proposed work this process is based on searching similar edge structures in the frames.Input image is represented as a graph based on edges and their intersections.Intersection points are ranked according to the number of connections.Algorithm for transforming image into graph is presented in Fig. 1.The proposed method is based on matching edge structures inside the images.Robustness of the algorithm depends on the quality of detected edges.Thus, we try to discard small edges and edges with small magnitude before recovering structures.One of the most effective way to remove such edges is a bilateral filtering [18].We used bilateral filtering to blur image while preserving strong edges (Fig. 2 where s  controls the influence of spatial neighbourhood, I  the influence of the intensity difference and k is a normalizing coefficient defined as  is a set of edges.Each vertex is described by its coordinates and rank which shows how many connections this vertex has.Edges of graph show connections between vertices.To compute rank of the vertex we used the following idea.Let's consider the structure presented in the Fig. 3 as a graph, obtained after edge detection.It has nine edges e 1 …e 9 , ten vertices v 1 …v 10 and could be described by the following connectivity matrix : 0001000000 0001000000 0001000000 1110110010 0001000000 0001000000 0000000010 0000000010 0001001101 0000000010 Let's define j-th order rank of the i-th vertex as a number of vertices which could be reached from i-th vertex by j steps.First order rank of the i-th vertex shows how many vertices are directly connected to it.Thus, the v 1 will have first order rank 1 1 1 r  and the v 4 will have 1 4 6 r  .Second order rank shows how many vertices could be reached from the i-th vertex in two steps.Thus, 2 1 6 r  and 2 4 9 r  .Finally, n-th order rank shows how many vertices could be reached from i-th vertex in n steps.This ranking could be used to evaluate complexity and size of the substructures.Rank table for the graph presented in Fig. 3 is shown in Table 1.Let's define the highest order of the rank rMax as the minimal order which satisfies the following condition: where n is the number of edges in graph.
The highest order for this graph is 3. Using higher order rank is useless for that graph as long as it would have same values.However, in common case, using higher order ranks allows to make matching process more effective by detecting more complex structures inside the image.
Vertices After transforming two frames into graphs problem of finding correspondences between them could be reformulated as a problem of finding sub-graphs with similar structure.We start our search with finding correspondences for vertices with high rank.During this process spatial information could also be considered.Thus we try to find corresponding vertex which would have the same rank and will be located close to the reference vertex.In the future work we this step may be improved by searching correspondences for groups of vertices instead of matching them one by one.As the result of this step we obtain two set of points P and Q which are used to estimate camera motion as it was described.
Matching idea is based on searching similar substructures in the image graphs.Typically, images contain many simple substructures.These substructures have a small maximum rank order rMax.Furthermore, most of the vertices have small rank of first and second order, while very few of them are ranked higher than 6.Thu s we start searching of correspondences from the structures with highest ranks.In the graph presented on Fig. 3 such vertices are v 4 and v 9 .They have ranks 6 and 4 respectively.As long as they are connected we will try to search for two connected vertices with ranks 4 and 6 in the next frame.
To simplify matching process we try to find substructures with more complex structure.This problem is solved by computing rank of vertices of higher orders.
Some edges could be lost in image sequence due to noise, camera movement and movement of objects in scene.It may cause differences in the structure of image graphs.To solve this problem the following ideas were used: 1. We tried to match structures located closer to the center of frame first in order to avoid loosing edges due to camera motion.2. We consider that vertex i in one frame may correspond to vertex j in other frame even if their ranks are not exactly same.Thus, for each vertex in first frame we select several candidates which ranks are differs less than 30%.After candidates are selected for all vertices we try to choose best corresponding pairs using spatial information.
To decrease number of incorrect matches and to increase the preciseness of matching algorithm matched points were additionally compared using Cross-Hexagon Search (CHS) algorithm [19].Coordinates of matched graph vertices were used a block centers and offset between matched points in consistent frames was used as initial offset.As long as CHS is used for verification of matched feature points, it can be changed to any block-based feature matching algorithm such as Diamond Search [20] or Three-Step Search [21].

Experimental results
All experiments were done on Intel Core 2 Duo with 2Gb memory under Borland Builder environment.Program was not optimized for the maximum performance, thus, computational time, shown in Table 2, could be decreased.To evaluate matching quality several types of experiments were done.First group of tests considered planar camera motion (translation and rotation around camera optical axis).

Stage
Image sequences with camera smoothly by shifted by 20 cm in different directions and rotated by 20 degrees around its optical axis were made (see Fig. 4 for example).To evaluate quality of camera motion estimation average and maximum absolute errors were computed.
Results for all groups of tests are shown in Table 3.Additionally, these tests were used to evaluate performance of algorithm with and without CHS correction.(Fig. 7,8).It is easy to see, that using additional verification step for correspondent points could decrease estimation error, with relatively small increasing of computational time (about 0,07 sec).In last group of tests image sequence with predefined camera motion trajectory was used to evaluate error depending on number of graph vertices used for matching result is shown in Fig. 9. Table 3 shows that the proposed method can effectively estimate camera motion for scene with moving objects.However, its weak point is natural scenes with small number of geometrical objects.

Conclusion
In this paper we introduce an algorithm for converting images into graphs based on edges structure and graph matching algorithm based on vertex ranking.Proposed method could be used in different applications such as camera motion estimation, stereo matching, motion compensation, background model generation and many others.The proposed method provides efficient image-to-graph mapping for urban scenes.In case of natural scenes additional prefiltering may be required for removing unimportant information from edge image.In this paper we also presented simple algorithm for camera motion estimation based on parametric motion model.In future works we would like to improve the performance of the proposed algorithm to work in a real-time applications.Furthermore, algorithm could be improved for natural scenes by using different type of features.

Fig. 1 .
Fig. 1.Transforming image into graph (b)).This allows us to reduce number of small edges.With a Gaussian function  of input image I at pixel p is defined as

6 )Fig. 2 .
Fig. 2. Example of image to graph transformation.Refer text for details.

Fig. 3 .
Fig. 3. Example of graph structure.Next step is edge detection (Fig.2(c)).In this work Sobel edge detector was used.Edges with magnitude less than 50 were filtered out during this step.Remained edges were mapped into ,   space using Hough transform.As a result 2-dimensional voting array was generated.Local maximums of this array correspond to strongest edges in the image.Using reverse Hough transform for these maximums we obtain set of lines (Fig.2(d)).Intersection of these lines gives us straight line segments (Fig.2(e)) and their intersections (Fig.2(f)).The purpose of these steps is to obtained graph-like structure.In further processing we use information about edge intersections (vertices) and their connections (edges).The computational time of the proposed algorithm might be reduced by removing Hough transform.It is not very important to detect exactly straight lines between vertices.The important information is their connectivity.However, using straight lines instead of edges allows more robust detection of edge intersection.Information about edges could be represented in the form of a graph , EV   where   1 ,..., m vv v  is a set of vertices obtained from intersections and

Fig. 4 .
Fig. 4. Planar camera motion estimation for translation (a) and translation with rotation (b).Second group of tests was used to evaluate algorithm performance for full 3D camera motion with known real camera trajectory.Three kinds of scenes were used: static scenes, scenes with moving objects and scenes with high amount of natural objects (trees, grass etc).Example of camera trajectory estimation is shown at Fig.5~6.

Table 2 .
Average computational time for 200 vertices.