An Introduction to Model-Based Pose Estimation and 3-D Tracking Techniques

The aim of this chapter is to present a general overview of the feature-based 3-D pose estimation and tracking techniques. Principles, classical techniques and recent advances are presented and discussed in the context of a monocular camera. The objective is to focus on techniques employed within both the visual servoing and registration fields for the wideclass of rigid objects. The main assumption to this problem rely on the availability of a 3-D model of the object to track.


LSIIT -University of Strasbourg
France

Chapter overview
The aim of this chapter is to present a general overview of the feature-based 3-D pose estimation and tracking techniques.Principles, classical techniques and recent advances are presented and discussed in the context of a monocular camera.The objective is to focus on techniques employed within both the visual servoing and registration fields for the wideclass of rigid objects.The main assumption to this problem rely on the availability of a 3-D model of the object to track.

Introduction: Model-based tracking and pose estimation
The recovery of the 3-D geometric information from 2-D images is a fundamental problem in computer vision.When only one view is available, the appearance or the relative arrangement of the object features of interest should be modelled in a symbolic description so as to be compared with the image descriptors thanks to a similarity criterion.Geometricbased approaches restrict the search for correspondence to a sparse set of geometrical features.They use numerical and symbolic properties of entities available.To automatically compute a rigid-body transformation (the pose), it is necessary to match a 3-D model features with part of the visible 2-D image features, a process referred to as the correspondence problem, and for the past four decades, the model-based pose estimation of objects with a simple geometry has been intensively studied.
The major goal is tracking at camera frame rate the pose parameters in the world space.Therefore, features such as points, lines, ellipses are not only extracted from 2-D images, but the 3-D model and the pose of the object has to be also exploited.

Related work on model-based 3-D tracking
The modelbased 3-D tracking is closely related to the pose estimation problem.It can cope with abrupt motions and it is generally more efficient to deal with partial occlusions of the object of interest than 2-D tracking.However, it needs the correspondence problem to be solve at least once.The definition of object tracking algorithms in image sequences is an important issue for research and applications related to robot vision.A robust extraction and realtime spatiotemporal tracking of measurements is one of the keys to a successful visual servoingbased tracking, in particular for positionbased visual servoing approaches

The importance of feature grouping
The pose determination from a single perspective view requires a local geometric description of objects and image features of interest.The space search for the onetoone correspondence may be very large when no constraint is used to associate them.Various recognition schemes have been proposed in the past to solve the search of the correspondence problem like the interpretation tree (Grimson, 1990), geometric hashing (Lamdan, 1988), aspect graphs (Ikeuchi, 1987;Hansen, 1989), focal features (Bolles, 1982 and1986), pose clustering (Olson, 2001), the soft assignment (Gold, 1998;David, 2002) and alignment techniques (Huttenlocher, 1990;Torr, 1999;Bartoli, 2001).
Many authors have presented solutions to this problem in the context of the registration with alignment techniques.The goal is then to match two subsets (a subset of image features with a subset of features of the 3-D model) corresponding to a geometric transformation consistent with the data 2 (Haralick, 1984;Lowe, 1987;Ikeuchi, 1987;Thompson, 1987;Sugihara, 1988;Grimson, 1991;Jurie, 1999;David, 2002).For linear primitives such as points (vertices of a polyhedra, corners, zerocurvature points, intersections of lines,...) or straight lines (edges, tangent lines at zerocurvature points, polar lines,...), the search space of a consistent viewpoint1 is large as several features are needed to constrain it (typically 3 to 5 matched linear features are required).Quadratic primitives constrain more the viewpoint.In many situations, only one quadratic feature is needed to find a small set of consistent pose parameters.In counterpart, quadratic primitives are more tricky to detect with a sufficient level of reliability (see (Weiss, 1993;Werghi, 1995;Pilu, 1996)).The size of these subsets are depending both on the geometric transformation and the number of dof which are constrained by the features used.Features grouping and classification are then important stages, to speed up the correspondence problem by rejecting a priori several very probable inconsistent subsets.For instance, the determination of a onedimensional homography between two sets of points requires the grouping of 3 collinear points in both sets, since a onedimensional projective basis is defined with three points on the same line2 .

An introduction to the correspondence problem
We introduce the correspondence problem as it is an essential vision component for automate the pose and to detect outliers.A set of n m 3-D model features and a set of n I 2-D image features are used through a matching process to determine the "best" Euclidean transformation.Usually n m and n I are different for many reasons.A model feature may be occluded or outofthe camera field of view or some irrelevant feature may be detected in the image, because of the acquisition and segmentation processes.These artefacts act as outliers and they must be discarded.Anyway, even if n I = n m , a featurebased pose determination algorithm needs the onetoone correspondence between subsets of all available features.A naive approach could be done by computing the large set of geometric transformations with all combinations of n m ordered model features and n m unordered image points, and to select the transformation with the lowest target registration error (the "best" transformation).However, this is unpractical since the number of combinations is increasing very quickly with the number of features and lead to a cumbersome computing time.
The matching process must be fast and as few timesensitive as possible to a small variation of the number of features involved.That's why, it is of prime importance to design pose algorithms with few features even if the final registration computation takes advantage of all available inliers.
For instance, when one solves the solutions for the pose by means of the wellknown perspective 3-points problem (Horaud, 1987;Haralick 1989;DeMenthon 1992;Alter 1994), n m model points should be matched with n I image points, considering every arrangement of triples of feature points.Hence, for n m 3-D model points and n I image points, there are candidates (couples of triples) (Huttenlocher, 1990).To register the polyhedra with respect to the camera frame in Figure 1, there are typically n m = 8 vertices and at most n I = 7 visible points in the image.This lead to a large search space since 11760 putative couples of triples should be scanned so as to find the right alignment3 .One more image point issuing from artefacts extends the number of candidates up to 18 816, two more image points up to 28 224.To provide a practical solution, some authors proposed to turn towards the estimation theory.There are many existing robust estimation algorithms and we relate one the most popular family of techniques referred to as the hypothesizeandtest approach (Grimson, 1990) where a small set of correspondences are first hypothesized, and the corresponding transformation is computed.The best known example of this approach is the RANdom SAmple Consensus (RANSAC) algorithm of Fischler and Bolles (Fischler, 1981) which is able to cope with a large amount of outliers and to automatically compute the a geometrical transformation.Another close algorithm is that of Rousseeuw & Leroy (Rousseeuw, 1987) applied by Rosin (Rosin, 1999) for the pose determination.

Some pose estimation problems
We begin this section with a short review of wellknown techniques for the pose with feature points and lines in general configuration, so as to orient the discussion toward less wellknown techniques like collinear points, spheres, cylinders and multiple heterogeneous features.

Pose recovery from lines
Usually, the extension of the previous related methods to the case of lines is straightforward for some specific arrangements as lines and points are dual entities in a projective plane.For instance, with a non parametric planar curve, it is possible to extract zerocurvature points (which are invariants by perspective projection in most cases), but since this kind of points are very unlocalized, one may preferably used tangent lines instead of zerocurvature points to match the 2D image features with the object model (Mokhtarian, 1986;Richetin, 1991).In the case of a polyhedra (see Figure 1 and Figure 3), a 3-D model can be built with a set of 3-D straight lines which are generally not in the same plane.The linepoint duality does not hold any more and a specific method with 3-D lines (4 dof) should be investigated.It's a more difficult problem than that of the pose from points, since the equations which relate the 3-D line representation (see Figure 2) and its perspective projection in the image are quadratic functions of the pose parameters (Hartley, 2000).Dhome et al. (Dhome89), Chen (Chen, 1991b), Liu et al. (Liu, 1990}, Navab & Faugeras. (Navab, 1993), Andreff et al. (Andreff, 2000), Bartoli & Sturm (Bartoli, 2001) and Ansar & Daniilidis (Ansar, 2003) have proposed some pose and displacement algorithms from lines.In (Dhome, 1989), the solution is given by solving a polynomial equation of degree 8 with at least three lines.Chen (Chen, 1991) points out some particular and important arrangements of 3 lines (concurrent lines, three lines with two parallel lines, coplanar lines, perpendicular lines,...) in order to reduce the degree of the polynomial equation, and consequently to reduce the number of solutions.The various approaches are differing from the geometric constraint used (Liu, 1990 andChen, 1991), and by the type of representation used like the Plückerian representation for the pose and motion analysis (Navab, 1993;Mitiche, 1995), for large displacements (Bartoli, 2001), with the normalized version of the Plückerian representation (Andreff, 2000) or to get a linear algorithm with large data redundancy (Ansar, 2003).The perspective projection of a 3-D line represented with the Plücker matrix L (whose components are described with = (r, ) with r T = 0 in the object frame) is the line l c such that ) .(b) A backprojection of an image line does not completely defines the 3-D line from which it came from is a vector with a direction perpendicular to the interpretation plane (also called pullback plane) which contains the straight line and the origin of the frame (the projection centre C in Figure 2b).It is a suitable representation since one may easily deal with geometrical transformations (Bartoli, 2001) including the perspective projection.The dual of the Plücker matrix, L*, may also represent the 3-D straight line, with the intersection of two planes.L and L* are related with a simple rule (L L* = 0) and both representations are commonly used.These two (4 x 4) matrices are defined up to a scale, skewsymmetric and singular.The rank value ( 2) is expressing the orthogonality constraint between the two vectors r and .Moreover, with the derivation of the characteristic polynomial of the Plücker matrix, one can easily show that eigenvalues are complex conjugates scalars (i µ, -i µ, 0, 0) with the expression µ 2 = r 2 + 2 .µ can be arbitrarily set to any nonzero value in order to normalize L. P c is the perspectivity, a (3 x 4) real matrix, composed of the camera parameters matrix K c , the 3-D rotation R and the position vector t of the object frame w.r.t. the camera frame: (3) Figure 3. Tracking simple shape in a structured environment.The modelbased polyedral object pose estimation is used to compute the camera displacements during the image sequence It is now clear that (1) is non linear with respect to the pose parameters r 1 , r 2 , r 3 and t.The pose computation needs to solve these parameters given a set of n 3-D lines i and the corresponding imaged lines .It may be shown from equation (1) applied to the i th line, that we have (4) where is the matrix Kronecker product.This is a linear system with respect to the (18 x 1) vector of algebraically dependent unknowns, and it can be solved with at least n = 3 straight lines.

Pose recovery from collinear points
In this part, we discuss on a very particular case of pose from points, especially when points are all lying on a common line as it is with fiducial markers or for patterns with structured lighting.
Recovering the relative orientation (2 dof -a unit vector r) and the position (a 3-vector t) of a set of n collinear points such as the markers in Figure 4, with respect to the camera frame has been previously investigated by Haralick fifteen years ago (Haralick, 1992).The interpoint distances and the focal length f of the camera are assumed to be known.Haralick solved this problem with a linear algorithm.Let P 0 = t, P 1 = t + 1 r, P 2 = t + 2 r, ..., P n-1 = t + n-1 r be n distinct points where i represents the distance between the i+1 th and i th points.The first point P 0 is arbitrarily chosen as the origin ( 0 =0), hence the perspective projection Q i of the i th point with homogeneous coordinates (u i , v i , 1) is given by ( 5) where K c is a (3 x 3) upper diagonal matrix containing the parameters of the camera.From the above equation, Haralick built an homogeneous linear system with a univariate matrix K c = diag(f, f, 1) and vectors t and r as unknowns ( 6) A r and by A t are two (2n x 3) real matrices whose components are functions of the camera parameters and the 's.A closedform solution can be found with n > 2 distinct points.This system may be reformulated as a classical optimization problem with an equality constraint r =1.The solution for r is given by the eigenvector associated with the smallest eigenvalue of the following symmetric matrix and the position vector t is straightforwardly given by the expression .We end up with two estimates for r (a twofold ambiguity in the sign).However, for real objects placed in front of the camera, the third component of vector t must be strictly positive assuming that the camera zaxis (usually, the optical axis) is pointed towards the scene.This leads to the uniqueness of the solution for the pose.It worth pointing out that in presence of both noisy data and close points in the object pattern, matrices Ar and At are illconditioned, which may introduce a significant bias in the results.The use of the least mean squares for n > 3 and the lack of data normalization in the original algorithm tend the solution to be sensitive to the matrix condition number.One has to pay attention to data normalization (Hartley, 1997) since the pose estimation may be computed with points not always well scattered.This may also lead to numerical problems.To lower the condition number, it seems advisable to normalize data coordinates with an affine transformation (Trucco, 1998).

Pose recovery from spheres
To deal with quadratic primitives, we begin with the pose from spheres in this paragraph.The projection of a sphere surface through the central projection is a cone with the vertex at the projection centre (see Figure 5a).The intersection of that cone with the sphere surface is called the contour generator ( ) whereas intersections with the image plane provide the apparent contour ( ), both are elliptic planar curves in general.To our knowledge, the mathematical formulation and a solution to the 3-D pose of spherical objects has been firstly proposed by Shin and Ahmad (Shin, 1989).The solution was based on 3-D analytical geometry and a closedform solution is given.SafaeeRad et al. (SafaeeRad, 1992) have also studied spherical objects pose in the context of mobile robotics and they have pointed out some major practical limitations in the pose accuracy, like the location of edges used in the ellipse parameters fitting, the uncertainty of intrinsic scale factors (due to timing mismatches that occur between camera scanning hardware and image acquisition hardware) and also the radial distortion of the lens.Ferri et al. (Ferri, 1993) present some algorithms for linear and quadratic primitives which include the computation of the 3-D pose and in particular a quadric of revolution.They provide a simple pose recovery procedure from the eigendecomposition of the matrix representation and they mentioned that in the case of a sphere, it must have two equal eigenvalues.Pose determination and camera calibration by means of images of spherical objects has also been studied, in particular by Teramoto and Xu (Teramoto, 2002), by Agrawal and Davis (Agrawal, 2003), by Dhome et al. (Dhome, 1990) and by Daucher et al. (Daucher, 1994).A sphere S is a quadric, and with the homogeneous coordinates of any point M on the sphere surface, it can be represented by a (4 x 4) symmetrical matrix S as M T S M = 0.The matrix S depends on the radius r s and the sphere position vector t =(t x , t y , t z ) T from its centre to the world reference frame.When the camera reference frame coincides with the world reference frame, t z is chosen in the direction perpendicular to the image plane, and pointing towards the scene.When expressed in that frame, it is provided by (7) where I is the identity matrix.Since t is the distance between the sphere centre and the projection centre, C, the scalar t 2r s 2 must be positive as it is for a sphere placed in front of the camera, taken into account its own size.The dual of the sphere S is a sphere represented by the adjoint S* of S since it is a symmetric matrix.It is given by S* = S -1 .Both matrices S and S* will be used hereinafter since a sphere and its image are easily related by dual matrices.With the pinhole camera model and homogeneous coordinates, a 3-D point M is projected onto the image plane in a 2-D point m such that (8) with ( 9) where the (3 x 4) camera matrix P c is a perspectivity (or camera matrix) and is a non null scalar.When expressed in the camera frame with the projection centre C as the origin, the camera matrix is of the form given by ( 8).Under the camera matrix P c , the outline of the sphere S is an ellipse E in the image which can be represented with a (3 x 3) symmetric matrix E. The dual of E is the adjoint E* which can be related to the adjoint of S as follow ( 10) By substituting equations ( 7) and ( 9) in ( 10), it has been shown by Agrawal (Agrawal, 2003) that (11) The work of Teramoto et al. (Teramoto, 2002) is a pose determination method in which the position direction and the size of a ball is derived in the case of known intrinsic parameters.It is extended here to the recovery of the full 3-D position with a slightly modification of the original work which will serve to analyze the eigendecomposition in presence of noisy data.Given the dual matrix E*, let us denote with t u = t/s, a unit vector with a nonnull scalar (s=± t ).Starting from equation ( 7), we have: (12) with Q*=(K c ) -1 E* (K c ) -T .Since s/r s is always greater 1, the righthandside of the above equation is a rank-3 matrix, it can be written as: (13) and the lefthandside can be decomposed as U diag ( 1 , 2 , 3 ) U T with U=[U 1 ,U 2 ,U 3 ] is an orthonormal matrix.It's clear that we have ( 14) and ( 15) The solution is unique since the sign for t z (t z > 0) reveals the sign for the scalar s.It is worth pointing out that in one hand the symmetrical matrix E* has 5 independent components as it represents an ellipse in the image.In the other hand, the symmetrical matrix Itt T /r s 2 has two eigenvalues equal to one (clearly tt T is a rank-1 matrix), hence representing a "calibrated ellipse".So, it's clear that the geometric information issued from equation ( 11) has not been fully exploited.This means that either the position vector t can be solved with an overdetermined system (with data redundancy) or other parameters (like intrinsic parameters) may be determined from a unique sphere and its corresponding ellipse in a single image.Let us now considering the former case.Following this remark, a simple improvement of the Teramoto's method consists of a slightly modification of equation ( 14) since the two singular values 1 and 2 must be equal with uncorrupted data.Thus, we propose to replace equation ( 14) to by ( 16) that is 1 is replaced by the midvalue of the first two singular values in presence of noise.
The resulting matrix Q m * is then the closest symmetrical matrix to Q* with the Frobenius norm and is given by .Once the matrix Q m * has been derived, the (4 x 4) matrix may be computed with the estimated position vector thanks to the Teramoto's method, that is with .Simulation results reported in Figure 6 show a better behaviour with respect to noise for the modified version we propose compared to the original algorithm.It is simple to implement and it does not require more computations.It has been used as an efficient tool while tracking the 3-D position of a moving camera mounted on a wheeled robot for robotic competition (see Figure 7).

Pose estimation from cylinders
To go ahead with quadratic primitives, we now discuss on the pose from cylinders, especially with straight homogeneous circular cylinders (SHCC), is the class of cylinders with a straight axis and a circular section with constant radius (see Figure 8).In the late 80s and early 90s, shape from contour approaches have been developed in an attempt to determine constraints on a threedimensional scene structure based on different assumptions about the shape.The understanding of the relations between image contours geometry and the shape of the observed object and the viewing parameters is still a challenging problem and it is essential that special shapes are not represented by freeform surfaces without regard to their special properties, but treated in a way more appropriate to their simple nature.Explicit relations from occluding contours to the model of a curved threedimensional object have been presented for objects with geometrical properties such as generalized cylinders or surfaces of revolution (Dhome, 1992;Ferri, 1993;Kriegman, 1990 ;Ponce, 1989).More recent works are based on the image contour of a cylinder crosssection when it is visible.Puech et al. (Puech, 1997) used the image of two crosssections to locate a straight uniform generalized cylinder in 3-D space and Shiu and Huang (Shiu, 1991) solve the problem for a finite and known cylinder height, that is a 3-D pose determination for 5 degrees of freedom.SafaeeRad et al. (SafaeeRad, 1992) estimate the 3-D circle centre and orientation from the projection of one of the two circles on the cylinder ends.However, ellipse fitting generally becomes inaccurate when the cylinder radius is small with respect to the cylinder height and also since both circles on the cylinder are not completely visible.Huang et al. (Huang, 1996) solve the pose determination of a cylinder through a reprojection transformation which may be thought as a rectification transformation.The computed transformation is applied to the image of the cylinder and brings the camera optical axis to perpendicularly intersect the cylinder axis, which is then parallel to one of the two image axes.The new image (called "canonical" image) is a symmetrical pattern which simplify the computation of the pose.It is an interesting method which provides an analytical solution of the problem, including the recovery of the height of the cylinder.However, it's requiring an image transformation and errors for estimating the reprojection transformation may lead to significant bias in the contours location of the resulting canonical image and consequently to the pose parameters.In a similar way, Wong et al. (Wong, 2004) take advantage of the invariance of surface of revolution (SOR) to harmonic homology and have proposed to recover the depth and the focal length (by assuming that the principal point is located at the image center and that the camera has unit aspect ratio) from the resulting silhouette which exhibits a bilateral symmetry.It is also a rectification which brings the revolution axis to coincide with the yaxis of the image.If the image of a latitude circle in the SOR can located, the orientation of the revolution axis can also be estimated.
Figure 8.A straight homogeneous circular cylinder and its perspective projection.The backprojection of apparent lines (l -, l + ) is a couple of planes (P c ) T l -and (P c ) T l + passing through the centre.The image of the cylinder axis is the axis l s of the harmonic homology H relating the two apparent lines Some other approaches based on contours and shading have also been proposed.Asada et al. (Asada, 1992) provide a technique to recognize the shape of a cylinder crosssection and to determine the orientation axis under the weak Lambertian assumption for the reflectance of the object surface when the surface does not include specular component.Caglioti and Castelli (Caglioti, 1999) focus their work rather on metallic surfaces (cylinders and cones) and they recover the pose parameters by means of the axial symmetric reflection model.However, although methods which do not solely involve geometric features should be more investigated to achieve an accurate pose estimation, these two latter methods assume an orthographic projection for the camera model, a too strong approximation when cylinder orientation is partly embedded in the perspective effect.It has been shown in (Doignon, 2007) that the estimation of the Plücker coordinates (r, ) of the cylinder axis can be directly recovered from the degenerate conic C = l -(l + ) T +l + (l -) T built with the two apparent lines (l -, l + ) and the cylinder's radius r c with a calibrated camera.In one hand, we have (17) and in the other hand (after some computations , see (Doignon, 2007) for details).( 18) with and the unit vector z u =z/ .It is easy to see that and finally .Some results, illustrated in Figure 9 and Figure 10, show the efficacy of the proposed fitting and pose Firstly, the pose determination is performed in a simple and controlled environment (Figure 9).Second, in a complex environment with a moving background (the abdomen of a pig), a cylindricalshaped laparoscopic instrument is detected (thanks to a joint huesaturation colour attribute).Plücker coordinates are computed with the method described above.It is shown in (Doignon, 2007) that the direction of vector is directly related to the image of the cylinder's axis.
Figure 9.The 3-D tracking of the cylinder's axis through the detection of the conjugate apparent lines in a uniform background Figure 10.The 3-D of the cylinder's axis of cylindricalshaped laparoscopic instruments through the detection of the conjugate apparent lines in a complex environment (abdomen of a pig)

Pose estimation with multiple types of geometrical features
We now turn to a little bit towards more complex environments.As some features may be occluded during the tracking, it is necessary to estimate the pose with multiple features.We introduce this approach with an example in Figure 11.The virtual visual servoingbased pose estimation (see the end of paragraph 2.1) is carried out with three geometrical features : a cylinder, a circular needle and marker points.Only four degrees of freedom are necessary to estimate the attitude of the instrument axis.The 4 dof of the pose can be determined using the contour generator and its image (the apparent contour) of the cylinder (see paragraph 3.5).However, the positions of the marking spots not only define the proper rotations and translations, but also give information on the orientation and position of the axis of the shaft.We then chose to estimate the 6 dofs of the instrument with all the available features.This can be done with analytical methods using both the apparent contours and one known point at the cylinder's surface (Nageotte, 2006).The full pose estimation is interesting for robustness considerations only if all the available information given by the apparent lines and all the spots is used.To this purpose, the Virtual Visual Servoing (VVS) due to Sundareswaran (Sundareswaran, 1998) and also by Marchand (Marchand, 2002) and may handle the information redundancy.VVS is a numerical iterative method for minimizing the error between the extracted features and the forward projection of the object in the images and based on the imagebased visual servoing (IBVS) schemes.This process needs the computation of an interaction matrix which relates the variations of each image feature and the camera velocity screw .With these image features, the full interaction matrix L s has the following generic form: The interaction matrices associated to a point p, L pt , to a line l, L line , and associated to an ellipse E=(x e , y e , r min , r max , e ), L ellipse , can be found in Espiau et al. (Espiau, 1992) or in Chaumette et al. (Chaumette, 1993).In order to guarantee a fast convergence and a good stability of the VVS, it is useful to initialize the algorithm close enough to the real pose.For this purpose, we use the modifed version of the Haralick's method and the DeMenthon iterative method (DeMenthon, 1995) for points, that of Dhome (Dhome, 1992) to get the 4 solutions for the pose of a circle and the pose determination of the axis of a circular cylinder described at the paragraph 3.5.With this initial pose parameters estimates for the attitude of the camera with respect to the object of interest (a laparoscopic surgical instrument, see Figure 11), the following control law is applied to the virtual camera (20) until the control vector becomes smaller than a specified value.The process converges quickly towards the real pose of the camera (see Figure 11).

Further Readings : Pose estimation with multiple cues
We close the chapter by touching upon the difficult problem of the pose estimation and tracking in a complex and unknown environment.In many situations, like for assistance domestic environments, outdoor navigation or tracking inside the human body, the observed scene is unstructured and the background is not uniform nor constant.It is then not a trivial task to detect object of interests or patterns, since brightness and colour are changing and the background moves due to human displacements, the wind or the breathing or heart beating in the third case, leading to shadows, occlusions or specularities.Some rather recent works integrate multiple visual cues like colour (Vincze, 2005), texture (Pressigout, 2006) or global&local descriptions (Kragic, 2002) or a learning stage (Vacchetti, 2004) to improve the tracking process.
The Vision for Robotics (V4R) software package proposed by Vincze et al. (Vincze, 2005) integrates multiple cues like edge gradient, colour, intensity, topological interrelations among features and pose from preceding frames to provide an efficient visual modelbased tracking tool in realistic and unconstrained environments.These cues are derived not only from the images but also from object parameters (the model) and pose information stored within previous tracking cycles.A fourstage tracking scheme is designed, from the 2-D feature extraction to the pose computation and validation.
For domestic environment (living room), Kragic & Christensen propose to use both the appearance and geometrical models to estimate the position and orientation of the object relative to the camera/robot coordinate system.Following a threestage strategy (initialization, pose estimation, tracking), this system may be seen as a coarsetofine tracking.
The initialization step provides an approximation to the current object pose by means of the Principle Components Analysis.Once the model is found, a local fitting is used to extract linear primitives from which the pose is computed.Finally, if the object or the eyeinhand robot start to move, the Drummond's method (Drummond & Cipolla, 2000) is adopted to provide a realtime estimate of the object pose.
Vacchetti et al. (Vacchetti, 2004) have formulated the tracking problem using a single camera in terms of local bundle adjustment and have developed an image correspondences method that can handle short and widebaseline matching.The video information of a very limited number of reference images created during a training stage is merged with that of preceding frames during the tracking.The tracking process needs a 3-D model of any object that can be represented by a 3-D mesh.Thus, keyframes serve to register an incoming image by means of a set of extracted corners thanks to the minimization of the reprojection error with the Tukey Mestimator.The reported results have demonstrated a very good robustness with respect to aspect changes, model inaccuracies, partial occlusions, focal length and scale changes, and illumination changes.Pressigout and Marchand (Pressigout, 2006) propose a realtime hybrid 3-D tracking with the integration of the texture information in a nonlinear edgebased pose estimation algorithm.
Pose and camera displacements are formulated in terms of a full scale nonlinear optimization instead of a point of interestbased pose estimation.In particular, the camera displacement estimation is based on a two images intensity matching.A non linear criterion based on intensity mapping error is defined to that purpose from which an interaction matrix is derived.

Conclusion
With this article, we have addressed some issues in pose estimation with geometrical features and a modelbased approach in the context of monocular vision.While the 3-D tracking/estimation may be performed with optimal estimators (with the Kalman filter and its extended/nonlinear versions or with the particle filter also referred to as the sequential Monte Carlo method), system models and state vectors need the pose parameters recovery or the 3-D motion recovery from the motion field.The pose determination is needed for applications with high accurate 3-D positioning requirements, when occlusions, shadows or abrupt motions have to be handle.To this purpose, several geometrical featurebased approaches have been reviewed for solving the pose with various constrained degrees of freedom.
Cue integration leads to robustness and automatic measurement of scene complexity.This is of prime importance to exploit the video information captured in a complex environment with dynamical changes.Composite features, colour, edges, texture integration are some additional data which have brought significant improvements of the tracker's behaviour thanks to robust and Mestimators.This is a key factor of success for a visionbased module with some autonomous capabilities, hence for the achievement of some visionbased (semi) autonomous tasks.

Figure 1 .
Figure 1.The 2D/3D rigid registration as an alignment technique from a subset of model features (in green) to a subset of visible image features Figure 2. (a) The geometric representation of a 3-D line with the Plücker coordinates = (r, ) .(b) A backprojection of an image line does not completely defines the 3-D line from which it came from

Figure 4 .
Figure 4. Collinear points (centroids of blue markers) stuck on a metallic and cylindrical surface (a) A sphere, its contour generator ( ) and its apparent contour ( ) in image plane is a conic.(b) Each detected ellipse is represented by a matrix E* related to the 3-D position of the sphere centre Windowing techniques are used to define the search space of the area of interest (the closest red ball).The diagonal of the inner blue square that area is corresponding to the estimated depth, whereas the the blue cross are corresponding to the position of the centre.(add at the end of figure caption)of the centre.

Figure 6 .
Figure 6.Mean position errors with the Teramoto's method (red) and with the modified version (blue) with respect to varying noise level

Figure 11 .
Figure 11.The pose estimation as a Virtual Visual Servoing process with multiple geometric features (apparent lines, markers and a circular needle).(a) The blue lines are the projections with the initial virtual camera position.(bc).The projections when the error vector (ss*) tends to 0