The Use of Contour, Shape and Form in an Integrated Neural Approach for Object Recognition

How objects are recognised by humans is still an open research field. But, in general there is an agreement that humans recognise objects as established by the similarity principle – among othersof the Gestalt theory of visual perception, which states that things which share visual characteristics such as contour, shape, form, size, colour, texture, value or orientation will be seen as belonging together (Ellis, 1950). This principle applies to human operators; for instance, when an operator is given the task to pick up a specific object from a set of similar objects; the first approaching action will probably be guided solely by visual information clues such as shape similarity. But, if further information is given (i.e. type of surface), then a finer clustering could be accomplished to identify the target object.


Introduction
How objects are recognised by humans is still an open research field.But, in general there is an agreement that humans recognise objects as established by the similarity principleamong others-of the Gestalt theory of visual perception, which states that things which share visual characteristics such as contour, shape, form, size, colour, texture, value or orientation will be seen as belonging together (Ellis, 1950).This principle applies to human operators; for instance, when an operator is given the task to pick up a specific object from a set of similar objects; the first approaching action will probably be guided solely by visual information clues such as shape similarity.But, if further information is given (i.e.type of surface), then a finer clustering could be accomplished to identify the target object.
The task described above can also be accomplished by automated systems such as industrial robots that can be benefited from the integration of a robust invariant object recognition capability following the above assumptions and by using image features from the object's contour (boundary object information), its shape (i.e.type of curvature or topographical surface information) and form (depth information).These features can be concatenated in order to form an invariant vector descriptor which can be mapped into specific objects using Artificial Intelligence schemes such as Artificial Neural Network (ANN).In previous work, it was demonstrated the feasibility of the approach to learn and recognise multiple 3D working pieces using its contour from 2D images and using a vector descriptor called the Boundary Object Function (BOF) (Peña-Cabrera, et al., 2005).The BOF exhibited invariance with different geometrical pieces, but did not consider surface topographical information.In order to overcome this condition and to have a more robust descriptor, a methodology that includes a shape index using the Shape From Shading (SFS) method (Horn, 1970) is presented as well as the depth information coming from a stereo vision system.The main idea of the approach is to concatenate three vectors, (BOF+SFS+DI) so that not only the contour but also the object's curvature information (shape) and form are taken into account by the ANN.
In this article after presenting related work in Section 2 and original work in Section 3, the contour vector description (BOF), the SFS vector and the stereo disparity map (Depth) are explained in Sections 4, 5 and 6 respectively.A description of the learning algorithm using the FuzzyARTMAP ANN is given in Section 7 followed by Section 8 that describes the results of the proposed integrated approach.Finally, conclusions and future work is described in Section 9.

Related work
Some authors have contributed with techniques for invariant pattern classification using classical methods such as invariant moments (Hu, 1962); artificial intelligence techniques, as used by Cem Yüceer & Kemal Oflazer (Yüceer and Oflazer, 1993) which describes a hybrid pattern classification system based on pattern pre-processor and an ANN invariant to rotation, scaling and translation.Stavros J. & Paulo Lisboa developed a method to reduce and control the number of weights of a third order network using moment classifiers (Stavros and Lisboa, 1992) and Shingchern D. You & G. Ford proposed a network for invariant object recognition of objects in binary images using four sub-networks (Shingchern and Ford, 1994).Montenegro used the Hough transform to invariantly recognize rectangular objects (chocolates) including simple defects (Montenegro, 2006).This was achieved by using the polar properties of the Hough transform, which uses the Euclidian distance to classify the descriptive vector.This method showed to be robust with geometric figures, however for complex objects it would require more information coming from other techniques such as histogram information or information coming from images with different illumination sources and levels.Gonzalez et al. used a Fourier descriptor, which obtains image features through silhouettes from 3D objects (Gonzalez Garcia, et al., 2004).Their method is based on the extraction of silhouettes from 3D images obtained from laser scan, which increases recognition times.Another interesting method for 2D invariant object representation is the use of the compactness measure of a shape, sometimes called the shape factor, and which is a numerical quantity representing the degree to which a shape is compact.Relevant work in this area within the theory of shape numbers was proposed by Bribiesca and Guzman (Bribiesca and Guzman, 1980).
Worthington studied topographical information from image intensity data in grey scale using the Shape from Shading (SFS) algorithm (Worthington and Hancock, 2001).This information is used for object recognition.It is considered that the shape index information can be used for object recognition based on the surface curvature.Two attributes were used, one was based on low-level information using curvature histogram and the other was based on structural arrangement of the shape index maximal patches and its attributes in the associated region.
Lowe defines a descriptor vector named SIFT (Scale Invariant Feature Transform), which is an algorithm that detects distinctive image points and calculates its descriptor based on the histograms of the orientation of key points encountered (Lowe, 2004).The extracted points are invariants to scale, rotation as well as source and illumination level changes.These points are located within a maximum and minimum of a Gaussian difference applied to the space scale.This algorithm is very efficient, but the processing time is relatively high and furthermore the working pieces have to have a rich texture.

Original work
Classic algorithms such as moment invariants are popular descriptors for image regions and boundary segments; however, computation of moments of a 2D image involves a significant amount of multiplications and additions in a direct method.In many real-time industrial applications, the speed of computation is very important, the 2D moment computation is intensive and involves parallel processing, which can become the bottleneck of the system when moments are used as major features.In addition to this limitation, observing only the piece's contour is not enough to recognise an object since objects with the same contour can still be confused.
In order to cope with this limitation, in this paper a novel method that includes a parameter about the piece contour (BOF), the shape of the object's curvature (SFS) and the depth information from the stereo disparity map (Depth) is presented as main contribution.
The BOF algorithm determines the distance from the centroid to the object's perimeter and the SFS calculates the curvature of the way that light is reflected on parts, whereas the depth information is useful to differentiate similar objects with different height.These features (contour, form and depth) are concatenated in order to form a invariant vector descriptor which is the input to an Artificial Neural Network (ANN).

Object's contour
As mentioned earlier, the Boundary Object Function (BOF) method considers only the object's contour to recognise different objects.It is very important to obtain as accurately as possible, metric properties such as area, perimeter, centroid point, and distance from the centroid to the points of the contour of the object.In this section, a description of the BOF method is presented.

Metric properties
The metric properties for the algorithm are based on the Euclidean distance between two points in the image plane.The first step is to find the object in the image performing a pixellevel scan from top to bottom (first criterion) and left to right (second criterion).For instance, if an object in the image is higher than the others, this object will be considered first.In the event that all objects are from the same height, then the second criterion applies and the selected object will be the one located more to the left.

Perimeter
The definition of perimeter is the set of points that make up the shape of the object, in discrete form and is the sum of all pixels that lie on the contour, which can be expressed as: Equation ( 1) shows how to calculate the perimeter; the problem lies in finding which pixels in the image belong to the perimeter.For searching purposes, the system calculates the perimeter obtaining the number of points around a piece grouping X and Y points coordinates corresponding to the perimeter of the measured piece in clockwise direction.
The perimeter calculation for every piece in the Region of Interest (ROI) is performed after the binarization.Search is always accomplished, as mentioned earlier, from top to bottom and left to right.Once a white pixel is found, all the perimeter is calculated with a search function as it is shown in figure 1.The next definitions are useful to understand the algorithm:  A nearer pixel to the boundary is any pixel surrounded mostly by black pixels in 8connectivity.


A farther pixel to the boundary is any pixel that is not surrounded by black pixels in 8connectivity.


The highest and lowest coordinates are the ones that create a rectangle (Boundary Box).
The search algorithm executes the following procedures once it has found a white pixel: Searches for the nearer pixel to the boundary that has not been already located.
Assigns the label of actual pixel to the nearer pixel to the boundary recently found.
Paints the last pixel as a visited pixel.
If the new coordinates are higher than the last higher coordinates, the new values are assigned to the higher coordinates.
If the new coordinates are lower than the last lower coordinates, the new values are assigned to the lower coordinates.
Steps 1 to 5 are repeated until the procedure returns to the initial point, or no other nearer pixel to the boundary is found.
This technique will surround any irregular shape very fast, and will not process useless pixels of the image.

Area
The area of an object is defined as the space between a region, in other words, the sum of all pixels that form the object, which can be defined by equation (2):

Centroid
The centre of mass of an arbitrary shape is a pair of coordinates (Xc, Yc) in which all its mass is considered concentrated and on which all the resultant forces are acting on.In other

First found pixel
words it is the point where a single support can balance the object.Mathematically, in the discrete domain, the centroid is defined as: where A is obtained from eq. ( 2)

Generation of descriptive vector (BOF)
The generation of the descriptive vector called The Boundary Object Function (BOF) is based on the Euclidean distance between the object's centroid and the contour.If we assume that P1(X 1 , Y 1 ) are the centroid coordinates (X C , Y C ) and P2(X 2 , Y 2 ) is a point on the perimeter, then this distance is determined by the following equation: The descriptive vector (BOF) in 2D contains the distance calculated in eq. ( 4) for the whole object's contour.The vector is composed by 180 elements where each element represents the distance data collected every two degrees.The vector is normalized by dividing all the vector elements by the element with maximum value.Figure 2 shows an example where the object is a triangle.In general, the starting point for the vector generation is crucial, so the following rules apply: the first step is to find the longest line passing through the centre of the piece, as shown in Figure 2

Object's form
The use of shading is taught in art class as an important cue to convey 3D shape in a 2D image.Smooth objects, such as an apple, often present a highlight at points where a reception from the light source makes equal angles with reflection toward the viewer.At the same time, smooth object get increasingly darker as the surface normal becomes perpendicular to rays of illumination.Planar surfaces tend to have a homogeneous appearance in the image with intensity proportional to the angle between the normal to the plane and the rays of illumination.In other words, the Shape From Shading algorithm (SFS) is the process of obtaining three-dimensional surface shape from reflection of light from a greyscale image.It consists primarily of obtaining the orientation of the surface due to local variations in brightness that is reflected by the object, and the intensities of the greyscale image is taken as a topographic surface.
In the 70's, Horn formulated the problem of Shape From Shading finding the solution of the equation of brightness or reflectance trying to find a single solution (Horn, 1970).Today, the issue of Shape from Shading is known as an ill-posed problem, as mentioned by Brooks, causing ambiguity between what has a concave and convex surface, which is due to changes in lighting parameters (Brooks, 1983).To solve the problem, it is important to study how the image is formed, as mentioned by Zhang (Zang, et al., 1999).A simple model of the formation of an image is the Lambertian model, where the grey value in the pixels of the image depends on the direction of light and surface normal.So, if we assume a Lambertian reflection, we know that the direction of light and brightness can be described as a function of the object surface and the direction of light, and then the problem becomes a little simpler.
The algorithm consists in finding the gradient of the surface to determine the normals.The gradient is perpendicular to the normals and appears in the reflectance cone whose centre is given by the direction of light.A smoothing operation is performed so that the normal direction of the local regions is not very uneven.When this is performed, some normals still lie outside of the normal cone reflectance, so that it is necessary to rotate them to place these normals within the cone.This is an iterative process to finally obtain the kind of local surface curvature.
The procedure is as follows, first the light reflectance E in (i, j), is calculated using the expression: where: S is the unit vector for the light direction, and the term n , is the normal estimation in the Kth iteration.The reflectance equation of the image is defined by a cone of possible normal directions to the surface as shown in Figure 3 where the reflectance cone has an angle of cos-1(E(i,j)).If the normals satisfy the recovered reflectance equation of the image, then these normals must fall on their respective reflectance cones.

Image's gradient
The first step is to calculate the surface normals which are calculated using the gradient of the image (I), as shown in equation ( 6).
Where [p q] are used to obtain the gradient and are known as Sobel operators.

Normals
Since the normals are perpendicular to the tangents, the tangents can be found by the cross product, which is parallel to (-p, -q, 1) T. Then we can write for the normal expression: Assuming that z component of the normal to the surface is positive.

Smoothness and rotation
Smoothing, in few words can be described as avoiding abrupt changes between normal and adjacent.The Sigmoidal Smoothness Constraint makes the restriction of smoothness or regularization, forcing the error of brightness to satisfy the matrix rotation θ, deterring sudden changes in direction of the normal through the surface.
With the normal smoothed, then the next step is to rotate these normals so that they lie in the reflectance cone as shown in Figure 4.Where n , are the smoothed normals, n , are the normals after the smoothness and before the rotation, and n , are the normals after a rotation of θ degrees.The smoothness and rotation of the normals involve several iterations represented by the letter k.

Shape index
Koenderink separated the shape index in different regions depending on the type of curvature, which is obtained through the eigenvalues of the Hessian matrix, which is represented by K 1 and K 2 as given by the following equation (Koenderink &. Van Doorn, 1992).
The result of the shape index φ has values between [-

35
, 88 Table 1.Classification of the Shape Index Figure 5 shows the image from the surface local form depending on the value of the Shape Index, and Figure 6 shows an example of the SFS vector from a rectangular piece used during experiments.

Histogram of disparity map (depth)
With binocular vision, the vision system is able to interact in a three-dimensional world coping with volume and distance within the environment.Due to the separation between both cameras, two images are obtained with small differences between them; such differences are called disparity and form a so-called disparity map.The epipolar geometry describes the geometric relationships of images formed in two or more cameras focused on a point or pole.
The most important elements for this geometric system as illustrated in figure 7 are: the epipolar plane, consisting of the pole (P) and two optical centres (O and O') from two chambers.The epipoles (e and e') are the virtual image of the optical centres (O and O').The baseline, that join the two optical centres and epipolar lines (l and l'), formed by the intersection of the epipolar plane with both images (ILEFT and IRIGHT) connects the epipoles with the image of the observed points (p, p').
Epipolar line is crucial in stereoscopic vision, because one of the most difficult parts in stereoscopic analysis is to establish the correspondence between two images, mating stereo, deciding which point in the right image corresponds to which on the left.
The epipolar constraint allows you to narrow the search for stereoscopic, correspondence of two-dimensional (whole image) to a search in a dimension on the epipolar line.One way to further simplify the calculations associated with stereoscopic algorithms is the use of rectified images; that is, to replace the images by their equivalent projections on a common plane parallel to the baseline.It projects the image, choosing a suitable system of coordinates, the rectified epipolar lines are parallel to the baseline and they are converted to single-line exploration.
In the case of rectified images, given two points p and p', located on the same line of exploration the left image and right image, with coordinates (u, v) and (u', v'), the disparity www.intechopen.comAdvances in Object Recognition Systems 120 is given as the difference d = u'-u.If B is the distance between the optical centres, also known as baseline, it can be shown that the depth of P is z = −B / d.

Stereoscopic matching algorithms
The stereoscopic matching algorithm reproduce the human stereopsis process so that a machine, for instance a robot, can perceive the depth of each point in the observed scene and thus is able to manipulate objects, avoid or recreate three-dimensional models.For a pair of stereoscopic images the main goal of these algorithms is to find for each pixel in an image its corresponding pixel in the other image (mating), in order to obtain a disparity map that contains the position difference for each pixel between two images which is proportional to the depth map.To determine the actual depth of the scene it is necessary to take into account the geometry of the stereoscopic system to obtain a metric map.As mating a single pixel is almost impossible, each pixel is represented by a small region that contains it, a socalled window correlation, thereby realizing the correlation between the windows of one image and the other, using the colour of pixels within.Once the disparity map is obtained, then the histogram of this map is the region of interest.

Learning and recognition
The selection of the ANN for this purpose was based on previous results where the convergence time for some ANN architectures was evaluated during recognition tasks of simple geometrical parts.The assessed networks were Backpropagation, Perceptron and Fuzzy ARTMAP using the BOF vector.Results showed that the FuzzyARTMAP network outperformed the other networks with lower training/testing times (0.838ms/0.0722ms)compared with Perceptron (5.78ms/0.159ms) and Backpropagation (367.577ms/0.217ms) (Lopez-Juarez, et al., 2010).
In the Fuzzy ARTMAP (FAM) network there are two modules ART a and ART b and an inter-ART module "Map-field" that controls the learning of an associative map from ART a recognition categories to ART b categories (Carpenter and Grossberg, 1992).This is illustrated in Figure 8.
The Map-field module also controls the match tracking of ART a vigilance parameter.A mismatch between Map field and ART a category activated by input Ia and ART b category activated by input I b increases ART a vigilance by the minimum amount needed for the system to search for, and if necessary, learn a new ART a category whose prediction matches the ART b category.The search initiated by the inter-ART reset can shift attention to a novel cluster of features that can be incorporated through learning into a new ART a recognition category, which can then be linked to a new ART prediction via associative learning at the Map-field.
A vigilance parameter measures the difference allowed between the input data and stored patterns.Therefore, this parameter affects the selectivity or granularity of the network prediction.For learning, the FuzzyARTMAP has 4 important factors: Vigilance in the input module ( a ), vigilance in the output module ( b ), vigilance in the Map field ( ab ) and learning rate ().For the specific case of the work presented in this article, the input information is concatenated and presented as a sole input vector A, while the vector B receives the correspondence associated to the respective component, during the training process.

Experimental results
The experimental results were obtained using two sets of four 3D working pieces of different cross-section: square, triangle, cross and star.One set had its top surface rounded, so that these were referred to as being of rounded type.The other set had a flat top surface and referred to as pyramidal type.The working pieces are showed in figure 9.The object recognition experiments by the FuzzyARTMAP (FAM) neural network were carried out using the above working pieces.The network parameters were set for fast learning (β = 1) and high vigilance parameter (ρ ab = 0.9).There were carried out four types of experiments.The first experiment considered only the BOF taking data from the contour of the piece, the second experiment considered information from the SFS algorithm taking into account the reflectance of the light on the surface and the third experiment was performed using the depth information.The fourth experiment used the concatenated vector from the three object descriptors (BOF+SFS+Depth).An example of how an object was coded using the three descriptors is showed in figure 10.Two graphs are presented; the first graph corresponds to the descriptive vector from the Rounded-Square object and the other corresponding to the Pyramidal-square object.The BOF descriptive vector is formed by the 180 first elements (observe that both patterns are very similar since the object's crosssectional shape is the same).Next, there are 175 elements corresponding to the SFS values (every shape corresponding to the 7 index values was repeated 25 times).The following 176 values corresponded to the Depth information obtained for the Disparity Histogram that contained 16 values that were repeated 11 times.Several experiments were defined to test the invariant object recognition capability of the system.For these experiments, the FuzzyARTMAP network was trained with 3 patterns, the objects were located in different orientation and location within a defined working space of 20cm x 27cm using different scales and also the slope of the plane was modified.
The overall results under the above conditions are illustrated in figure 11.The first row corresponded to the recognition rates obtained using only the BOF, SFS, and Depth vector.
It was observed a high recognition rate.For instance, using only the BOF, the system was able to recognize 99.8% from the whole set of objects.
In the second row it is shown the recognition rate using a combination of the BOF+SFS, and BOF+Depth vectors.It is important to notice that the recognition rate in both cases was lower than using the BOF vector alone (99.4% and 98.61%, respectively).In the last experiment, the complete concatenated vector BOF+SFS+Depth vector was used achieving 100% recognition rate varying the scale up to 20% and using a slope of 15 0 .

Conclusions
The research presented in this article provides an alternative methodology to integrate a robust invariant object recognition system using image features from the object's contour (boundary object information), its form (i.e.type of curvature or topographical surface information) and depth information from a stereo camera.The features can be concatenated in order to form an invariant vector descriptor which is the input to an Artificial Neural Network (ANN) for learning and recognition purposes.
Experimental results were obtained using two sets of four 3D working pieces of different cross-section: square, triangle, cross and star.One set had its surface curvature rounded and the other had a flat surface curvature so that these object were named of pyramidal type.Using the BOF information and training the neural network with this vector it was demonstrated that all pieces were recognised irrespective from its location an orientation within the viewable area.When information was concatenated (BOF + SFS and BOF + Depth), the robustness of the vision system lowered since the recognition rate in both cases was lower than using the BOF vector alone (99.4% and 98.61% respectively).But, using the complete concatenated vector BOF+SFS+Depth achieved 100% recognition rate invariant to scale up to 20% and also invariant to the inclination of the plane up to 15 0 .Further tests were conducted but the recognition was lower since for instance, increasing the slope angle it contributed to distort the contour as detected by the BOF hence making the recognition rate very sensitive.
(a), there are several lines.The longest line is taken and divided by two, taking the centre of the object as reference.Thus, the longest middle part of the line is taken as shown in Figure 2(b) and this is taken as starting point for the BOF vector descriptor generation as shown in Figure 2(c).The object's pattern representation is depicted in Figure 2(d).

Fig. 2 .
Fig. 2. Example for the generation of the BOF vector.

Fig. 3 .
Fig. 3. Possible normal directions to the surface over the reflectance cone.

Fig. 5 .
Fig. 5. Representation of local forms in the Shape Index classification.
1, 1] which can be classified, according to Koenderink, depending on its local topography, as shown in table 1.