Open access peer-reviewed chapter

Multi-Object Recognition Using a Feature Descriptor and Neural Classifier

Written By

Enrique Guzmán-Ramírez, Ayax García, Esteban Guerrero-Ramírez, Antonio Orantes-Molina, Oscar Ramírez-Cárdenas and Ignacio Arroyo-Fernández

Submitted: 14 June 2022 Reviewed: 26 July 2022 Published: 26 August 2022

DOI: 10.5772/intechopen.106754

From the Edited Volume

Vision Sensors - Recent Advances

Edited by Francisco Javier Gallegos-Funes

Chapter metrics overview

85 Chapter Downloads

View Full Metrics

Abstract

In the field of object recognition, feature descriptors have proven to be able to provide accurate representations of objects facilitating the recognition task. In this sense, Histograms of Oriented Gradients (HOG), a descriptor that uses this approach, together with Support Vector Machines (SVM) have proven to be successful human detection methods. In this paper, we propose a scheme consisting of improved HOG and a classifier with a neural approach to producing a robust system for object recognition. The main contributions of this work are: First, we propose an improved gradient calculation that allows for better discrimination for the classifier system, which consists of performing a threshold over both the magnitude and direction of the gradients. This improvement reduces the rate of false positives. Second, although HOG is particularly suited for human detection, we demonstrate that it can be used to represent different objects accurately, and even perform well in multi-class applications. Third, we show that a classifier that uses a neuronal approach is an excellent complement to a HOG-based feature extractor. Finally, experimental results on the well-known Caltech 101 dataset illustrate the benefits of the proposed scheme.

Keywords

  • multi-object recognition systems
  • object representation based on feature descriptor
  • histogram of oriented gradients
  • classifier with a neural approach

1. Introduction

Computer vision is a discipline through which a machine is enabled in order to recognize the world around it using visual perception, allowing it to deduce the structure and properties of a three-dimensional world from one or more two-dimensional images. In this respect, Forsyth-Ponce and Ballard-Brown argue that computer vision refers to the construction of explicit and significant descriptions of physical objects from images [1, 2]. That is, computer vision enables a machine to extract and analyze spectral, spatial, and temporal information of the different objects contained within an image. While spectral information includes frequency (color) and intensity (grayscale), spatial information refers to aspects such as shape and position (one, two, and three dimensions) and temporal information comprising stationary aspects (presence and/or absence) and time-dependent (events, movements, and processes).

Due to this ability, tasks such as failure detection [3, 4]; verification [5, 6]; identification [7, 8]; tracking analysis [9, 10]; and recognition [11, 12] can be performed by a computer vision system.

The object recognition task is of particular interest to this research as it plays a significant role in a computer vision system and is necessary even in order to complete some of the tasks listed above. It is evident that there are an increasing number of areas and/or applications requiring object recognition, e.g., fruit sorting [13], face detection [14], people detection [15], face recognition [16], object tracking [17], automatic traffic sign recognition [18], and vehicle license plate recognition [7], among many others. Gonzales and Woods define object recognition as the task of organizing input data into previously defined classes, using significant features extracted from objects that are immersed in an image containing irrelevant details [19]. Considering this definition, it is evident that both feature extraction and classification are extremely important for an object recognition task to achieve its aim. The feature extraction process applies operations to an image in order to obtain information describing the objects it contains. Moreover, this information should be able to discriminate between different object classes. The goal of this process is to improve the effectiveness and efficiency of the classification process [20]. The classification process uses the information generated by the feature extractor to perform both phases comprising it, learning, (thereby creating a bank of models), and recognition (responsible for determining which objects belonging to the bank of models is present in analyzed image) [21].

The development of the work presented in this paper is motivated by the increasing demand and necessity of techniques and algorithms that efficiently perform tasks related to the recognition of objects, as well as their implementation in real applications, both industrial and research, where a visual perception system is required. Considering the above, this work proposes the design and implementation of an object recognition system using methods based on feature descriptor and neural classifier. We will focus on a type of descriptor called the feature descriptor. In their original form, these descriptors characterize shape in 2D images as histograms of edge pixels, HOG (Histograms of Oriented Gradients) algorithm belongs to this family of descriptors [22]. Two practical objectives have been defined to meet the proposal. First, study and development of an improved HOG algorithm that can accurately represent different objects. Second, to demonstrate that the performance of a neuronal approach in the task of classification and labeling of the data generated by HOG is competitive with the techniques currently used.

The HOG method for object representation, proposed by Dalal and Triggs [23], describes the appearance of local regions (objects) within an image by means of the distribution of intensity gradients or edge directions. For this purpose, the HOG method applies a similar principle to be used by different methods, such as edge orientation histograms [24], shape contexts [25], and scale-invariant feature transform (SIFT) [26], which counts the number of occurrences of gradient orientation in specific portions of an image but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

HOG has demonstrated that it is capable of generating representations that provide discriminative information from the objects in an image using normalized representations of objects. Since it operates on local cells, it is invariant to geometric and photometric transformations. For improved accuracy, the local histograms can be contrast-normalized by calculating a measure of the intensity across a larger region of the image, called a block, and then using this value to normalize all cells within the block. This normalization results in better invariance to changes in illumination and shadowing. Furthermore, HOG is invariant to changes in the data background as well as object position.

Although, the HOG method is particularly suited for human detection in images [11, 12, 23, 27, 28, 29], in this paper we show that it can be used to represent different objects accurately, and even perform well in multi-class applications.

On the other hand, artificial neural networks (ANN) have been successfully used in a variety of classification tasks in real industrial, scientific, and business applications [13, 30, 31, 32, 33]. Several kinds of neural networks can be used for this purpose, but we decided to use feedforward multi-layer networks or multi-layer perceptron (MLP), which are the most widely studied and used neural network classifiers.

MLP offers features that make it a competitive alternative to various conventional classification methods [34]. First, adaptability, MLP is capable of developing its own feature representation. That is to say, MLP can organize data into the vital aspects or features that enable one pattern to be distinguished from another. Second, generalization, MLP has the ability to respond appropriately to input patterns different from those involved in the learning process. Third, MLP is a universal function approximator, thus it can approximate any function with arbitrary accuracy. Fourth, since MLP uses a nonlinear activation function, it is a nonlinear model, which makes it flexible in modeling real-world complex relationships.

1.1 Related work

Similar works with which our proposal has been compared, because they use descriptors with similar characteristics to ours, are described below. In [35], an object model is generative and probabilistic, so appearance, scale, shape, and occlusion are all modeled by probability density functions, which here are Gaussians. Learning is carried out using the expectation–maximization algorithm, presented in [36], which iteratively converges to learn an object category using the detecting regions and their scales, and then estimating the parameters of the above densities from these regions, such that the model gives a maximum-likelihood description of the training data. Recognition proceeds by first detecting features, and then evaluating these features through a process Bayesian, using the model parameters estimated in the learning.

Zhang et al. proposed a scheme that represents the local features of the object by means of PCA-SIFT method and the global features using shape context method [37]. Both sets of features are presented to two-layer AdaBoost training network. Boosting refers to the general method of producing a very accurate prediction rule by combining relatively inaccurate rules-of-thumb. It has been used widely in computer vision, particularly for object recognition. Layer 1 chooses as “good” features, those that have the best ability to discriminate the target object class from the nontarget object class. Then, layer 2 locates the final “good” features based on the distances between the most discriminant local features selected by layer 1. This two-layered boosting method produces two strong classifiers, which can then be used in a cascaded for recognition tasks.

On the other hand, in [38] Zhang et al. presented a scheme that represents images as distributions (signatures or histograms) of features extracted with different key point detectors and descriptors. The proposed scheme represents an object from the union of a detector with a descriptor. Two complementary local region detector types are used: The Harris-Laplace detector, which responds to corner-like regions, and the Laplacian detector, which extracts blob-like regions. At the most basic level, these two detectors are invariant to scale transformations only. To achieve invariance to other transformations, such as rotation or illumination, this scheme may include the descriptors SIFT, SPIN, and RIFT. For classification process, the authors use the well-known Support Vector Machines.

In [39], Leibe et al. propose a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. For the objects representation, this proposal introduces a basic object representation model known as Implicit Shape Model (ISM). ISM consists of a class-specific alphabet of local appearances that are prototypical for the object category, and of a spatial probability distribution that specifies where each codebook entry may be found on the object. Then, the object detection is implemented as a probabilistic Hough voting procedure from which hypotheses are found by a scale-adaptive Mean-Shift search.

More recently, Leptev showed how histogram-based image descriptors can be combined with a boosting classifier to provide a robust object detector [40]. Each feature is represented by a histogram of local image measurements within a region. For this purpose, HOG features are adopted and considered histograms of alternative image measurements such as color and second-order image derivatives. The HOG features are formed from orientation of local image gradient at each point using Gaussian derivatives of image computed for scale parameter defined. The histograms are normalized to the l1 unit norm. At the training, the features for normalized training images are computed, and apply AdaBoost to select a set of features and the corresponding weak classifiers optimizing classification performance.

The overall organization of the paper is as follows, after the introduction, we examine theoretical issues of HOG and MLP in Section 2. Section 3 presents the object recognition system based on HOG and MLP proposed in this paper. Experimental results are discussed in Section 4. Finally, Section 5 concludes the paper.

Table 1 shows a summary of the main features that the mentioned schemes and the one proposed in this paper possess.

MethodFeature extractionClassifier
[35]Detector of Kadir and BradyBayesian approach
[37]PCA-SIFT/shape contextmulti-layer Adaboost
[38]SIFT/SPIN/RIFTSVM
[39]Hessian-Laplace/Harris-Laplace contextk-means/Agglomerative/RNN
[40]Histogram-based descriptorAdaBoost
Our proposalHOGMLP

Table 1.

Main features summary of the mentioned schemes and our proposal.

Advertisement

2. Theoretical background of HOG and MLP models

As mentioned earlier, we will focus on a type of descriptor called the feature descriptor. In general terms, this approach uses 4 steps to calculate a descriptor from an image, 1. An edge detector is applied to the image. 2. a basis point is chosen, which is a coordinate in the edge map, then a template is defined, mainly circular, centered at that point and it is divided into sections of the same size. These sections divide the image into regions, each of which corresponds to one dimension of the feature vector. 3. The value of a dimension is calculated as the number of edge pixels that fall into the region (a histogram that summarizes the spatial distribution of edges in the image relative to the chosen basis point). It is common to use the term “bin” to refer to the region in the image as well as the dimension in the feature vector. 4. Feature vector normalization.

With this approach, if the bins were small enough to each contain one pixel, then the histogram would be an exact description of the shape in the support region.

The HOG descriptor is a member of this family, it was proposed by Dalal and Triggs in [23] and extensively documented by Dalal in his PhD thesis, which was supervised by Triggs [41]. HOG offers a successful and popular object representation, particularly with human representation [22, 42]. HOG is inspired by SIFT, thus it can be regarded as a dense version of SIFT. Both algorithms are based on histograms of gradient orientations weighted by gradient magnitudes. However, there are important differences between these algorithms. First, these two algorithms differ slightly with regard to the type of spatial bins that they use, as HOG has a more sophisticated way of binning. Second, HOG computes the descriptors by means of blocks in dense grids at some single scale without orientation alignment, while in SIFT, descriptors are computed at sparse, scale-invariant key image points, and rotated to align orientation. Third, SIFT only computes the gradient histogram for patches around specific interest points obtained by taking the difference of Gaussians in the scale space (this is a local descriptor and SIFT features are usually compared by computing the Euclidean distance between them). HOG, on the other hand is computed for an entire image by dividing it into smaller cells and summing up the gradients over every pixel within each cell in an image (HOG is used to classify patches using classifiers such as SVM). Finally, SIFT is used for the identification of specific objects since the Gaussian weighting involved enables it to describe the importance of a particular point, while HOG does not have such a bias. HOG, therefore, is better suited to the classification task than SIFT.

The main idea behind the HOG descriptors is that the appearance and shape of a local object within an image can be described by the distribution of intensity gradients or edge directions. That is to say, the HOG features concentrate on the contrast of silhouette contours against the background. HOG is a window-based descriptor, whereby the window is typically computed by dense sampling over all image points.

In general, the HOG algorithm can be divided into 4 phases: gradient computation, orientation binning, descriptor blocks, and normalization blocks. In the first phase, the gradients are computed using Gaussian smoothing followed by discrete derivative masks. The experiments by Dalal and Triggs demonstrated that a simple 1-D101 mask at none scale smoothing σ=0 works best. The second phase is responsible for dividing the gradient image into small-connected regions called cells (typical cell size is 6×6 or 8×8 pixels), and within each cell, a frequency histogram is computed representing the distribution of edge orientations within the cell. For this purpose, each pixel calculates a weighted function of the gradient magnitude (called vote, typically the magnitude itself is used) based on the orientation of the gradient element centered on it. Then, the edge orientations are quantized into q bin uniformly spaced over 0–180° when an unsigned gradient is used, or 0–360° when a signed gradient is used. In a third phase, groups of adjacent cells are considered as spatial regions called blocks, (typical block size is 2×2 or 3×3 cells), and using an overlap between blocks significantly improves the algorithm performance. The grouping of cells into a block is the basis for the grouping and normalization of histograms. Two classes of block geometries commonly used are square or rectangular (R-HOG), with associations of spatial cells in squares or rectangles, and circular blocks (C-HOG) partitioned into cells in log-polar fashion. Finally, the blocks defined in the previous phase are normalized. For this purpose, let v be the unnormalized block, vk be its k-norm for k=1,2, and ϵ be a small normalization constant to avoid division by zero [41], then the following schemes can be used:

  1. L2normvv/v22+ϵ2;

  2. L2Hys (L2norm followed by clipping, limiting the maximum values of v to 0.2, and renormalizing);

  3. L1normvv/v1+ϵ;

  4. L1sqrtvv/v1+ϵ.

The final descriptor is then the vector of all components of the normalized cell responses from all of the blocks in the detection window.

On the other hand, it is common to find a support vector machine (SVM) classifier to complement a feature extractor HOG within a scheme of object recognition. One purpose of this paper is to determine the performance of an object recognition system that uses HOG and neural network-based classifier, particularly a multi-layer perceptron (MLP).

MLP was derived from the neuronal model known as perceptron, which was presented by Rossenblatt in 1958 [43]. The perceptron is based on the model of McCulloch and Pitts [44] and presents a learning rule based on error correction. MLP is an ANN composed of an input layer, n hidden layers, and an output layer, all consisting of m type-perceptron neurons. MLP is the most studied neural network model, which can approximate any continuous nonlinear function arbitrarily well on a compact interval. Due to this property, MLP became popular in order to parametrize nonlinear models and with classification purposes. Furthermore, the characteristics of the MLP are well known to solve the problem that occurs when the data available in training are generally not sufficient to cover the variability of the object’s appearance. This problem is present in most recognition systems, including our own.

MLP belongs to the category of supervised classifiers. Then, with the training set, composed of p pairs of input–output vectors that define the behavior of the system that the ANN will adapt, is defined as

x1t1x2t2xptp=xμtμμ=1,2,,pE1

where xμ=xiμn and tμ=tkμq are the input and target vectors, respectively.

MPL structure is defined as follows: It has an input layer of n units, one or more successive hidden layers of intermediate units, and a layer of q output units. Considering an MLP with r hidden layers, where xμ=xiμnis the μ-th input vector that belongs to the training set (defined by Eq. (1)). Then, ylμ=yjllμml, zμ=zkμq and tμ=tkμq represent the output of the l-th hidden layer, the output generated by MLP and the output target that MLP must generate, respectively, when xμ is presented to the network; where l=12r, ml and q indicate the number of units comprising the l-th hidden layer and the output layer, respectively.

Each unit’s output of layer l will be connected to the input of each unit in the layer l + 1, and a synaptic weight will be associated with each of these connections. Thus, the synaptic weights and thresholds of the first hidden layer are represented by W1=wj1i1m1×n and θ1=θj11m1, respectively. For remaining hidden layers, l=2r, the synaptic weights and thresholds are defined as Wl=wjljl1lml×ml1 and θl=θjllml. Finally, the synaptic weights and thresholds of the output layer are defined as Wr+1=wkjrr+1q×mr and θr+1=θkr+1q.

MLP is a feed-forward neural network, and its operation is defined as follows. The output of first hidden layer is expressed mathematically as follows:

yj11=fiwj1i1xiθj11E2

Whereas the output of the l-th hidden layer, with l=2r, is computed by

yjll=fjl1wjljl1lyjl1l1θjllE3

Finally, the MLP operation is defined as:

zk=gjrwkjrr+1yjrrθkr+1E4

Typically, activation functions for units of hidden layers, f, are nonlinear; e.g., unipolar sigmoid function 1/1+ex and bipolar sigmoid function 1ex/1+ex. Activation functions of this type introduce the nonlinearity into the network and enable the MLP to approximate any nonlinear function with arbitrary accuracy. The activation functions of the units in the output layer, g may be linear or nonlinear, depending on the application.

The MLP evolution based its success on the design of training algorithms that could minimize the error committed by the network by adequately and automatically modifying the values of the synaptic weights. In this sense, the training algorithm called backpropagation is the most popular used to adapt the MLP to a specific application because it is conceptually simple and computationally efficient. In 1974 Paul Werbos developed the basic principles of backpropagation, while developing his PhD thesis, by implementing a system that estimated a dynamic model for predicting social communications and nationalism [45]. In 1986 Rumelhart et al. formalized the backpropagation algorithm as a method that allows a type-MLP ANN to learn the association that exists between a set of input patterns and the corresponding classes [46]. Over time, backpropagation has become one of the most widely used neural learning methods, proving to be an efficient tool in applications of pattern recognition, dynamic modeling, sensitivity analysis, and the control of systems over time, among others.

The backpropagation algorithm looks for the minimum error function in weight space using the method of gradient descent. This method is applied for training the units of hidden layers of an MLP; that is to say, the basic idea of this algorithm states that updating the synaptic weights of the units of a layer depends on the error generated by the layer itself and errors generated by the following layers. The aforementioned is established by the mathematical structure of the backpropagation algorithm, which can be expressed as follows:

Δwkjrr+1=εδzkyjrr,E5a

where δzk=tkzkgukr+1 and ukr+1=jrwkjrr+1yjrrθkr+1

Δwjljl1l=εδyjllyjl1l1,E5b

where δyjll=fjllujllkδzkwkjll+1 and ujll=jl1wjljl1lyjl1l1θjll; for l=2r

Δwj1i1=εδyj11xiE5c

where δyj11=fj11uj11j2δyj22wj2j12 and uj11=iwj1i1xiθj11.ε is a small-valued constant that defines the learning rate of the network.

Advertisement

3. Object recognition based on histogram of oriented gradients and multi-layer perceptron

The Object recognition system based on the histogram of oriented gradients and multi-layer perceptron (ORS HOG-MLP) proposed in this paper presents the following contributions: (1) offers good performance in multiclass applications. (2) Determine the performance of an object recognition system that uses HOG and neural network-based classifier, particularly an MLP. (3) In order to improve the characterization process of the image, a modification that improves the properties representation of the gradient calculation algorithm is proposed.

The ORS HOG-MLP is an algorithm that is geared to automatic object recognition. Figure 1 shows the elements that are part of this system and the relationship that exists between them.

Figure 1.

Block diagram of ORS HOG-MLP.

Initially, a window detector process takes samples over the entire image, each sample has a fixed size and it is called a detection window. Better performance of the algorithm occurs when there is an overlap between detection windows. The image will be sampled several times and on each occasion, the detection window size will be different. This feature is intended to avoid image segmentation and get an algorithm robust to the object’s size.

Then the ORS HOG-MLP algorithm is applied to the extracted detection window. First, the gradient of an image at each pixel is computed. Let the discrete version of a detection window be represented as the matrix A=aijwi×hi; wi and hi are the width and height of the detection window, respectively, and a represents the ij-th pixel value, 0a2n1, and n is the number of bits necessary to represent the value of a pixel. Now, the aim of this process is to compute the magnitude and direction of the gradients for each pixel. For this purpose, we use a 1-D101 mask at none scale smoothing σ=0 which is applied over all image pixels. The magnitudes and directions obtained by this process are grouped into two matrices (M and Th); these matrices are defined as

M=mijwi×hi;mij=ai+1,jai1,j2+ai,j+1ai,j12Th=thijwi×hi;thij=tan1ai,j+1ai,j1ai+1,jai1,jE6

Figure 2 shows the results of gradient computation process. A detection window includes information that does not belong to the interest object (information about other objects and background information), during the gradient computation process, this information corrupts the interest object.

Figure 2.

Results of gradient computation process. (a) Original image. (b) Gradient magnitude. (c) Pixel intensity proportional to the gradient direction (gradient direction). (d) Gradient magnitude improved. (e) Pixel intensity proportional to the gradient direction improved.

In order to improve the representation of the interest object properties by reducing this information, we propose to apply a threshold operation on the matrices M and Th,

mij=0,ifmij<umbral_gradnij,otherwisethij=0,ifmij<umbral_gradthij,otherwiseE7

Figure 2 also shows the result obtained by applying this process on the matrices.

Now, both matrices M and Th are divided into cx×cy small-connected regions of px×py pixels, called magnitude cells and direction cells, and defined as cmlk=c_mijlkpx×pycx×cy and cthlk=c_thijlkpx×pycx×cy; cx=wi/px and cy=hi/py (see Figure 3). Then, each pixel calculates a weighted function of its gradient magnitude based on its gradient orientation to contribute to the building of the histogram of the cell to which it belongs. For this purpose, a vector of p=9 bins uniformly spaced over 0–180° is defined for the quantification process of gradient orientations: bin0bin1bin8=10°30°170°. The distance between bins is denoted by dbins=20°.

Figure 3.

Details of orientation binning and descriptor block phases.

Let MC=Mclk be a matrix of cx×cy elements, called bins matrix, where the Mclk=mcolk element is a p-dimensional vector containing the corresponding bins mcolk of the histogram of the lk-th cell, o=0,1,,p1 (see Figure 3). Furthermore, considering two adjacent bins, bino and bino+1, where binoc_thijlkbino+1, the distances between c_thijlk and each of the bins are defined as do=c_thijlkbino and do+1=bino+1c_thijlk.

Thus, the result of this process is defined as

mcolk=mcolk+1dodbinsc_mijlk,ifbino<c_thijlkmcolk+c_mijlk,ifbino=c_thijlkmco+1lk=mco+1lk+1do+1dbinsc_mijlk,ifc_thijlk<bino+1mco+1lk+c_mijlk,ifbino+1=c_thijlkE8

Figure 4 shows the result of applying orientation binning process.

Figure 4.

Result of orientation binning process. (a) Original image. (b) Gradient magnitude. (c) Gradient direction. (d) Histograms of the cells. (e) Descriptor blocks phase.

In the next process, sets of 2×2 adjacent cells are grouped into spatial regions called descriptor blocks (n_Mc - number of Mc per block). In order to get better performance, the blocks are formed using an overlap of one cell on both x-axis and y-axis (see Figures 3 and 4e). The result of this process is the matrix BL=bl, composed of bx×by descriptor blocks; where bx=1,2,,cx1, by=1,2,,cy1 and a bl block is defined as

blbx,by=Mcbx,byMcbx,by+1Mcbx+1,byMcbx+1,by+1E9

Then, each descriptor block must be normalized. For this purpose, we decided to use the L2Hys block normalization scheme, which first applied the L2norm (scheme)

blbx,by=blbx,byblbx,by22+ϵ2=blbx,byomcobx,by2+omcobx,by+12+omcobx+1,by2+omcobx+1,by+12+ϵ2E10

Thus, each mc. (bin) of the blbx,by block is limited to a maximum value of 0.2: mc=0.2,ifmc>0.2, and then it is renormalized again with L2norm.

Eventually, the final object descriptor, Od, is a r-dimensional vector of all components (bins) of the normalized cell responses from all of the blocks in the detection window

Od=odr=bl1,1bl1,2blbx,by=Mc1,1Mc1,2Mc2,1Mc2,2Mc1,2Mc1,3Mc2,2Mc2,3Mcbx,byMcbx,by+1Mcbx+1,byMcbx+1,by+1=mc011mc811mc012mc812mc021mc821mc022mc822mc012mc812mc013mc813mc022mc822mc023mc823,,mc0bx,bymc8bx,bymc0bx,by+1mc8bx,by+1,mc0bx+1,bymc8bx+1,by,mc0bx+1,by+1mc8bx+1,by+1E11

where r=bx×by×n_Mc×p and od0=mc011,od1=mc111,,od8=mc811,od9=mc012,,od17=mc812,,odr9=mc0bx+1,by+1,odr1=mc8bx+1,by+1

Figure 5 shows the sequence of processes that generate the HOG descriptor of the object.

Figure 5.

Sequence of processes for generating the HOG descriptor of the object.

Now, the MLP belonging to ORS HOG-MLP must be adapted to be able to recognize the interest objects. For this purpose, the processes described above are applied to q interest objects and each descriptor obtained is associated with its corresponding class, c=ck; thus, the training set of MLP is defined as

Od1c1Od2c2Odqcq=Odμcμμ=1,2,,qE12

The MPL structure is established as follows: It has one hidden layer of h units, the input layer has r units and the output layer has v units. Considering the training set, the backpropagation learning algorithm is used to generate the bank of models, which includes the vital information of the objects that integrate the training set. Essentially, the bank of models is present in the synaptic weights, W1=wjo1h×rand W2=wkj2v×h, that define the connections between neurons that integrate the MLP.

Finally, the operation process of ORS HOG-MLP, when the Odμ object is presented, is defined as

ckμ=gj=0h1wkj2fo=0r1wjo1odoμθj1θk2E13

where k=0,2,,v.

Advertisement

4. Experimental results and discussion

This section is intended to measure the ORS HOG-MLP performance through a set of experiments. In the first experiment, system performance is analyzed when it is adapted to recognize only one object class and was conducted on several object classes. In the second experiment, the system behavior is analyzed when it is configured to recognize multiple classes. Finally, our scheme’s performance is with other results reported in the literature.

Before we present and discuss the data obtained from experiments, let us introduce concepts and methods used to measure ORS HOG-MLP performance. Let O be an object presented to ORS HOG-MLP and the system generates the class c0 to indicate that the object does not belong to its bank of models, then

A True Positive (TP) occurs when the class generated by the system is cμ and O belongs to the class cμ, for μ=2,,q; this indicates a successful classification.

A true negative (TN) occurs when the class generated by the system is c0 and O does not belong to the bank of models; this indicates a successful rejection.

A False Positive (FP) occurs when the class generated by the system is cμ and O does not belong to this class, for μ=2,,q; this indicates an incorrect classification.

A False Negative (FN) occurs when the class generated by the system is c0 and O belongs to the bank of models; this indicates an incorrect rejection.

From these concepts, objective methods, which give information concerning system performance are obtained.

True positive rate (TPR). TPR determines the sensitivity of the system, i.e. it measures the proportion of successful classification obtained by system; TPR is defined as

TPR=number ofTPnumber ofTP+number ofFNE14

False Positive Rate (FPR). FPR indicates the proportion of wrongly classified objects; FPR is defined as

Eu=FPPWuFNRuE15

Accuracy (ACC). ACC is used to evaluate the tendency of the system to ascertain correct TPs, defined as

ACC=number ofTP+number ofTNnumber ofTP+number ofTN+number ofFP+number ofFNE16

False negative rate (FNR). FNR indicates the miss rate of the system, defined as

FNR=number ofFNnumber ofTP+number ofFN=1TPRE17

False positives per window (FPPW). FPPW indicates the number of errors by detection window and it is defined as

FPPW=number ofFPNE18

where N is the total number of windows processed. Finally, the proposed MLP classifier returns a real value for each detection window, this value is thresholding with a fixed value u in order to determine whether or not it is an object belonging to a class of bank of models. Thus, FNR and FPPW are dependent functions of u, allowing the plotting of evaluation curves ROC (Receiver Operating Characteristic) [47], Eq. (19), that show the tradeoff between the miss rate and the FPPW for each u.

Eu=FPPWuFNRuE19

The Caltech 101 dataset, collected by Fei-Fei et al. [48], was used for benchmarking our proposal. For both training and operation phases of the system, positive images (those that contain only interest objects) and negative images (those that do not contain objects of interest or background) were generated.

The Caltech 101 dataset consists of a set of 1179 negative images for the system training phase Itrain and a set of 1179 negative images for the system operation phase Itest. The number of positive images for training and operation phases is W+train_γ and W+test_γ, respectively, varies for each class (see Table 2).

Data setPositive images
V-Scan
Training W+train_γTestin W+test_γ
Airplane200 W+train_cA200 W+test_cA
Butterfly40 W+train_cB40 W+test_cB
CarSide86 W+train_cC86 W+test_cC
Chair30 W+train_cD30 W+test_cD
Electric guitar30 W+train_cE30 W+test_cE
Faces100 W+train_cF100 W+test_cF
Helicopter40 W+train_cG40 W+test_cG
Horses85 W+train_cH85 W+test_cH
Ketch50 W+train_cI50 W+test_cI
Laptop40 W+train_cJ40 W+test_cJ
Motorbikes200 W+train_cK200 W+test_cK
Piano45 W+train_cL45 W+test_cL
Revolver40 W+train_cM40 W+test_cM
SoccerBall30 W+train_cN30 W+test_cN

Table 2.

Positive images per class.

In order to demonstrate the performance of our proposal when only one object needs to be identified, in first experiment, ORS HOG-MLP was adapted to individually identify each class of objects in Table 2. With the intention of adding robustness to the system, from Itrain and Itest, additional negative windows for training and testing of the system, Wtrain=1743 and Wtest=1909, respectively, were generated. Thus, the training and testing of the final sets of negative images were defined as: Itrain=Itrain+Wtrain=2922, and Itest=Itest+Wtest=3288.

Then, negative images are associated with class 0, and positive images of the object under study, defined by γ, to class 1. The training set is defined as Itrainc0W+train_γc1, this set is presented to ORS HOG-MLP in order to adapt it to recognize the object γ. On the other hand, the ORS HOG-MLP performance is evaluated by applying the recognition phase to sets Itest and W+test_γ. This process was repeated for all objects belonging to Table 2.

The HOG algorithm parameters were adjusted as follows: The number of cells by detection window varies depending on the object shape, 9 bins uniformly spaced over 0–180° are defined, the size of blocks is of 2×2 adjacent cells and overlap of one cell in both x-axis and y-axis is used. The MLP is defined with the following features: The hidden layer has 5 neurons, the activation functions of neurons in the hidden layer and output layer are sigmoid, and random initial weights in the range of 0.25,0.25, neurons bias is −1.0, learning rate ε=0.01, and for each object, 20,000 iterations were carried out to train the network. Figure 6 shows some examples of detections, and Figure 7 and Table 3 summarize the results of the first experiment.

Figure 6.

Examples of detection results on the Caltech 101 dataset. Detected objects are enclosed in rectangles.

Figure 7.

Performance of our proposal based on evaluation curves ROC: (a) Airplane class, (b) Butterfly class, (c) Motorbikes class.

Data setPerformanceNumber
of cells
TPRFPRACCFPFN
Airplane0.9550.000970.99633916x8
Airplane0.9450.000600.99602118x8
Butterfly0.450.001300.991642212x8
Butterfly0.450.001900.99106228x8
Carside0.7670.000970.992732012x8
Carside0.7320.000970.99183238x8
Chair0.1330.00.991602612x8
Chair0.2330.000320.99231238x8
Electric guitar0.4000.000320.993911812x8
Electric guitar0.4000.000320.99391188x8
Faces0.9600.000100.99870416x8
Faces0.7600.000100.99240248x8
Helicopter0.3750.003560.9884112516x8
Helicopter0.5000.004530.989114208x8
Horses0.7410.00.993002212x8
Horses0.8230.003880.991412158x8
Ketch0.6800.002260.992671610x10
Ketch0.7400.000640.99522138x8
Laptop0.6500.000320.995211416x16
Laptop0.6250.000640.99452158x8
Motorbikes0.9600.000320.99721816x8
Motorbikes0.9800.000010.9987048x8
Piano0.8000.000640.99642912x8
Piano0.7770.00.99680108x8
Revolver0.7000.001940.994261216x8
Revolver0.6750.001290.99454138x8
Soccerball0.2410.000320.992612212x12
Soccerball0.2410.000640.99231228x8
Watch0.6600.002260.987173412x8
Watch0.7000.002590.98808308x8

Table 3.

Results of first experiment.

Considering the results shown in Table 3, the average value of TPR and FPR parameters, 0.6387 and 0.001228, respectively, show that the probability of correctly classifying an object is approximately 64%, and the probability of misclassifying an object is less than 1%. Meanwhile, the ACC parameter indicates that the system accuracy is over 98% (e.g., for motorbikes, corresponding to 198 out of 200 correct detections with 2 false positives). These results also show that using a detection window size of 8×8 or 16×8 cells does not significantly affect system performance.

In the definition of MLP structure, tests were performed using 5, 10, 15, and 20 neurons in the hidden layer. The results show variations of 0.5% of 103 FPPW, so we decided to work with the smaller number of neurons to reduce the computational cost of the system.

In second experiment, ORS HOG-MLP is configured as a multi-class recognition system. Therefore, the system is trained to recognize several groups of objects belonging to Table 2, where the number of objects by group can be 2, 3, 4, or 6. Thus, the training set is defined as Itrainc0W+train_γc1W+train_γcμ, μ=2,3,4,6, and γ identifies the different objects belonging to a group; e.g. Itrainc0W+train_cAc1W+train_cCc2W+train_cKc3 is a group that includes the airplane, carside, and motorbike objects, and the negative images are associated with the class 0. For this group, the recognition phase uses Itest, W+test_cA, W+test_cC, and W+test_cK to evaluate system performance.

The configuration parameters of the HOG algorithm and MLP are the same as those used in Experiment 1. Table 4 shows the results of this experiment. The results of second experiment show that system performance is not affected when operating in multiclass mode. This is deduced from the average values of TPR and ACC, which indicate that the probability of correctly classifying an object is approximately 68% and that the system accuracy fluctuates between 97% and 99%. e.g., for group2 = {airplane1, carside2, electricguitar3, motorbikes4, revolver5, watch6}, using a block of 8×8 cells, ORS HOG-MLP presents an ACC = 98% for airplane1 (corresponding to 196 out of 200 correct detections with 4 false positives), and it presents an ACC = 99% for motorbikes4 (corresponding to 198 out of 200 correct detections with 2 false positives). Those results make the robustness of the proposed system evident, when it is used in multi-object recognition applications.

GroupObjects
number
Performance
TPR1ACC1TPR2ACC2TPR3ACC3TPR4ACC4TPR5ACC5TPR6ACC6#Cells
160.0250.9760.030.9791.0000.9890.6600.9830.750.9850.1030.9808x8
260.9300.9880.6700.9830.1000.9830.9700.9910.6000.9870.6300.9818x8
260.9600.9930.7700.9890.0300.9860.9950.9960.6800.9920.6300.98412x8
360.9600.9930.1800.9850.1300.9870.9700.990.4500.9880.9750.9948x8
440.9350.9940.7100.9900.3500.9900.9550.9958x8
440.9450.9930.6700.9880.280.9870.9750.99512x8
530.9200.9930.7000.9890.9500.9958x8
530.9300.9940.6900.9900.9600.99616x8
620.9500.9950.9800.9978x8
620.9750.9970.9800.99712x8
720.9300.9970.6000.9948x8

Table 4.

Results of second experiment.

The groups are integrated as follows: group1 = {butterfly1, hair2, faces3, ketch4, laptop5, soccerball6}, group2 = {airplane1, carside2, electricguitar3, motorbikes4, revolver5, watch6}, group3 = {airplane1, butterfly2, chair3, faces4, laptop5, motorbikes6}, group4 = {airplane1, carside2, helicopter3, motorbikes4}, group5 = {airplane1, carside2, motorbikes3}, group6 = {airplane1, motorbikes2}, group7 = {faces 1, revolver2}.

Finally, the performance of the proposed scheme is compared with several object recognition schemes that have been cited frequently in related studies in this area. Table 5 shows the results of this comparison.

MethodData set
MotorbikesCars
Object recognition scheme proposed in [35]92.5%88.5%
Object recognition scheme proposed in [37]99.0%
Object recognition scheme proposed in [38]98.5%95.0%
Object recognition scheme proposed in [39]97.4%96.7%
Object recognition scheme proposed in [40]89.6%66.3%
HOG + MLP (our proposal, one object)99.87%99.27%
HOG + MLP (our proposal, multi-objects)3 objects99.6%99.0%
4 objects99.5%99.0%
6 objects99.1%98.9%

Table 5.

Performance comparison on accuracy for object recognition for two of the Caltech categories with other methods from the literature.

Table 5 shows a comparison of our scheme’s performance with other results reported in the literature. With an ACC performance of 99%, our method presents a significant improvement over the previous result. The table also shows that the performance of the proposed scheme is not significantly affected when used in multi-object recognition applications. Zhang et al. reported a similar performance with 99% using a scheme composed of PCA-SIFT method and shape context method for object representation and two-layer AdaBoost network as classification technique [37]. However, due to the methods and techniques used by Zhang et al., this scheme presents a computational cost relatively greater than that presented by our proposal.

Advertisement

5. Conclusions

Although object recognition is a very active area of research, it is still considered an overly complex task due to the following difficulties: 1. Objects of the same class with high variability in appearance. A class of objects can integrate elements with variations in shape, color, and texture. In addition, multiple factors such as position, lighting, and occlusions, among others, can increase these differences. 2. The lack of reference images for the training phase of the classifier. The available data are generally not enough to cover the variability in appearance of objects. Furthermore, there may be significant differences in the conditions of training and system operation.

This research has placed special emphasis on the study of HOG algorithms for the feature extraction stage. This is because HOG has demonstrated that using normalized representations of objects can generate representations that provide discriminative information from the objects in an image. Furthermore, since HOG operates on local cells, it is invariant to geometric and photometric transformations as well as to changes in background and object position. On the other hand, it seeks to exploit the well-known features of the MLP to solve the problem that occurs when the limited data available during training are generally not enough to cover the variability in the appearance of objects.

It is important to emphasize that the proposed improvement in the step of calculating the HOG algorithm gradient reduces the rate of false positives. It was also demonstrated that HOG can accurately represent different objects and offers good performance in multiclass applications. Finally, we show that a classifier that uses a neuronal approach is an excellent complement to a HOG-based feature extractor.

It is the intention of this working group to use the proposed system in autonomous systems applications through its modeling on reconfigurable logic.

References

  1. 1. Forsyth DA, Ponce J. Computer vision: A modern approach. 2dn ed. New Delhi: Pearson; 2011. p. 792
  2. 2. Ballard DH, Brown CM. Computer Vision. New York: Prentice Hall; 1982. p. 539
  3. 3. Mak KL, Peng P, Yiu KFC. Fabric defect detection using morphological filters. Image and Vision Computing. 2009;27(10):1585-1592. DOI: 10.1016/j.imavis.2009.03.007
  4. 4. Abouelela A, Abbas HM, Eldeeb H, Wahdan AA, Nassar SM. Automated vision system for localizing structural defects in textile fabrics. Pattern Recognition Letters. 2005;26(10):1435-1443. DOI: 10.1016/j.patrec.2004.11.016
  5. 5. Saeidi RG, Latifi M, Najar SS, Saeidi AG. Computer vision-aided fabric inspection system for on-circular knitting machine. Textile Research Journal. 2005;75(6):492-497. DOI: 10.1177/0040517505053874
  6. 6. Li O, Wang M, Gu W. Computer vision based system for apple surface defect detection. Computers and Electronics in Agriculture. 2002;36(2–3, 223):215. DOI: 10.1016/S0168-1699(02)00093-5
  7. 7. Kocer HE, Cevik KK. Artificial neural networks based vehicle license plate recognition. Procedia Computer Science. 2011;3:1033-1037. DOI: 10.1016/j.procs.2010.12.169
  8. 8. Ozbay S. and Ercelebi E. Automatic vehicle identification by plate recognition. In Proceedings of World Academy of Science, Engineering and Technology; Turkey. 2005. pp. 222-225
  9. 9. McKenna SJ, Jabri S, Duric Z, Rosenfeld A, Wechsler H. Tracking groups of people. Computer Vision and Image Understanding. 2000;80(1):42-56. DOI: 10.1006/cviu.2000.0870
  10. 10. Comaniciu D, Ramesh V, Meer P. Kernel-based object tracking. Transactions on Pattern Analysis and Machine Intelligence. 2003;25(5):564-577. DOI: 10.1109/TPAMI.2003.1195991
  11. 11. Wang X, Han TX, Yan S. An HOG-LBP human detector with partial occlusion handling. In: Proceedings of IEEE 12th International Conference on Computer Vision; 29 September – 2 October 2009; Japan. 2010. pp. 32-39. DOI: 10.1109/ICCV.2009.5459207
  12. 12. Yang S, Liao X, Borasy U. A pedestrian detection method based on the HOG-LBP feature and gentle adaBoost. International Journal of Advancements in Computing Technology. 2012;4(19):553-560. DOI: 10.4156/IJACT.VOL4.ISSUE19.66
  13. 13. Nandi CS, Tudu B, Koley C. An automated machine vision based system for fruit sorting and grading. In: Proceedings of Sixth International Conference on Sensing Technology; 18-21 December 2012; India. 2013. pp. 195-200. DOI: 10.1109/ICSensT.2012.6461669
  14. 14. Viola P, Jones MJ. Robust real-time face detection. International Journal of Computer Vision. 2004;57(2):137-154. DOI: 10.1023/B:VISI.0000013087.49260.fb
  15. 15. Muñoz-Salinas R, Aguirre E, García-Silvente M. People detection and tracking using stereo vision and color. Image and Vision Computing. 2007;25(6):995-1007. DOI: 10.1016/j.imavis.2006.07.012
  16. 16. Zhao X, Lin Y, Ou B, Yang J. A wavelet-based image preprocessing method or illumination insensitive face recognition. Journal of Information Science and Engineering. 2015;31(5):1711-1731
  17. 17. Yilmaz A, Javed O, Shah M. Object tracking: A survey. Journal ACM Computing Surveys. 2006;38(4):13. DOI: 10.1145/1177352.1177355
  18. 18. Miura J, Kanda T, Shirai Y. An active vision system for real-time traffic sign recognition. In: Proceedings of Intelligent Transportation Systems; 1-3 October 2000; USA. 2002. pp. 52-57. DOI: 10.1109/ITSC.2000.881017
  19. 19. Gonzales RC, Woods RE. Digital Image Processing. 4th ed. Ney York: Pearson; 2018. p. 1022
  20. 20. Nixon M, Aguado A. Feature Extraction and Image Processing for Computer Vision. 3rd ed. London: Academic Press; 2012. p. 609
  21. 21. Theodoridis S, Koutroumbas K. Patern recognition. 4th ed. London: Academic Press; 2009. p. 961
  22. 22. Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. Transactions on Pattern Analysis and Machine Intelligence. 2002;24(24):509-522. DOI: 10.1109/34.993558
  23. 23. Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of Conference on Computer Vision and Pattern Recognition; 20-25 June 2005; USA. 2005. pp. 886-893. DOI: 10.1109/CVPR.2005.177
  24. 24. Freeman WT, Roth M. Orientation histograms for hand gesture recognition. In: Proceedings of International Workshop on Automatic Face and Gesture Recognition. 1995. pp. 296-301
  25. 25. Belongie S, Malik J, Puzicha J. Matching shapes. In Proceedings of 8th International Conference on Computer Vision; 7-14 July 2001; Canada. 2002. pp. 454-461. DOI: 10.1109/ICCV.2001.937552
  26. 26. Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004;60(2):91-110. DOI: 10.1023/B:VISI.0000029664.99615.94
  27. 27. Zhu Q, Avidan S, Yeh M, Cheng K. Fast human detection using a cascade of histogram oriented gradients. In: Proceedings of Conference of Computer Vision and Pattern Recognition; 17-22 June 2006; USA. 2006. pp. 1491-1498. DOI: 10.1109/CVPR.2006.119
  28. 28. Kobayashi T, Hidaka A, Kurita T. Selection of histograms of oriented gradients features for pedestrian detection. In: Proceedings of Conference on Neural Information Processing; 13-16 November 2007; Japan. 2008. pp. 598-607. DOI: 10.1007/978-3-540-69162-4_62
  29. 29. Socarras Y, Vázquez D, López AM, Gerónimo D, Gevers T. Improving HOG with image segmentation: Application to human detection. In: International Conference on Advanced Concepts for Intelligent Vision Systems; 4-7 September 2012; Czech Republic. 2012. pp. 178-189. DOI: 10.1007/978-3-642-33140-4_16
  30. 30. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representation; 14-16 April 2014; Canada. 2014. pp. 1-16. DOI: 10.48550/arXiv.1312.6229
  31. 31. He H, Chen S. Imorl: Incremental multiple-object recognition and localization. Transactions on Neural Networks. 2008;19(10):1727-1738. DOI: 10.1109/TNN.2008.2001774
  32. 32. Hanson SJ, Matsuka T, Haxby JV. Combinatorial codes in ventral temporal lobe for object recognition: Haxby (2001) revisited: Is there a “face” area? NeuroImage. 2004;23(1):156-166. DOI: 10.1016/j.neuroimage.2004.05.020
  33. 33. Markou M, Singh S. Novelty detection, a review-part2: Neural network based approach. Signal Processing. 2003;83(12):2499-2521. DOI: 10.1016/j.sigpro.2003.07.019
  34. 34. Guoqiang PZ. Neural networks for classification: A survey. Transactions on Systems, Man, and Cybernetics – Part C. Applications and Reviews. 2000;30(4):451-462. DOI: 10.1109/5326.897072
  35. 35. Fergus R, Perona P, Zisserman A. Object class recognition by unsupervised scale-invariant learning. In: Proceedings of Conference on Computer Vision and Pattern Recognition; 18-20 June 2003; USA. 2003. pp. II-II. DOI: 10.1109/CVPR.2003.1211479
  36. 36. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological). 1977;39(1):1-38
  37. 37. Zhang W, Yu B, Zelinsky GJ, Samaras D. Object class recognition using multiple layer boosting with heterogeneous features. In: Proceedings of Conference on Computer Vision and Pattern Recognition; 20-25 June 2005; USA. 2005. pp. 323-330. DOI: 10.1109/CVPR.2005.251
  38. 38. Zhang J, Marszałek M, Lazebnik S, Schmid C. Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision. 2007;73(2):213-238. DOI: 10.1007/s11263-006-9794-4
  39. 39. Leibe B, Leonardis A, Schiele B. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision. 2008;77:259-289. DOI: 10.1007/s11263-007-0095-3
  40. 40. Laptev I. Improving object detection with boosted histograms. Image and Vision Computing. 2009;27(5):535-544. DOI: 10.1016/j.imavis.2008.08.010
  41. 41. Dalal N. Finding People in Images and Videos [Doctoral thesis]. Saint Ismier, Frace, Institut National Polytechnique de Grenoble-INPG. 2006
  42. 42. Kim S, Cho K. Design of high-performance HOG feature calculation circuit for real-time pedestrian detection. Journal of Information Science and Engineering. 2015;31(6):2055-2073. DOI: 10.6688/JISE.2015.31.6.13
  43. 43. Rosenblatt F. The perceptron: A probabilistic model for information storage and retrieval in the brain. Psychological Review. 1958;65:386-408
  44. 44. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics. 1943;5(4):115-133
  45. 45. Werbos PJ. Beyond regression: New tools for prediction and analysis in the behavioral sciences [Doctoral thesis]. Cambridge, MA, Committee on Appl. Math., Harvard Univ.; 1974
  46. 46. Rumelhart DE, Hinton GE, Williams RJ. Learning Internal Representations by Error Propagation. San Diego: Univ. of California, Inst for Cognitive Science; 1985
  47. 47. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27:861-874. DOI: 10.1016/j.patrec.2005.10.010
  48. 48. Fei-Fei R, Fergus R, Perona P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding. 2007;106(1):59-70. DOI: 10.1016/j.cviu.2005.09.012

Written By

Enrique Guzmán-Ramírez, Ayax García, Esteban Guerrero-Ramírez, Antonio Orantes-Molina, Oscar Ramírez-Cárdenas and Ignacio Arroyo-Fernández

Submitted: 14 June 2022 Reviewed: 26 July 2022 Published: 26 August 2022