Open access peer-reviewed chapter

SVM Classifiers – Concepts and Applications to Character Recognition

By Antonio Carlos Gay Thome

Submitted: December 9th 2011Reviewed: July 31st 2012Published: November 7th 2012

DOI: 10.5772/52009

Downloaded: 2796

1. Introduction

Support Vector Machines – SVMs, represent the cutting edge of ranking algorithms and have been receiving special attention from the international scientific community. Many successful applications, based on SVMs, can be found in different domains of knowledge, such as in text categorization, digital image analysis, character recognition and bioinformatics.

SVMs are relatively new approach compared to other supervised classification techniques, they are based on statistical learning theory developed by the Russian scientist Vladimir Naumovich Vapnik back in 1962 and since then, his original ideas have been perfected by a series of new techniques and algorithms.

Since the introduction of the concepts by Vladimir, a large and increasing number of researchers have worked on the algorithmic and the theoretical analysis of SVM, merging concepts from disciplines as distant as statistics, functional analysis, optimization, and machine learning. The soft margin classifier was introduced few years later by Cortes and Vapnik [1], and in 1995 the algorithm was extended to the regression case.

There are several published studies that compare the paradigm of neural networks against to the support vector machines. The main difference between the two paradigms lies in how the decision boundaries between classes are defined. While the neural network algorithms seek to minimize the error between the desired output and the generated by the network, the training of an SVM seeks to maximize the margins between the borders of both classes.

SVM approach has some advantages compared to others classifiers. They are robust, accurate and very effective even in cases where the number of training samples is small. SVM technique also shows greater ability to generalize and greater likelihood of generating good classifiers.

By nature SVMs are essentially binary classifiers, however, based on several researchers’ contributions they were adapted to handle multiple classes cases. The two most common approaches used are the One-Against-All and One-Against-One techniques, but this scenario is still an ongoing research topic.

In this chapter we briefly discuss some basic concepts on SVM, describe novel approaches proposed in the literature and discuss some experimental tests applied to character recognition. The chapter is divided into 4 sections. Section 2 presents the theoretical aspects of the Support Vector Machines. Section 3 reviews some strategies to deal with multiple classes. Section 4 details some experiments on the usage of One-Against-All and One-Against-One approach applied to character recognition.

2. Theoretical foundations of the SVM

Support vector machines are computational algorithms that construct a hyperplane or a set of hyperplanes in a high or infinite dimensional space. SVMs can be used for classification, regression, or other tasks. Intuitively, a separation between two linearly separable classes is achieved by any hyperplane that provides no misclassification on all data points of any of the considered classes, that is, all points belonging to class A are labeled as +1, for example, and all points belonging to class B are labeled as -1.

This approach is called linear classification however there are many hyperplanes that might classify the same set of data as can be seen in the figure 1 below. SVM is an approach where the objective is to find the best separation hyperplane, that is, the hyperplane that provides the highest margin distance between the nearest points of the two classes (called functional margin). This approach, in general, guarantees that the larger the margin is the lower is the generalization error of the classifier.

Figure 1.

Separation hyperplanes. H1 does not separate the two classes; H2 separates but with a very tinny margin between the classes and H3 separates the two classes with much better margin than H2

If such hyperplane exists, it is clear that it provides the best separation border between the two classes and it is known as the maximum-margin hyperplane and such a linear classifier is known as the maximum margin classifier.

2.1. Brief history

Research on pattern recognition started in 1936 through the work done by R. A. Fisher who suggested the first algorithm for pattern recognition [2]. After him we have the work done by Frank Rosemblat in 1957 that invented the nowadays well known linear classifier named PERCEPTRON that is the simplest kind of feed forward neural network [3]. In 1963 Vapnik and Lerner introduced the Generalized Portrait algorithm (the algorithm implemented by support vector machines is a nonlinear generalization of the Generalized Portrait algorithm) [4]. Aizerman, Braverman and Rozonoer in 1964, introduced the geometrical interpretation of the kernels as inner products in a feature space [5] and Cover in 1965 discussed large margin hyperplanes in the input space and also sparseness [6].

The field of statistical learning theory was first developed and proposed by Vapnik and Chervonenkis in 1974 [7] and, based on this theory, appears in the year of 1979, the first concepts about SVMs [8]. SVMs close to their current form were first introduced by Boser at al. with a paper presented at the COLT 1992 conference in 1992 [9].

2.2. Formal definition of the SVM classifier – The linear model

The surface model used by SVM to perform the separation is the hyperplane. Let then W and b be, respectively, the vector normal to the hyperplane and its displacement relative to the origin [10]. Thus, we have that the decision function for an input x is given by equation (1).

D(x)=WxbE1

where,

x{AifD(x)>0BifD(x)<0E2

As can be seen in figure 2 below, the distance from x (with signal) to the hyperplane is given by 3.

D(x)WE3

Thus, D(x1) and D(x2) will have opposite signs (belong to different sets) if and only if x1 and x2 are on opposite sides of the separation hyperplane.

Figure 2 shoes that the Vector W is perpendicular to the hyperplane and the parameter |b|Wdetermines the offset of the hyperplane from the origin along the normal vector. It is desired to choose W and b to maximize the margin M that represents the distance between the parallel hyperplanes that are as far apart as possible while still separating the both set of data. These two hyperplanes can be described respectively by the following equations (4).

Figure 2.

Example of the separating hyperplane (in two dimensions), distances and margins (from Boser et al, 1992 [9]).

Wxb=+1andWxb=1E4

Let the set of sample points be represented by x1, …, xp and their respective group classification be represented by y1, …, yp where

yi={+1ifxiA1ifxiBE5

If the two groups of samples in the training data are linearly separable it is then possible to select the two hyperplanes in a way that there are no points between them and then try to maximize the distance between the two hyperplanes [11].

The distance between these two hyperplanes is given by 2Wand to maximize it implies to minimize W and, in order to prevent data points falling into the margin M, we add the following constraint to each equation (6):

Wxib+1i,yi=+1andWxib1i,yi=1E6

Multiplying each equation by its corresponding yi they are transformed into just one equation as following (7):

yi(Wxib)1i,i=1...pE7

Dividing now both sides of the equation by wit turns into (8)

yi(Wxib)w1w=Mi,i=1...pE8

To maximize M we need to minimize wsubject to the following constraint (9).

min(inw,b)wsubjecttoi,i=1...pyi(wxib)10E9

The optimization problem above is difficult to solve because it depends onw, the norm of w, which involves a square root. Fortunately it is possible to alter the equation substitutingwby 12w2without changing the solution (the minimum of the original and the modified equations have the same w* and b*). The problem now belongs to the quadratic programming (QP) optimization that is easier to be computed and is stated as in (10).

min(inw,b)12w2subjecttoi,i=1...pyi(wxib)10E10

The factor of 1/2 is used for mathematical convenience and the problem can now be solved by standard quadratic programming techniques. Applying non negative Lagrange multipliers αi (i = 1 … p) to the objective function turns the problem into its dual form as in (11).

L(w,b,α)=12w2i=1pαi(yi(wxib)1)subjecttoαi0i,i=1...pE11

Considering now that in the solution point the gradient of L() is null, the equation can be handled in order to obtain a new quadratic programming problem as in (12):

Lw|w=w*=w*i=1pαiyixi=0w*=i=1pαiyixiLw|b=b*=i=1pαiyi=0i=1pαiyi=0E12

In this case, the minimum point with respect to w and b is the same to the maximum with respect to α, and the problem can be stated as in (13).

max(inαi)αT112αTHαsubjecttoi,i=1...pαi0,αTy=0E13

Where α = (α1,..., αp)T, y = (y1,..., yp)T, 0 and 1 have size p, and Hp×p is such that

Hi,j=yiyjxiTxjE14

A condition imposed by the Kühn-Tucker Theorem is that

αi*(yi(w*xib*)1)=0i,i=1...pE15

so that, if αi*0then

yi(w*xib*)1=0i,i=1...pE16

that is,

yi(w*xib*)=1E17

Any xi that satisfies equation (17) is called support vector and the SVM trainings are reduced to the set of such vectors.

In the cases where the samples are not linearly separable the approach described above would diverge and grow arbitrarily. In order to deal with the problem it is then introduced a set of slack variables (δ) in equation (6) as showed in (18).

D(xi)=Wxib+1δii,yi=+1andD(xi)=Wxib1+δii,yi=1whereδi0,i,i=1...pE18

These equations can be rewritten as

yiD(xi)+1δii,i=1...pE19

The slack variables provide some freedom to the system allowing some samples do not respect the original equations. It is necessary however to minimize the number of such samples and also the absolute value of the slack variables. The way to do this is introducing a penalization term into the objective function as follows (19):

max(inw,b)12w2+Ci=1pδisubjecttoi,i=1...pyi(wxib)1+δi0,δi0E20

Variable C indicates the strength of the penalization to be applied. Introducing Lagrange multipliers on the penalization variables the dual form of the problem becomes as in (21).

L(w,b,α,δ,λ)=12w2+Ci=1pδii=1pαi(yi(wxib)1)i=1pλiδisubjecttoi,i=1...pαi0λi0E21

Where

δ=(δi...δp)Tandλ=(λi...λp)TE22

From here, as before, the problem can be represented into its quadratic form in terms of α(22).

max(inαi)αT112αTHαsubjecttoi,i=1...pαTy=0,0αcE23

Where c = (C … C) is a p dimension vector with all values equal C.

2.3. The non-linear model

Whereas the original problem as proposed by Vladimir Vapnik in 1979 [8], was stated for a finite dimensional space, it often happens that the sets to be discriminated are not linearly separable in their original space. For this reason, it was proposed by Isabelle Guyon, Bernhard Boser and Vapnik in 1992 [9], that the original finite-dimensional space was mapped into a higher-dimensional space, presumably making the separation easier in the new space.

In order to achieve non-linear separation, instead of generating a new quadratic programming problem as in previous section, it is possible to modify the vectors of the input space into vectors of a feature space through a chosen transform function ϕ with N ≥ n and then compute the separation hyperplane on the feature space. Figure 3 shows an example of such scheme.

Φ:nNE24

Figure 3.

The transform function maintains the same dimension of the input space but makes the representation in feature space be linearly separable

The computation of the separation hyperplane is not done explicit on the feature space but using a scheme where every occurrence of ϕ(u).ϕ(v) is replaced by a function K(u,v) called kernel function and the H() function as seen before becomes (23):

Hi,j=yiyjK(xi,xj)E25

The optimum W vector is given by

w*=i=1pαiyiΦ(xi)E26

And the support vector machine decision function becomes

D(x)=i=1pαiyiK(xi,x)bE27

To keep the computational load reasonable, the mapping used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function K(x,y) selected to suit the problem. The hyperplanes in the higher dimensional space are defined as the set of points whose inner product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations of feature vectors that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation:

iαiK(xi,x)=constantE28

This approach allows the algorithm to find the maximum-margin hyperplane into the transformed feature space. The transformation may be non-linear and / or the transformed space may be of high dimension. The classifier, in the feature space, draws a hyperplane that represents a non-linear separation curve in the original input space.

If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimension. Maximum margin classifiers are well regularized, so the infinite dimension does not spoil the results. Some common kernels include:

Polynomial (homogeneous): K(xi,x)=(xix)d

Radial Basis Function: K(xi,x)=exp(λxix2);λ>0

Gaussian Radial basis function: K(xi,x)=exp(xix22σ2)

Sigmoid:

K(xi,x)=tanh(kxix+c);forsome(butnotevery)k>0andc<0E29

3. The multiclass classification strategies

Multiclass SVM approach aims to assign labels to a finite set of several elements based on a set of linear or non-linear basic SVMs. The dominant approach for doing so in the literature is to reduce the single multiclass problem into multiple binary problems [12 – 15].

Doing so, each of the problems can be seen then as a binary classification, which is assumed to produce an output function that gives relatively large values for those examples that belong to the positive class and relatively small values for the examples that belong to the negative class.

Two common methods to build such binary classifiers are those where each classifier is trained to distinguish: (i) one of the labels against to all the rest of labels (known as one-versus-all) [16] or (ii) every pair of classes (known as one-versus-one). Classification of new instances for one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class. The classification of one-versus-one case is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally, the class with more votes determines the instance classification.

3.1. the one-versus-all strategy

One-Against-All multiclassifier is compound by a number of binary classifiers, one for each class. Using a Winner-Takes-All strategy, each binary classifier is trained taking the examples from one of the classes as positive and the examples from all other classes as negative. The multiclassifier output is activated for the class whose binary classifier gives the greatest output amongst all. Formally, given a vector y with the outputs of the binary classifiers, the multiclassifier generates a vector L = (l1,..., ls), in the following way (27):

i*=argmaxi=1...s{yi},Li={+1ifi=i*,i=1...s1otherwiseE30

Where ‘s’ represents the total number of classes.

As seen, a one-against-all multiclassifier for 's' different classes requires the construction of 's' distinct binary classifiers, each one responsible for distinguishing one class from all the others. However, doing so does not guarantee that the resulting multi-class classifier is good. The problem is that all binary classifiers are assumed to show equal competence distinguishing their respective class, in other words, there is an underlying assumption that all binary classifiers are totally trustable and equally reliable, which does not always hold in multi-class cases as Yi Liu [17] shows through a simple example as in figure 4.

Figure 4.

a) Three classes problem and respective boundaries; (b) binary classifier that distinguishes well class 3 from all others (dashed line); (c) binary classifier that does not distinguish well class 1 from all others (dashed line). The example was taken from [15].

The same error occurs with the binary classifier for class 2 and so, the multi-class classifier based on these three binary classifiers would not provide good accuracy. In order to mitigate such problem, Liu [15] suggests two reliability measures: SRM – static reliability measure and DRM – dynamic reliability measure.

3.1.1. Static reliability measure

As pointed out by Vapnik [17] and Cortes [1], a small training set error does not guarantee a small generalization error when the number of training samples is relative small with respect to the feature vector x dimension. SVM training is done minimizing the objective function and as the objective function becomes smaller, smaller also becomes the generalization error. Based on this fact, Liu rewrites the objective function as seen in (28) and proposes the SRM as in (29).

Obj=12w2+Ci=1N(1yiD(xi))+E31

Where (u)+ = u if u >0 and 0 if u ≤0.

λSRM=exp(1/2w2+Ci=1N(1yiD(xi))+σ)E32

Where D(xi) = wTxi + b, and the parameter σ = CN is a normalization factor to offset the effect of the different regularization parameter C and training size N. This λSRM metric is reduced to (30) for those linearly separable cases where (1 – yiD(xi))+ = 0 for all training samples.

λSRM=exp(w2CN2)E33

From (28) we notice that 2/w2is the classification margin. Small wcorresponds to large margin and more accurate classifier. Small walso corresponds to larger reliability measureλSRM.

3.1.2. Dynamic reliability measure

The basic idea, differently of the static measure that is global over the whole training samples, is to estimate the classifier’s reliability in a local region of feature space surrounding the test sample x. The ‘k’ surrounding samples of x are denoted by Nk(x).

Suppose A(x) ∈ {1, -1} is the class label assigned to x by a SVM classifier and let NkA(x)(x)denote the set of the training samples that belong to the set of ‘k’ nearest neighbors of x and are classified to the same class of x. Now, rewriting equation (28) as in (31)

Obj=i=1N12Nw2+Ci=1N(1yiD(xi))+=i=1NObj(xi)E34

Liu formulate the local version of OBJ as in (32)

Objlocal=Obj(x^i)=i=1kx(12Nw2+Ci=1N(1yiD(xi))+)E35

Where x^iNkA(x)(x),(x^i,y^i)is the training pair, and kx is the number of training samples in the setNkA(x)(x). And the dynamic reliability measure becomes as in (33).

λDRM(x)=exp(ObjlocalCkx)E36

3.1.3. SRM and DRM decision rule

For a test sample x, assuming ‘M’ trained support vector machines each with its decision function, we evaluate D(x) for each classifier and after, generate the corresponding soft decision output yi[1,+1]assuming that all classifiers are completely trustable (34)

yi=sign(Di(x))(1exp(|Di(x)|)E37

Now, assuming that λidenotes either the SRM or DRM reliability measure we have (35)

y˜i=yiλiandi*=argmaxy˜ii=1...ME38

Mota in [18] sees the same problem from a different point of view. According to them in the One-Against-All method, SVM binary classifiers are obtained by solving different optimization problems and the outputs from these binary classifiers may have different distributions, even when they are trained with the same set of parameters and so, comparing these outputs using equation (27) may not work very well.

The output mapping, as suggested in [18], tries to mitigate such problem normalizing the outputs of the binary classifiers in such way to make them comparable by the equation (27). Four strategies are suggested: MND, BND, DNCD and MLP, based, respectively, on distance normalization (the first three) and base on a neural network model (the last one).

3.1.4. MND output mapping strategy

Analyzing the histogram of the raw outputs (original outputs) from a typical SVM binary classifier (figure 5) we observe a bimodal distribution consisting of two normal functions each with different mean and standard deviation. Different binary classifiers show different values of mean and standard deviation which makes unfair to compare their outputs. Then, before applying equation (10), the outputs from each binary classifier are normalized in a way that they all provide a normal distribution with mean at −1 or +1 and a standard deviation equal to 1.

Using a validation data set the samples are grouped into two groups A1 (the current class) and A2 (all the other classes) and the respective output distribution mean and standard deviation are computed and then the normalized output is obtained by equation (36).

ui=d(yi,1)'+d(yi,+1)'2E39

Where

d(yi,k)'=yiμkσk,k={1,+1}E40

Figure 5.

Output histogram of a binary one-against-all classifier

3.1.5. BND output mapping strategy

BND Strategy takes into account both normalized distances using the equation (38). When both distances d(yi,k)',k={1,+1}are positive (i.e., yi is on the right side of the centers of both normal functions) then ui is +1. When both distances are negative yi is on the left side of both centers and ui is −1, but when the distance signals are different, ui is between −1 and +1, closer to +1 if d(yi,+1)'is greater thand(yi,1)', 0 when the distances are equal and closer to −1 otherwise.

ui=d(yi,1)'+d(yi,+1)'|d(yi,1)'|+|d(yi,+1)'|E41

3.1.6. DNCD output mapping strategy

Instead of using normalized distances (like in MND Strategy), DNCD Strategy builds a normalized output by joining the non-normalized distances and normalizing it by the distance between the centers of the normal functions as in equation (39).

ui=d(yi,1)'+d(yi,+1)'|μ1μ+1|E42

3.1.7. MLP output mapping strategy

In this case, instead of having a function which maps each raw output yi to a normalized output ui, we have a function which maps the entire vector y into the vector u. The idea of this strategy is to implement the mapping function by using an MLP neural network, trained using a validation data set. The training samples are the outputs given by the multiclassifier for the validation data set. The expected outputs for those samples are the multiclassifier expected outputs, that is, a vector for which all positions have value −1, except for the one which corresponds to the class of that sample, whose value is +1.

Homogeneous M class multiclassifier is the one where its M binary classifiers are all trained with the same set of parameters. This approach, however, may not be the best option once the training of each classifier is independent and so, the chance is high to find a better set of classifiers if the search for different parameters is allowed in each case. But, in these cases, if a number ‘g’ of such parameters is used then the number of possible combinations of them is gsand, obviously, even for reasonable values of ‘g’ the test for all possible combinations is impracticable.

One approach is to choose a subset of alternative parameters composition and train a set of L distinct homogeneous multiclass SVMs. The output mapping is then applied to each of the ‘L*s’ binary classifiers and the heterogeneous multiclassifier is formed by selecting the best binary classifier from the ‘L’ homogeneous multiclassifiers. The selection is done through the classification quality metric ‘q’ as in (40) computed from the confusion matrix of each binary classifier.

qi=2Miij=1sMij+j=1iMjiE43

Where Mij is the value of the i-th row and j-th column of the confusion matrix, which corresponds to the number of samples of class Ai that were missclassified as being of class Aj by the homogeneous multiclassifier. The more qi approaches to 1 the better is the interaction of the i-th binary SVM among the other ones of the same homogeneous multiclassifier. Thus, not only we take into account the number of hits of an SVM, but also we penalize it for possible confusions in that multiclassifier. Finally, the heterogeneous multiclassifier is produced by the binary SVMs of greatest quality for each class.

3.2. The one-versus-one strategy

This method constructs one binary classifier for every pair of distinct classes and so, for M classes, a number of M*(M-1)/2 binary classifiers are constructed. The binary classifier Aij is trained taking the examples from class ‘i’ as positive and the examples from class ‘j’ as negative. For a new example x, if classifier Aij classifies it as class ‘i’, then the vote for class ‘i’ is added by one. Otherwise, the vote for class ‘j’ is increased by one. After each of the M*(M-1)/2 binary classifiers makes its vote, the strategy assigns the current example to the class with the largest number of votes.

Two interesting variations for the One-Against-One strategy, not using maximum vote, were proposed, one by Hastie and Tibshirani [19] known as pairwise coupling and other by Platt [20] that is a sigmoid version of the same pairwise coupling approach suggested by Hastie. Another interesting variation of this pairwise is proposed by Moreira and Mayoraz [21].

3.2.1. Pairwise coupling

Considering that each binary classifier Cij on a One-Against-One strategy provides a probabilistic output asrij=prob(Ai/AiAj),ij, Hastie and Tibshirani propose to combine them in order to obtain an estimation of the posterior probabilities for all classifiers together pi=prob(Ai/x),i=1...M.To estimate the pi’s, M*(M-1)/2 auxiliary variables μij’s related to the pi’s are introduced such as: μij = pi/(pi + pj) and then, pi’s are determined so that μij’s are close to rij’s. Kullback-Leibler distance [22, 23] between rij and uij was chosen as the measure of closeness (28).

lp=i<jnij(rijlogrijμij+(1rij)log1rij1μij)E44

where nij is the number of examples that belongs to the union of both classes (Ai U Aj) in the training set. The associated score equations are (29).

ijnijμij=ijnijrij,i=1...M,subjecttok=1Mpk=1E45

The pi’s are computed using the following iterative procedure:

  1. Start from an initial guess of pi’s

  2. Compute the corresponding μij’s: μij=pipi+pj

  3. Repeat (i = 1 … M and so on) until the convergence is reached:

  4. Renormalize pi’s: pi=pii=1Mpi

  5. Recomputed μij’s

3.2.2. Sigmoid pairwise coupling

Platt criticized Hastie and Tibshirani’s method of generating posterior class probabilities for each binary SVM, and suggested the use of a properly designed sigmoid applied to the SVM output to form these probabilities such as in (30).

Pr(w1|x)=11+eAf+BE47

Where ‘f’ is the output of the SVM associated with the example x and the parameters ‘A’ and ‘B’ are determined by the minimization of the negative log-likelihood function over the validation data. In [20] Platt suggests a pseudo-code for the determination of the parameters ‘A’ and ‘B’.

4. Character recognition experiments

The ability to identify machine printed characters in an automated manner has obvious applications in numerous fields (figure 6). Optical character recognition (OCR), as this field is commonly known, has been a topic of interest for a long time since the late 1940’s, when Jacob Rabinow started his work. Jacob was an engineer and inventor, he lived from 1910 to 1999 and during his life he earned 230 U.S. patents on a variety of mechanical, optical and electrical devices.

Figure 6.

a) Example of a LPR – License Plate Recognition application; b) Example of a text reading from scanned paper

The earliest OCR machines were primitive mechanical devices with fairly high failure rates. As the amount of new written material increased, so did the need to process it all in a fast and reliable manner, and these machines were clearly not up to the task. They quickly gave way to computer-based OCR devices that could outperform them both in terms of speed and reliability.

Today there are many OCR devices in use based on a variety of algorithms. Despite the fact that these OCR devices can offer good accuracy and high speed, they are still far away compared to the performance reached by the human being. Many challenges are still opened not only with respect to the variety of scenarios, as well as, types of printed characters and handwritings, but also with respect to the accuracy by itself. There is no device able to recognize 100%, they always make mistake and, sometimes, bad mistakes like find a character that does not exist or recognize a complete different character than it really is (example: recognize as an ‘M’ what in fact is an ‘S’).

4.1. Remarks

The research field on automatic algorithms for character recognition is very large including different forms of characters like Chinese, Arabic and others; different origin like printed and handwritten and different approaches to obtain the character image like on line and off line.

The experiments on character recognition reported in the literature vary in many factors such as the sample data, pre-processing techniques, feature representation, classifier structure and learning algorithm. Only a reduced number of these works have compared their proposed methods based on the same set of characters. Obviously that this fact makes tough to get a fair comparison among the reported results.

Some databases were created and divulgated to the researcher’s community with the objective to offer a generic and common set of characters to be used as patterns for the researches. Some of the most popular databases are CENPARMI, NIST, MNIST and DEVNAGARI.

License Plate and handwritten numeral recognition are on the most addressed research topics in nowadays and the experiments on handwritten numeral have been done basically using CENPARMI and NIST Special Database 19.

CENPARMI database, for example, contains 4,000 training samples and 2,000 test samples segmented from USPS envelope images. This set is considered difficult but it is easy to achieve in the literature recognition rates reported over 98%. Suen et al. reported accuracy of 98.85% by training neural networks on 450,000 samples [24] training it with 4,000 samples. Liu et al. report rates over 99% using polynomial classifier (PC) and SVMs [25], [26]. They report an accuracy of 99.58% using RBF SVM and 99.45% using Polynomial SVM. In [27] Ahmad et al. report the usage of a hybrid RBF kernel SVM and a HMM – Hidden Markov Model system over an online handwriting problem taken from the IRONOFF-UNIPEN database. The same authors in [28] report a work done on the recognition of words. Pal et al. also report in [29] the usage of a hybrid system based on SVM and MQDF – Modified Quadratic Discriminant Function for the problem of Devnagari Character Recognition. Arora et al., all from India, report in [30] a performance comparison between SVM and ANN – Artificial Neural Network on the problem of Devnagari Character Recognition.

License Plate recognition as well as off line handwritten recognition represents a very tough challenge for the researchers. There are a number of possible difficulties that the recognition algorithm must be able to cope with, which includes, for example: a) poor image resolution, usually because the camera is too far away from the plate; b) poor lighting and low contrast due to overexposure, reflection or shadows; c) object obscuring (part of) the plate, quite often a tow bar, or dirt on the plate; d) bad conservation state of the plate; e) Blurry images, particularly motion blur; and f) lack of global pattern, sometimes even inside a same country or state (figure 7).

Figure 7.

Example of License plate samples from 50 states of USA [31]

There is plenty of research work on this subject reported in the literature but the accuracy comparison among them is even more complex and difficult than the work done on handwritten. The accuracy not only depends on the type of the plates itself but also on the conditions on which the images were taken and on the level of the problems cited on previous paragraph. Waghmare et al. [32] report the use of 36 One-Against-All multiclass SVM classifier trained to recognize the 10 numeral and 26 letters from Indian plates (figure 8a). Parasuraman and Subin in [33] also report the usage of a Multiclass SVM classifier to recognize plates from Indian motorcycles (figure 8b). Other works on LPR can be found in [34 – 37].

In summary, character recognition is intrinsically a non linear and high dimensional problem. Among to the variety of OCR algorithms found in the literature, the SVM classifier is one of the most popular based on its good accuracy, high response speed and robustness. In the following subsections we describe some experiments in character recognition using both One-Against-All and One-Against-One multiclass SVMs.

Figure 8.

Indian plates for car (a) and motorcycle (b)

4.2. Using one-against-all strategy

The strategies proposed in [18] are here evaluated when applied to a problem of classifying characters extracted from vehicle plates. Two multiclass SVMs are trained: one to recognize the 10 digits (10 classes) and other to distinguish the 26 letters (26 classes).

For each group of characters, three data sets were formed based on the feature extraction used. In data set 1 (DS1) the feature vector has dimensionality of 288 formed by the 16 × 16 character bit matrix and 32 additional values from the character horizontal and vertical projections. Principal Component Analysis [38] reduced the original dimension to 124 (for digits) and 103 (for letters). Data sets 2 (DS2) and 3 (DS3) were generated respectively by 56 and 42 statistical moments extracted from the 16 x 16 character bit matrix.

Each data set was divided in three subsets: one for training, one for validation, and one for test. Table 1 shows how the samples were divided in these three subsets.

SubsetDigitsLetters
Training27942088
Validation25001875
Test75005625
Total12,7949,588

Table 1.

Number of samples for each data subset

4.2.1. Digit recognition

Based on two kernel functions and a set of different values for two variables as shown in table 2 (standard deviation for the Gaussian kernel and exponent order for the polynomial kernel), a set of 55 Homogeneous multiclassifiers were trained.

Kernel FunctionKernel ParameterC value
Polynomial[1, 4][0.1, 10]
Gaussian[1.0, 3.0][5, 80]

Table 2.

Number of samples for each data subset

The heterogeneous multiclassifier was formed with 10 binary classifiers selected each one of the 55 homogeneous multiclassifiers using the output mapping and confusion matrices as explained in previous section. The best results achieved for the test subsets of each data set are seen in Table 3. WTA-SVM is the common Winner-Takes-All strategy for One-Against-All approach.

StrategyDS1DS2DS3
WTA-SVM4.16%2.95%2.77%
MND4.07%2.84%2.73%
BND4.23%2.91%2.83%
DNCD4.33%3.05%2.87%
MLP3.48%2.85%2.57%

Table 3.

Error results on the test sets

4.2.2. Letter recognition

The multiclassifier construction was also based on the same two kernel functions and the same two variables as shown in table 4 (standard deviation for the Gaussian kernel and exponent order for the polynomial kernel), a set of 30 Homogeneous multiclassifiers were trained.

Kernel FunctionKernel ParameterC value
Polynomial[2, 4][0.05, 5]
Gaussian[1.0, 3.5][1, 20]

Table 4.

Number of samples for each data subset

The heterogeneous multiclassifier was formed with 26 binary classifiers selected each one of the 30 homogeneous multiclassifiers using the output mapping and confusion matrices as explained in previous section. The best results achieved for the test subsets of each data set are seen in Table 5. WTA-SVM is the common Winner-Takes-All strategy for One-Against-All approach.

StrategyDS1DS2DS3
WTA-SVM5.24%4.36%3.29%
MND5.74%4.37%3.57%
BND6.24%4.52%3.72%
DNCD5.64%4.46%3.45%
MLP4.46%4.12%2.88%

Table 5.

Error results on the test sets

4.3. Using one-against-one strategy

Experiments using One-Against-One RBF Kernel SVM are described in [39] for Brazilian plates, a total of 22464 numerals and 16848 letters were used for training and testing the system (Table 6). Brazilian plates show two different patterns (figure 9 – black characters over gray background and white characters over red background).

SubsetDigitsLetters
Training39423024
Test1852213824
Total22,46416,848

Table 6.

Number of samples for each data subset

Figure 9.

Brazilian patterns of plate

4.3.1. Digit recognition

As reported, an average accuracy of 99.61% was achieved (Table 7) and the worst performance occurred on the recognition of the number 8, which was misclassified 17 times from a total of 1725 samples (Table 8 – 7 times misclassified as 6 and 7 as 9).

Correct ClassificationError
LabelNumber%Number%
0222299.64%80.36%
1209599.90%20.10%
2184099.67%60.33%
3179999.89%20.11%
4171699.42%100.58%
5170099.77%40.23%
6175199.38%110.62%
7182599.67%60.33%
8170899.01%170.99%
9179399.61%70.39%
Total1844973
Average99.61%0.39%

Table 7.

Percentage of classification for Digits

SVM \ Target0123456789
02222000200000
10209500000000
21018400120002
30001799010010
41000171616101
50100017002001
61000121751043
70021000182503
82000017017087
91002030011793

Table 8.

Digits confusion matrix

4.3.2. Letter recognition

As reported, an average accuracy of 98.60% was achieved (Table 9X) and the worst performances showed by letters D (89.76%), O (93.31%) and Q (94.51%).

Correct ClassificationErrorCorrect ClassificationError
LabelNumber%Number%LabelNumber%Number%
A408100.00%00.00%N71998.76%91.24%
B45098.90%51.10%O66993.31%486.69%
C57699.31%40.69%P44099.32%30.68%
D29889.76%3410.24%Q37994.51%225.49%
E22198.66%31.34%R40199.01%40.99%
F17798.88%21.12%S32799.09%30.91%
G25098.43%41.57%T34099.71%10.29%
H30998.10%61.90%U53198.88%61.12%
I16897.67%42.33%V37499.73%10.27%
J33799.70%10.30%W28498.95%31.05%
K159099.69%50.31%X28798.97%31.03%
L292599.59%120.41%Y28399.30%20.70%
M34997.76%82.24%Z53899.81%10.19%
Total13630194
Average98.60%1.40%

Table 9.

Percentage of classification for Digits

Table 10 shows and explains the reasons for such reduced performance in comparison to the other 23 letters. The fact is that these three letters show very similar visual aspect and the SVM misclassified 23 letters ‘O’ as ‘D’, 26 ‘D’ as ‘O’, 17 ‘Q’ as ‘O’ and 18 ‘O’ as ‘Q’.

Figure 10.

Similarity among letters D, O and Q

BCDEFGHIJKLMNOPQRSTUVWZ
B2111
C1111
D33233
E21
F2
G31
H11112
I31
J1
K1112
L12612
M11321
N11313
O1326117
P12
Q11218
R22
S111
T1
U1122
V1
W21
X3
Y11
Z1

Table 10.

Letters confusion matrix

© 2012 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Antonio Carlos Gay Thome (November 7th 2012). SVM Classifiers – Concepts and Applications to Character Recognition, Advances in Character Recognition, Xiaoqing Ding, IntechOpen, DOI: 10.5772/52009. Available from:

chapter statistics

2796total chapter downloads

2Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Online Handwritten Chinese/Japanese Character Recognition

By Bilan Zhu and Masaki Nakagawa

Related Book

First chapter

Neural Forecasting Systems

By Takashi Kuremoto, Masanao Obayashi and Kunikazu Kobayashi

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us