## 1. Introduction

Support Vector Machines – SVMs, represent the cutting edge of ranking algorithms and have been receiving special attention from the international scientific community. Many successful applications, based on SVMs, can be found in different domains of knowledge, such as in text categorization, digital image analysis, character recognition and bioinformatics.

SVMs are relatively new approach compared to other supervised classification techniques, they are based on statistical learning theory developed by the Russian scientist Vladimir Naumovich Vapnik back in 1962 and since then, his original ideas have been perfected by a series of new techniques and algorithms.

Since the introduction of the concepts by Vladimir, a large and increasing number of researchers have worked on the algorithmic and the theoretical analysis of SVM, merging concepts from disciplines as distant as statistics, functional analysis, optimization, and machine learning. The soft margin classifier was introduced few years later by Cortes and Vapnik [1], and in 1995 the algorithm was extended to the regression case.

There are several published studies that compare the paradigm of neural networks against to the support vector machines. The main difference between the two paradigms lies in how the decision boundaries between classes are defined. While the neural network algorithms seek to minimize the error between the desired output and the generated by the network, the training of an SVM seeks to maximize the margins between the borders of both classes.

SVM approach has some advantages compared to others classifiers. They are robust, accurate and very effective even in cases where the number of training samples is small. SVM technique also shows greater ability to generalize and greater likelihood of generating good classifiers.

By nature SVMs are essentially binary classifiers, however, based on several researchers’ contributions they were adapted to handle multiple classes cases. The two most common approaches used are the One-Against-All and One-Against-One techniques, but this scenario is still an ongoing research topic.

In this chapter we briefly discuss some basic concepts on SVM, describe novel approaches proposed in the literature and discuss some experimental tests applied to character recognition. The chapter is divided into 4 sections. Section 2 presents the theoretical aspects of the Support Vector Machines. Section 3 reviews some strategies to deal with multiple classes. Section 4 details some experiments on the usage of One-Against-All and One-Against-One approach applied to character recognition.

## 2. Theoretical foundations of the SVM

Support vector machines are computational algorithms that construct a hyperplane or a set of hyperplanes in a high or infinite dimensional space. SVMs can be used for classification, regression, or other tasks. Intuitively, a separation between two linearly separable classes is achieved by any hyperplane that provides no misclassification on all data points of any of the considered classes, that is, all points belonging to class A are labeled as +1, for example, and all points belonging to class B are labeled as -1.

This approach is called linear classification however there are many hyperplanes that might classify the same set of data as can be seen in the figure 1 below. SVM is an approach where the objective is to find the best separation hyperplane, that is, the hyperplane that provides the highest margin distance between the nearest points of the two classes (called functional margin). This approach, in general, guarantees that the larger the margin is the lower is the generalization error of the classifier.

If such hyperplane exists, it is clear that it provides the best separation border between the two classes and it is known as the maximum-margin hyperplane and such a linear classifier is known as the maximum margin classifier.

### 2.1. Brief history

Research on pattern recognition started in 1936 through the work done by R. A. Fisher who suggested the first algorithm for pattern recognition [2]. After him we have the work done by Frank Rosemblat in 1957 that invented the nowadays well known linear classifier named PERCEPTRON that is the simplest kind of feed forward neural network [3]. In 1963 Vapnik and Lerner introduced the Generalized Portrait algorithm (the algorithm implemented by support vector machines is a nonlinear generalization of the Generalized Portrait algorithm) [4]. Aizerman, Braverman and Rozonoer in 1964, introduced the geometrical interpretation of the kernels as inner products in a feature space [5] and Cover in 1965 discussed large margin hyperplanes in the input space and also sparseness [6].

The field of statistical learning theory was first developed and proposed by Vapnik and Chervonenkis in 1974 [7] and, based on this theory, appears in the year of 1979, the first concepts about SVMs [8]. SVMs close to their current form were first introduced by Boser at al. with a paper presented at the COLT 1992 conference in 1992 [9].

### 2.2. Formal definition of the SVM classifier – The linear model

The surface model used by SVM to perform the separation is the hyperplane. Let then W and b be, respectively, the vector normal to the hyperplane and its displacement relative to the origin [10]. Thus, we have that the decision function for an input x is given by equation (1).

where,

As can be seen in figure 2 below, the distance from x (with signal) to the hyperplane is given by 3.

Thus, D(x1) and D(x2) will have opposite signs (belong to different sets) if and only if x1 and x2 are on opposite sides of the separation hyperplane.

Figure 2 shoes that the Vector W is perpendicular to the hyperplane and the parameter

Let the set of sample points be represented by x_{1}, …, x_{p} and their respective group classification be represented by y_{1}, …, y_{p} where

If the two groups of samples in the training data are linearly separable it is then possible to select the two hyperplanes in a way that there are no points between them and then try to maximize the distance between the two hyperplanes [11].

The distance between these two hyperplanes is given by

Multiplying each equation by its corresponding y_{i} they are transformed into just one equation as following (7):

Dividing now both sides of the equation by

To maximize M we need to minimize

The optimization problem above is difficult to solve because it depends on

The factor of 1/2 is used for mathematical convenience and the problem can now be solved by standard quadratic programming techniques. Applying non negative Lagrange multipliers α_{i} (i = 1 … p) to the objective function turns the problem into its dual form as in (11).

Considering now that in the solution point the gradient of L() is null, the equation can be handled in order to obtain a new quadratic programming problem as in (12):

In this case, the minimum point with respect to w and b is the same to the maximum with respect to α, and the problem can be stated as in (13).

Where α = (α_{1},..., α_{p})^{T}, y = (y_{1},..., y_{p})^{T}, 0 and 1 have size p, and H_{p×p} is such that

A condition imposed by the Kühn-Tucker Theorem is that

so that, if

that is,

Any x_{i} that satisfies equation (17) is called support vector and the SVM trainings are reduced to the set of such vectors.

In the cases where the samples are not linearly separable the approach described above would diverge and grow arbitrarily. In order to deal with the problem it is then introduced a set of slack variables (δ) in equation (6) as showed in (18).

These equations can be rewritten as

The slack variables provide some freedom to the system allowing some samples do not respect the original equations. It is necessary however to minimize the number of such samples and also the absolute value of the slack variables. The way to do this is introducing a penalization term into the objective function as follows (19):

Variable C indicates the strength of the penalization to be applied. Introducing Lagrange multipliers on the penalization variables the dual form of the problem becomes as in (21).

Where

From here, as before, the problem can be represented into its quadratic form in terms of α(22).

Where c = (C … C) is a p dimension vector with all values equal C.

### 2.3. The non-linear model

Whereas the original problem as proposed by Vladimir Vapnik in 1979 [8], was stated for a finite dimensional space, it often happens that the sets to be discriminated are not linearly separable in their original space. For this reason, it was proposed by Isabelle Guyon, Bernhard Boser and Vapnik in 1992 [9], that the original finite-dimensional space was mapped into a higher-dimensional space, presumably making the separation easier in the new space.

In order to achieve non-linear separation, instead of generating a new quadratic programming problem as in previous section, it is possible to modify the vectors of the input space into vectors of a feature space through a chosen transform function ϕ with N ≥ n and then compute the separation hyperplane on the feature space. Figure 3 shows an example of such scheme.

The computation of the separation hyperplane is not done explicit on the feature space but using a scheme where every occurrence of ϕ(u).ϕ(v) is replaced by a function K(u,v) called kernel function and the H() function as seen before becomes (23):

The optimum W vector is given by

And the support vector machine decision function becomes

To keep the computational load reasonable, the mapping used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function K(x,y) selected to suit the problem. The hyperplanes in the higher dimensional space are defined as the set of points whose inner product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations of feature vectors that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation:

This approach allows the algorithm to find the maximum-margin hyperplane into the transformed feature space. The transformation may be non-linear and / or the transformed space may be of high dimension. The classifier, in the feature space, draws a hyperplane that represents a non-linear separation curve in the original input space.

If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimension. Maximum margin classifiers are well regularized, so the infinite dimension does not spoil the results. Some common kernels include:

Polynomial (homogeneous):

Radial Basis Function:

Gaussian Radial basis function:

Sigmoid:

## 3. The multiclass classification strategies

Multiclass SVM approach aims to assign labels to a finite set of several elements based on a set of linear or non-linear basic SVMs. The dominant approach for doing so in the literature is to reduce the single multiclass problem into multiple binary problems [12 – 15].

Doing so, each of the problems can be seen then as a binary classification, which is assumed to produce an output function that gives relatively large values for those examples that belong to the positive class and relatively small values for the examples that belong to the negative class.

Two common methods to build such binary classifiers are those where each classifier is trained to distinguish: (i) one of the labels against to all the rest of labels (known as one-versus-all) [16] or (ii) every pair of classes (known as one-versus-one). Classification of new instances for one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class. The classification of one-versus-one case is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally, the class with more votes determines the instance classification.

### 3.1. the one-versus-all strategy

One-Against-All multiclassiﬁer is compound by a number of binary classiﬁers, one for each class. Using a Winner-Takes-All strategy, each binary classiﬁer is trained taking the examples from one of the classes as positive and the examples from all other classes as negative. The multiclassiﬁer output is activated for the class whose binary classiﬁer gives the greatest output amongst all. Formally, given a vector y with the outputs of the binary classiﬁers, the multiclassifier generates a vector L = (l_{1},..., l_{s}), in the following way (27):

Where ‘s’ represents the total number of classes.

As seen, a one-against-all multiclassiﬁer for 's' different classes requires the construction of 's' distinct binary classiﬁers, each one responsible for distinguishing one class from all the others. However, doing so does not guarantee that the resulting multi-class classifier is good. The problem is that all binary classifiers are assumed to show equal competence distinguishing their respective class, in other words, there is an underlying assumption that all binary classifiers are totally trustable and equally reliable, which does not always hold in multi-class cases as Yi Liu [17] shows through a simple example as in figure 4.

The same error occurs with the binary classifier for class 2 and so, the multi-class classifier based on these three binary classifiers would not provide good accuracy. In order to mitigate such problem, Liu [15] suggests two reliability measures: SRM – static reliability measure and DRM – dynamic reliability measure.

#### 3.1.1. Static reliability measure

As pointed out by Vapnik [17] and Cortes [1], a small training set error does not guarantee a small generalization error when the number of training samples is relative small with respect to the feature vector x dimension. SVM training is done minimizing the objective function and as the objective function becomes smaller, smaller also becomes the generalization error. Based on this fact, Liu rewrites the objective function as seen in (28) and proposes the SRM as in (29).

Where (u)_{+} = u if u >0 and 0 if u ≤0.

Where D(x_{i}) = w^{T}x_{i} + b, and the parameter σ = CN is a normalization factor to offset the effect of the different regularization parameter C and training size N. This λ_{SRM} metric is reduced to (30) for those linearly separable cases where (1 – y_{i}D(x_{i}))_{+} = 0 for all training samples.

From (28) we notice that

#### 3.1.2. Dynamic reliability measure

The basic idea, differently of the static measure that is global over the whole training samples, is to estimate the classifier’s reliability in a local region of feature space surrounding the test sample x. The ‘k’ surrounding samples of x are denoted by N_{k}(x).

Suppose A(x) ∈ {1, -1} is the class label assigned to x by a SVM classifier and let

Liu formulate the local version of OBJ as in (32)

Where _{x} is the number of training samples in the set

#### 3.1.3. SRM and DRM decision rule

For a test sample x, assuming ‘M’ trained support vector machines each with its decision function, we evaluate D(x) for each classifier and after, generate the corresponding soft decision output

Now, assuming that

Mota in [18] sees the same problem from a different point of view. According to them in the One-Against-All method, SVM binary classiﬁers are obtained by solving different optimization problems and the outputs from these binary classiﬁers may have different distributions, even when they are trained with the same set of parameters and so, comparing these outputs using equation (27) may not work very well.

The output mapping, as suggested in [18], tries to mitigate such problem normalizing the outputs of the binary classiﬁers in such way to make them comparable by the equation (27). Four strategies are suggested: MND, BND, DNCD and MLP, based, respectively, on distance normalization (the first three) and base on a neural network model (the last one).

#### 3.1.4. MND output mapping strategy

Analyzing the histogram of the raw outputs (original outputs) from a typical SVM binary classiﬁer (figure 5) we observe a bimodal distribution consisting of two normal functions each with different mean and standard deviation. Different binary classiﬁers show different values of mean and standard deviation which makes unfair to compare their outputs. Then, before applying equation (10), the outputs from each binary classiﬁer are normalized in a way that they all provide a normal distribution with mean at −1 or +1 and a standard deviation equal to 1.

Using a validation data set the samples are grouped into two groups A_{1} (the current class) and A_{2} (all the other classes) and the respective output distribution mean and standard deviation are computed and then the normalized output is obtained by equation (36).

Where

#### 3.1.5. BND output mapping strategy

BND Strategy takes into account both normalized distances using the equation (38). When both distances _{i} is on the right side of the centers of both normal functions) then u_{i} is +1. When both distances are negative y_{i} is on the left side of both centers and u_{i} is −1, but when the distance signals are different, u_{i} is between −1 and +1, closer to +1 if

#### 3.1.6. DNCD output mapping strategy

Instead of using normalized distances (like in MND Strategy), DNCD Strategy builds a normalized output by joining the non-normalized distances and normalizing it by the distance between the centers of the normal functions as in equation (39).

#### 3.1.7. MLP output mapping strategy

In this case, instead of having a function which maps each raw output y_{i} to a normalized output u_{i}, we have a function which maps the entire vector y into the vector u. The idea of this strategy is to implement the mapping function by using an MLP neural network, trained using a validation data set. The training samples are the outputs given by the multiclassiﬁer for the validation data set. The expected outputs for those samples are the multiclassiﬁer expected outputs, that is, a vector for which all positions have value −1, except for the one which corresponds to the class of that sample, whose value is +1.

Homogeneous M class multiclassifier is the one where its M binary classifiers are all trained with the same set of parameters. This approach, however, may not be the best option once the training of each classifier is independent and so, the chance is high to find a better set of classifiers if the search for different parameters is allowed in each case. But, in these cases, if a number ‘g’ of such parameters is used then the number of possible combinations of them is

One approach is to choose a subset of alternative parameters composition and train a set of L distinct homogeneous multiclass SVMs. The output mapping is then applied to each of the ‘L*s’ binary classifiers and the heterogeneous multiclassifier is formed by selecting the best binary classifier from the ‘L’ homogeneous multiclassifiers. The selection is done through the classification quality metric ‘q’ as in (40) computed from the confusion matrix of each binary classifier.

Where M_{ij} is the value of the i-th row and j-th column of the confusion matrix, which corresponds to the number of samples of class A_{i} that were missclassiﬁed as being of class A_{j} by the homogeneous multiclassiﬁer. The more q_{i} approaches to 1 the better is the interaction of the i-th binary SVM among the other ones of the same homogeneous multiclassiﬁer. Thus, not only we take into account the number of hits of an SVM, but also we penalize it for possible confusions in that multiclassiﬁer. Finally, the heterogeneous multiclassiﬁer is produced by the binary SVMs of greatest quality for each class.

### 3.2. The one-versus-one strategy

This method constructs one binary classiﬁer for every pair of distinct classes and so, for M classes, a number of M*(M-1)/2 binary classiﬁers are constructed. The binary classiﬁer A_{ij} is trained taking the examples from class ‘i’ as positive and the examples from class ‘j’ as negative. For a new example x, if classiﬁer A_{ij} classifies it as class ‘i’, then the vote for class ‘i’ is added by one. Otherwise, the vote for class ‘j’ is increased by one. After each of the M*(M-1)/2 binary classiﬁers makes its vote, the strategy assigns the current example to the class with the largest number of votes.

Two interesting variations for the One-Against-One strategy, not using maximum vote, were proposed, one by Hastie and Tibshirani [19] known as pairwise coupling and other by Platt [20] that is a sigmoid version of the same pairwise coupling approach suggested by Hastie. Another interesting variation of this pairwise is proposed by Moreira and Mayoraz [21].

#### 3.2.1. Pairwise coupling

Considering that each binary classifier C_{ij} on a One-Against-One strategy provides a probabilistic output as_{i}’s, M*(M-1)/2 auxiliary variables μ_{ij}’s related to the p_{i}’s are introduced such as: μ_{ij} = p_{i}/(p_{i} + p_{j}) and then, p_{i}’s are determined so that μ_{ij}’s are close to r_{ij}’s. Kullback-Leibler distance [22, 23] between r_{ij} and u_{ij} was chosen as the measure of closeness (28).

where n_{ij} is the number of examples that belongs to the union of both classes (A_{i} U A_{j}) in the training set. The associated score equations are (29).

The p_{i}’s are computed using the following iterative procedure:

#### 3.2.2. Sigmoid pairwise coupling

Platt criticized Hastie and Tibshirani’s method of generating posterior class probabilities for each binary SVM, and suggested the use of a properly designed sigmoid applied to the SVM output to form these probabilities such as in (30).

Where ‘f’ is the output of the SVM associated with the example x and the parameters ‘A’ and ‘B’ are determined by the minimization of the negative log-likelihood function over the validation data. In [20] Platt suggests a pseudo-code for the determination of the parameters ‘A’ and ‘B’.

## 4. Character recognition experiments

The ability to identify machine printed characters in an automated manner has obvious applications in numerous fields (figure 6). *Optical character recognition (OCR), as this field is commonly known,* has been a topic of interest for a long time since the late 1940’s, when Jacob Rabinow started his work. Jacob was an engineer and inventor, he lived from 1910 to 1999 and during his life he earned 230 U.S. patents on a variety of mechanical, optical and electrical devices.

The earliest OCR machines were primitive mechanical devices with fairly high failure rates. As the amount of new written material increased, so did the need to process it all in a fast and reliable manner, and these machines were clearly not up to the task. They quickly gave way to computer-based OCR devices that could outperform them both in terms of speed and reliability.

Today there are many OCR devices in use based on a variety of algorithms. Despite the fact that these OCR devices can offer good accuracy and high speed, they are still far away compared to the performance reached by the human being. Many challenges are still opened not only with respect to the variety of scenarios, as well as, types of printed characters and handwritings, but also with respect to the accuracy by itself. There is no device able to recognize 100%, they always make mistake and, sometimes, bad mistakes like find a character that does not exist or recognize a complete different character than it really is (example: recognize as an ‘M’ what in fact is an ‘S’).

### 4.1. Remarks

The research field on automatic algorithms for character recognition is very large including different forms of characters like Chinese, Arabic and others; different origin like printed and handwritten and different approaches to obtain the character image like on line and off line.

The experiments on character recognition reported in the literature vary in many factors such as the sample data, pre-processing techniques, feature representation, classifier structure and learning algorithm. Only a reduced number of these works have compared their proposed methods based on the same set of characters. Obviously that this fact makes tough to get a fair comparison among the reported results.

Some databases were created and divulgated to the researcher’s community with the objective to offer a generic and common set of characters to be used as patterns for the researches. Some of the most popular databases are CENPARMI, NIST, MNIST and DEVNAGARI.

License Plate and handwritten numeral recognition are on the most addressed research topics in nowadays and the experiments on handwritten numeral have been done basically using CENPARMI and NIST Special Database 19.

CENPARMI database, for example, contains 4,000 training samples and 2,000 test samples segmented from USPS envelope images. This set is considered difficult but it is easy to achieve in the literature recognition rates reported over 98%. Suen et al. reported accuracy of 98.85% by training neural networks on 450,000 samples [24] training it with 4,000 samples. Liu et al. report rates over 99% using polynomial classifier (PC) and SVMs [25], [26]. They report an accuracy of 99.58% using RBF SVM and 99.45% using Polynomial SVM. In [27] Ahmad et al. report the usage of a hybrid RBF kernel SVM and a HMM – Hidden Markov Model system over an online handwriting problem taken from the IRONOFF-UNIPEN database. The same authors in [28] report a work done on the recognition of words. Pal et al. also report in [29] the usage of a hybrid system based on SVM and MQDF – Modified Quadratic Discriminant Function for the problem of Devnagari Character Recognition. Arora et al., all from India, report in [30] a performance comparison between SVM and ANN – Artificial Neural Network on the problem of Devnagari Character Recognition.

License Plate recognition as well as off line handwritten recognition represents a very tough challenge for the researchers. There are a number of possible difficulties that the recognition algorithm must be able to cope with, which includes, for example: a) poor image resolution, usually because the camera is too far away from the plate; b) poor lighting and low contrast due to overexposure, reflection or shadows; c) object obscuring (part of) the plate, quite often a tow bar, or dirt on the plate; d) bad conservation state of the plate; e) Blurry images, particularly motion blur; and f) lack of global pattern, sometimes even inside a same country or state (figure 7).

There is plenty of research work on this subject reported in the literature but the accuracy comparison among them is even more complex and difficult than the work done on handwritten. The accuracy not only depends on the type of the plates itself but also on the conditions on which the images were taken and on the level of the problems cited on previous paragraph. Waghmare et al. [32] report the use of 36 One-Against-All multiclass SVM classifier trained to recognize the 10 numeral and 26 letters from Indian plates (figure 8a). Parasuraman and Subin in [33] also report the usage of a Multiclass SVM classifier to recognize plates from Indian motorcycles (figure 8b). Other works on LPR can be found in [34 – 37].

In summary, character recognition is intrinsically a non linear and high dimensional problem. Among to the variety of OCR algorithms found in the literature, the SVM classifier is one of the most popular based on its good accuracy, high response speed and robustness. In the following subsections we describe some experiments in character recognition using both One-Against-All and One-Against-One multiclass SVMs.

### 4.2. Using one-against-all strategy

The strategies proposed in [18] are here evaluated when applied to a problem of classifying characters extracted from vehicle plates. Two multiclass SVMs are trained: one to recognize the 10 digits (10 classes) and other to distinguish the 26 letters (26 classes).

For each group of characters, three data sets were formed based on the feature extraction used. In data set 1 (DS1) the feature vector has dimensionality of 288 formed by the 16 × 16 character bit matrix and 32 additional values from the character horizontal and vertical projections. Principal Component Analysis [38] reduced the original dimension to 124 (for digits) and 103 (for letters). Data sets 2 (DS2) and 3 (DS3) were generated respectively by 56 and 42 statistical moments extracted from the 16 x 16 character bit matrix.

Each data set was divided in three subsets: one for training, one for validation, and one for test. Table 1 shows how the samples were divided in these three subsets.

#### 4.2.1. Digit recognition

Based on two kernel functions and a set of different values for two variables as shown in table 2 (standard deviation for the Gaussian kernel and exponent order for the polynomial kernel), a set of 55 Homogeneous multiclassiﬁers were trained.

The heterogeneous multiclassifier was formed with 10 binary classiﬁers selected each one of the 55 homogeneous multiclassiﬁers using the output mapping and confusion matrices as explained in previous section. The best results achieved for the test subsets of each data set are seen in Table 3. WTA-SVM is the common Winner-Takes-All strategy for One-Against-All approach.

#### 4.2.2. Letter recognition

The multiclassifier construction was also based on the same two kernel functions and the same two variables as shown in table 4 (standard deviation for the Gaussian kernel and exponent order for the polynomial kernel), a set of 30 Homogeneous multiclassiﬁers were trained.

The heterogeneous multiclassifier was formed with 26 binary classiﬁers selected each one of the 30 homogeneous multiclassiﬁers using the output mapping and confusion matrices as explained in previous section. The best results achieved for the test subsets of each data set are seen in Table 5. WTA-SVM is the common Winner-Takes-All strategy for One-Against-All approach.

### 4.3. Using one-against-one strategy

Experiments using One-Against-One RBF Kernel SVM are described in [39] for Brazilian plates, a total of 22464 numerals and 16848 letters were used for training and testing the system (Table 6). Brazilian plates show two different patterns (figure 9 – black characters over gray background and white characters over red background).

#### 4.3.1. Digit recognition

As reported, an average accuracy of 99.61% was achieved (Table 7) and the worst performance occurred on the recognition of the number 8, which was misclassified 17 times from a total of 1725 samples (Table 8 – 7 times misclassified as 6 and 7 as 9).

#### 4.3.2. Letter recognition

As reported, an average accuracy of 98.60% was achieved (Table 9X) and the worst performances showed by letters D (89.76%), O (93.31%) and Q (94.51%).

Table 10 shows and explains the reasons for such reduced performance in comparison to the other 23 letters. The fact is that these three letters show very similar visual aspect and the SVM misclassified 23 letters ‘O’ as ‘D’, 26 ‘D’ as ‘O’, 17 ‘Q’ as ‘O’ and 18 ‘O’ as ‘Q’.