Online Handwritten Chinese / Japanese Character Recognition

Handwritten character pattern recognition methods are generally divided into two types: online recognition and offline recognition [1]. Online recognition recognizes character patterns captured from a pen-based or touch-based input device where trajectories of pentip or finger-tip movements are recorded, while offline recognition recognizes character patterns captured from a scanner or a camera device as two dimensional images.


Introduction
Handwritten character pattern recognition methods are generally divided into two types: online recognition and offline recognition [1]. Online recognition recognizes character patterns captured from a pen-based or touch-based input device where trajectories of pentip or finger-tip movements are recorded, while offline recognition recognizes character patterns captured from a scanner or a camera device as two dimensional images.
Structural methods are weak at collecting global character pattern information, while they are robust against character shape variations. In contrast, un-structural methods are robust against noises but very weak against character shape variations. By combining a structural method (structural recognizer) with an un-structural method (un-structural recognizer), the recognition accuracy improves since they compensate for their respective disadvantages [27,28].
For online recognition, structural features are often employed with hidden Markov models (HMMs) [12][13][14][15][16][17][18][19][20][21] or Markov random field (MRF) [29,30]. However, since the un-structural features are easily extracted from an online handwritten pattern by discarding temporal and structural information, we can apply the un-structural method as well. Therefore, we can combine the structural and un-structural methods. In freely written string recognition, we need to consider whether we should select segmentation-free or over-segmentation-based methods. Character segmentation of cursive handwriting is difficult due to the fact that spaces between characters are not obvious. Without character recognition cues and linguistic context, characters cannot be segmented unambiguously. A feasible way to overcome the ambiguity of segmentation is called integrated segmentation and recognition, which is classified into segmentation-free and over-segmentation-based methods [31,32] as shown in Figure 1. Segmentation-free methods, mostly combined with hidden Markov model (HMM)-based recognition [20,21,[33][34][35][36][37][38], simply slice the word pattern into frames (primitive segments) and label the sliced frames, which are concatenated into characters during recognition. Such methods do not sufficiently incorporate character shape information. On the other hand, over-segmentationbased methods attempt to split character patterns at their true boundaries and label the split character patterns [39][40][41][42][43][44][45][46]. Character patterns may also be split within them, but they are merged later. This is called over-segmentation. It is usually accomplished in two steps: oversegmentation and path search. The string pattern is over-segmented into primitive segments such that each segment composes a single character or part of a character. The segments are combined to generate candidate character patterns (forming a segmentation candidate lattice), which are evaluated using character recognitions incorporating geometric and linguistic contexts. However, character segmentation is a hard problem to solve. When handwritten words are not easily segmented, such as with cursive writing, oversegmentation may result in misrecognition. In this case, a segmentation-free method is appropriate. For Chinese/Japanese recognition, however, segmentation is easier than in English or other western languages.
Since over-segmentation-based methods can better utilize character shapes, they are considered to outperform segmentation-free methods [2]. Moreover, over-segmentationbased methods produce less primitive segments since they attempt to find the true boundaries of character patterns as segmentation point candidates; therefore, we consider that over-segmentation-based methods are effective and more efficient compared with segmentation-free methods for Chinese/Japanese string recognition. We show our online handwritten Chinese/Japanese string recognition system in Figure 2, where an oversegmentation-based method is used.
In this chapter, we describe the recent technology trends, problems and methods to solve them for the online handwritten Chinese/Japanese character recognition. The rest of this chapter is organized as follows. Section 2 presents an overview of our online handwritten string recognition system. Section 3 presents structural and un-structural recognitions. Section 4 describes coarse classification. Section 5 describes combination of structural and un-structural recognitions. Section 6 presents string recognition, and Section 7 draws conclusions.

Recognition system overview
We process each online handwritten string pattern as follows:

Segmentation candidate lattice construction
Strokes in a string are grouped into blocks (primitive segments) in accordance with the features such as off-stroke (pen lift between two adjacent strokes) distance and overlap of bounding boxes of adjacent strokes. Each primitive segment is assumed to be a character or a part of a character. An off-stroke between adjacent blocks is called a candidate segmentation point, which can be a true segmentation point (SP) or a non-segmentation point (NSP). One or more consecutive primitive segments form a candidate character pattern. The combination of all candidate patterns is represented by a segmentation candidate lattice.

Character pattern recognition
There are thousands of categories for the Chinese/Japanese language. First, to improve the recognition speed, we reduce recognition candidates by using a coarse classifier for each online candidate character pattern. After coarse classification, we apply a structural recognizer and an un-structural recognizer to recognize the input pattern, and obtain two sets of character candidate classes from the two recognizers. Each candidate class of each set has a corresponding structural or un-structural recognition score. We combine the two sets of candidate classes considering their recognition scores to output a set of candidate classes to nominate them into the candidate lattice.

Search and recognition
We apply the beam search strategy to search for the candidate lattice. During the search, the paths are evaluated in accordance with the path evaluation criterion proposed Zhu et al. [47], which combines the scores of character recognition, linguistic context, and geometric features (character pattern sizes, inner gaps, single-character positions, pair-character positions, candidate segmentation points) with the weighting parameters estimated by a genetic algorithm (GA). This method selects an optimal path as the recognition result.

Structural recognition
First, we include a brief description here on our structural character pattern recognition system [48,49]. Our system first normalizes an input online character pattern by a linear or nonlinear normalization method. An online character pattern is a sequence of strokes and a stroke is a time sequence of coordinates of pen-tip or finger-tip movements. Then, it extracts feature points by such a method as Ramner [50]. First, the start and end points of every stroke are picked up as feature points. Then, the most distant point from the straight line between adjacent feature points is selected as a feature point if the distance to the straight line is greater than a threshold value. This selection is done recursively until no more feature points are selected. The feature point extracting process is shown in Figure 3(a).
Then it uses a MRF model to match the feature points with the states of each character class as shown in Figure 3(b) and obtain a similarity for each character class. It then selects the character class with the largest similarity as the recognition result.
Site: Feature points from an input pattern S={s1, s2, s3,…,s12} Label: States of a character class L={l1, l2, l3,…, l12} Labeling problem: Assign labels to the sites such as {s1= l1, s2 = l1, s3 = l3,…,s11 = l9, s12 = l11} The system recognizes the input pattern by assigning labels to the sites to make the matching between the input pattern and each character class. MRF model is used to solve the labeling problem.
Structural methods can be further divided into template-based structural methods [3,[6][7][8][9][10][11] and statistical structural methods [4, 5, 12-21, 39, 40]. Template-based structural methods work well with handwriting recognition for user dependent systems. However, these methods do not take into account the distributions of training patterns, resulting in limited recognition accuracy. Statistical structural methods measure probabilistic primitives and/or relationships so as to better model the shape variations of input patterns [29,30].
HMMs probabilistically treat a sequence of feature vectors in writing or position order, so that they can only use the neighborhood relationships between the successively adjacent feature vectors in writing or position order (the so-called one-dimensional neighborhood relationships) and the two-dimensional neighborhood relationships, such as those among the neighboring feature vectors, on distances are not explicitly expressed. For onedimensional neighborhood relationships, HMMs only use the state transition probabilities and unary features, but binary features are not used well. Moreover, the neighborhood relationships among more than two neighboring feature vectors, such as ternary features, cannot be used. Although some HMMs apply binary features, they only merge the binary features into the unary features and use a vector of larger dimension because HMMs do not take a new view of the binary features, which limits recognition accuracy [31]. The MRF model is described using an undirected graph in which a set of random variables have a Markov property, and MRFs can be used to effectively integrate information among neighboring feature vectors, such as binary and ternary features, and two-dimensional neighborhood relationships [54]. Therefore, MRFs have been effectively applied to stroke- analysis-based structural offline character recognition [4,5]. They have also been widely and successfully applied to image processing [55,56] and online stroke classification [57]. However, MRFs had not been applied to online character recognition until our reports [48,49]. Current online handwritten character recognition tends to use HMMs (note that HMMs can be viewed as specific cases of MRFs). MRFs have more degrees of freedom than HMMs for explicitly expressing relations among multiple feature vectors.
Saon et al. [33] proposed an HMM-based offline English word recognition method that uses neighboring pixels to estimate the pixel observation probabilities and discussed its performance. However, it is still an HMM-based method, although it uses the neighborhood relationships of the recognition. Based on the advantages of MRFs, we can assume that applying MRFs instead of HMMs to integrate the information among the neighboring feature vectors can improve performance of offline English or other western word recognition using segmentation-free methods [32][33][34][35][36][37][38].
Since online character patterns contain temporal information on pen movements, structural methods that discard temporal information and only apply structural information can result in stroke-order independence. However, it is computationally expensive since the neighborhood relationships must be examined in two dimensions. Although the method introducing temporal information is very sensitive to stroke order variations, it is efficient in recognition speed, and combining it with an un-structural method can deal with the strokeorder variations [27,28]. Even for the one-dimensional neighborhood relationships applying MRFs instead of HMMs to integrate the information of binary features between the successively adjacent feature vectors in writing or position order can improve performance.
Cho et al. [58] proposed a Bayesian network (BN)-based framework for online handwriting recognition. BNs share similarities with MRFs. They are directional acyclic graphs and model the relationships among the neighboring feature vectors as conditional probability distributions, while MRFs are undirected graphs and model the relationships among the neighboring feature vectors as probability distributions of binary or ternary features.
Introducing weighting parameters to MRFs and optimizing them based on conditional random fields (CRFs) [59] or the minimum classification error (MCE) [60] may result in even higher recognition accuracy; CRFs have been successfully applied to online string and offline word recognition [42,61].
We have proposed an MRF model with weighting parameters optimized by CRFs for online recognition of handwritten Japanese characters [48,49]. We focused on an online structural method introducing temporal information into one-dimensional neighborhood relationships and compared their effects on HMMs and MRFs. The model effectively integrates unary and binary features and introduces adjustable weighting parameters to the MRFs, which are optimized according to CRF. The proposed method extracts feature points along the pen-tip trace from pen-down to pen-up and matches those feature points with states for character classes probabilistically based on this model. Experimental results demonstrated the superiority of the method and that MRFs exhibited higher recognition accuracy than HMMs.
Online Handwritten Chinese/Japanese Character Recognition 57

Un-structural recognition
For the un-structural recognizer, we do not need to transform each online character pattern to an offline character pattern (two dimensional images), and extract directional features directly from the online character pattern. From an online character pattern, we extract directional features: histograms of normalized stroke direction [26]. For coordinate normalization, we apply pseudo 2D bi-moment normalization (P2DBMN) [22]. The local stroke direction is decomposed into eight directions, and from the feature map of each direction, 8x8 values are extracted by Gaussian blurring so that the dimensionality of feature vectors is 512. To improve the Gaussianity of feature distribution, each value of the 512 features is transformed by the Box-Cox transformation (also called variable transformation). The input feature vector is reduced from 512D to nD by the Fisher linear discriminant analysis (FLDA) [31]. Then we use the nD feature vectors to create a modified quadratic discriminant function (MQDF) recognizer [24] as follows: x where μi is the mean vector of class ωi, λij (j = 1, …, k) are the largest eigenvalues of the covariance matrix and φij are the corresponding eigenvectors, k denotes the number of principal axes, and δ is a modified eigenvector that is set as a constant. The value of δ can be optimized on the training data set. However, for a convenience, we simply set it as γλaverage where λaverage is the average of λij (i,j = 1, …, n) for all features of all classes and γ is a constant that is larger than 0 and smaller than 1.
According to previous works [23,26], the best un-structural recognition performance is obtained when n is about 160 and k is about 50 for the MQDF recognizer. When combining structural and un-structural recognizers and then combining them with linguistic context and geometric features for the string recognition, we have found the best combination performance is obtained when n is about 90 and k is about 10 for the MQDF recognizer. Therefore, we take n as 90 and k as 10, respectively.

Coarse classification
Although character classifiers with high recognition accuracy have been reported [26,[47][48][49], the demand for speeding up recognition is very high for portable devices as well as for desktop applications for which handwriting recognition is incorporated as one of the modules. The performance of these relatively small devices requires having a fast as possible recognition speed while maintaining high accuracy. Even for a desktop PC with relatively high performance, a recognition speed of within 0.5 seconds per page is required in actual applications. Therefore, we need to refine the recognition scheme to improve the processing speed.
Chinese, Japanese, or Korean have thousands of different categories, and their large character set is problematic not only for the recognition rate but also for the recognition speed. A general approach to improving the recognition speed is to perform coarse classification, pre-classification, or candidate selection before the fine classification [62,63].
The coarse classification typically uses simpler classification algorithms or fewer features in order to achieve a better speed than does the fine classification. It is used to select a relatively small subset of candidates out of a very large set quickly. The fine classification would then be used on these candidates to match an input pattern so that the whole recognition time is reduced. Current approaches for coarse classification typically use distance measures that are simpler than those for fine classification [64,65]. The confidence evaluation provides even more precise candidate selection [66]. Others use simple features different from those for fine classification [67]. Sequential (multi-stage) classifications using a partial set of features at each stage have also been used [68].
On the other hands, prototypes may be organized prior to the search itself so that the search is performed on a subset of prototypes. We could mention a number of methodologies that vary slightly in how the data is organized. The simplest ones are proposals for ordered spaces and tree structures. The search on pre-structured spaces aims particularly at alleviating problems with search costs. As a result, recognition is accelerated.

Combined recognition
How to combine different classifiers is an important problem in multiple classifier approaches.
In Japanese character recognition, Oda et al. improved recognition performance by combining a recognizer by a structural method and that by an un-structural method using probabilistic tables to normalize the combination scores [69]. The combination method by probabilistic tables is a generative method, and applying a discriminative method such as the MCE criterion and neural network to estimate and to optimize the combination may bring about higher performance.
Liu investigated the effects of confidence transformation in combining multiple classifiers using various combination rules [70]. Kermorvant et al. constructed a neural network to combine the top rank candidates of three word recognizers [37]. The two works used the discriminative methods to estimate the combination parameters. However, when optimizing the parameters the previous works always only considered the character/word recognition performance, not the string recognition performance. In fact, real applications usually use the string recognition rather than the character recognition. The character recognition is a part of the string recognition. Therefore, when we create a character recognizer, we have to consider the string recognition performance, as done by Tonouchi [71] and Cheriet et al. [71]. The methods that only guarantee the character recognition accuracy do not necessarily provide high string recognition performance. They cannot even be applied for string recognition.
On the other hand, we have to point out that introducing more parameters for a discriminative method dose not bring about higher performance, since we have only a limited amount of samples for training. However, previous works tended to introduce too many parameters for a discriminative method.

Online Handwritten Chinese/Japanese Character Recognition 59
We have applied a discriminative method for MCE to optimize the parameters for combinations of structural and un-structural recognizers with a linear or nonlinear function for online handwritten Japanese string recognition [28]. To introduce an effective set of parameters, we applied a k-means method to cluster the parameters of all character categories into groups, and for categories belonging to the same group, we introduced the same weight parameters. We investigated how to construct the function and how to introduce effective parameters for discriminative methods under the condition of a limited amount of samples for classifier training. We designed the objective functions of parameter optimization so as to optimize the string performance. Moreover, we used GA to estimate super parameters such as the number of clusters, initial learning rate, and maximum learning times as well as the sigmoid function parameter for the MCE optimization. Experimental results demonstrated the superiority of our method.

Linguistic contextual processing
String recognition applies not only character recognition, but also linguistic contextual processing. As shown in Figure 4 (a), by character recognition, each candidate character pattern is associated with a number of candidate classes with confidence scores. The combination of all character classes is represented by a character recognition candidate lattice. The linguistic contextual processing evaluates the combinations from character classes to character classes. By searching the candidate lattice by the Viterbi algorithm, the optimal path with maximum score gives the final result of string recognition. Linguistic contextual processing methods can be roughly divided into two classes: methods using the character combinations and methods using the word combinations. As shown in Figure 4 (b), the linguistic contextual processing evaluates the probability P(C) of the string C that comprises a sequence of characters {c1, c2, …} or a sequence of words {w1, w2,…}.
The methods using the character combinations evaluate the probability of the character combinations for each string candidate. We can use the appearance probability of only one character (unigram), bi-gram of two characters, tri-gram of three characters and generally called n-gram of n characters. The tri-gram is smoothed to overcome the imprecision of training with insufficient text by combining unigram, bi-gram and tri-gram using a linear function with weighting parameters.
In our experiment, under the condition with character writing boxes, using bi-gram improved the character recognition rate by 5 points from 92.9%, and using tri-gram improved the character recognition rate by one point. Moreover, under the condition without character writing boxes, using bi-gram improved the character recognition rate by 10 points from 81.3%, and using tri-gram improved the character recognition rate by 3 points.
The methods using the word combinations first divide string into words by morphological analysis, and then evaluate the probability of the word combinations for each string candidate. We can also use the appearance probability of only one word (unigram), bi-gram of two words, tri-gram of three words and generally called n-gram of n words. Although the methods have some problems such as unknown words and word dictionary memory, Nagata et al. have presented it could save more than 2/3 misrecognitions in a handwriting OCR simulation by dealing with unknown words [73].
Recognition of a limited vocabulary such as addresses, person names, dates and departments is important issues in string recognition. Applying a general-purpose contextual processing method such as tri-gram to recognize specific words would cause misrecognitions. Applying domain specific methods to recognize specific words into the vocabulary set can improve recognition rate significantly [45,74].

Freely written string recognition
With pen-based or touch-based input devices of large writing areas such as tablet PCs, Pad PCs, electronic whiteboards and digital pens, people tend to write text continuously with little constraints. This urges the need of handwritten string recognition. Compared to isolated character recognition, handwritten string recognition faces the difficulty of character segmentation because characters cannot be reliably segmented before they are recognized. Moreover, in continuous handwriting, characters tend to be written more cursively.
The integrated segmentation and recognition to overcome the ambiguity of segmentation, is dichotomized into segmentation-free method and over-segmentation-based method as shown in Fig. 1 [31]. Based on the advantages of over-segmentation-based method, we apply it for our recognition system. Figure 5. Character segmentation-recognition candidate lattice By character recognition, each candidate character pattern is associated with a number of candidate classes with confidence scores. The combination of all candidate patterns and character classes is represented by a character segmentation-recognition candidate lattice, where each arc denotes a segmentation point and each node denotes a character class assigned to a candidate pattern as shown in Figure 5. The segmentation paths and corresponding string classes in the candidate lattice are evaluated by combining the scores of candidate characters and between-character compatibilities (geometric and linguistic contexts).
In over-segmentation based string recognition, how to evaluate the candidate characters (lying on paths in the candidate lattice) is a key issue. A desirable criterion should make the path of correct segmentation have the largest score. Unlike HMM-based recognition that classifies a unique sequence of feature vectors (each for a frame) on a string, the candidate lattice of over-segmentation has paths of different lengths, each corresponding to a different sequence of feature vectors, thus the comparison of different paths cannot be based on the Bayesian decision theory as for HMM-based recognition. Instead, candidate character recognition and context scores are heuristically combined to evaluate the paths. Such heuristic evaluation criteria can be divided into summation-based ones [75] and normalization-based ones [71]. A summation criterion is the summation of character-wise log-likelihood or the product of probabilistic likelihood. Since the likelihood measure is usually smaller than one, the summation (product) criterion is often biased to paths with fewer characters, and so, tends to over-merge characters. On the other hand, the normalized criterion, obtained by dividing the summation criterion by the number of segmented characters (segmentation length), tends to over-split characters.
To solve the problems, we have proposed a robust context integration model for online handwritten Japanese string recognition [47]. By labeling primitive segments, the proposed method can not only integrate the character shape information into recognition by introducing some adjustable parameters, but also is insensitive to the number of segmented character patterns because the summation is over the primitive segments. Experimental results demonstrated the superiority of our proposed string recognition model.
We include a brief description here on our recognition model [47]. Denote X = x1…xm as successive candidate character patterns of one path, and every candidate character pattern xi is assigned a candidate class Ci. Then f(X,C) is the score of the path (X,C) where C = C1…Cm. The path evaluation criterion is expressed as follows:   X,C where Ph, h=1,…,6, stand for the probabilities of P(Ci|Ci-2Ci-1), P(bi|Ci), P(qi|Ci), P(p u i|Ci), P(xi|Ci), and P(p b i|Ci-1Ci), respectively. bi, qi, p u i, and p b i are the feature vectors for character pattern sizes, inner gaps, single-character positions, and pair-character positions, respectively. gi is the between-segment gap feature vector. P(Ci|Ci-2,Ci-1) is the tri-gram probability. ki is the number of primitive segments contained in the candidate character pattern xi. λh1, λh2 (h=1~7) and λ are the weighting parameters estimated by GA. P(xi|Ci) is estimated by the combination score of the structural and un-structural recognizers. We can also divide it into two parts P(x str i|Ci), P(x un-str i|Ci) where x str i denotes the structural features of xi, x un-str i denotes the un-structural features of xi, P(x str i|Ci) is estimated by the score of the structural recognizer and P(x un-str i|Ci) is estimated by the score of the un-structural recognizer. The path evaluation criterion is changed as follows:   X,C where Ph, h=1,…,7, stand for the probabilities of P(Ci|Ci-2Ci-1), P(bi|Ci), P(qi|Ci), P(p u i|Ci), P(x str i|Ci), P(x un-str i|Ci), and P(p b i|Ci-1Ci), respectively. λh1, λh2 (h=1~8), and λ are the weighting parameters estimated by GA. By the path evaluation criterion, we re-estimate the combination of the structural and un-structural recognizers.

Conclusion
This chapter described the recent trends in online handwritten Chinese/Japanese character recognition and our recognition system. We apply an over-segmentation-based method for our recognition system where the paths are evaluated in accordance with our path evaluation criterion, which combines the scores of character recognition, linguistic context, and geometric features (character pattern sizes, inner gaps, single-character positions, paircharacter positions, candidate segmentation points) with the weighting parameters estimated by GA. We combine structural and un-structural methods to recognize each character pattern so that the recognition accuracy improves.
Improving recognition performance is the aim of our future work. This can be achieved by incorporating more effective geometric features, exploiting better geometric context likelihood functions and weighting parameter learning methods, and improving the accuracy of character recognizer. To speed up recognition and reduce memory size is another dimension of our future work. We should consider effective methods to remove invalid patterns from the lattice.

Author details
Bilan Zhu and Masaki Nakagawa Department of Computer and Information Sciences, Tokyo University of Agriculture and Technology, Tokyo, Japan