Open access peer-reviewed chapter

A Novel Method for Multifont Arabic Characters Features Extraction

By Nadia Ben Amor and Najoua Essoukri Ben Amara

Submitted: December 3rd 2011Reviewed: August 9th 2012Published: November 7th 2012

DOI: 10.5772/52245

Downloaded: 2430

1. Introduction

Recently, many researchers around the world focused on Arabic document analysis, promising results have been reported.

However, there are not standard databases in Arabic to be considered as a benchmark. Each of research groups implemented their own system of set of data they gathered and different recognition rates were reported. Therefore, it is very difficult to give comparative and objective results for the proposed methods.

The aim of our work is to test several feature extraction algorithm and classification method using the same data base that we developed and which is composed of some 664 488 Arabic characters in nine different fonts and to conclude as far as the best suitable method for Arabic morphological specificities.

2. A review of Arabic characteristics

In this section we present a description of the important aspects of Arabic characters since the characteristics of Arabic writing is different from other alphabets.

Arabic script is cursive in both its handwritten and printed forms and letter shape is context sensitive.

The cursive nature of Arabic script is the main challenge to any Arabic text recognition system. Besides, Arabic script cursiveness obeys well-defined rules: some letters of the alphabet are never connected to their successors while others link to their within-word successors by a horizontal connection line.

In addition to the cursive aspect, we can also note the multitude of directions that can be described by the same Arabic character, especially in the multifont context.

Arabic writing may be classified into three different styles [1, 9]:

  • Typewritten: This is a computer generated style. It is the simplest one because the characters are written without ligature or overlaps Figure1.

Figure 1.

Example of typewritten Arabic style

  • Typeset: This style is more difficult than typewritten because it has many ligatures and overlaps. It is used to write books and newspapers.

Nowadays, this style may also be generated using computers Figure2.

Figure 2.

Example of typeset Arabic style

  • Handwritten: This is the most difficult style because of the variation of writing the Arabic alphabets from one writer to another.

Figure 3.

Examples of handwritten Arabic

Besides to different style of writing, there are many fonts in Arabic which make the recognition process more and more difficult.

In our work we have been dealing with multifont Arabic isolated characters. In fact, Segmenting Arabic script into characters is very difficult and always generates errors in the segmentation–based system. This work solves the cursiveness problem by presenting a segmentation–free system.

Due to the lack of common Arabic script data base, we had to develop our own one including all the shapes of the Arabic characters, beforehand segmented.

These characters was considered in nine different fonts which are Arabic transparent, Badr, AlHada, Diwani, Kufi, Cordoba, Andalus, Ferisi and Salam (Figure 4, Figure 5).

Figure 4.

Samples of isolated Arabic characters considered in nine different fonts.

Besides, theses characters were considered in the different shapes they could have depending on their position within a word. Some samples of these different shapes are represented in the figure 5.

In fact, more and more Arabic documents are compound and use the multifont context, such as the newspapers and the magazines or even the official documents. Figure 6, extracted from an official Tunisian Newspaper, includes three different fonts which are Arabic Transparent, Ferisi and Andalus, used in the big title and the subtitles.

Figure 5.

Samples of different Arabic characters shape according to their font and position in a word.

Figure 6.

Examples of Arabic multifont documents, extracted from two official newspapers

We have developed so far several processes for mutltifont Arabic characters recognition. All of these methods have proved the importance of the cooperation of different types of information at different levels (feature extraction, classification, post-processing…). This cooperation helps to overcome the variability of Arabic script especially in a multifont context [12, 13, 14, 15].

In this paper we highlight the role of Contourlets in the feature extraction step in an Arabic OCR context. This will allow us to compare the Contourlets performances with those of Wavelets and the Standard Hough Transform (SHT) that we previously used for the same purpose in our multifont Arabic recognition system. This comparison will lead to conclude as far as the precious contribution of the Contourlets in Arabic characters recognition field.

In the following section, we present the first approaches we developed in the features extraction step, then we introduce the Contourlet transform in the 3rd section. In section 4, we detail the system performances and experimental results. And finally, we conclude this paper in Section 5.

3. Wavelets and SHT approach for Arabic characters feature extraction

Feature extraction is a preliminary step for characters recognition. However, there is no perfect edge detector or feature extraction algorithm.

Many approaches have been so far developed for many alphabets such as Latin and Japanese. Yet given the specificity of this kind of writing we cannot apply them, as they are, for Arabic characters. Indeed, Arabic writing presents a very specific morphology. Thus the field remains one of the most challenging even though some works have been done [6, 17].

Arabic script is mainly composed of graphemes of cursive and structural nature. That’s why we developed first two approaches based on wavelets transform and standard Hough transform-SHT. Wavelet transform is suitable for extracting cursive characteristics, while SHT is well known for extracting directional features.

Even though these methods have allowed us to achieve good recognition rates, it is worth mentioning that they presented some weaknesses regarding the pure directional and cursive aspect of some Arabic characters such as....

In fact, the wavelet transform has been proven to be powerful in many signal and image processing applications such as compression [11], noise removal, image edge enhancement and feature extraction.

However, wavelets are not optimal in capturing the two-dimensional singularities found in images. They are not effective in representing the images with smooth contours in different directions even though they offer multi-scale and time-frequency localization of an image (Figure7, Figure8). Wavelets are known to be quite efficient in representing image textures, but they show up insufficient as far as the smooth contour localization is concerned [16].

Typically, a separable 2-D wavelet transform provides:

  • multiresolution, which is the ability to visualize the transform with varying resolution from coarse to fine

  • localization, which is the ability of the basis elements to be localized in both the spacial and frequency domains

  • critical sampling, which is the ability for the basis elements to have little redundancy.

Figure 7.

Examples with good recognition results using wavelets as feature extractor (cursive aspect)

Figure 8.

Examples with less good recognition results using wavelets as feature extractor (directional aspect)

However, it is not capable of providing:

  • directionality, which is having basis elements defined in a variety of directions

  • anisotropy, which is having basis elements defined in various aspect ratios and shapes.

In fact, despite its efficiency the wavelet transform can only capture limited directional information. This can affect the performance of the recognition system especially that the cursive nature of Arabic characters leads to a large number of directions to be considered. Thus the introduction of a directional based feature extraction method was a necessity.

The other features extraction method we focused on, was the SHT.

The SHT is known to be the popular and powerful technique for finding multiple lines in a binary image, and has been used in various applications.

It is very useful when dealing with the identification of features of a particular shape within a character image such as straight lines, but it fails as soon as it’s a question of curves and circles localization [9]. This fact is shown in Figure9 and Figure10.

Figure 9.

Examples of characters where the SHT fails in capturing cursive forms

Figure 10.

Examples of characters where the SHT manages in capturing straight forms

Besides, trying to take advantage of these two previous methods, we have integrated them in a hybrid approach. This hybridization allowed localizing image texture as well as straight lines and directional features. In spite of the improvement of the results, the computation time had considerably increased [14].

4. Discrete Contourlet transform and feature extraction

Recently, several transforms have been proposed for image analysis that have incorporated directionality and multi-resolution which could more efficiently capture edges in the processed images. In fact, much more elaborated techniques of signal processing emerged such as Steerable Pyramid [4], Curvelets [3] and Contourlets [7] which are some well known examples. The Contourlet transform is one of the new geometrical image transforms, which seems to be promising since it allows extracting both directional and cursive primitives.

The contourlet transform uses a stage of subband decomposition followed by a directional transform. In the contourlet transform, a Laplacian pyramid is applied in the first stage, while directional filter banks (DFB) are used in the angular decomposition stage [7].

Unlike Wavelets, the contourlet transform is a directional transform capable of capturing contours and fine details in images.

In addition, the contourlet expansion is composed of basis functions oriented at a variety of directions in multiple scales. With this rich set of basis functions, the contourlets can effectively capture smooth contours.

Contourlets not only possess the main features of wavelets (multiscale and time-frequency localization), but also offer a high degree of directionality and anisotropy. Precisely, Contourlets transform involves basis functions that are oriented at any power of two’s number of directions with flexible aspect ratios. [8]

The double filter bank structure of the contourlet is shown in Figure 11 for obtaining sparse expansions for typical images having smooth contours.

4.1. Laplacian Pyramid decomposition

The first filter bank, known as the Laplacian Pyramid (LP), is utilized to generate a multiscale representation of an image of interest. LP decomposition at each level generates a down-sampled low-pass version of the original image and the difference between the original and the prediction, which results in a band-pass image. The LP decomposition is shown in Figure 12. In LP decomposition process, H and G are one dimensional low pass analysis and synthesis filters respectively. M is the sampling matrix. Here, the band-pass image obtained in LP decomposition is then processed by the directional filter bank stage to reveal the directional details at each specific scale level.

The output values from the second filter bank are called “contourlet coefficients”. Any analysis performed with the contourlet coefficients is considered as in the “contourlet domain.”

Figure 11.

Double Filter Bank Decomposition of Contourlets transform.

Figure 12.

The principle of LP

4.2. Directional Filter Bank decomposition

Directional Filter Bank (DFB) is designed to capture the high frequency content like smooth contours and directional edges. Several implementations of these DFBs are available in the literature [8]. Combination of a Laplacian Pyramid (LP) and a DFB gives a double filter bank structure known as contourlet filter bank. Band pass images from the LP are fed to DFB so that directional information can be captured.

The scheme can be iterated on the coarse image. This combination of LP and DFB stages results in a double iterated filter bank structure known as contourlet filter bank, which decomposes the given image into directional sub-bands at multiple scales.

Since the purpose of using Contourlets is to focus on the cursive nature of the Arabic characters, we take an example of a cursive area and examine the behaviour of both wavelets and Contourlets on it Figure13.

Figure.13.a shows how wavelets arrange each others along the edge at different resolutions. The small blue squares represent the wavelets at the finest resolution, the green ones represent intermediate resolution and the red squares represent wavelets at the coarsest resolution. Figure.13.b shows the alignment of Contourlets and we can notice that the squares are replaced by rectangles.

Besides, we notice that, at each resolution, the edge can be represented by a far less number of contourlets than wavelets. As Wavelets are isotropic they can not take advantage of the underlying geometry of the edge. They approximate the edge as a collection of dots (small squares) so many points are needed to represent an edge. While contourlets are representing the edge as a collection of small needles hence only a few needle shaped line segments can represent the edge.

Figure 13.

Wavelets (a) vs. Contourlets (b)

To sum up, one contourlet may be assumed to be formed by grouping several wavelets at the same resolution.

In the Figure 14, we present some examples of Arabic characters images decomposition, using Contourlets, wavelet and SHT. The better quality comparing with Wavelets and SHT is obvious.

5. Experimental results

Due to the lack of a standard database in Arabic to be considered as a benchmark, we developed our own database including all the Arabic characters beforehand segmented and presented in the different shapes they could have in a word.

All images in the database are processed in the grey level in the Tiff format.

Each image is decomposed in the contourlet domain. The resulting coefficients are structured in a special cellular form. Many experiments were conducted and we retained the Standard Deviation (SD) vector as a set of features.

Edge and texture orientations are captured by using contourlet decomposition with 3 level (0, 2 and 3) decomposition. At each level, the numbers of directional subbands are 3, 4 and 8 respectively. ‘Pkva’ filters are used for LP decomposition and directional subband decomposition.

Figure 14.

Examples of images of features extraction using Contourlets, Wavelets and SHT: Better quality and recognition rates than Wavelets at greater level of resolution. Better curves detection than SHT.

As a result of this process, we obtain as output, a cell-vector where except output {1} corresponding to the lowpass subband. Each cell corresponds to one pyramidal level and is a cell-vector that contains band-pass directional subbands from the DFB at that level. These parameters result in a 16-dimentional feature vector (n=16). Standard deviation vector used as image feature is computed on each directional sub-band of the contourlet decomposed image and then normalized. This normalized feature vectors are used to feed the entry of the Artificial Neural Network classification stage.

Two architectures of neural network were implemented: a global Multilayer Perceptron (MLP) and a modular one.

In Table 1, we present the different recognition rates achieved when using Contourlets [13], Wavelets [10] and SHT [11] in features extraction. These results show the efficiency of contourlet transform compared to those obtained previously with the SHT and wavelet transform even though the used directional filter is a predefined one.

Features extraction



Classification
ContourletsWaveletsSHT
CharactersModular MLPGlobal MLPModular MLPGlobal MLPModular MLP
اAlif99.4399.5899.5297.7099.10
بBa’99.4199.8599.4398.8595.13
تTa’98.6699.5698.5696.7997.16
ثTha’99.1899.3199.0498.7598.86
جJim99.2810097.9997.5097.02
حHa’99.9010099.439896.80
خKha’99.4010098.8597.2396.26
دDal99.2799.7310098.3795.69
ذDhal98.4399.6098.1897.9696.73
رRa’98.8499.5197.8996.7996.73
زZay98.5799.7495.4195.0896.55
سSin99.1599.8599.5298.1894.65
شChin98.5999.9098.7697.7596.16
صSad98.3599.8399.5210096.24
ضDhad97.9699.7196.4796.5594.62
طTa’98.8399.7698.3798.2394.64
ظDha’98.8799.2298.6697.8595.24
ع‘Ayn99.2399.8298.4795.6196.76
غGhayn99.1899.7898.7698.6396.22
فFa’98.2999.2099.2310095.50
قQaf99.6999.649898.9795.34
كKaf99.0299.7199.5298.6995.16
لLam99.4599.6010098.9398.38
مMim99.2499.4799.3399.1697.02
نNoun99.1999.4297.5196.9297.18
هHa’98.4099.8598.7698.3798.55
وWaw99.3198.9898.9597.8697.31
يYa’99.6699.8898.0998.1297.98
Average rate (%)99.0399.6698.6597.9596.53

Table 1.

Recognition rate per character corresponding to the MLP models

6. Conclusion and perspectives

In this paper, we were interested in the challenges of Arabic characters feature extraction especially in a multifont context. We proposed a new approach for Arabic character recognition based on contourlet transform for feature extraction.

The achieved results show the efficiency of this transform compared with the Wavelet transform and the SHT. They proved its superiority in describing the different morphological variations of Arabic isolated characters. In fact, the contourlet transform have the advantage of highlighting both directional and cursive nature of Arabic scripts.

As a major perspective to this work we can consider to optimize the Contourlets Algorithm, by developing an adaptive filter depending on the character’s class and form. Such as implementing filters adapting the most recognized directions by the SHT and of course the main directions of the Arabic scrip itself.

© 2012 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Nadia Ben Amor and Najoua Essoukri Ben Amara (November 7th 2012). A Novel Method for Multifont Arabic Characters Features Extraction, Advances in Character Recognition, Xiaoqing Ding, IntechOpen, DOI: 10.5772/52245. Available from:

chapter statistics

2430total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Decision Tree as an Accelerator for Support Vector Machines

By Fu Chang and Chan-Cheng Liu

Related Book

First chapter

Neural Forecasting Systems

By Takashi Kuremoto, Masanao Obayashi and Kunikazu Kobayashi

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us