Open access peer-reviewed chapter

A Deterministic Algorithm for Arabic Character Recognition Based on Letter Properties

By Evon Abu-Taieh, Auhood Alfaries, Nabeel Zanoon, Issam H. Al Hadid and Alia M. Abu-Tayeh

Submitted: March 11th 2018Reviewed: April 3rd 2018Published: June 27th 2018

DOI: 10.5772/intechopen.76944

Downloaded: 342

Abstract

Handheld devices are flooding the market, and their use is becoming essential among people. Hence, the need for fast and accurate character recognition methods that ease the data entry process for users arises. There are many methods developed for handwriting character recognition especially for Latin-based languages. On the other hand, character recognition methods for Arabic language are lacking and rare. The Arabic language has many traits that differentiate it from other languages: first, the writing process is from right to left; second, the letter changes shape according to the position in the work; and third, the writing is cursive. Such traits compel to produce a special character recognition method that helps in producing applications for Arabic language. This research proposes a deterministic algorithm that recognizes Arabic alphabet letters. The algorithm is based on four categorizations of Arabic alphabet letters. Then, the research suggested a deterministic algorithm composed of 34 rules that can predict the character based on the use of all of categorizations as attributes assembled in a matrix for this purpose.

Keywords

  • conditional random field
  • rule-based mode
  • word prediction
  • virtual keyboard
  • Arabic text entry
  • enhancement
  • text entry system
  • theory of randomized search heuristics
  • structured prediction
  • theory of computation

1. Introduction

Arabic language is one of the top five languages spoken in the world. Arabic is used by more than 422 million native and non-native speakers in the world. Also, the letters of the Arabic alphabets are used in other languages like Urdu (65 million natives and 94 million non-native) and Persian (110 million) languages. In addition, languages like Baluchi, Brahui, Pashto, Central Kurdish, Sindhi, Kashmiri, Punjabi, and Uyghur are using the Arabic letters. Hence, there is a need to develop an algorithm for character recognition for the Arabic language. Yet, there are major challenges that arise: first, Arabic is a cursive language. Unlike other languages written, Arabic alphabets change shape as written; hence, separate letters in Arabic are usually sub-word rather than stand-alone word. Second, Arabic is written from right to left unlike Latin languages. Third, Arabic has 28 alphabets, with some letters changing shapes based on the location of the letter in the word. Also, some letters are very similar in form yet have secondary marks to differentiate. Furthermore, Arabic is written from right to left cursively. Due to all the mentioned reasons, Arabic character recognition systems are under developed and lacking.

This research is composed of five sections. The first section presents 20 related works. Then, the research explains the letter shapes in Arabic language and the four categories used in the proposed algorithm. The four categorization methods will be employed to develop a deterministic algorithm method of categorization. The first categorization method depends on the number of dots used with each letter. The second categorization method depends on the shape of the letter, with classification to the letters. The third categorization is presented with the shape of the letter as used in the beginning, middle, and end of the word. The fourth categorization method relays on the proportion method, which is a method used in Arabic calligraphy that is based on rhombic dot. Then, the research suggested a deterministic algorithm composed of 34 rules that can predict the character based on the use of all of categorizations as attributes assembled in a matrix for this purpose.

2. Related work

Character recognition is an open-ended problem. A computer cannot recognize character or language alphabets. There is a great progress in character recognition that all can be seen in the different smart application used on smart phones and pad as well as notebooks and PCs. Problems that arise with non-Latin languages are well known, and many researchers have conducted research for their respective languages: Hindi language [1]; Chinese language [2, 3, 4]; and Arabic Language [5, 6]. Furthermore, many researchers have conducted research for the Arabic alphabets: Parvez and Mahmoud [7] conducted a survey for text recognition and published their work in a paper titled Offline Arabic Handwritten Text Recognition: A Survey. Three researchers [8] conducted a research and published their work in a paper titled Robust Named Entity Detection Using an Arabic Offline Handwriting Recognition System; the paper focus was “on extraction of a predefined set of Arabic named entities (NEs) in Arabic handwritten text.” Another research by Fouad Slimane, Slim Kanoun, Adel M. Alimi, Jean Hennebert, and Rolf Ingold was conducted in 2010, yet the concentration was on printed character rather than handwritten one. The research conducted by Fard, Moghadam, Bidgoli, and Hussain [9] was very promising and was conducted on Persian language, yet the method used was neural network based which is not deterministic. Another research conducted by Abu-Taieh [10] used an enhanced method of neural network. Another promising research that was studied by Aljarrah et al. [11] is the study concentrated on printed Arabic rather than hand written in order to produce Arabic optical character recognition system. The need for Arabic character recognition is evident according to Ali and Sagheer [12] which addresses the need of smart mobile phones and tablets and hence the need for Arabic character recognition.

Researchers Supriana and Nasution [13] cited nine works including their own research that all are non-deterministic. The research of Sarfraz, Ahmed, and Ghazi [14] developed a license plate recognition system. The research by Izakian, Monadjemi, Ladani, and Zamanifar [15] used chain codes, while Abandah, Khedher, and Mohammed [16] used selected feature extraction techniques. In their research Al-Taani and Al-Haj [17] used structural features, while Kapogiannopoulos and Kalouptsidis [18] used skew angle. The research of Zidouri [19] proposed a general method for Arabic letter segmentation, while Amin [20] used global features and decision tree technique on printed letters not handwritten. Cowell and Hussain [21] used extracting features.

3. Letters in Arabic language

To develop the proposed algorithm, the researchers studied and presented the different categorizations for Arabic letters. Next, each categorization will be explained accordingly. The first categorization method depends on the number of dots used with each letter. The second categorization method depends on the shape of the letter, with classification to the letters. The third categorization is presented with the shape of the letter as used in the beginning, middle, and end of the word. The fourth categorization method relays on the proportion method, which is a method used in Arabic calligraphy that is based on rhombic dot. Each categorization will be explained in Sections 3.1, 3.2, 3.3, and 3.4.

3.1. First categorization: number of dots in the letter

The use of dots to distinguish letters in Latin-based languages is familiar to people. In English, small letters I and J are distinguished by using a dot on top of the letter. In Arabic the use of a dot is used extensively; in fact, only 12 letters out of 28 letters are not doted. Furthermore, some letters use one, two, and three dots. Next, the concept of doted letters will be explained.

The first categorization is according to the number of dots used with each letter. This categorization splits the 28 letters (Table 1) into five branches, and from within it breeds two extra letters. The first branch is composed of 12 letters that has no dots whatsoever. The second branch is composed of ten letters: the eight letters have their dot above the body of the letter, and the other two letters have their dot below the letter body. The third branch is composed of four letters: the three letters have their two dots above the body of the letter, and the other one has two dots below its body. The fourth branch has two letters with three dots above the body.

First branchNo dots12ح د ر س ص ط ع ل م ه و ا
Second branchOne dot10ب ج خ ز ذ ض ظ غ ف ن
Third branchTwo dots3ي ت ق ة
Fourth branchThree dots2ث ش
Fifth branchWith hamza3وْ أ ك

Table 1.

Arabic alphabets (according to dots).

The fifth branch deals with hamza: there is one basic letter where the hamza is part of the letter “ك,” and the other hamza is not part of the letter like the “أ” and “وْ.” The categorization is summarized in Table 1.

3.2. Second categorization: letter shape

The second categorization is according to shape of the letter: this categorization splits the 28 letters into 15 branches based on the body of the letter rather than the dots on the letter (see Table 2). However, some increase the number of shapes to 18 shapes [22]. The first branch is made of four letters all very similar in shape: two of them are differentiated by one dot (one above the letter and below the letter), and the other two (one has two dots above it and one has three dots above it). The second branch has two letters very similar to each other: one can differentiate between them by the dot above one, while the other one has no dot; furthermore, the third branch and the fourth branch have the same idea similar in shape, yet one dot makes a difference. The same happens with the fifth and sixth branches. The seventh branch has three letters that are similar in shape: one without dot, one with dot above it, and one with dot below it. The eighth branch has two letters that are very similar in shape: one with one dot and the other with two dots. The ninth branch has two letters similar in shape: one with no dots and the other with three dots. The tenth has two letters: one with no dots and other with two dots. The 11th branch has two letters: one with no hamza and the other with hamza shape above the body of the letter. The 12th, 13th, 14, and 15th branches are not similar to each other nor to the rest of the letters.

Table 2.

Arabic alphabets (according to shape).

One may add here a note about the shape of the letters; there are nine letters that have as part of them enclosed space that resembles a circle. These nine letters are (م و ه ف ق ط ظ ص ض). The enclosed circle property is an important aspect of the nine letters that will be used in the algorithm at a later stage.

3.3. Third categorization: letter location in a word

The third categorization of the Arabic alphabets is based on the location of the letter in a word. Generally, shapes of the Arabic alphabets change according to position of the letter in the word itself (beginning, middle, end); some letter can be connected (refers to the letter succeeding or preceding), and others cannot be connected. The shapes of the letters can be generated with ligature or character overlaps [23, 24]. When discussing the letters that start a word, these six letters when falling at the beginning of a word must stand alone; those letters are (ا د ذ ر ز و), and the rest of the letters do change form as seen in Table 3. Using the same six letters in the middle or end of the word, these letters are only connected and they do not change form. All letters when used at the end of the word have two states: connected and stand-alone.

Table 3.

Arabic alphabets: stand alone, beginning, middle, and end of a word.

From the previous one can notice that six letters have special characteristics, namely, (ا, و, د, ذ, ر, ز). These characters when used in the beginning of a word must stand alone, and also when they end a group of characters, they must be followed by independent character. Hence, they only connect to the predecessor not the successor.

The matrix, seen in Table 4, represents the different combination between all 28 letters. The first column in the matrix is the letter coming at the beginning of the order, and the first row is all the letters coming second. Each cell in the matrix shows the two letter shapes and how they change as the order differs. The highlighted letters are the previously mentioned six letters, namely, (ا,و, د, ذ, ر, ز), which if appears at the beginning of the word, then they stand alone. When these letters appear consecutively within a word, they will both be written as stand-alone independent letters.

Table 4.

Matrix of the different combinations for all 28 letters.

3.4. Fourth categorization: letter proportion

To keep letters proportional to each other, two ways were used by calligraphers: rhombic dot and circles. Arabic calligraphy was used in mosques and castells as decoration since Islam forbids pictures and statues [22]. Hence, there is a need to decorate with words. Proportion is an essential part of the written word. The circle proportion was suggested by “Ibn Mugla,” a well-known calligrapher from the eleventh century [25]. Three elements are the bases of proportion in Arabic calligraphy [26, 27]:

  • The height of the alif, which is a straight and vertical stroke (3–12) rhombic dots.

  • The width of the alif, (the rhombic dot) which is the square impression formed by pressing the tip of the calligrapher’s reed pen to paper (see Figures 1 and 3).

  • An imaginary circle with alif as its diameter, within which all Arabic letters could fit and be written (see Figure 2).

Figure 1.

Example of measuring the letter by using rhomboid dots [28].

Figure 2.

Example of measuring the letter by using circle [28].

Figure 3.

The rhombic dot as a guide to proportions [22].

The circle is halved vertically and horizontally, with diameter equals the height of the first letter in Arabic alphabets called alif. Looking back at Table 2 and Figure 4 that represent the shape and form of the letter, the first branch, the shape of the letter, is in the lower half of the circle. The second branch, according to the circle, takes up the first quarter and the third quarter. In the third branch, the letter falls in the first quarter of the circle with the upper half diameter aligned with the half alif of the letter. In the fourth branch, the letter lies on the left half of the circle. In the fifth branch, two letters are located in the fourth quarter of the circle. In the sixth branch, two letters also fall in the fourth quarter of the circle. In the seventh branch, the letter lies on the left half of the circle. In the eighth branch, two letters are both parted in the first and second quarter of the circle with a circular part above the horizontal diameter. In the tenth branch, the letter takes the first quarter of the circle. The eleventh branch takes the second and third quarter of the circle. In the twelfth branch, the letter is at the center of the circle and uses the bottom half the alif. The thirteen branch is the alif itself, which is the diameter of the circle. The fourteenth branch is taking the third quarter of the circle. The fifteenth branch takes the first and fourth quarter of the circle.

Figure 4.

Proportions in Arabic calligraphy.

One can conclude by studying the second categorization and the proportion categorization through the following:

  • First, the second branch and ninth branch both (four letters) take same area of the circle.

  • Second, the third branch and tenth branch (four letters) both use the first quarter of the circle.

  • Third, the fourth branch and seventh branch use the left edge of the circle, yet the differentiation between the two is that one letter is written from right to left and one letter is written from left to right as seen in Figure 5.

  • Fourth, the fifth and sixth branches (four letters) use the fourth quarter of the circle.

Figure 5.

Direction of writing with two circle edge letters.

Hence, give an insight to further classify the letters and manage them into groups. The previous sections explained in details the four categorizations used in the proposed algorithm. Each categorization was an essential in the building blocks and rules of the algorithm.

3.5. Findings of the four categorizations

Based on the four categorizations explained above, a tree of rules can be built as seen in Figure 6. The rule tree has five branches: the first branch is for Arabic alphabets that contain no dots. The second branch is for letters with one dot. The third branch is for letters with two dots. The fourth branch includes all letters with three dots. The fifth branch is for letters with hamza.

Figure 6.

Categorization of the tree drawn based on the four categorizations.

For the first branch including 12 letters and in order to distinguish among the letters, the fourth categorization logic was used. Each letter in this branch was located in the quarters of the circle suggested in the fourth categorization. Two letters used the same quarters (س ص); both fall in the first and third quarters of the imaginary circle, which explain the fourth categorization. Still, letter (ص) has an enclosed space, while letter (س) has no enclosed space. Hence, differentiating between the two letters depends on the enclosed space. The enclosed space property is explained previously in the third categorization. The edge of the circle from the fourth categorization was used to differentiate between the ten letters and the letters (ح ع). Furthermore, to differentiate between the two letters, the direction of writing was used. The direction of writing was explained in Figure 5 previously.

The second branch consisting of all letters with one dot included ten letters. The branch spliced further to dot below and dot above the body of the letter. Again, in this branch the imaginary circle from the fourth category was used. The location of the letters according to the quarters of the imaginary circle was used as seen in Figure 6. Also, the distinguishing feature of the letter falling on the edge of the imaginary circle is used, and the property of writing direction seen in Figure 5 is also used.

The third branch consisting of all letters with two dots included three letters. The branch spliced further to dots below and dots above the body of the letter. There is only one letter in all the alphabets that has two dots below it (ي). And, there are three letters with two dots above the letter body. To distinguish between the three letters, the imaginary circle from the fourth categorization and again none shared the same quarters of the circle.

The fourth branch included all letters with three dots. The branch included only two letters both have the dots above their body. Hence, the quarters of the imaginary circle were used to distinguish between them.

The fifth branch is the hamza (ء) branch which included three letters, and the distinguishing features were the imaginary circle quarters: letter (ك) is in the second and third quarters of the circle, letter (ؤ) is in the first and fourth quarters, and letter (أ) falls on the diameter of the circle.

The hamza (ء) can be seen on top of the letters (أ, ؤ, ك); the hamza sometimes is considered an independent letter when used in some words like (علاء) and is used as part of the letter in other words. The hamza is a distinguishing character between the two letters (ك, ل). Hence, it is treated as semi-letter and is not listed in the alphabets.

4. Suggested algorithm

After studying all the previously mentioned categorizations, one can reach the conclusion that a deterministic algorithm can predict the character being drawn based on the following matrix in Figure 7 and along with the matrix is the suggested algorithm in Figure 8, hence reducing the determination of a letter to 38 rules.

Figure 7.

The property rules to define each letter in the Arabic alphabets.

Figure 8.

Determine_Character (input:one_character).

The suggested algorithm shown in Figure 7 is composed of five major if-then statements which are based on the first categorization explained above and later summarized in Figure 7. The first if-then statement runs from line 1 to 12 in Figure 8. The if-then statement really deals with all cases of the letters which have no dots, and their location in the circle is mentioned in proportion categorization. The enclosed space property mentioned earlier was very important to distinguish letter “س” and letter “ص”; both letters fall in the same location in the circle Q1 and Q2, yet the latter has an enclosed space. Also, notice that both “ح” and “ع” have the same properties, yet to differentiate them, the direction of writing is used [9].

The second major if-then statement started at line 13 and dealt with letters with one dot. As shown in Figure 6, the dot can be either above or below the body of the letter, i.e., both letters “ن” and “ب” fall in the lower part of the circle quarters 3 and 4. Yet, to make the differentiation, the dot was essential here, the latter had the dot below as seen in lines 23 and 20 in Figure 8. Also, line 21, in the same figure, dealt with two letters that are essential falling in the same location, and both had the dot above them, distinguished by the writing direction left to right or right to left. The algorithm can be improved by eliminating line 24, hence reducing the number of rules to 37, since one can use one statement.

The third major if-then statement starts at line 25 and ends at line 31 in Figure 8. The if-then statement deals with letters that have two dots according to the first categorization and the matrix seen in Figure 7. The nested if-statement deals with the two dots whether above or below the letter. Three letters have two dots above them, yet their location on the circle is very distinguishable; hence, using the attribute “enclosed space” was not necessary. Furthermore, one can eliminate line 31 since this is the only letter in the alphabet that has two dots below it. Still, for the purpose of clarity, line 31 was left in the suggested algorithm. If line 31 was eliminated, the number of rules will be again reduced to 36 rules.

The fourth major if-then statement deals with letters that have three dots; there are only two of them. Both letters can be distinguished based on their respective location according to the proportion categorization. Again, line 34 can be eliminated but was left for the purpose of clarity. If line 34 was eliminated, the number of rules will be again reduced to 35 rules.

The last major if-then statement starts at line 35; the statement deals with case of “hamza.” The hamza is an essential part in letter “ك” and is used with other letters like “أ ؤ.” The three letters are distinguished by their location within the circle according to proportion categorization. Line 38 can be eliminated but was left to clarify the algorithm, hence reducing the number of rules to 34.

5. Conclusion

The proposed algorithm stems from many needs that are more apparent today. First, there is a rise in the use of handheld devices, which use character recognition methods that serves mainly Latin-based languages. Arabic language is one of the top five languages spoken in the world. Arabic is used by more than 422 million native and non-native speakers in the world. Arabic language is different from other languages: Arabic is a cursive language, written from right to left, and letters change shape according to the position of the word. Hence, there is a dire need to develop an algorithm for character recognition for the Arabic language. However, many algorithms have used artificial intelligent methods to recognize characters that make their algorithms non-deterministic, while the proposed algorithm is deterministic. This research presents four categorization methods that will be employed to develop a deterministic algorithm method of categorization. The first categorization method depends on the number of dots used with each letter. The second categorization method depends on the shape of the letter, with classification to the letters. The third categorization is presented with the shape of the letter as used in the beginning, middle, and end of the word. The fourth categorization method relays on the proportion method, which is a method used in Arabic calligraphy that is based on rhombic dot. Then, the research suggested a deterministic algorithm composed of 34 rules that can predict the character based on the use of all of categorizations as attributes assembled in a matrix for this purpose [29].

The proposed algorithm is only one piece in the whole puzzle. There are many parts that need to be developed. One major part is the input section of the algorithm. Such part needs to exist in order for the puzzle to be complete. The input section needs to parse the word into segments that can detect the shape of the letters, the dots, and the hamza. Furthermore, this research will be a building block for further research and development.

Biography

Evon M. O. Abu-Taieh, PhD, is an associate professor and an author/editor of four scholar books, contributed in more than eight scholar books. She has more than 40 published papers. She was previously the acting dean in the University of Jordan (Aqaba) for 3 years. Dr. Evon is an editorial board member in five renounced journals. She has more than 29 years of experience in education, computers, aviation, transport, AI, ciphering, routing algorithms, compression algorithms, multimedia, and simulation.

Auhood Abdullah Alfaries, PhD, is as assistant professor in the IT Department in King Saud University (KSU). Dr. Auhood received her PhD degree in Semantic Web and Web Services from the School of Computing and Information Systems, Brunel University, UK. She held a number of IT-related academic and administrative positions both in KSU and princess Noura bint Abdulrahman University (PNU). She has experience in quality and program accreditation by serving in a number of quality-related roles since 2011. Auhood is associated with a number of important bodies such as an associate of the UK Higher Education Academy, a member of the Institute of Electrical and Electronics Engineers (IEEE), and a member of the Saudi Computer Society. She is also an ABET program evaluator. She participated as a conference and journal reviewer and a member of a number of national and international workshops and conference program committees. She has served as the vice dean and dean of E-Learning and Distance Learning Deanship in KSU and then in PNU for 2 years and has also served as the assistant general director and then the director for the General Directorate of Information and Communications Technology (ITC) in PNU. Currently, she is the dean of the College of Computer and Information Sciences. Auhood’s research interest includes semantic web, ontology engineering, natural language processing, machine learning, and cloud computing. She is a member of IWAN Research Group.

Dr. Nabeel Mohammed Zanoon received his PhD degree in Computer Systems Engineering, from the South-West State University, Kursk, Russia, in 2011. He is a faculty member with Al-Balqa’ Applied University since 2011, where he is currently an assistant professor and the head of the Department of Applied Sciences as well as the director of the ICDL Computer Centre and Cisco Academy Branch of Aqaba University College. He has published several researches in several areas: security of e-banking, algorithm scheduling in grid and cloud, meta-grammar, hardware and architecture, fiber optical, and mobile ad hoc networks.

Issam Hamad Al Hadid is a lecturer at the University of Jordan. He completed his PhD degree at the University of Banking and Financial Sciences (Jordan) in 2010, obtained his MSc degree in Computer Science at Amman Arab University (Jordan) in 2005, and earned his BSc degree in Computer Science at Al-Zaytoonah University (Jordan) in 2002. He has published many research papers in different fields of science in refereed journal and international conference proceedings. His researches focus on self-healing architecture; also, his research interests include AI, knowledge-based systems, security systems, compression techniques and algorithms, and information retrieval.

Alia Abu-Tayeh earned her PhD degree in 1995. She is a lecturer in the University of Jordan (Aqaba) and an ex-lecturer in King Hussein University. She published many scientific articles in renowned journals. Her interest ranges from linguistics to applied mathematics in computer and languages.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Evon Abu-Taieh, Auhood Alfaries, Nabeel Zanoon, Issam H. Al Hadid and Alia M. Abu-Tayeh (June 27th 2018). A Deterministic Algorithm for Arabic Character Recognition Based on Letter Properties, Artificial Intelligence - Emerging Trends and Applications, Marco Antonio Aceves-Fernandez, IntechOpen, DOI: 10.5772/intechopen.76944. Available from:

chapter statistics

342total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Human-AI Synergy in Creativity and Innovation

By Tony McCaffrey

Related Book

First chapter

Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic Medicine and Biomedical Research

By Gaston K. Mazandu, Irene Kyomugisha, Ephifania Geza, Milaine Seuneu, Bubacarr Bah and Emile R. Chimusa

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us