New Distance Measures for Arabic Handwritten Text Recognition

recent years, optical character recognition has attracted scientists and researchers. Latin, Chinese, Korean and Thai characters have been researched more thoroughly than Arabic characters. The research has concentrated firstly on printed and typeset characters until acceptable recognition accuracy...

Full description

Saved in:
Bibliographic Details
Main Author: El-Bashir, Mohammad Said Mansur
Format: Thesis
Language:English
English
Published: 2008
Online Access:http://psasir.upm.edu.my/id/eprint/5233/1/FSKTM_2008_8a.pdf
http://psasir.upm.edu.my/id/eprint/5233/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.5233
record_format eprints
spelling my.upm.eprints.52332013-05-27T07:21:21Z http://psasir.upm.edu.my/id/eprint/5233/ New Distance Measures for Arabic Handwritten Text Recognition El-Bashir, Mohammad Said Mansur recent years, optical character recognition has attracted scientists and researchers. Latin, Chinese, Korean and Thai characters have been researched more thoroughly than Arabic characters. The research has concentrated firstly on printed and typeset characters until acceptable recognition accuracy has been achieved. Nowadays, most of the researches have gone towards handwritten character recognition. Arabic text is cursive as characters in a sub-word are connected to each other. This makes the recognition process more complex and a segmentation procedure is required to separate the connected characters from each other before they can be recognized. Features extracted have to be chosen carefully since it has a very important role in the segmentation and recognition process. The recognition accuracy mostly depends on the classifier applied and the segmentation procedure. In this research work, a framework for recognizing the Arabic handwriting is presented. Two approaches have been proposed. The first approach has been designed to recognize the word as a whole to fit applications such as sorting postal mails and bank checks where the number of words or digits that need to be recognized is limited. The words may include country and city names written on postal mails, or some reserved words or amounts used on bank checks. The second approach represents the general case where any type of documents or handwritten text can be recognized by this approach. In both approaches, a preprocessing stage including image enhancement and normalization. The most significant features are extracted by implementing the Principal Components Analysis. A new segmentation-based approach is designed and implemented for the second approach to segment the text into characters, while no or simple segmentation procedure is performed in the first approach. The recognition step is performed by applying the nearest neighbor algorithm. Four different distance measures are used with the nearest neighbor, the first norm, second norm (Euclidean), and two new norms proposed called ENorm, EEuclidean. The two new norms proposed (ENorm, EEuclidean) are derived from the first and second norm respectively. The recognition accuracy is enhanced by using the two new norms proposed. The approaches have been tested as well, and a number of experiments have been discussed more thoroughly. The first approach is experimented by four datasets, which are sub-words containing two characters, sub-words containing three characters, Latin letters and Hindi digits which are used with Arabic language nowadays. The recognition accuracy is the attribute used for measurement, and an 8-fold cross validation technique is used to test this attribute. The average recognition accuracy is 94.8% for the digits, 78% for the three-character sub-words, 77% for the two-character sub-words and 67% for Latin letters. The second approach has achieved recognition accuracy of 73% without detecting dots and 77% with dot detection. 2008 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/5233/1/FSKTM_2008_8a.pdf El-Bashir, Mohammad Said Mansur (2008) New Distance Measures for Arabic Handwritten Text Recognition. PhD thesis, Universiti Putra Malaysia. English
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
English
description recent years, optical character recognition has attracted scientists and researchers. Latin, Chinese, Korean and Thai characters have been researched more thoroughly than Arabic characters. The research has concentrated firstly on printed and typeset characters until acceptable recognition accuracy has been achieved. Nowadays, most of the researches have gone towards handwritten character recognition. Arabic text is cursive as characters in a sub-word are connected to each other. This makes the recognition process more complex and a segmentation procedure is required to separate the connected characters from each other before they can be recognized. Features extracted have to be chosen carefully since it has a very important role in the segmentation and recognition process. The recognition accuracy mostly depends on the classifier applied and the segmentation procedure. In this research work, a framework for recognizing the Arabic handwriting is presented. Two approaches have been proposed. The first approach has been designed to recognize the word as a whole to fit applications such as sorting postal mails and bank checks where the number of words or digits that need to be recognized is limited. The words may include country and city names written on postal mails, or some reserved words or amounts used on bank checks. The second approach represents the general case where any type of documents or handwritten text can be recognized by this approach. In both approaches, a preprocessing stage including image enhancement and normalization. The most significant features are extracted by implementing the Principal Components Analysis. A new segmentation-based approach is designed and implemented for the second approach to segment the text into characters, while no or simple segmentation procedure is performed in the first approach. The recognition step is performed by applying the nearest neighbor algorithm. Four different distance measures are used with the nearest neighbor, the first norm, second norm (Euclidean), and two new norms proposed called ENorm, EEuclidean. The two new norms proposed (ENorm, EEuclidean) are derived from the first and second norm respectively. The recognition accuracy is enhanced by using the two new norms proposed. The approaches have been tested as well, and a number of experiments have been discussed more thoroughly. The first approach is experimented by four datasets, which are sub-words containing two characters, sub-words containing three characters, Latin letters and Hindi digits which are used with Arabic language nowadays. The recognition accuracy is the attribute used for measurement, and an 8-fold cross validation technique is used to test this attribute. The average recognition accuracy is 94.8% for the digits, 78% for the three-character sub-words, 77% for the two-character sub-words and 67% for Latin letters. The second approach has achieved recognition accuracy of 73% without detecting dots and 77% with dot detection.
format Thesis
author El-Bashir, Mohammad Said Mansur
spellingShingle El-Bashir, Mohammad Said Mansur
New Distance Measures for Arabic Handwritten Text Recognition
author_facet El-Bashir, Mohammad Said Mansur
author_sort El-Bashir, Mohammad Said Mansur
title New Distance Measures for Arabic Handwritten Text Recognition
title_short New Distance Measures for Arabic Handwritten Text Recognition
title_full New Distance Measures for Arabic Handwritten Text Recognition
title_fullStr New Distance Measures for Arabic Handwritten Text Recognition
title_full_unstemmed New Distance Measures for Arabic Handwritten Text Recognition
title_sort new distance measures for arabic handwritten text recognition
publishDate 2008
url http://psasir.upm.edu.my/id/eprint/5233/1/FSKTM_2008_8a.pdf
http://psasir.upm.edu.my/id/eprint/5233/
_version_ 1643823128656216064
score 13.160551