The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.]
Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Penerbit UiTM
2019
|
Subjects: | |
Online Access: | https://ir.uitm.edu.my/id/eprint/61451/1/61451.pdf https://ir.uitm.edu.my/id/eprint/61451/ https://mjoc.uitm.edu.my/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.uitm.ir.61451 |
---|---|
record_format |
eprints |
spelling |
my.uitm.ir.614512022-06-14T03:09:44Z https://ir.uitm.edu.my/id/eprint/61451/ The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] Che Abdul Rahman, Anis Nadiah Ho Abdullah, Imran Zainuddin, Intan Safinaz Jaludin, Azhar Neural networks (Computer science) Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development. Penerbit UiTM 2019-12 Article PeerReviewed text en https://ir.uitm.edu.my/id/eprint/61451/1/61451.pdf The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.]. (2019) Malaysian Journal of Computing (MJoC), 4 (2): 7. pp. 335-348. ISSN 2600-8238 https://mjoc.uitm.edu.my/ |
institution |
Universiti Teknologi Mara |
building |
Tun Abdul Razak Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknologi Mara |
content_source |
UiTM Institutional Repository |
url_provider |
http://ir.uitm.edu.my/ |
language |
English |
topic |
Neural networks (Computer science) |
spellingShingle |
Neural networks (Computer science) Che Abdul Rahman, Anis Nadiah Ho Abdullah, Imran Zainuddin, Intan Safinaz Jaludin, Azhar The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] |
description |
Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development. |
format |
Article |
author |
Che Abdul Rahman, Anis Nadiah Ho Abdullah, Imran Zainuddin, Intan Safinaz Jaludin, Azhar |
author_facet |
Che Abdul Rahman, Anis Nadiah Ho Abdullah, Imran Zainuddin, Intan Safinaz Jaludin, Azhar |
author_sort |
Che Abdul Rahman, Anis Nadiah |
title |
The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] |
title_short |
The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] |
title_full |
The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] |
title_fullStr |
The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] |
title_full_unstemmed |
The comparisons of OCR tools: a conversion case in the Malaysian Hansard Corpus development / Anis Nadiah Che Abdul Rahman ...[et al.] |
title_sort |
comparisons of ocr tools: a conversion case in the malaysian hansard corpus development / anis nadiah che abdul rahman ...[et al.] |
publisher |
Penerbit UiTM |
publishDate |
2019 |
url |
https://ir.uitm.edu.my/id/eprint/61451/1/61451.pdf https://ir.uitm.edu.my/id/eprint/61451/ https://mjoc.uitm.edu.my/ |
_version_ |
1736837336166039552 |
score |
13.211869 |