A model for islamic istilahnet in Malay manuscripts for big data analytics and linguistics consortium

The research on Malay manuscripts content in Information Technology is limited especially on statistical approach as compared to rule-based approach. This research aims to propose a hybrid model, which combines the two approaches for jawi-roman transliteration of Malay manuscript contents. This rese...

Full description

Saved in:
Bibliographic Details
Main Authors: Abu Seman, Muhamad Sadry, Wan Mamat, Wan Ali @ Wan Yusoff, Noordin, Mohamad Fauzan, Othman, Roslina
Format: Monograph
Language:English
Published: 2019
Subjects:
Online Access:http://irep.iium.edu.my/73052/1/Research%20Report%20RIGS%202015%20-%20MSAS.pdf
http://irep.iium.edu.my/73052/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The research on Malay manuscripts content in Information Technology is limited especially on statistical approach as compared to rule-based approach. This research aims to propose a hybrid model, which combines the two approaches for jawi-roman transliteration of Malay manuscript contents. This research assesses the quality scores of utilizing a prevalent statistical model, Statistical Model Transliteration (SMT) for jawi-roman transliteration. This research utilizes exploratory approach. The data used were extracted from 3 Malay manuscripts: Bidāyat al-Mubtadī bi-Faḍlillāh al-Muhdī, Kashf al-Asrār and Hujjat al-Balighah, acquired from ISTAC with a total of 3,420 rows of data transliterated into old jawi, modern jawi and roman form. Quality scores of Bilingual Evaluation Understudy (BLEU) score and word error rate are used for evaluation of SMT output. The findings show that E-Jawi.net word error rate for old jawi-roman is 55.8% error while modern jawi-roman is 32.42% on the initial data. Hence, the research opted for human expert to develop a quality corpus for SMT consisting of multiple transliterations of the manuscript contents in modern jawi and roman. Significantly, the model is dependable on a quality parallel corpus.