A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana

In recent times, speech technology and its related applications are becoming a popular topic among researchers. There are many applications of speech technology developed for businesses, military, transport, aerospace, PDAs, and so on. The importance of speech technology-based applications has promp...

Full description

Saved in:
Bibliographic Details
Main Author: Aminath , Farshana
Format: Thesis
Published: 2018
Subjects:
Online Access:http://studentsrepo.um.edu.my/11964/1/Aminath.pdf
http://studentsrepo.um.edu.my/11964/2/Aminath.pdf
http://studentsrepo.um.edu.my/11964/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.um.stud.11964
record_format eprints
spelling my.um.stud.119642021-01-30T19:51:29Z A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana Aminath , Farshana QA76 Computer software TA Engineering (General). Civil engineering (General) In recent times, speech technology and its related applications are becoming a popular topic among researchers. There are many applications of speech technology developed for businesses, military, transport, aerospace, PDAs, and so on. The importance of speech technology-based applications has prompted researchers to improve the techniques of these applications for many languages around the world. However, only limited number of languages benefited from speech technology applications such as the Automatic Speech Recognition (ASR) system and the Text-to-Speech (TTS) system. One of the main reasons for this technological gap between the languages is the lack of basic resources such as the lexical and speech corpus, which are essential as the foundation for developing this technology. Though researchers have managed to assemble these basic resources for some languages, the methods used for accumulating them are not as efficient as of the established languages. Some of these methods also depend on the types of resources needed for developing lexical and speech corpora. This research emphasizes on developing a lexical corpus for an under-resourced language that lacks the basic resources. This research also focuses on improving the quality of the corpus in terms of phonetic coverage and corpus size for the related under-resourced language. Developing a lexical corpus includes collecting an initial large corpus, and selecting suitable sentences therein. The selected set of sentences must cover all possible phonetic units of the language and ensuring uniform distribution of those units. This research proposed a novel method the development of a lexical corpus for Dhivehi, a language that lacks in key resources for developing speech technology-based applications. This research proposed the use of Zipfian distribution for selecting sentences from the initial large corpus. From 109,208 sentences collected from web sources, 360 sentences were selected to ensure a phonetically rich and balanced lexical corpus. The performance of the developed corpus is evaluated in terms of phonetic coverage and size of the corpus. Phonetic coverage is measured by finding the sum of the sequence of phonemes in the corpus. The size of the corpus is evaluated using the cosine similarity, which measures the frequency distribution of the phonemes occurring in the developed final corpus and comparing them with the large initial corpus. The closer the similarity between final and large corpus, the better is the phonetic coverage. High similarity between the two corpora indicates that the developed corpus using the proposed method can perform as efficient as the initial large corpus. Statistical phonetic unit distribution similarity of selected sentences was 0.988 as compared to phonemes distribution of the large corpus. Since the similarity of the two distributions is close, it means that the optimized corpus can perform as efficient as the larger corpus. The performance of the proposed method was also evaluated by comparing the results with an existing benchmark method (greedy algorithm). The results show that the sentences selected using proposed method cover all the phonetic units and is 14 times smaller than the corpus developed using the benchmark method. 2018-04 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/11964/1/Aminath.pdf application/pdf http://studentsrepo.um.edu.my/11964/2/Aminath.pdf Aminath , Farshana (2018) A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana. Masters thesis, University of Malaya. http://studentsrepo.um.edu.my/11964/
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA76 Computer software
TA Engineering (General). Civil engineering (General)
spellingShingle QA76 Computer software
TA Engineering (General). Civil engineering (General)
Aminath , Farshana
A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana
description In recent times, speech technology and its related applications are becoming a popular topic among researchers. There are many applications of speech technology developed for businesses, military, transport, aerospace, PDAs, and so on. The importance of speech technology-based applications has prompted researchers to improve the techniques of these applications for many languages around the world. However, only limited number of languages benefited from speech technology applications such as the Automatic Speech Recognition (ASR) system and the Text-to-Speech (TTS) system. One of the main reasons for this technological gap between the languages is the lack of basic resources such as the lexical and speech corpus, which are essential as the foundation for developing this technology. Though researchers have managed to assemble these basic resources for some languages, the methods used for accumulating them are not as efficient as of the established languages. Some of these methods also depend on the types of resources needed for developing lexical and speech corpora. This research emphasizes on developing a lexical corpus for an under-resourced language that lacks the basic resources. This research also focuses on improving the quality of the corpus in terms of phonetic coverage and corpus size for the related under-resourced language. Developing a lexical corpus includes collecting an initial large corpus, and selecting suitable sentences therein. The selected set of sentences must cover all possible phonetic units of the language and ensuring uniform distribution of those units. This research proposed a novel method the development of a lexical corpus for Dhivehi, a language that lacks in key resources for developing speech technology-based applications. This research proposed the use of Zipfian distribution for selecting sentences from the initial large corpus. From 109,208 sentences collected from web sources, 360 sentences were selected to ensure a phonetically rich and balanced lexical corpus. The performance of the developed corpus is evaluated in terms of phonetic coverage and size of the corpus. Phonetic coverage is measured by finding the sum of the sequence of phonemes in the corpus. The size of the corpus is evaluated using the cosine similarity, which measures the frequency distribution of the phonemes occurring in the developed final corpus and comparing them with the large initial corpus. The closer the similarity between final and large corpus, the better is the phonetic coverage. High similarity between the two corpora indicates that the developed corpus using the proposed method can perform as efficient as the initial large corpus. Statistical phonetic unit distribution similarity of selected sentences was 0.988 as compared to phonemes distribution of the large corpus. Since the similarity of the two distributions is close, it means that the optimized corpus can perform as efficient as the larger corpus. The performance of the proposed method was also evaluated by comparing the results with an existing benchmark method (greedy algorithm). The results show that the sentences selected using proposed method cover all the phonetic units and is 14 times smaller than the corpus developed using the benchmark method.
format Thesis
author Aminath , Farshana
author_facet Aminath , Farshana
author_sort Aminath , Farshana
title A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana
title_short A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana
title_full A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana
title_fullStr A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana
title_full_unstemmed A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana
title_sort phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / aminath farshana
publishDate 2018
url http://studentsrepo.um.edu.my/11964/1/Aminath.pdf
http://studentsrepo.um.edu.my/11964/2/Aminath.pdf
http://studentsrepo.um.edu.my/11964/
_version_ 1738506550267346944
score 13.211869