Improving named entity recognition accuracy for gene and protein in biomedical text literature

The task of recognising biomedical named entities in natural language documents called biomedical Named Entity Recognition (NER) is the focus of many researchers due to complex nature of such texts. This complexity includes the issues of character-level, word-level and word order variations. In this...

Full description

Saved in:
Bibliographic Details
Main Authors: Tohidi, Hossein, Ibrahim, Hamidah, Azmi Murad, Masrah Azrifah
Format: Article
Language:English
Published: Inderscience Publishers 2014
Online Access:http://psasir.upm.edu.my/id/eprint/37986/1/Improving%20named%20entity%20recognition%20accuracy%20for%20gene%20and%20protein%20in%20biomedical%20text%20literature.pdf
http://psasir.upm.edu.my/id/eprint/37986/
http://www.inderscienceonline.com/doi/abs/10.1504/IJDMB.2014.064523
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.37986
record_format eprints
spelling my.upm.eprints.379862015-12-29T09:05:27Z http://psasir.upm.edu.my/id/eprint/37986/ Improving named entity recognition accuracy for gene and protein in biomedical text literature Tohidi, Hossein Ibrahim, Hamidah Azmi Murad, Masrah Azrifah The task of recognising biomedical named entities in natural language documents called biomedical Named Entity Recognition (NER) is the focus of many researchers due to complex nature of such texts. This complexity includes the issues of character-level, word-level and word order variations. In this study, an approach for recognising gene and protein names that handles the above issues is proposed. Similar to the previous related works, our approach is based on the assumption that a named entity occurs within a noun group. The strength of our proposed approach lies on a Statistical Character-based Syntax Similarity (SCSS) algorithm which measures similarity between the extracted candidates and the well-known biomedical named entities from the GENIA V3.0 corpus. The proposed approach is evaluated and results are satisfied. For recognitions of both gene and protein names, we achieved 97.2% for precision (P), 95.2% for recall (R), and 96.1 for F-measure. While for protein names recognition we gained 98.1% for P, 97.5% for R and 97.7 for F-measure. Inderscience Publishers 2014 Article PeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/37986/1/Improving%20named%20entity%20recognition%20accuracy%20for%20gene%20and%20protein%20in%20biomedical%20text%20literature.pdf Tohidi, Hossein and Ibrahim, Hamidah and Azmi Murad, Masrah Azrifah (2014) Improving named entity recognition accuracy for gene and protein in biomedical text literature. International Journal of Data Mining and Bioinformatics, 10 (3). pp. 239-268. ISSN 1748-5673; ESSN: 1748-5681 http://www.inderscienceonline.com/doi/abs/10.1504/IJDMB.2014.064523 10.1504/IJDMB.2014.064523
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
description The task of recognising biomedical named entities in natural language documents called biomedical Named Entity Recognition (NER) is the focus of many researchers due to complex nature of such texts. This complexity includes the issues of character-level, word-level and word order variations. In this study, an approach for recognising gene and protein names that handles the above issues is proposed. Similar to the previous related works, our approach is based on the assumption that a named entity occurs within a noun group. The strength of our proposed approach lies on a Statistical Character-based Syntax Similarity (SCSS) algorithm which measures similarity between the extracted candidates and the well-known biomedical named entities from the GENIA V3.0 corpus. The proposed approach is evaluated and results are satisfied. For recognitions of both gene and protein names, we achieved 97.2% for precision (P), 95.2% for recall (R), and 96.1 for F-measure. While for protein names recognition we gained 98.1% for P, 97.5% for R and 97.7 for F-measure.
format Article
author Tohidi, Hossein
Ibrahim, Hamidah
Azmi Murad, Masrah Azrifah
spellingShingle Tohidi, Hossein
Ibrahim, Hamidah
Azmi Murad, Masrah Azrifah
Improving named entity recognition accuracy for gene and protein in biomedical text literature
author_facet Tohidi, Hossein
Ibrahim, Hamidah
Azmi Murad, Masrah Azrifah
author_sort Tohidi, Hossein
title Improving named entity recognition accuracy for gene and protein in biomedical text literature
title_short Improving named entity recognition accuracy for gene and protein in biomedical text literature
title_full Improving named entity recognition accuracy for gene and protein in biomedical text literature
title_fullStr Improving named entity recognition accuracy for gene and protein in biomedical text literature
title_full_unstemmed Improving named entity recognition accuracy for gene and protein in biomedical text literature
title_sort improving named entity recognition accuracy for gene and protein in biomedical text literature
publisher Inderscience Publishers
publishDate 2014
url http://psasir.upm.edu.my/id/eprint/37986/1/Improving%20named%20entity%20recognition%20accuracy%20for%20gene%20and%20protein%20in%20biomedical%20text%20literature.pdf
http://psasir.upm.edu.my/id/eprint/37986/
http://www.inderscienceonline.com/doi/abs/10.1504/IJDMB.2014.064523
_version_ 1643832116965801984
score 13.209306