A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately ide...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Published: |
Nature Portfolio
2021
|
Subjects: | |
Online Access: | http://eprints.um.edu.my/26800/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.um.eprints.26800 |
---|---|
record_format |
eprints |
spelling |
my.um.eprints.268002022-04-14T07:03:55Z http://eprints.um.edu.my/26800/ A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides Charoenkwan, Phasit Chotpatiwetchkul, Warot Lee, Vannajan Sanghiran Nantasenamat, Chanin Shoombuatong, Watshara QD Chemistry Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlastack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs. Nature Portfolio 2021-12 Article PeerReviewed Charoenkwan, Phasit and Chotpatiwetchkul, Warot and Lee, Vannajan Sanghiran and Nantasenamat, Chanin and Shoombuatong, Watshara (2021) A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides. Scientific Reports, 11 (1). ISSN 2045-2322, DOI https://doi.org/10.1038/s41598-021-03293-w <https://doi.org/10.1038/s41598-021-03293-w>. 10.1038/s41598-021-03293-w |
institution |
Universiti Malaya |
building |
UM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaya |
content_source |
UM Research Repository |
url_provider |
http://eprints.um.edu.my/ |
topic |
QD Chemistry |
spellingShingle |
QD Chemistry Charoenkwan, Phasit Chotpatiwetchkul, Warot Lee, Vannajan Sanghiran Nantasenamat, Chanin Shoombuatong, Watshara A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
description |
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlastack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs. |
format |
Article |
author |
Charoenkwan, Phasit Chotpatiwetchkul, Warot Lee, Vannajan Sanghiran Nantasenamat, Chanin Shoombuatong, Watshara |
author_facet |
Charoenkwan, Phasit Chotpatiwetchkul, Warot Lee, Vannajan Sanghiran Nantasenamat, Chanin Shoombuatong, Watshara |
author_sort |
Charoenkwan, Phasit |
title |
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
title_short |
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
title_full |
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
title_fullStr |
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
title_full_unstemmed |
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
title_sort |
novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides |
publisher |
Nature Portfolio |
publishDate |
2021 |
url |
http://eprints.um.edu.my/26800/ |
_version_ |
1735409460094959616 |
score |
13.160551 |