Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals

Information Retrieval (IR) systems are currently facing a continuous challenge due to the increasing size of datasets. Extremely, large data from different aspects is gathered each day resulting in huge increase in the scale of the raw data available across the internet. Indexing, as the main functi...

Full description

Saved in:
Bibliographic Details
Main Authors: Aldailamy, Ali Y., Abdul Hamid, Nor Asila Wati, Abdulkarem, Mohammed
Format: Article
Language:English
Published: culty of Computer Science and Information Technology, University of Malaya 2018
Online Access:http://psasir.upm.edu.my/id/eprint/72313/1/Distributed%20indexing.pdf
http://psasir.upm.edu.my/id/eprint/72313/
https://ejournal.um.edu.my/index.php/MJCS/article/view/15490
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.72313
record_format eprints
spelling my.upm.eprints.723132020-05-04T01:34:52Z http://psasir.upm.edu.my/id/eprint/72313/ Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals Aldailamy, Ali Y. Abdul Hamid, Nor Asila Wati Abdulkarem, Mohammed Information Retrieval (IR) systems are currently facing a continuous challenge due to the increasing size of datasets. Extremely, large data from different aspects is gathered each day resulting in huge increase in the scale of the raw data available across the internet. Indexing, as the main function of IR systems, is becoming a time-consuming problem. Therefore, efficient indexing of a large volume of data is now a critical requirement of modern IR systems. High performance indexing is performed nowadays over the use of MapReduce programming model. MapReduce is a programming paradigm that enables massive processing and large collections distribution across multiple (hundreds or thousands) commodity computers. To shed some light on this issue, this paper presents a detailed performance analysis of distributed indexing for Solr, Katta and Terrier with the context of MapReduce. In particular, this study compares and analyzes the distributed indexing performance of three frameworks using 1GB, 3GB, 6GB, and 9GB subsets of TREC dataset as the processing power increase. The experiments measure the indexing average time, then throughput, speedup, and efficiency of indexing process. The results show that, Terrier performance is the best in the presence of large collections and scalable processing power. While, Solr performance is the best when having limited computing power and small document collections. Finally, the experimental results show that, Katta produced the worst indexing average time among the three frameworks but its speedup scales linearly with processing power and collection size. culty of Computer Science and Information Technology, University of Malaya 2018-12 Article NonPeerReviewed text en http://psasir.upm.edu.my/id/eprint/72313/1/Distributed%20indexing.pdf Aldailamy, Ali Y. and Abdul Hamid, Nor Asila Wati and Abdulkarem, Mohammed (2018) Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals. Malaysian Journal of Computer Science (spec. 2018). 87- 104. ISSN 0127-9084 https://ejournal.um.edu.my/index.php/MJCS/article/view/15490 10.22452/mjcs.sp2018no1.7
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
description Information Retrieval (IR) systems are currently facing a continuous challenge due to the increasing size of datasets. Extremely, large data from different aspects is gathered each day resulting in huge increase in the scale of the raw data available across the internet. Indexing, as the main function of IR systems, is becoming a time-consuming problem. Therefore, efficient indexing of a large volume of data is now a critical requirement of modern IR systems. High performance indexing is performed nowadays over the use of MapReduce programming model. MapReduce is a programming paradigm that enables massive processing and large collections distribution across multiple (hundreds or thousands) commodity computers. To shed some light on this issue, this paper presents a detailed performance analysis of distributed indexing for Solr, Katta and Terrier with the context of MapReduce. In particular, this study compares and analyzes the distributed indexing performance of three frameworks using 1GB, 3GB, 6GB, and 9GB subsets of TREC dataset as the processing power increase. The experiments measure the indexing average time, then throughput, speedup, and efficiency of indexing process. The results show that, Terrier performance is the best in the presence of large collections and scalable processing power. While, Solr performance is the best when having limited computing power and small document collections. Finally, the experimental results show that, Katta produced the worst indexing average time among the three frameworks but its speedup scales linearly with processing power and collection size.
format Article
author Aldailamy, Ali Y.
Abdul Hamid, Nor Asila Wati
Abdulkarem, Mohammed
spellingShingle Aldailamy, Ali Y.
Abdul Hamid, Nor Asila Wati
Abdulkarem, Mohammed
Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals
author_facet Aldailamy, Ali Y.
Abdul Hamid, Nor Asila Wati
Abdulkarem, Mohammed
author_sort Aldailamy, Ali Y.
title Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals
title_short Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals
title_full Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals
title_fullStr Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals
title_full_unstemmed Distributed indexing: performance analysis of Solr, Terrier and Katta Information Retrievals
title_sort distributed indexing: performance analysis of solr, terrier and katta information retrievals
publisher culty of Computer Science and Information Technology, University of Malaya
publishDate 2018
url http://psasir.upm.edu.my/id/eprint/72313/1/Distributed%20indexing.pdf
http://psasir.upm.edu.my/id/eprint/72313/
https://ejournal.um.edu.my/index.php/MJCS/article/view/15490
_version_ 1665895998988222464
score 13.214268