An improved framework for content and link-based web spam detection: a combined approach

In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pag...

Full description

Saved in:
Bibliographic Details
Main Author: Shahzad, Asim
Format: Thesis
Language:English
English
English
Published: 2021
Subjects:
Online Access:http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf
http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf
http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf
http://eprints.uthm.edu.my/1777/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.uthm.eprints.1777
record_format eprints
spelling my.uthm.eprints.17772021-10-11T07:58:48Z http://eprints.uthm.edu.my/1777/ An improved framework for content and link-based web spam detection: a combined approach Shahzad, Asim QA76.75-76.765 Computer software In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability. 2021-05 Thesis NonPeerReviewed text en http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf text en http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf text en http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf Shahzad, Asim (2021) An improved framework for content and link-based web spam detection: a combined approach. Doctoral thesis, Universiti Tun Hussein Onn Malaysia.
institution Universiti Tun Hussein Onn Malaysia
building UTHM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Tun Hussein Onn Malaysia
content_source UTHM Institutional Repository
url_provider http://eprints.uthm.edu.my/
language English
English
English
topic QA76.75-76.765 Computer software
spellingShingle QA76.75-76.765 Computer software
Shahzad, Asim
An improved framework for content and link-based web spam detection: a combined approach
description In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability.
format Thesis
author Shahzad, Asim
author_facet Shahzad, Asim
author_sort Shahzad, Asim
title An improved framework for content and link-based web spam detection: a combined approach
title_short An improved framework for content and link-based web spam detection: a combined approach
title_full An improved framework for content and link-based web spam detection: a combined approach
title_fullStr An improved framework for content and link-based web spam detection: a combined approach
title_full_unstemmed An improved framework for content and link-based web spam detection: a combined approach
title_sort improved framework for content and link-based web spam detection: a combined approach
publishDate 2021
url http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf
http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf
http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf
http://eprints.uthm.edu.my/1777/
_version_ 1738580903987249152
score 13.18916