Term frequency and inverse document frequency with position score and mean value for mining web content outliers
In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used t...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2013
|
Online Access: | http://psasir.upm.edu.my/id/eprint/39114/1/FSKTM%202013%208%20IR.pdf http://psasir.upm.edu.my/id/eprint/39114/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In the past few years, there was a rapid expansion of activities in the Web Content Mining area. However, the focus was only on the technical, visual design and frequent web content pattern while less frequent web content pattern called outliers was undervalued. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. It is important to detect outliers especially when a web portal is hacked. Recently, there are only a few approaches suggested to Mining Web Content Outliers such as Signed-with-Weight technique and mining through mathematical approach. The mathematical approach developed is based on two way rectangular representations and correlation method. However the approaches do not take the advantage of position score and stemmed domain dictionary. Position score and stemmed domain dictionary are very useful in mining web content outliers because it may effects on reduction the relevance of documents.
Therefore, this study was made to resolve the problems in Mining Web Content Outliers by combining the strength of word-based techniques, position score weighting technique and stemmed domain dictionary. The existing weighting
technique was transformed to the Term Frequency and Inverse Document Frequency with Position Score and Mean Value (TF.IDF.PSM) weighting technique by implementing a standard weighting technique from Information Retrieval called Term Frequency and Inverse Document Frequency (TF.IDF) and a weighting technique from Text Categorization called the Term Frequency and Relevance Frequency (TF.RF) into Web Content Mining. This technique is started with extracting the web pages, preprocess it and then generate the full word
profile. Depending on the length of the character, the respective index on the stemmed domain dictionary is searched. Positive count is incremented by one, if the word is present in the dictionary and document. Then word frequency in a web page and in every web pages and position score are counted. Finally the dissimilarity measure is computed to determine outliers. In the dissimilarity
measure part, the TF.IDF.PSM is used not only to calculate and analyze the relevant words but also to consider the importance of the irrelevant words by assigning weight based on the word position in a page. A statistical approach
‗mean‘ is added to balance the weight of position score.
The technique has been tested on 431 web pages from the Course folder of University Wisconsin, provided by World Wide Knowledge Base. While the 43 benchmark dataset is from Science Medical folder provided by The 20 Newsgroups Dataset. Term Frequency and Inverse Document Frequency
(TF.IDF) weighting technique from Information Retrieval (IR) and the Term Frequency and Relevance Frequency (TF.RF) weighting technique by Text Categorization are used during experimental phase and the results are qualified by two parameters which is the percentage of the accuracy and the F1-measure. The experimental results show that the TF.IDF.PSM weighting technique achieves up to 98.95% of accuracy, which is about 3.21% higher than the Signed-with-Weight technique. Besides, it also achieves up to 94.19% of F1-measure, which is a 18.12% improvement from the Signed-with-Weight technique. |
---|