Processing skyline queries in centralised and distributed incomplete databases

Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimens...

Full description

Saved in:
Bibliographic Details
Main Author: Alwan, Ali Amer
Format: Thesis
Language:English
Published: 2013
Online Access:http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf
http://psasir.upm.edu.my/id/eprint/43004/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimensions for every data item are available (complete). However, this assumption is not always true particularly for multidimensional database as some values may be missing. The incompleteness of data leads to the loss of the transitivity property of skyline technique and results into failure in test dominance as some data items are incomparable to each other. Furthermore,missing values will influence negatively on the process of finding skylines, leading to high overhead, due to exhaustive pairwise comparisons between the data items. This problem becomes more complicated when multiple tables with incomplete data need to be accessed in determining the skylines. Since in distributed database tables are spread over various locations, therefore, join operation is needed in identifying the skylines. Joining these dimensions without any filteration will result in a huge amount of data to be joined. Furthermore, most of the previous works in the area of skyline queries in incomplete database emphasized only on retrieving skylines without estimating the missing values. In other words, the derived skylines have some missing values in one or more dimensions. However, in many cases users are concerned about the values in these dimensions. This thesis aims at proposing an efficient approach which is able to identify skylines in incomplete database. The approach employs the concepts of clustering data to partition the initial database into a set of distinct clusters. Then, the derived clusters are further divided into smaller groups and local skylines of each cluster are then identified. Next, a set of virtual skylines called k-dom that is derived from the local skylines are merged to derive a global k-dom skyline which is inserted at the top of each cluster to identify the candidate skylines. The final skylines are retrieved after conducting pairwise comparisons among the candidate skylines. The approach is extended to process skyline queries in incomplete distributed databases by pruning the input relations before conducting the join and skyline operations. The thesis also proposes an approach to estimate the missing values in the skylines. The approach utilizes the concept of mining attribute correlations to generate approximate functional dependencies (AFDs) that capture the relationships between dimensions. Besides, the strength of probability correlations between dimensions is computed in order to estimate the values. Then, the skylines are ranked according to the confidence of the generated AFD and the strength of probability correlations of the dimensions. Several experiments on synthetic and real datasets have been conducted. The results showed that our proposed approach for processing skyline queries in incomplete database has reduced the number of pairwise comparisons in the range of 75%-93% and the processing time in the range of 50%-89% compared to the previous approach. While the approach for processing skyline queries in incomplete distributed databases achieved between 56% to 88% reduction in the processing time and 84% to 90% for network cost compared to the previous approach. Lastly, the results for imputing the missing values of the skylines have shown that our approach achieved 25% error rate between the real missing values and the estimated values of the skylines.