Processing skyline queries in centralised and distributed incomplete databases
Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimens...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2013
|
Online Access: | http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf http://psasir.upm.edu.my/id/eprint/43004/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.upm.eprints.43004 |
---|---|
record_format |
eprints |
spelling |
my.upm.eprints.430042016-07-12T02:37:51Z http://psasir.upm.edu.my/id/eprint/43004/ Processing skyline queries in centralised and distributed incomplete databases Alwan, Ali Amer Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimensions for every data item are available (complete). However, this assumption is not always true particularly for multidimensional database as some values may be missing. The incompleteness of data leads to the loss of the transitivity property of skyline technique and results into failure in test dominance as some data items are incomparable to each other. Furthermore,missing values will influence negatively on the process of finding skylines, leading to high overhead, due to exhaustive pairwise comparisons between the data items. This problem becomes more complicated when multiple tables with incomplete data need to be accessed in determining the skylines. Since in distributed database tables are spread over various locations, therefore, join operation is needed in identifying the skylines. Joining these dimensions without any filteration will result in a huge amount of data to be joined. Furthermore, most of the previous works in the area of skyline queries in incomplete database emphasized only on retrieving skylines without estimating the missing values. In other words, the derived skylines have some missing values in one or more dimensions. However, in many cases users are concerned about the values in these dimensions. This thesis aims at proposing an efficient approach which is able to identify skylines in incomplete database. The approach employs the concepts of clustering data to partition the initial database into a set of distinct clusters. Then, the derived clusters are further divided into smaller groups and local skylines of each cluster are then identified. Next, a set of virtual skylines called k-dom that is derived from the local skylines are merged to derive a global k-dom skyline which is inserted at the top of each cluster to identify the candidate skylines. The final skylines are retrieved after conducting pairwise comparisons among the candidate skylines. The approach is extended to process skyline queries in incomplete distributed databases by pruning the input relations before conducting the join and skyline operations. The thesis also proposes an approach to estimate the missing values in the skylines. The approach utilizes the concept of mining attribute correlations to generate approximate functional dependencies (AFDs) that capture the relationships between dimensions. Besides, the strength of probability correlations between dimensions is computed in order to estimate the values. Then, the skylines are ranked according to the confidence of the generated AFD and the strength of probability correlations of the dimensions. Several experiments on synthetic and real datasets have been conducted. The results showed that our proposed approach for processing skyline queries in incomplete database has reduced the number of pairwise comparisons in the range of 75%-93% and the processing time in the range of 50%-89% compared to the previous approach. While the approach for processing skyline queries in incomplete distributed databases achieved between 56% to 88% reduction in the processing time and 84% to 90% for network cost compared to the previous approach. Lastly, the results for imputing the missing values of the skylines have shown that our approach achieved 25% error rate between the real missing values and the estimated values of the skylines. 2013-06 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf Alwan, Ali Amer (2013) Processing skyline queries in centralised and distributed incomplete databases. PhD thesis, Universiti Putra Malaysia. |
institution |
Universiti Putra Malaysia |
building |
UPM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Putra Malaysia |
content_source |
UPM Institutional Repository |
url_provider |
http://psasir.upm.edu.my/ |
language |
English |
description |
Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions
(attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimensions for every data item are available
(complete). However, this assumption is not always true particularly for multidimensional database as some values may be missing. The incompleteness of data leads to the loss of the transitivity property of skyline technique and results into failure in test dominance as some data items are incomparable to each other. Furthermore,missing values will influence negatively on the process of finding skylines, leading to high overhead, due to exhaustive pairwise comparisons between the data items. This
problem becomes more complicated when multiple tables with incomplete data need to be accessed in determining the skylines. Since in distributed database tables are spread
over various locations, therefore, join operation is needed in identifying the skylines. Joining these dimensions without any filteration will result in a huge amount of data to be joined. Furthermore, most of the previous works in the area of skyline queries in incomplete database emphasized only on retrieving skylines without estimating the missing values. In other words, the derived skylines have some missing values in one or more dimensions. However, in many cases users are concerned about the values in these dimensions.
This thesis aims at proposing an efficient approach which is able to identify skylines in incomplete database. The approach employs the concepts of clustering data to partition the initial database into a set of distinct clusters. Then, the derived clusters are further
divided into smaller groups and local skylines of each cluster are then identified. Next, a set of virtual skylines called k-dom that is derived from the local skylines are merged to derive a global k-dom skyline which is inserted at the top of each cluster to identify the
candidate skylines. The final skylines are retrieved after conducting pairwise comparisons among the candidate skylines. The approach is extended to process skyline
queries in incomplete distributed databases by pruning the input relations before conducting the join and skyline operations.
The thesis also proposes an approach to estimate the missing values in the skylines. The approach utilizes the concept of mining attribute correlations to generate approximate functional dependencies (AFDs) that capture the relationships between dimensions. Besides, the strength of probability correlations between dimensions is computed in
order to estimate the values. Then, the skylines are ranked according to the confidence of the generated AFD and the strength of probability correlations of the dimensions.
Several experiments on synthetic and real datasets have been conducted. The results showed that our proposed approach for processing skyline queries in incomplete database has reduced the number of pairwise comparisons in the range of 75%-93% and the processing time in the range of 50%-89% compared to the previous approach. While
the approach for processing skyline queries in incomplete distributed databases achieved between 56% to 88% reduction in the processing time and 84% to 90% for network cost
compared to the previous approach. Lastly, the results for imputing the missing values of the skylines have shown that our approach achieved 25% error rate between the real
missing values and the estimated values of the skylines. |
format |
Thesis |
author |
Alwan, Ali Amer |
spellingShingle |
Alwan, Ali Amer Processing skyline queries in centralised and distributed incomplete databases |
author_facet |
Alwan, Ali Amer |
author_sort |
Alwan, Ali Amer |
title |
Processing skyline queries in centralised and distributed incomplete databases |
title_short |
Processing skyline queries in centralised and distributed incomplete databases |
title_full |
Processing skyline queries in centralised and distributed incomplete databases |
title_fullStr |
Processing skyline queries in centralised and distributed incomplete databases |
title_full_unstemmed |
Processing skyline queries in centralised and distributed incomplete databases |
title_sort |
processing skyline queries in centralised and distributed incomplete databases |
publishDate |
2013 |
url |
http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf http://psasir.upm.edu.my/id/eprint/43004/ |
_version_ |
1643833438115987456 |
score |
13.209306 |