Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali

Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Ghazali, Shamihah
Format: Thesis
Language:English
Published: 2022
Subjects:
Online Access:https://ir.uitm.edu.my/id/eprint/66929/1/66929.pdf
https://ir.uitm.edu.my/id/eprint/66929/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.uitm.ir.66929
record_format eprints
spelling my.uitm.ir.669292022-09-19T07:00:23Z https://ir.uitm.edu.my/id/eprint/66929/ Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali Muhammad Ghazali, Shamihah Data processing Indoor air pollution. Including indoor air quality Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance of imputation methods to replace the missing values with estimated values. Based on the literature search, investigation for an appropriate imputation method on Single-Site Temporal Time-Dependent (SSTTD) multivariate structure air quality dataset particularly with long gap sequence of missing values issue was found less discussed. Several empirical orthogonal functions (EOF) based imputation methods are proposed in this study to fill the gap. The EOF, sometimes named Principal Component Analysis (PCA) method, is a promising technique applied to solve for missing values. However, the existing EOF imputation method has a drawback because it uses data matrix centralization based on statistics mean for EOF computation. To be applied for the air quality dataset, the existing approach needs to be improvised because the air quality dataset often consists of extreme observations due to climatic variations and random processes. Therefore, the implementation of statistic median and trimmed mean seems better in the matrix centralization. In this study, several proposed EOF-based methods are introduced. The capability of the methods for estimating missing values for long gap problems focusing on air quality (PM10) of the SSTTD multivariate data set in Malaysia is investigated. The performance of the existing EOF based method, the EOF mean centred approach (EOF-mean) and several proposed EOF based methods; the EOF based on median (EOF-median), EOF based on the trimmed mean (EOF-trimmean) and the newly applied Regularized Expectation Maximization Principal Component Analysis (R-EMPCA) are compared. The study was conducted using real PM10 data set from Klang and Shah Alam air quality monitoring stations. Performance assessment and evaluation of the methods was conducted by comparing the imputed values in the artificial missing data set with the true observed values in the reference (complete) data set. The artificial missing values data sets are created from an identified reference (complete) data set with respect to several patterns according to four different percentages (5, 10, 20 and 30) and long sequence (gap) size (12, 24, 168 and 720) of missing points (hours) at both study locations. Based on several performance indicators, including RMSE, MAE, Rsquare and AI, the results have shown that R-EMPCA has the most excellent performance with the highest accuracy in estimating the missing values, and the second best is EOF-trimmean. For further improvement, the estimation of the estimated values was improvised using B-spline Roughness Penalty (RP) Smoothing approach, which resulted in the proposed R-EMPCA-RP and EOF-trimmean-RP imputation methods. The application of the RP approach is proven fruitful. 2022 Thesis NonPeerReviewed text en https://ir.uitm.edu.my/id/eprint/66929/1/66929.pdf Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali. (2022) Masters thesis, thesis, Universiti Teknologi MARA (UiTM). <http://terminalib.uitm.edu.my/66929.pdf>
institution Universiti Teknologi Mara
building Tun Abdul Razak Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Mara
content_source UiTM Institutional Repository
url_provider http://ir.uitm.edu.my/
language English
topic Data processing
Indoor air pollution. Including indoor air quality
spellingShingle Data processing
Indoor air pollution. Including indoor air quality
Muhammad Ghazali, Shamihah
Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
description Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance of imputation methods to replace the missing values with estimated values. Based on the literature search, investigation for an appropriate imputation method on Single-Site Temporal Time-Dependent (SSTTD) multivariate structure air quality dataset particularly with long gap sequence of missing values issue was found less discussed. Several empirical orthogonal functions (EOF) based imputation methods are proposed in this study to fill the gap. The EOF, sometimes named Principal Component Analysis (PCA) method, is a promising technique applied to solve for missing values. However, the existing EOF imputation method has a drawback because it uses data matrix centralization based on statistics mean for EOF computation. To be applied for the air quality dataset, the existing approach needs to be improvised because the air quality dataset often consists of extreme observations due to climatic variations and random processes. Therefore, the implementation of statistic median and trimmed mean seems better in the matrix centralization. In this study, several proposed EOF-based methods are introduced. The capability of the methods for estimating missing values for long gap problems focusing on air quality (PM10) of the SSTTD multivariate data set in Malaysia is investigated. The performance of the existing EOF based method, the EOF mean centred approach (EOF-mean) and several proposed EOF based methods; the EOF based on median (EOF-median), EOF based on the trimmed mean (EOF-trimmean) and the newly applied Regularized Expectation Maximization Principal Component Analysis (R-EMPCA) are compared. The study was conducted using real PM10 data set from Klang and Shah Alam air quality monitoring stations. Performance assessment and evaluation of the methods was conducted by comparing the imputed values in the artificial missing data set with the true observed values in the reference (complete) data set. The artificial missing values data sets are created from an identified reference (complete) data set with respect to several patterns according to four different percentages (5, 10, 20 and 30) and long sequence (gap) size (12, 24, 168 and 720) of missing points (hours) at both study locations. Based on several performance indicators, including RMSE, MAE, Rsquare and AI, the results have shown that R-EMPCA has the most excellent performance with the highest accuracy in estimating the missing values, and the second best is EOF-trimmean. For further improvement, the estimation of the estimated values was improvised using B-spline Roughness Penalty (RP) Smoothing approach, which resulted in the proposed R-EMPCA-RP and EOF-trimmean-RP imputation methods. The application of the RP approach is proven fruitful.
format Thesis
author Muhammad Ghazali, Shamihah
author_facet Muhammad Ghazali, Shamihah
author_sort Muhammad Ghazali, Shamihah
title Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_short Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_full Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_fullStr Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_full_unstemmed Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_sort long gap imputation in air quality (pm10) data set using improvised eof-based method with roughness penalty approach / shamihah muhammad ghazali
publishDate 2022
url https://ir.uitm.edu.my/id/eprint/66929/1/66929.pdf
https://ir.uitm.edu.my/id/eprint/66929/
_version_ 1744651664216817664
score 13.15806