Ground truth dataset: Objectionable web content

Cyber parental control aims to filter objectionable web content and prevent children from being exposed to harmful content. Succeeding in detecting and blocking objectionable content depends heavily on the accuracy of the topic model. A reliable ground truth dataset is essential for building effecti...

Full description

Saved in:
Bibliographic Details
Main Authors: Altarturi, Hamza H. M., Anuar, Nor Badrul
Format: Article
Published: MDPI 2022
Subjects:
Online Access:http://eprints.um.edu.my/43853/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.um.eprints.43853
record_format eprints
spelling my.um.eprints.438532024-01-31T03:58:52Z http://eprints.um.edu.my/43853/ Ground truth dataset: Objectionable web content Altarturi, Hamza H. M. Anuar, Nor Badrul QA75 Electronic computers. Computer science Cyber parental control aims to filter objectionable web content and prevent children from being exposed to harmful content. Succeeding in detecting and blocking objectionable content depends heavily on the accuracy of the topic model. A reliable ground truth dataset is essential for building effective cyber parental control models and validation of new detection methods. The ground truth is the measurement for labeling objectionable and unobjectionable websites of the cyber parental control dataset. The lack of publicly accessible datasets with a reliable ground truth has prevented a fair and coherent comparison of different methods proposed in the field of cyber parental control. This paper presents a ground truth dataset that contains 8000 labelled websites with 4000 objectionable websites and 4000 unobjectionable websites. These websites consist of more than 2 million web pages. Creating a ground truth objectionable web content dataset involved a few phases, including data collection, extraction, and labeling. Finally, the presence of bias, using kappa coefficient measurement, is addressed. The ground truth dataset is available publicly in the Mendeley repository. Dataset: 10.17632/f239556fkr.2; https://data.mendeley.com/datasets/f239556fkr. Dataset License: CC BY 4.0. © 2022 by the authors. MDPI 2022 Article PeerReviewed Altarturi, Hamza H. M. and Anuar, Nor Badrul (2022) Ground truth dataset: Objectionable web content. Data, 7 (11). ISSN 2306-5729, DOI https://doi.org/10.3390/data7110153 <https://doi.org/10.3390/data7110153>. 10.3390/data7110153
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Research Repository
url_provider http://eprints.um.edu.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Altarturi, Hamza H. M.
Anuar, Nor Badrul
Ground truth dataset: Objectionable web content
description Cyber parental control aims to filter objectionable web content and prevent children from being exposed to harmful content. Succeeding in detecting and blocking objectionable content depends heavily on the accuracy of the topic model. A reliable ground truth dataset is essential for building effective cyber parental control models and validation of new detection methods. The ground truth is the measurement for labeling objectionable and unobjectionable websites of the cyber parental control dataset. The lack of publicly accessible datasets with a reliable ground truth has prevented a fair and coherent comparison of different methods proposed in the field of cyber parental control. This paper presents a ground truth dataset that contains 8000 labelled websites with 4000 objectionable websites and 4000 unobjectionable websites. These websites consist of more than 2 million web pages. Creating a ground truth objectionable web content dataset involved a few phases, including data collection, extraction, and labeling. Finally, the presence of bias, using kappa coefficient measurement, is addressed. The ground truth dataset is available publicly in the Mendeley repository. Dataset: 10.17632/f239556fkr.2; https://data.mendeley.com/datasets/f239556fkr. Dataset License: CC BY 4.0. © 2022 by the authors.
format Article
author Altarturi, Hamza H. M.
Anuar, Nor Badrul
author_facet Altarturi, Hamza H. M.
Anuar, Nor Badrul
author_sort Altarturi, Hamza H. M.
title Ground truth dataset: Objectionable web content
title_short Ground truth dataset: Objectionable web content
title_full Ground truth dataset: Objectionable web content
title_fullStr Ground truth dataset: Objectionable web content
title_full_unstemmed Ground truth dataset: Objectionable web content
title_sort ground truth dataset: objectionable web content
publisher MDPI
publishDate 2022
url http://eprints.um.edu.my/43853/
_version_ 1789940697862766592
score 13.160551