Automated dataset generation for training peer-to-peer machine learning classifiers

Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In...

Full description

Saved in:
Bibliographic Details
Main Authors: Zarei, Roozbeh, Monemi, Alireza, Marsono, Muhammad Nadzir
Format: Article
Published: Springer Science and Business Media, LLC 2015
Subjects:
Online Access:http://eprints.utm.my/id/eprint/57928/
http://dx.doi.org/10.1007/s10922-013-9279-z
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.57928
record_format eprints
spelling my.utm.579282021-12-19T06:19:23Z http://eprints.utm.my/id/eprint/57928/ Automated dataset generation for training peer-to-peer machine learning classifiers Zarei, Roozbeh Monemi, Alireza Marsono, Muhammad Nadzir TK Electrical engineering. Electronics Nuclear engineering Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this paper, an automated training dataset generation for an on-line P2P traffic classification is proposed to allow frequent classifier retraining. A two-stage training dataset generator (TSTDG) is proposed by combining a 3-class heuristic and a 3-class statistical classification to automatically generate a training dataset. In the heuristic stage, traffic is classified as P2P, non-P2P, or unknown. In the statistical stage, a dual Decision Tree is built based on a dataset generated in the heuristic stage to reduce the amount of classified unknown traffic. The final training dataset is generated based on all flows that are classified in these two stages. The proposed system has been evaluated on traces captured from a campus network. The overall results show that the TSTDG can generate an accurate training dataset by classifying around 94 % of total flows with high accuracy (98.59 %) and a low false positive rate (1.27 %). Springer Science and Business Media, LLC 2015 Article PeerReviewed Zarei, Roozbeh and Monemi, Alireza and Marsono, Muhammad Nadzir (2015) Automated dataset generation for training peer-to-peer machine learning classifiers. Journal of Network and Systems Management, 23 (1). pp. 89-110. ISSN 1064-7570 http://dx.doi.org/10.1007/s10922-013-9279-z DOI:10.1007/s10922-013-9279-z
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
topic TK Electrical engineering. Electronics Nuclear engineering
spellingShingle TK Electrical engineering. Electronics Nuclear engineering
Zarei, Roozbeh
Monemi, Alireza
Marsono, Muhammad Nadzir
Automated dataset generation for training peer-to-peer machine learning classifiers
description Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this paper, an automated training dataset generation for an on-line P2P traffic classification is proposed to allow frequent classifier retraining. A two-stage training dataset generator (TSTDG) is proposed by combining a 3-class heuristic and a 3-class statistical classification to automatically generate a training dataset. In the heuristic stage, traffic is classified as P2P, non-P2P, or unknown. In the statistical stage, a dual Decision Tree is built based on a dataset generated in the heuristic stage to reduce the amount of classified unknown traffic. The final training dataset is generated based on all flows that are classified in these two stages. The proposed system has been evaluated on traces captured from a campus network. The overall results show that the TSTDG can generate an accurate training dataset by classifying around 94 % of total flows with high accuracy (98.59 %) and a low false positive rate (1.27 %).
format Article
author Zarei, Roozbeh
Monemi, Alireza
Marsono, Muhammad Nadzir
author_facet Zarei, Roozbeh
Monemi, Alireza
Marsono, Muhammad Nadzir
author_sort Zarei, Roozbeh
title Automated dataset generation for training peer-to-peer machine learning classifiers
title_short Automated dataset generation for training peer-to-peer machine learning classifiers
title_full Automated dataset generation for training peer-to-peer machine learning classifiers
title_fullStr Automated dataset generation for training peer-to-peer machine learning classifiers
title_full_unstemmed Automated dataset generation for training peer-to-peer machine learning classifiers
title_sort automated dataset generation for training peer-to-peer machine learning classifiers
publisher Springer Science and Business Media, LLC
publishDate 2015
url http://eprints.utm.my/id/eprint/57928/
http://dx.doi.org/10.1007/s10922-013-9279-z
_version_ 1720436856541151232
score 13.159267