Automated dataset generation for training peer-to-peer machine learning classifiers
Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Published: |
Springer Science and Business Media, LLC
2015
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/57928/ http://dx.doi.org/10.1007/s10922-013-9279-z |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.utm.57928 |
---|---|
record_format |
eprints |
spelling |
my.utm.579282021-12-19T06:19:23Z http://eprints.utm.my/id/eprint/57928/ Automated dataset generation for training peer-to-peer machine learning classifiers Zarei, Roozbeh Monemi, Alireza Marsono, Muhammad Nadzir TK Electrical engineering. Electronics Nuclear engineering Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this paper, an automated training dataset generation for an on-line P2P traffic classification is proposed to allow frequent classifier retraining. A two-stage training dataset generator (TSTDG) is proposed by combining a 3-class heuristic and a 3-class statistical classification to automatically generate a training dataset. In the heuristic stage, traffic is classified as P2P, non-P2P, or unknown. In the statistical stage, a dual Decision Tree is built based on a dataset generated in the heuristic stage to reduce the amount of classified unknown traffic. The final training dataset is generated based on all flows that are classified in these two stages. The proposed system has been evaluated on traces captured from a campus network. The overall results show that the TSTDG can generate an accurate training dataset by classifying around 94 % of total flows with high accuracy (98.59 %) and a low false positive rate (1.27 %). Springer Science and Business Media, LLC 2015 Article PeerReviewed Zarei, Roozbeh and Monemi, Alireza and Marsono, Muhammad Nadzir (2015) Automated dataset generation for training peer-to-peer machine learning classifiers. Journal of Network and Systems Management, 23 (1). pp. 89-110. ISSN 1064-7570 http://dx.doi.org/10.1007/s10922-013-9279-z DOI:10.1007/s10922-013-9279-z |
institution |
Universiti Teknologi Malaysia |
building |
UTM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknologi Malaysia |
content_source |
UTM Institutional Repository |
url_provider |
http://eprints.utm.my/ |
topic |
TK Electrical engineering. Electronics Nuclear engineering |
spellingShingle |
TK Electrical engineering. Electronics Nuclear engineering Zarei, Roozbeh Monemi, Alireza Marsono, Muhammad Nadzir Automated dataset generation for training peer-to-peer machine learning classifiers |
description |
Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this paper, an automated training dataset generation for an on-line P2P traffic classification is proposed to allow frequent classifier retraining. A two-stage training dataset generator (TSTDG) is proposed by combining a 3-class heuristic and a 3-class statistical classification to automatically generate a training dataset. In the heuristic stage, traffic is classified as P2P, non-P2P, or unknown. In the statistical stage, a dual Decision Tree is built based on a dataset generated in the heuristic stage to reduce the amount of classified unknown traffic. The final training dataset is generated based on all flows that are classified in these two stages. The proposed system has been evaluated on traces captured from a campus network. The overall results show that the TSTDG can generate an accurate training dataset by classifying around 94 % of total flows with high accuracy (98.59 %) and a low false positive rate (1.27 %). |
format |
Article |
author |
Zarei, Roozbeh Monemi, Alireza Marsono, Muhammad Nadzir |
author_facet |
Zarei, Roozbeh Monemi, Alireza Marsono, Muhammad Nadzir |
author_sort |
Zarei, Roozbeh |
title |
Automated dataset generation for training peer-to-peer machine learning classifiers |
title_short |
Automated dataset generation for training peer-to-peer machine learning classifiers |
title_full |
Automated dataset generation for training peer-to-peer machine learning classifiers |
title_fullStr |
Automated dataset generation for training peer-to-peer machine learning classifiers |
title_full_unstemmed |
Automated dataset generation for training peer-to-peer machine learning classifiers |
title_sort |
automated dataset generation for training peer-to-peer machine learning classifiers |
publisher |
Springer Science and Business Media, LLC |
publishDate |
2015 |
url |
http://eprints.utm.my/id/eprint/57928/ http://dx.doi.org/10.1007/s10922-013-9279-z |
_version_ |
1720436856541151232 |
score |
13.159267 |