Jorge Guerra Torres, Carlos Catania, Veas Eduardo Enrique
Active learning approach to label network traffic datasets
Journal of Information Security and Applications, Elsevier, Elsevier, 2019
Modern Network Intrusion Detection systems depend on models trained with up-to-date labeled data. Yet, the process of labeling a network traffic dataset is specially expensive, since expert knowledge is required to perform the annotations. Visual analytics applications exist that claim to considerably reduce the labeling effort, but the expert still needs to ponder several factors before issuing a label. And, most often the effect of bad labels (noise) in the final model is not evaluated. The present article introduces a novel active learning strategy that learns to predict labels in (pseudo) real-time as the user performs the annotation. The system called RiskID, presents several innovations: i) a set of statistical methods summarize the information, which is illustrated in a visual analytics application, ii) that interfaces with the active learning strategy forbuilding a random forest model as the user issues annotations; iii) the (pseudo) real-time predictions of the model are fed back visually to scaffold the traffic annotation task. Finally, iv) an evaluation framework is introduced that represents a complete methodology for evaluating active learning solutions, including resilience against noise.