一种可解释钓鱼邮件样本生成算法及其在机器学习中的应用
首发时间:2020-04-07
摘要:钓鱼邮件由于包含隐私信息所以无法公开这一事实,极大阻碍了研究人员获取大规模的真实钓鱼电子邮件样本。同时,研究过程中使用的合法电子邮件数据集往往采集自某一特定领域,与钓鱼样本差异巨大,在建立模型过程中容易造成模型过拟合。本文提出一种基于数据插入的钓鱼邮件生成方法,在不改变钓鱼电子邮件恶意属性的前提下增加钓鱼样本数量,解决模型训练过程中出现的空间偏差问题,并能在一定程度上缩小良性样本与恶意样本在统计特征上的差异。本文基于钓鱼数据集和安然数据集在邮件HTML内容上的差异实现了六个不同的资源生成器和一个通信关系选择器,通过实现控制-数量序列对来控制新样本的生成,提出一种分类器泛化性能的量化评价方法和指标,并验证了新生成样本可以用于训练出泛化能力更强的分类器。本文的核心贡献在于提出一种方法为模型提供更高质量的数据。
关键词: 信息安全 样本生成 钓鱼邮件检测 空间偏差 机器学习
For information in English, please click here
An Explainable Method of Phishing Emails Generation and Its Application in Machine Learning
Abstract:The fact that phishing emails cannot be released because they contains private information greatly hinders researchers from obtaining large-scale samples of real phishing emails. At the same time, the legal email data set is often collected from a certain field, which is quite different from the phishing samples, and it is easy to cause overfitting or spatial bias during the process of building the model. This paper proposes a method for generating phishing emails based on data insertion, which can increase the number of phishing samples without changing the malicious attributes, solve the problem of spatial bias during model training, and can reduce the difference in statistical characteristics between benign and malicious samples to a certain extent. Based on the differences in the email HTML content of the phishing dataset and the Enron dataset, this paper implements six resource generators and a communication relationship selector. It controls the generation of new samples by implementing control-quantity sequence pairs, and proposes a quantitative evaluation methods and indicators of the classifier\'s generalization ability, and verified that the newly generated samples can be used to train a classifier with stronger generalization ability. The main contribution of this paper is to propose a method to provide the model with higher quality data.
Keywords: information security sample generation phishing email detection space bias machine Learning
基金:
引用
No.****
动态公开评议
共计0人参与
勘误表
一种可解释钓鱼邮件样本生成算法及其在机器学习中的应用
评论
全部评论0/1000