一种可解释钓鱼邮件样本生成算法及其在机器学习中的应用

于高晴; 范文庆; 黄玮

0
0
浏览
下载

摘要
关键词
基金信息
论文图表
动态公开评议
相关论文
评论

一种可解释钓鱼邮件样本生成算法及其在机器学习中的应用

首发时间：2020-04-07

于高晴 ¹
于高晴（1994-），女，硕士研究生，主要研究方向：信息安全
范文庆 ¹
范文庆（1983-），男，副教授，硕导，主要研究方向：信息安全
黄玮 ¹
黄玮（1983-），男，副教授，硕导，主要研究方向：信息安全

1、中国传媒大学计算机与网络空间安全学院，北京 100024

摘要：钓鱼邮件由于包含隐私信息所以无法公开这一事实，极大阻碍了研究人员获取大规模的真实钓鱼电子邮件样本。同时，研究过程中使用的合法电子邮件数据集往往采集自某一特定领域，与钓鱼样本差异巨大，在建立模型过程中容易造成模型过拟合。本文提出一种基于数据插入的钓鱼邮件生成方法，在不改变钓鱼电子邮件恶意属性的前提下增加钓鱼样本数量,解决模型训练过程中出现的空间偏差问题，并能在一定程度上缩小良性样本与恶意样本在统计特征上的差异。本文基于钓鱼数据集和安然数据集在邮件HTML内容上的差异实现了六个不同的资源生成器和一个通信关系选择器，通过实现控制-数量序列对来控制新样本的生成，提出一种分类器泛化性能的量化评价方法和指标，并验证了新生成样本可以用于训练出泛化能力更强的分类器。本文的核心贡献在于提出一种方法为模型提供更高质量的数据。

关键词：信息安全样本生成钓鱼邮件检测空间偏差机器学习

For information in English, please click here

An Explainable Method of Phishing Emails Generation and Its Application in Machine Learning

YU Gaoqing ¹
于高晴（1994-），女，硕士研究生，主要研究方向：信息安全
FAN Wenqing ¹
范文庆（1983-），男，副教授，硕导，主要研究方向：信息安全
HUANG Wei ¹
黄玮（1983-），男，副教授，硕导，主要研究方向：信息安全

1、School of Computer Science and Cybersecurity,Communication University of China,Beijing 100024

Abstract：The fact that phishing emails cannot be released because they contains private information greatly hinders researchers from obtaining large-scale samples of real phishing emails. At the same time, the legal email data set is often collected from a certain field, which is quite different from the phishing samples, and it is easy to cause overfitting or spatial bias during the process of building the model. This paper proposes a method for generating phishing emails based on data insertion, which can increase the number of phishing samples without changing the malicious attributes, solve the problem of spatial bias during model training, and can reduce the difference in statistical characteristics between benign and malicious samples to a certain extent. Based on the differences in the email HTML content of the phishing dataset and the Enron dataset, this paper implements six resource generators and a communication relationship selector. It controls the generation of new samples by implementing control-quantity sequence pairs, and proposes a quantitative evaluation methods and indicators of the classifier\'s generalization ability, and verified that the newly generated samples can be used to train a classifier with stronger generalization ability. The main contribution of this paper is to propose a method to provide the model with higher quality data.

Keywords： information security sample generation phishing email detection space bias machine Learning

基金：

论文图表：

引用

导出参考文献

.txt

.ris

.doc

于高晴，范文庆，黄玮. 一种可解释钓鱼邮件样本生成算法及其在机器学习中的应用[EB/OL]. 北京：中国科技论文在线 [2020-04-07]. https://www.paper.edu.cn/releasepaper/content/202004-59.

No.****

动态公开评议

共计0人参与

动态评论进行中

全部评论

0/1000

论文编号	202004-59
论文题目	一种可解释钓鱼邮件样本生成算法及其在机器学习中的应用
文献类型
收录期刊	上传封面中文期刊英文期刊期刊名称（中文）期刊名称（英文）年，卷（）上传封面中文专著英文专著书名（中文）书名（英文）出版地出版社出版年上传封面中文译著英文译著书名（中文）书名（英文）出版地出版社出版年上传封面中文论文集英文论文集编者.论文集名称（中文） [c]. 出版地出版社出版年， - 编者.论文集名称（英文） [c]. 出版地出版社出版年，- 上传封面中文文献英文文献期刊名称（中文）期刊名称（英文）日期-- 在线地址http:// 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期-- 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期--
英文作者写法：中外文作者均姓前名后，姓大写，名的第一个字母大写，姓全称写出，名可只写第一个字母，其后不加实心圆点“.”, 作者之间用逗号“，”分隔，最后为实心圆点“.”, 示例1：原姓名写法：Albert Einstein,编入参考文献时写法：Einstein A. 示例2：原姓名写法：李时珍；编入参考文献时写法：LI S Z. 示例3：YELLAND R L,JONES S C,EASTON K S,et al.