基于二重区间内随机阈值的标签噪声过滤算法
首发时间:2020-12-30
摘要:近年来监督学习被广泛地应用到各个领域,而常见的数据集往往包含大量标签噪声。而标签噪声的存在,容易导致模型过拟合,泛化能力差等。现有的标签噪声过滤算法虽然在一定程度上可以解决上述问题,但是仍存在识别能力较差、过滤效率低等问题。针对这些问题,本文提出了二重区间内随机阈值划分算法,将样本训练早期的损失值作为类别区分的依据,在一重设定区间内,利用相邻样本差值最大和次大值构建二重区间,缩小随机阈值的边界范围,增加随机阈值的可靠度。同时考虑到单一阈值导致的误差性,基于集成学习的思想引入了多次随机阈值划分的方法,模糊了阈值的界限,增加了过滤结果的可靠度。特别的,在实际生产生活中样本噪声率不可知的情况下,二重区间内随机阈值划分方法具有相对较高的准确率和召回率。在CIFAR10数据集上的实验结果表明,在标签噪声存在的情况下,所提方法在较高噪声率情况下有较好的噪声样本识别能力
For information in English, please click here
Filtering algorithm based on random threshold in double interval for noise label
Abstract:In recent years, supervised learning has been widely used in various fields, and common data sets often contain a lot of label noise. The presence of label noise easily leads to model over-fitting and poor generalization ability. Although the existing label noise filtering algorithm can solve the above problems to a certain extent, it still has problems such as poor recognition ability and low filtering efficiency. In response to these problems, this paper proposes a random threshold division algorithm in the double interval, which uses the loss value of the sample training early as the basis for classification. Interval, narrow the boundary range of the random threshold, and increase the reliability of the random threshold. At the same time, considering the error caused by a single threshold, based on the idea of integrated learning, a method of multiple random threshold divisions is introduced, which blurs the boundaries of the threshold and increases the reliability of the filtering results. In particular, when the sample noise rate is unknown in actual production and life, the random threshold division method in the double interval has relatively high accuracy and recall rates. The experimental results on the CIFAR10 data set show that in the presence of label noise, the proposed method has better noise sample recognition ability under the condition of higher noise rate
Keywords: noise label loss value noise filtering integrated learning
基金:
引用
No.****
动态公开评议
共计0人参与
勘误表
基于二重区间内随机阈值的标签噪声过滤算法
评论
全部评论0/1000