采样方法笔记

基于imbalance learn的采样方法研究，主要为过采样

1 概述

针对不平衡的数据样本，需要进行采样。

因为正样本太少，同时负样本太多，导致模型无法学习到足够的负样本信息。

同时，会导致模型对负样本赋予过多的权重。

因此，针对正负样本分布不均衡的情况，需要进行采样。

2 采样方法

2.1 欠采样

欠采样(Undersampling),从反例中随机抽取数据，与正例合并。

丢弃了大量数据，有可能会导致一些信息的缺失。有可能会导致过拟合的情况，因为无法泛化到更大的数据集。

实际效果未经过测试，但是不建议使用。

2.2 过采样

过采样(Oversampling),主要思路为针对数量少的正类，进行数据的扩充。可以通过复制的方法(random)或者一些人工合成数据的方法(比如smote)。

具体的实现，参考imbalanced-learn这个库

文档地址: https://imbalanced-learn.org/stable/

2.2.1 随机抽样

介绍

通过随机采样的方法，重复正例数据。实际上没有引入跟多的数据。如果过度的复制正例，有可能会加大噪声对模型的影响。同时也有可能导致过拟合。

代码

from imblearn.over_sampling import RandomOverSampler 
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)

注意问题

注意：随机采样方法中用到的采样器，参数如下：

RandomOverSampler(*, sampling_strategy='auto', random_state=None, shrinkage=None)

sampling_strategy 参数用来控制抽样比例，可以直接通过输入小数来确认抽样比例。使用默认的话会自动把正样本抽到1比1。大概率导致过拟合。

具体文档：

sampling_strategyfloat, str, dict or callable, default=’auto’

Sampling information to resample the data set.

When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as $lpha_{os} = N_{rm} / N_{M}$ where $N_{rm}$ is the number of samples in the minority class after resampling and $N_{M}$ is the number of samples in the majority class.

Warning

float is only available for binary classification. An error is raised for multi-class classification.

When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

'minority': resample only the minority class;

'not minority': resample all classes but the minority class;

'not majority': resample all classes but the majority class;

'all': resample all classes;

'auto': equivalent to 'not majority'.

When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

2.2.2 SMOTE

介绍

SMOTE全称是Synthetic Minority Oversampling Technique即合成少数类过采样技术，基本思想是对少数类别样本进行合成，是基于随机采样的变种。

抽样步骤：

根据样本不平衡的比例设置一个采样倍率N。
基于KNN，计算每个少量样本的K个近邻(欧式距离)。
随机选择一个样本与原样本进行线性插值:在两个样本的连线中随机选一个点。
重复以上操作，直到采样数量满足倍率N。

代码

from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state=42)
x_res, y_res = sm.fit_resample(x, y)

注意问题

当前的库不支持字符型数据，需要编码
不支持Nan，需要填充

2.2.3 SMOTENC

介绍

原理与smote相同，只是支持数值变量与类别变量混合。

代码

from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=42, categorical_features=[18, 19]) # 需注明类别变量的索引
X_res, y_res = sm.fit_resample(X, y)

2.2.4 SMOTEN

介绍

原理与smote相同，只支持类别变量。

代码

from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
x_res, y_res = sampler.fit_resample(x y)

2.2.5 ADASYN

介绍

ADASYN(Adaptive Synthetic),自适应综合过采样方法。

计算正类与负类的比例
计算要生产的正例总数，由参数β控制比例, $m_l,m_s$ 对应多数类少数类的数量
$G = (m_l-m_s)\beta$
对每个正例计算KNN，并计算 $r_i$
$r_i = \frac{\#负类}{k}$
$r_i$ 表示负类(多数类)的主导地位。值越高表示这个更难学习。
归一化 $r_i$ ,使所有 $r_i$ 的和为1
计算每个邻域需要生产的正例样本的数量： $G_i=G*r_i$
在每个邻域生成对应数量的样本。
$s_i= x_i+（x_{zi}-x_i)\lambda$
其中， $x_i$ 和 $x_{zi}$ 为邻域内的两个点。 $\lambda$ 为0~1的随机数。
直到满足采样数量

代码

from imblearn.over_sampling import ADASYN 
ada = ADASYN(random_state=42)
x_res, y_res = ada.fit_resample(x, y)

2.2.6 BorderlineSMOTE

介绍

smote算法的改进，核心思想是采样边界线上的样本，大致步骤如下：

在KNN之前，先根据每一个少数样本点周围样本的分布情况进行了筛选。
- 如果周围全部都是负类，认为样本是噪声，不处理。
- 如果负类超过半数，认为样本是边界样本，加入danger集合中。
对danger中的少数类样本(正类)，进行过采样。
采样方法与smote相同

备注：也有改进算法，同时对边界线的多数样本进行采样。

代码

from imblearn.over_sampling import BorderlineSMOTE 
sm = BorderlineSMOTE(random_state=42)
x_res, y_res = sm.fit_resample(x, y)

2.2.7 KMeansSMOTE

介绍

使用Kmeans对整个数据集进行聚类
将更多的合成样本分配给少数类分布稀疏的簇
对每个少数类的簇进行smote采样

代码

from imblearn.over_sampling import KMeansSMOTE
sm = KMeansSMOTE(random_state=42)
x_res, y_res = sm.fit_resample(x, y)

2.2.8 SVMSMOTE

介绍

与borderline smote类似，区别是使用SVM替换了KNN，从而borderline是不同的。

在边界区域的正类样本与其周围n个正类样本进行smote采样。

代码

from imblearn.over_sampling import SVMSMOTE sm = SVMSMOTE(random_state=42)x_res, y_res = sm.fit_resample(x, y)

3 效果总结

基于实际的风控业务进行了测试，基本在训练集都达到了过拟合的效果。但是在实际的测试集上效果与不进行采样相差不大。甚至采样比例过大的话，反而导致效果下降。

因此还需要基于实际业务进行测试后采样。

Max Blog