您现在的位置是：首页 > 技术教程正文

一文速学-XGBoost模型算法原理以及实现+Python项目实战

admin 阅读：32 2024-03-19

后台-插件-广告管理-内容页头部广告（手机）

前言

一、XGBoost模型概述

1.发展历史

2.算法改进之处

1.损失函数

2.分裂点选择

3.剪枝策略

4.正则化

5.学习率

6.提前停止

二、XGBoost算法原理

1.初始化构造目标函数

2.目标函数变换

变换优势总结

3.将树引入目标函数

4.构建最优树(贪心算法)

三、XGBoost实战-贷款违约预测模型

1.数据背景及描述

字段表

2.数据质量校验

查看重复值：

缺失值统计

异常值分析-MAD异常值识别法：

3.特征类别处理

1.grade

2.subGrade

4.XGBoost模型训练

1.xgboost.get_config()

2.树的最大深度以及最小叶子节点样本权重

3.gamma

4.subsample 和 colsample_bytree

5.正则项

6.学习速率

前言

集成模型Boosting补完计划第三期了，之前我们已经详细描述了AdaBoost算法模型和GBDT原理以及实践。通过这两类算法就可以明白Boosting算法的核心思想以及基本的运行计算框架，余下几种Boosting算法都是在前者的算法之上改良得到，尤其是以GBDT算法为基础改进衍生出的三种Boosting算法：XGBoost、LightGBM、CatBoost。大家应该都对XGBoost算法模型熟悉但是对GBDT模型一无所知，看过之前GBDT的读者应该对GBDT模型有了一个很清楚的认知，对于理解XGBoost算法有一定的基础。

XGBoost在各种数据挖掘、预测和分类任务中取得了极高的准确率和性能。是目前应用最广泛的机器学习算法之一。可以说，XGBoost的快速发展和广泛应用，推动了机器学习算法的进一步发展和优化，为人工智能技术的普及和应用打下了坚实的基础。那么此篇文章我将尽力让大家了解并熟悉XGBoost模型算法框架，保证能够理解通畅以及推演顺利的条件之下，尽量不使用过多的数学公式和专业理论知识。以一篇文章快速了解并实现该算法，以效率最高的方式熟练使用此方法。

博主专注建模四年，参与过大大小小数十来次数学建模，理解各类模型原理以及每种模型的建模流程和各类题目分析方法。此专栏的目的就是为了让零基础快速使用各类数学模型以及代码，每一篇文章都包含实战项目以及可运行代码。博主紧跟各类数模比赛，每场数模竞赛博主都会将最新的思路和代码写进此专栏以及详细思路和完全代码。希望有需求的小伙伴不要错过笔者精心打造的专栏。

以下是整篇文章内容。

一、XGBoost模型概述

1.发展历史

2014年：XGBoost由陈天奇在《XGBoost: A Scalable Tree Boosting System》一文中首次提出。
2015年：XGBoost在Kaggle竞赛中大放异彩，成为数据科学家和机器学习工程师的首选算法之一。
2016年：XGBoost发布了C++和Python两个版本，支持更多的特征工程和模型调优功能，极大地提高了算法的效率和可扩展性。
2017年：XGBoost获得了KDD Cup 2017竞赛中的多个奖项，并且成为Spark MLlib中的重要组件。
2018年：XGBoost在Microsoft Azure ML Studio中被引入，成为Azure Machine Learning的核心组件之一。
2019年：XGBoost发布了GPU版本，可以在GPU上加速模型训练和预测，大大提高了算法的计算速度。
2020年：XGBoost被应用在各种领域，如金融、医疗、自然语言处理和图像识别等，成为机器学习领域的一个重要里程碑。

这里通过简述的发展史可以得到我们可以通过python调用此算法，而且也可以通过调用GPU提高了算法的计算速度。由此可见XGBoost算法的受欢迎程度。

2.算法改进之处

XGBoost是一种基于梯度提升决策树（Gradient Boosting Decision Tree，GBDT）的机器学习算法，旨在优化和加速GBDT的训练过程，并提高模型的准确性和泛化能力。

我们拿这个XGBoost与其他GBDT的算法进行对比：

算法差异点	GBDT	XGBoost	LightGBM	CatBoost
弱学习器	CART回归树	1.CART回归树 2.线性学习器 3.Dart树	Leaf-wise树	对称树
寻找分裂点	贪心算法	近似算法	直方图算法	预排序算法
稀疏值处理	无	稀疏感知算法	EFB(互斥特征捆绑)	无
类别特征	不直接支持，可自行编码后输入模型	同GBDT	直接支持，GS编码	直接支持，Ordered TS编码
并行支持	不可以	可以	可以	可以

XGBoost的算法原理较于GBDT算法的改进包括以下几个方面：

1.损失函数

XGBoost采用泰勒展开式（Taylor expansion）来近似损失函数，其中损失函数可以是回归问题中的均方误差（MSE）或分类问题中的交叉熵（cross-entropy）。XGBoost通过二阶泰勒展开式考虑了损失函数的一、二阶导数，从而提高了模型的预测精度。

2.分裂点选择

XGBoost使用贪心算法来选择最优的分裂点（split point），即使得损失函数最小的分裂点。在寻找最优分裂点的过程中，XGBoost通过对特征值进行排序来加速计算，同时引入了直方图（histogram）和近似算法（approximate algorithm）来进一步提高计算效率。

3.剪枝策略

XGBoost采用与CART决策树类似的剪枝策略，通过设定叶子节点的最小权重和最大深度来控制模型的复杂度，并避免过拟合。剪枝策略可以在模型训练过程中或之后进行。

4.正则化

XGBoost还采用正则化方法来控制模型的复杂度，避免过拟合。具体来说，XGBoost支持两种正则化方法：L1正则化和L2正则化。通过调节正则化参数的值，可以实现对模型复杂度的灵活控制。

5.学习率

XGBoost还引入了学习率（learning rate）的概念，用于控制每次迭代时模型参数的更新幅度，避免过拟合。学习率通常设置为小于1的值，例如0.1或0.01。

6.提前停止

为了避免过拟合和提高模型的训练效率，XGBoost还支持提前停止（early stopping）功能。该功能会在训练过程中监测验证集的损失函数，当连续若干次迭代中验证集的损失函数没有下降时，就停止训练，避免过拟合和浪费计算资源。

二、XGBoost算法原理

在了解了XGBoost与GBDT算法差距之后我们不妨和学习GBDT算法一样整体来一遍清楚哪些计算节点增加了新的处理。

首先基础的流程还是和GBDT一致的：

1.初始化构造目标函数

将所有样本的权重设置为相等的值，建立一个初始模型作为基准模型，可以设置为简单的平均值或者是中位数，而XGBoost中加入了正则项，用来控制基学习器树的结构，目标函数定义如下：

例如建立一个弱分类器 $F0(x)=argmin_{c}\sum_{i=1}^{N}L(y_{i},c)$

因为我们的 $g_{i}$

计算损失函数时是以样本索引来遍历的 $\sum_{i=1}^{n}$

那么到这一步目标函数最优结构就确定了，但是针对于一棵树，模型的结构可能有很多种，此刻我们需要选择最优树结构，XGBoost算法采用的动态规划为贪心算法，去构建一颗最优的树。

不知道大家看到现在还有几个愿意看下去贪心算法构建最优树的，但是XGBoost算法的计算原理就是如此，要了解其计算过程这点是绕不过的，该学习了解的还是得下点功夫。

4.构建最优树(贪心算法)

贪心算法的模板下面给出：
1. 确定问题的优化目标，即需要优化的指标。
2. 将问题分解成若干个子问题，每个子问题都可以采取贪心策略求解。
3. 对于每个子问题，定义一个局部最优解，然后利用贪心策略选择当前状态下的最优解，更新全局最优解。
4. 对于更新后的全局最优解，判断是否满足问题的终止条件。如果满足，算法终止；如果不满足，继续执行步骤3。
而对于树模型的优劣程度我们是根据信息增益来判断的，也就是使用熵来进行评估不确定度，CART中使用的是基尼系数的概念。对于XGBoost算法来说我们比较

数据展示：

2.数据质量校验

使用pandas进行数据分析，也就是查看数据中的空缺值，重复值和异常值，以及相对应的数据描述。
1. import pandas as pd
2. df=pd.read_csv("train.csv")
3. test=pd.read_csv("testA.csv")
基本数据情况：
1. df.shape
2. (800000, 47)
3. test.shape
4. (200000, 46)
1. df.info()
3. <class 'pandas.core.frame.DataFrame'>
4. RangeIndex: 800000 entries, 0 to 799999
5. Data columns (total 47 columns):
6. # Column Non-Null Count Dtype
7. --- ------ -------------- -----
8. 0 id 800000 non-null int64
9. 1 loanAmnt 800000 non-null float64
10. 2 term 800000 non-null int64
11. 3 interestRate 800000 non-null float64
12. 4 installment 800000 non-null float64
13. 5 grade 800000 non-null object
14. 6 subGrade 800000 non-null object
15. 7 employmentTitle 799999 non-null float64
16. 8 employmentLength 753201 non-null object
17. 9 homeOwnership 800000 non-null int64
18. 10 annualIncome 800000 non-null float64
19. 11 verificationStatus 800000 non-null int64
20. 12 issueDate 800000 non-null object
21. 13 isDefault 800000 non-null int64
22. 14 purpose 800000 non-null int64
23. 15 postCode 799999 non-null float64
24. 16 regionCode 800000 non-null int64
25. 17 dti 799761 non-null float64
26. 18 delinquency_2years 800000 non-null float64
27. 19 ficoRangeLow 800000 non-null float64
28. 20 ficoRangeHigh 800000 non-null float64
29. 21 openAcc 800000 non-null float64
30. 22 pubRec 800000 non-null float64
31. 23 pubRecBankruptcies 799595 non-null float64
32. 24 revolBal 800000 non-null float64
33. 25 revolUtil 799469 non-null float64
34. 26 totalAcc 800000 non-null float64
35. 27 initialListStatus 800000 non-null int64
36. 28 applicationType 800000 non-null int64
37. 29 earliesCreditLine 800000 non-null object
38. 30 title 799999 non-null float64
39. 31 policyCode 800000 non-null float64
40. 32 n0 759730 non-null float64
41. 33 n1 759730 non-null float64
42. 34 n2 759730 non-null float64
43. 35 n3 759730 non-null float64
44. 36 n4 766761 non-null float64
45. 37 n5 759730 non-null float64
46. 38 n6 759730 non-null float64
47. 39 n7 759730 non-null float64
48. 40 n8 759729 non-null float64
49. 41 n9 759730 non-null float64
50. 42 n10 766761 non-null float64
51. 43 n11 730248 non-null float64
52. 44 n12 759730 non-null float64
53. 45 n13 759730 non-null float64
54. 46 n14 759730 non-null float64
55. dtypes: float64(33), int64(9), object(5)
查看重复值：
df[df.duplicated()==True]#打印重复值

缺失值统计
1. # nan可视化
2. missing = df.isnull().sum()/len(df)
3. missing = missing[missing > 0]
4. missing.sort_values(inplace=True)
5. missing.plot.bar()
异常值分析-MAD异常值识别法：

这里仅检测为数值的特征，对该方法不清楚的可以去看看我的另一篇文章一文速学(六)-数据分析之Pandas异常值检测及处理操作各类方法详解+代码展示：

3.特征类别处理

此时我们可以发现有五个特征为object特征，对于此特征我们需要进行特征转换，大家可以参考我的一文速学-特征数据类别分析与预处理方法详解+Python代码，处理定类特征。

1.grade

该特征为贷款等级，总共有A,B,C,D,E,F,G七个等级，使用OneHot Encoding方法可行：
pd.get_dummies(df.grade,drop_first=False)

2.subGrade

该特征为贷款等级之子级，命名规则为贷款等级加上[1,2,3,4,5]这五个等级，这显然用OneHot Encoding方法就不行了，而且与上个grade方法存在信息冗余，因此可以将此特征与上个特征grade特征进行合并处理，通过适当加入一定的量纲即可。当然有更好的方法，我这里便于大家理解，采用一下方法：
1. df_grade=pd.get_dummies(df.grade,drop_first=False)
2. def shine_convert(x):
3. x=x*2
4. return x/10
5. se_subGrade=df.subGrade.str[1:].astype(int).apply(shine_convert)
7. for i in range(df_grade.shape[0]):
8. data=df_grade.iloc[i,:]
9. non_zero_index = data != 0
10. data.loc[non_zero_index] += se_subGrade[i]
11. print(data)
这里就实现了在定类数据上面再作等级的划分的条件。当然跑80w数据我这电脑以及够慢了，这里并没有跑完，大家可以去尝试一下。

4.XGBoost模型训练

数据处理的问题我这里不演示太多，毕竟主要是写XGBoost的运用，文章太长了也容易造成知识疲劳，这里我们就直接使用XGBoost算法，大家默认数据已经是处理好的就好了。这里详细讲一下XGBoost算法调参该如何调。

1.xgboost.get_config()

获取全局配置的当前值。
全局配置由可在全局范围中应用的参数集合组成。有关全局配置中支持的参数的完整列表：Global Configuration
1. import xgboost as xgb
2. xgb.get_config()
2.树的最大深度以及最小叶子节点样本权重

首先对这个值为树的最大深度以及最小叶子节点样本权重和这个组合进行调整。最大深度控制了树的结构，最小叶子节点样本权重这个参数用于避免过拟合。当它的值较大时，可以避免模型学习到局部的特殊样本。但是如果这个值过高，会导致欠拟合。
1. param_test1 = {'max_depth':range(3,10,2),'min_child_weight':range(2,7,2)}
3. gsearch1 = GridSearchCV(estimator =XGBR( learning_rate =0.1, n_estimators=140, max_depth=5,
4. min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'reg:linear',
5. nthread=4, scale_pos_weight=1, seed=27),
6. param_grid = param_test1, scoring='r2',n_jobs=4, cv=5)
7. gsearch1.fit(Xtrain,Ytrain)
8. gsearch1.best_params_, gsearch1.best_score_
3.gamma

再对参数 gamma 进行调整。在XGBoost 节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。gamma 指定了节点分裂所需的最小损失函数下降值。这个参数的大小决定了模型的保守程度。参数越高，模型越不保守。
1. param_test3 = {
2. 'gamma':[i/10.0 for i in range(0,5)]
3. }
4. gsearch3 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=8, max_depth=8,
5. min_child_weight=3, gamma=0, subsample=0.8, colsample_bytree=0.8,
6. objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
7. param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
8. gsearch3.fit(X,y)
9. gsearch3.best_params_, gsearch3.best_score_
4.subsample 和 colsample_bytree

再对参数 subsample 和 colsample_bytree 进行调整。subsample 控制对于每棵树的随机采样的比例。减小这个参数的值，算法会更加保守，避免过拟合。但是，如果这个值设置得过小，它可能会导致欠拟合。colsample_bytree 用来控制每棵随机采样的列数的占比(每一列是一个特征)。
1. param_test4 = {
2. 'subsample':[i/100.0 for i in range(75,90,5)],
3. 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
4. }
5. gsearch5 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=8, max_depth=8,
6. min_child_weight=3, gamma=0.4, subsample=0.8, colsample_bytree=0.8,
7. objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
8. param_grid = param_test5, scoring='roc_auc',iid=False, cv=5)
9. gsearch5.fit(X_train,y_train)
10. gsearch5.best_params_, gsearch5.best_score_
5.正则项

控制模型的正则项，防止出现过拟合的现象。
1. param_test5 = {
2. 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
3. }
4. gsearch6 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=8, max_depth=8,
5. min_child_weight=3, gamma=0.2, subsample=0.85, colsample_bytree=0.85,
6. objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
7. param_grid = param_test6, scoring='roc_auc',iid=False, cv=5)
8. gsearch6.fit(X_train,y_train)
9. gsearch6.best_params_, gsearch6.best_score_
6.学习速率

最后进行学习速率的调整，选择最优的学习速率最终确定适合的模型。
1. param_test6 = {
2. 'learning_rate':[0.01, 0.02, 0.1, 0.2]
3. }
4. gsearch6 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=8, max_depth=8,
5. min_child_weight=1, gamma=0.2, subsample=0.8, colsample_bytree=0.85,
6. objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
7. param_grid = param_test6, scoring='roc_auc',iid=False, cv=5)
8. gsearch6.fit(X_train,y_train)
9. gsearch6.best_params_, gsearch6.best_score_
无参数XGBoost：
1. from sklearn.metrics import roc_auc_score
2. from sklearn.model_selection import train_test_split
3. import xgboost as xgb
4. from sklearn.model_selection import KFold
5. from sklearn.metrics import confusion_matrix
6. import seaborn as sns
7. import matplotlib.pyplot as plt
8. train=pd.read_csv("df2.csv")
9. train=train.iloc[:10000,:]
10. testA2=pd.read_csv("testA.csv")
12. # 划分特征变量与目标变量
13. X=train.drop(columns='isDefault')
14. Y=train['isDefault']
15. # 划分训练及测试集
16. x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=0)
17. # 模型训练
18. clf=xgb.XGBClassifier()
19. result = []
20. mean_score = 0
21. labels=[0,1]
22. clf.fit(x_train,y_train)
23. y_pred=clf.predict_proba(x_test)[:,1]
24. def classify_convert(x):
25. if x >0.5:
26. return 1
27. else:
28. return 0
29. list_predict=[]
30. for i in y_pred:
31. list_predict.append(classify_convert(i))
34. cm= confusion_matrix(y_test.values, list_predict)
35. sns.heatmap(cm,annot=True ,fmt="d",xticklabels=labels,yticklabels=labels)
36. print('验证集auc:{}'.format(roc_auc_score(y_test, y_pred)))
37. mean_score += roc_auc_score(y_test, y_pred)
38. plt.title('confusion matrix') # 标题
39. plt.xlabel('Predict lable') # x轴
40. plt.ylabel('True lable') # y轴
41. plt.show()
42. # 模型评估
43. print('mean 验证集Auc:{}'.format(mean_score))
44. cat_pre=sum(result)
45. from sklearn.metrics import f1_score
46. print('F1_socre:{}'.format(f1_score(y_test.values, list_predict, average='weighted')))
47. from sklearn.metrics import recall_score
48. print('Recall_score:{}'.format(recall_score(y_test.values, list_predict, average='weighted')))
49. from sklearn.metrics import precision_score
50. print('Percosopn:{}'.format(precision_score(y_test.values, list_predict, average='weighted')))
参数调整：
1. from sklearn.metrics import roc_auc_score
2. from sklearn.model_selection import train_test_split
3. import xgboost as xgb
4. from sklearn.model_selection import KFold
5. from sklearn.metrics import confusion_matrix
6. import seaborn as sns
7. import matplotlib.pyplot as plt
8. train=pd.read_csv("df2.csv")
9. train=train.iloc[:10000,:]
10. testA2=pd.read_csv("testA.csv")
12. # 划分特征变量与目标变量
13. X=train.drop(columns='isDefault')
14. Y=train['isDefault']
15. # 划分训练及测试集
16. x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=0)
17. # 模型训练
18. clf=xgb.XGBClassifier(eta=0.1,
19. n_estimators=8,
20. max_depth=8,
21. min_child_weight=2,
22. gamma=0.8,
23. subsample=0.85,
24. colsample_bytree=0.8
25. )
26. result = []
27. mean_score = 0
28. labels=[0,1]
29. clf.fit(x_train,y_train)
30. y_pred=clf.predict_proba(x_test)[:,1]
31. def classify_convert(x):
32. if x >0.5:
33. return 1
34. else:
35. return 0
36. list_predict=[]
37. for i in y_pred:
38. list_predict.append(classify_convert(i))
41. cm= confusion_matrix(y_test.values, list_predict)
42. sns.heatmap(cm,annot=True ,fmt="d",xticklabels=labels,yticklabels=labels)
43. print('验证集auc:{}'.format(roc_auc_score(y_test, y_pred)))
44. mean_score += roc_auc_score(y_test, y_pred)
45. plt.title('confusion matrix') # 标题
46. plt.xlabel('Predict lable') # x轴
47. plt.ylabel('True lable') # y轴
48. plt.show()
49. # 模型评估
50. print('mean 验证集Auc:{}'.format(mean_score))
51. cat_pre=sum(result)
52. from sklearn.metrics import f1_score
53. print('F1_socre:{}'.format(f1_score(y_test.values, list_predict, average='weighted')))
54. from sklearn.metrics import recall_score
55. print('Recall_score:{}'.format(recall_score(y_test.values, list_predict, average='weighted')))
56. from sklearn.metrics import precision_score
57. print('Percosopn:{}'.format(precision_score(y_test.values, list_predict, average='weighted')))
并不是最优参数，大家可以自行调整看，至此模型建立完毕。

那么让我们总结一下XGBoost模型特性：

XGBoost算法的核心思想是通过最小化损失函数来学习弱分类器，并将多个弱分类器组合成一个强分类器，以提高模型的准确性和泛化能力。XGBoost采用梯度提升算法和正则化方法来训练模型，引入学习率和提前停止等功能，进一步提高模型的效率和鲁棒性。

那么下一章将继续把Boosting模型大家庭补全-LightGBM。