《保险研究》20221206-《主流机器学习方法识别车险欺诈效果的比较研究》(陈凯、李斌杰)

[中图分类号]F840;TP181 [文献标识码]A [文章编号]1004-3306(2022)12-0090-13 DOI:10.13497/j.cnki.is.2022.12.006

资源价格:30积分

  • 内容介绍

[摘   要]近年来,我国车险市场巨大的体量也催生了许多车险欺诈案件,然而传统的车险欺诈识别手段效率不佳,本文采用机器学习的方法,基于包含中国在内的四个数据集进行了实证分析,以比较六种主流机器学习方法对车险欺诈的预测表现以及预测表现的稳健性。本文对四个原始数据集进行数据分割,使原数据集分为训练集和测试集,训练集用于构建机器学习模型,测试集用于评估机器学习模型的效果,从而评估各机器学习方法的预测表现以及预测表现的稳定性。首先基于特征空间采用SMOTE采样法,使训练集中的欺诈样本数与非欺诈样本数达到平衡。之后采用10折交叉验证法选取最佳的参数组合来确定机器学习中的最优调节参数,并采用ROC曲线及曲线下方的面积AUC作为模型预测效果的评估标准,以避免主观选取截断点造成的影响。最终,研究发现极端梯度提升决策树模型和随机森林模型的预测表现以及预测表现的稳定性较好。

[关键词]汽车保险;机器学习;SMOTE采样; ROC曲线

[作者简介]陈凯(通讯作者),北京大学经济学院副教授,;李斌杰,北京大学经济学院硕士研究生。


A Comparative Study on the Effectiveness of Machine Learning Methods in Auto Insurance Fraud Identification

CHEN Kai,LI Bin-jie

Abstract:The magnitude of China′s auto insurance market has induced a large amount of auto insurance frauds.However,the traditional auto insurance fraud identification methods are not effective.This paper uses machine learning methods and makes an empirical analysis based on four data sets to compare the prediction performance and robustness of six mainstream machine learning methods on auto insurance fraud detection.We split all four original data sets into training set and test set.The training set is used to build the machine learning model,and the test set is used to evaluate the effect of the machine learning model.Together,we evaluate the prediction performance of each machine learning method and the robustness of the prediction performance.Firstly,we use SMOTE sampling method to generate new data,in order to balance the number of fraud samples and non-fraud samples in the training set.We then use the 10-fold cross validation method to select the best parameter combination to determine the optimal adjustment parameters in machine learning.We use the Receiver Operating Characteristic Curve and the Area Under the Curve as the evaluation standard of the prediction effect of the model.Finally,we find the prediction performance and robustness of the stochastic forest model and extreme gradient lifting decision tree model are better.

Key words:auto insurance;machine learning;SMOTE Sampling;ROC Curve