本文主要参考Battle of the Boosting Algos: LGB, XGB, Catboost，结果与原文有出入。

文章目录

1. 对比标准
2. 结果
3. 总结
4. 代码
参考文献

1. 对比标准

1.1 数据集

分类：Fashion MNIST（60000条数据784个特征）
回归：NYC Taxi fares（60000条数据7个特征）
大规模数据集：NYC Taxi fares（2百万条数据7个特征）

PS：本文只进行了分类的对比

1.2 规则

使用基准模型
使用相同参数训练并利用GridSearchCV调参
比较训练和预测耗时、预测分数、可解释性

1.3 版本

xgboost==0.90
lightgbm==2.3.1
catboost==0.21

2. 结果

2.1 准确率

LightGBM>XGBoost>CatBoost
在这里插入图片描述

2.2 训练时间和预测时间

CatBoost<LightGBM<XGBoost

在这里插入图片描述

2.3 可解释性

XGBoost=LightGBM>Catboost

2.3.1 特征重要性

在这里插入图片描述

2.3.2 SHAP值

类别	含义
0	T-shirt/top
1	Trouser
2	Pullover
3	Dress
4	Coat
5	Sandal
6	Shirt
7	Sneaker
8	Bag
9	Ankle boot

XGBoost
在这里插入图片描述

LightGBM
在这里插入图片描述
CatBoost无法开箱即用

2.3.3 可视化二叉树

XGBoost
在这里插入图片描述

LightGBM
在这里插入图片描述

CatBoost绘制树函数

3. 总结

比赛选LightGBM，工业选Catboost

4. 代码

https://download.csdn.net/download/lly1122334/12171980

参考文献

Battle of the Boosting Algos: LGB, XGB, Catboost
Battle of the Boosting Algorithms
mlxtend: A library of extension and helper modules for Python’s data analysis and machine learning libraries
shap: A game theoretic approach to explain the output of any machine learning model
http://www.picnet.com.au/blogs/guido/post/2016/09/22/xgboost-windows-x64-binaries-for-download/
Graphviz – Graph Visualization Software Windows Packages

备份

# Feature Engineering
# this cell was adapted from https://www.kaggle.com/mahtieu/nyc-taxi-fare-prediction-data-expl-xgboost
def feature_engineering(df):
    df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
    #Drop rows with null values
    df = df.dropna(how = 'any', axis = 'rows')
    #Free rides, negative fares and passenger count filtering
    df = df[df.eval('(fare_amount > 0) & (passenger_count <= 6)')]
    # Coordinates filtering - Pickup and dropoff locations should be within the limits of NYC
    df = df[(df.pickup_longitude >= -77) &
                  (df.pickup_longitude <= -70) &
                  (df.dropoff_longitude >= -77) &
                  (df.dropoff_longitude <= 70) &
                  (df.pickup_latitude >= 35) &
                  (df.pickup_latitude <= 45) &
                  (df.dropoff_latitude >= 35) &
                  (df.dropoff_latitude <= 45)]

    df.pickup_datetime = df.pickup_datetime.dt.tz_convert('UTC')
    df.pickup_datetime = df.pickup_datetime.dt.tz_convert('America/New_York')

    # Fares may change every year
    df['year'] = df.pickup_datetime.dt.year

    # Different fares during weekdays and weekends
    df['dayofweek'] = df.pickup_datetime.dt.dayofweek

    # Different fares during public holidays
    df['dayofyear'] = df.pickup_datetime.dt.dayofyear

    # Different fares in peak periods and off-peak periods
    df['hourofday'] = df.pickup_datetime.dt.hour

    df = df.drop('pickup_datetime', axis=1)

    # Computes the distance (in miles) between the pickup and the dropoff locations
    df['distance'] = df.apply(
        lambda x: distance.distance((x.pickup_latitude, x.pickup_longitude), (x.dropoff_latitude, x.dropoff_longitude)).miles,
        axis = 1)

    df = df[df.eval('(distance > 0) & (distance < 150)')]
    fare_distance_ratio = (df.fare_amount/df.distance)
    fare_distance_ratio.describe()

    (fare_distance_ratio[fare_distance_ratio < 45]).hist()

    # Drop incoherent fares
    df = df[fare_distance_ratio < 45]
    del fare_distance_ratio

    # Coordinates of the 3 airpots of NYC
    airports = {'jfk': [40.6441666, -73.7822222],
                'laguardia': [40.7747222, -73.8719444],
                'newark': [40.6897222, -74.175]}

    # Computes the distance between the pickup location and the airport
    pickup = df.apply(lambda x: distance.distance((x.pickup_latitude, x.pickup_longitude), (airports.get('jfk'))).miles, axis=1)
    # Computes the distance between the dropoff location and the airport
    dropoff = df.apply(lambda x: distance.distance((x.dropoff_latitude, x.dropoff_longitude), (airports.get('jfk'))).miles, axis=1)
    # Selects the shortest distance
    df['to_jfk'] = pd.concat((pickup, dropoff), axis=1).min(axis=1)

    pickup = df.apply(lambda x: distance.distance((x.pickup_latitude, x.pickup_longitude), (airports.get('laguardia'))).miles, axis=1)
    dropoff = df.apply(lambda x: distance.distance((x.dropoff_latitude, x.dropoff_longitude), (airports.get('laguardia'))).miles, axis=1)
    df['to_laguardia'] = pd.concat((pickup, dropoff), axis=1).min(axis=1)

    pickup = df.apply(lambda x: distance.distance((x.pickup_latitude, x.pickup_longitude), (airports.get('newark'))).miles, axis=1)
    dropoff = df.apply(lambda x: distance.distance((x.dropoff_latitude, x.dropoff_longitude), (airports.get('newark'))).miles, axis=1)
    df['to_newark'] = pd.concat((pickup, dropoff), axis=1).min(axis=1)
    del pickup, dropoff
    return df

本文来自CSDN，观点不代表一起大数据-技术文章心得立场，如若转载，请注明出处：https://blog.csdn.net/lly1122334/article/details/104294112

更多内容请访问：IT源点

注意：本文归作者所有，未经作者允许，不得转载

XGBoost、LightGBM、Catboost对比

文章目录

1. 对比标准

1.1 数据集

1.2 规则

1.3 版本

2. 结果

2.1 准确率

2.2 训练时间和预测时间

2.3 可解释性

2.3.1 特征重要性

2.3.2 SHAP值

2.3.3 可视化二叉树

3. 总结

4. 代码

参考文献

全部评论: 0 条

本文目录

热门标签

程序员导航

热门文章

阿里云新老用户最新优惠

最新发布

最新评论