ハイパーパラメーターチューニング 機械学習

ランダムフォレストのパラメータをチューニング
XGBのパラメータをチューニング
予想

ランダムフォレストのパラメータをチューニング

rf_model

RandomForestClassifier (random_state=0)

ランダムフォレストで今回チューニングするパラメータ

max_depth : 決定木の深さの最大値
n_estimators : 多数決を行う決定木の数

params= {
   'max_depth' :[2,5,10],
   'n_estimators':np.linspace(10,100,5, dtype='int')
        }

np.linspaceについて
numpy.linspace()も等差数列を生成するが、間隔（公差）ではなく要素数を指定する。第一引数startに最初の値、第二引数stopに最後の値、第三引数numに要素数を指定する。

print(np.linspace(0, 10, 3))
# [ 0.  5. 10.]

チューニングにはGridSearchCV()を使用¶

引数はそれぞれ、

第1引数に機械学習モデルを指定
param_gridでチューニングするパラメータ名と範囲が入った辞書型orリスト型を指定
cvでチューニングで最適なパラメータを見つけるための交差検証の回数を指定
scoringでチューニング時に評価する指標を指定(今回はf値を指定)

grid_rf_model = GridSearchCV(
    RandomForestClassifier(random_state=0), 機械学習モデル
    param_grid=params,　チューニングする辞書
    cv=5,　CVの回数
    scoring = 'f1'　指標
    n_jobs=-1　n_jobsはタスクを何分割するかという指定をする部分。-1でコア数をマ
　　　　　　　　ックスで使うように計らってくれます。
    )

grid_fr_model.fit(x_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [2, 5, 10],
                         'n_estimators': array([ 10,  32,  55,  77, 100])},
             scoring='f1')

f値に基づいた最適なモデル

grid_rf_model.best_estimator_

RandomForestClassifier(max_depth=10, n_estimators=10, random_state=0)

モデルの保存

with open('grid_rf_model.pkl', mode='wb') as f:
    pickle.dump(grid_rf_model.best_estimator_, f)

XGBのパラメータをチューニング

xgb_model

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

XGBで今回チューニングするパラメータ

max_depth: 決定木の深さの最大値
min_child_weight: 決定木の葉の重みの下限
gamma: 決定木の葉の追加による損失減少の下限
subsample: 各決定木においてランダムに抽出される標本(データ)の割合
colsample_bytree: 各決定木においてランダムに抽出される列の割合

gridParams= {
            'max_depth':[3,6,8,10],
            'min_child_weight': [1,5,10]
            'gamma':[0.5,1,1.5,2,5]
            'subsample':[0.6,0.8,1.0],
            'colsample_bytree':[0.6,0.8,1.0],
            }

early_stoppingパラメータ

early_stopping_rounds: 損失減少しない場合の学習打ち切りのラウンド数
eval_set: 評価するデータセット

fitParams = {'early_stopping_rounds':10,
              'eval_set':[[x_test, y_test]]}

grid_xgb_model = GridSearchCV(
         xgb.XGBClassifier(random_state=0),
         param_grid=gridParams,
         cv=5,
         scoring='f1',
         n_jobs=-1
         )

grid_xgb_model.fit(
    x_train,
    y_train,
    **fitParams,
    verbose=2)

[0]	validation_0-logloss:0.59234
[2]	validation_0-logloss:0.49255
[4]	validation_0-logloss:0.45351
[6]	validation_0-logloss:0.44101
[8]	validation_0-logloss:0.43809
[10]	validation_0-logloss:0.43891
[12]	validation_0-logloss:0.43864
[14]	validation_0-logloss:0.43950
[16]	validation_0-logloss:0.44204
[18]	validation_0-logloss:0.44149
[20]	validation_0-logloss:0.43812
[22]	validation_0-logloss:0.44352
[23]	validation_0-logloss:0.44228

GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_ca...
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, predictor=None,
                                     random_state=0, reg_alpha=None,
                                     reg_lambda=None, ...),
             n_jobs=-1,
             param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
                         'gamma': [0.5, 1, 1.5, 2, 5],
                         'max_depth': [3, 6, 8, 10],
                         'min_child_weight': [1, 5, 10],
                         'subsample': [0.6, 0.8, 1.0]},
             scoring='f1')

f値に基づいた最適なモデル

grid_xgb_model.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.6,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=1.5, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=10,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

モデルの保存

with open ('grid_xgb_model.pkl', mode='wb') as f:
      pickle.dump(grid_xgb_model.best_estimator_,f)

pickle.dumpでファイルに書き込み

予想

ランダムフォレストで予測

with open('grid_rf_model.pkl', mode='rb') as f:
   rf_best_model=pickle.load(f)

pickle.load(f)でロード

rf_pred=rf_best_model.predict(x_test)

rf_best_model.score(x_test, y_test)

0.788

accuracy_score(y_test, rf_pred)

0.788

f1_score(y_test, rf_pred)

0.273972602739726

fpr, tpr, thresholds = roc_curve(y_test, rf_pred)
fpr, tpr, thresholds

(array([0.        , 0.04591837, 1.        ]),
 array([0.        , 0.18518519, 1.        ]),
 array([2, 1, 0], dtype=int64))

auc = roc_auc_score(y_test, rf_pred)
auc

0.5696334089191232

plt.plot(fpr, tpr, label='ROC curve (area = %2f)' % auc) 2f 小数点2桁まで
plt.legend()
plt.title('ROC curve')
plt.xlabel('False positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)

XGBで予測

with open('grid_xgb_model.pkl', mode ='rb') as f:
xgb_best_model = pickel.load(f)

xgb_pred= xgb_best_model.predict(x_test)

xgb_best_model.score(x_test, y_test)

0.812

accuracy_score(y_test, xgb_pred)

0.812

f1_score(y_test, xgb_pred)

0.37333333333333335

fpr , tpr, thresholds = roc_curve(y_test, xgb_pred)
fpr, tpr, thresholds

(array([0.        , 0.03571429, 1.        ]),
 array([0.        , 0.25925926, 1.        ]),
 array([2, 1, 0]))

auc = roc_aus_score(y_test, xgb_pred)
auc

0.6117724867724869

plt.plot(fpr , tpr, label ='ROC curve (area = % 2f)' %auc)
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)