データの整形３

データの置換
ダミー変数化
欠損値処理
欠損値の削除
欠損値に平均値を代入する

データの置換

データ分析では質的変数から量的変数への変換が必要になる。

質的変数とは、名前や性別、色などの数値以外で表されるデータのこと。
量的変数とは、身長や体重など数値で表すことができるデータのこと。

そこで、データの置換で質的変数から量的変数へ変換をしてみよう。

print(df.head())
df = df.replace({True:1,False:0})
print(df.head())

         date home_team away_team  home_score  away_score tournament     city  \
0  1872-11-30  Scotland   England           0           0   Friendly  Glasgow   
1  1873-03-08   England  Scotland           4           2   Friendly   London   
2  1874-03-07  Scotland   England           2           1   Friendly  Glasgow   
3  1875-03-06   England  Scotland           2           2   Friendly   London   
4  1876-03-04  Scotland   England           3           0   Friendly  Glasgow   

    country  neutral  
0  Scotland    False  
1   England    False  
2  Scotland    False  
3   England    False  
4  Scotland    False  
         date home_team away_team  home_score  away_score tournament     city  \
0  1872-11-30  Scotland   England           0           0   Friendly  Glasgow   
1  1873-03-08   England  Scotland           4           2   Friendly   London   
2  1874-03-07  Scotland   England           2           1   Friendly  Glasgow   
3  1875-03-06   England  Scotland           2           2   Friendly   London   
4  1876-03-04  Scotland   England           3           0   Friendly  Glasgow   

    country  neutral  この部分が置き換わっている。
0  Scotland        0  
1   England        0  
2  Scotland        0  
3   England        0  
4  Scotland        0

関数replace()を使って置換を行った
- replace()の前後で置換されているのが確認できる
複数の置換を同時に行いたい場合は,で区切る

まとめ

replace({置換元：置換先})を使うことで置換元を置換先に置換する

注意

括弧の種類に注意しよう
TrueやFalseはbool型なので' 'は必要ない

ダミー変数化

「ダミー変数化」とは質的変数の列を、「 0 か 1 」の量的変数に表せるように変換する手法だ

pd.get_dummies(df_titanic["Embarked"])

	C	Q	S
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
…	…	…	…
741	0	0	1
742	0	0	1
743	0	0	1
744	0	0	1
745	0	0	1

get_dummies()を使ってダミー変数化を行った。
質的変数の種類ごとに列を作成し、該当するデータには1を、他は0を割り当てている

注意

ダミー変数化すると、変数の数（データの列数）が増えることに注意しよう

まとめ

ダミー変数化を行う方法：get_dummies(データフレーム名["列名"])

欠損値処理

欠損値がいくつあるか？

df2.head()

	Age	Blood_pressure	Vital_capacity	Gender	Disease	Weight	Height
0	22.0	110.0	4300.0	M	1	79.0	183.0
1	NaN	128.0	NaN	M	1	NaN	NaN
2	24.0	104.0	3900.0	F	0	53.0	165.0
3	25.0	112.0	3000.0	F	0	45.0	155.0
4	27.0	108.0	4800.0	M	0	80.0	192.0

# 列ごとに調べる
print(df2.isna().sum())

Age               5
Blood_pressure    3
Vital_capacity    3
Gender            0
Disease           0
Weight            5
Height            7
dtype: int64

# 行ごとに調べる（上記5行のみ表示）
print(df2.isna().sum(axis=1).head())

0    0
1    4
2    0
3    0
4    0
dtype: int64

isna()はデータフレーム内の値が欠損値(NaN)かどうかを調べる関数
- 欠損値(NaN)である場合：True
- 欠損値(NaN)でない場合：False
sum()は，isna()を適用したデータフレームに対しては，Trueの個数を調べる関数
- axis=0 で列ごとに調べる（何も指定しない場合もこちら）
- axis=1 で行ごとに調べる
GenderとDisease以外の列には欠損値があることが読み取れる
2行目の行名1には欠損値が4つあることが読み取れる

欠損値の削除

df2.dropna().head()

	Age	Blood_pressure	Vital_capacity	Gender	Disease	Weight	Height
0	22.0	110.0	4300.0	M	1	79.0	183.0
2	24.0	104.0	3900.0	F	0	53.0	165.0
3	25.0	112.0	3000.0	F	0	45.0	155.0
4	27.0	108.0	4800.0	M	0	80.0	192.0
6	28.0	126.0	3800.0	F	1	43.0	164.0

関数dropna()で欠損値のある行を削除した
- 先ほど欠損値を確認した行名1が削除されていることが分かる

本当に全ての欠損値が消えたかisna()を使って確認してみよう。

欠損値に平均値を代入する

df2.fillna(df2.mean()).head()

	Age	Blood_pressure	Vital_capacity	Gender	Disease	Weight	Height
0	22.00	110.0	4300.000000	M	1	79.00	183.000000
1	38.72	128.0	3510.740741	M	1	57.12	166.173913
2	24.00	104.0	3900.000000	F	0	53.00	165.000000
3	25.00	112.0	3000.000000	F	0	45.00	155.000000
4	27.00	108.0	4800.000000	M	0	80.00	192.000000

欠損値は各行の平均値を用いて穴埋めしている
- 先ほど欠損値を確認した行名1が穴埋めされていることが分かる
関数fillna()で欠損値をどうやって埋めるかの指定をしている
- fillna()には様々な欠損値を埋める方法があるため、必要に応じて調べてほしい
関数mean()を使って各行の平均値を算出している