pandasを使った抽出

judge = df['country']=='England'　Englandのみで抽出
print(judge)　T or Fで返ってくる
df[judge]　　　TのみのデータをDfで返ってくる

0        False
1         True
2        False
3         True
4        False
         ...  
39003    False
39004    False
39005    False
39006    False
39007    False
Name: country, Length: 39008, dtype: bool

	date	home_team	away_team	home_score	away_score	tournament	city	country	neutral
1	1873-03-08	England	Scotland	4	2	Friendly	London	England	False
3	1875-03-06	England	Scotland	2	2	Friendly	London	England	False
6	1877-03-03	England	Scotland	1	3	Friendly	London	England	False
10	1879-01-18	England	Wales	2	1	Friendly	London	England	False
11	1879-04-05	England	Scotland	5	4	Friendly	London	England	False
…	…	…	…	…	…	…	…	…	…
38873	2018-03-27	Australia	Colombia	0	0	Friendly	London	England	True
38881	2018-03-27	England	Italy	1	1	Friendly	London	England	False
38907	2018-03-27	Serbia	Nigeria	2	0	Friendly	London	England	True
38981	2018-06-02	England	Nigeria	2	1	Friendly	London	England	False
38997	2018-06-03	Croatia	Brazil	0	2	Friendly	Liverpool	England	True

568 rows × 9 columns

print(judge)とdf[judge]の2種類の出力がされている
上記のプログラムではEnglandで行われた試合のみを抜き出している
df['country']=='England'では各行についてTrueかFalseを生成している
- ==は論理式の一つで、記号の左右が等しいかを判断している
- 出力される行数はdfの行数に等しくなる
df[judge]では，df[]の中身（judge）がTrueの行のみ出力されている

他の基本的な論理式の紹介

a>b aはbより大きい
a<b aはbより小さい
a<=b aはb以下
a>=b aはb以上
a!=b aとbは同じでない

他にも以下のような抜き出し方がある

df['列名'].str.contains('探したい文字列')
- 探したい文字列が指定された範囲内に含まれているか判断しTrueかFalseを返す

df1=pd.read_csv("International_football_results.csv")
df1.info()
match_2010 = df1["date"].str.contains("2010")　列を指定して、その後、検索文字
df1[match_2010]　match_2010はTorFのデータなので

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39008 entries, 0 to 39007
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        39008 non-null  object
 1   home_team   39008 non-null  object
 2   away_team   39008 non-null  object
 3   home_score  39008 non-null  int64 
 4   away_score  39008 non-null  int64 
 5   tournament  39008 non-null  object
 6   city        39008 non-null  object
 7   country     39008 non-null  object
 8   neutral     39008 non-null  bool  
dtypes: bool(1), int64(2), object(6)
memory usage: 2.4+ MB

Out[7]:

	date	home_team	away_team	home_score	away_score	tournament	city	country	neutral
31408	2010-01-02	Iran	Korea DPR	1	0	Friendly	Doha	Qatar	True
31409	2010-01-02	Qatar	Mali	0	0	Friendly	Doha	Qatar	False
31410	2010-01-02	Syria	Zimbabwe	6	0	Friendly	Kuala Lumpur	Malaysia	True
31411	2010-01-02	Yemen	Tajikistan	0	1	Friendly	Sana’a	Yemen	False
31412	2010-01-03	Angola	Gambia	1	1	Friendly	Vila Real de Santo António	Portugal	True
…	…	…	…	…	…	…	…	…	…
32241	2010-12-28	Saudi Arabia	Iraq	0	1	Friendly	Dammam	Saudi Arabia	False
32242	2010-12-29	Indonesia	Malaysia	2	1	AFF Championship	Jakarta	Indonesia	False
32243	2010-12-30	Syria	Korea Republic	0	1	Friendly	Abu Dhabi	United Arab Emirates	True
32244	2010-12-31	Kuwait	Zambia	4	0	Friendly	6th of October City	Egypt	True
32245	2010-12-31	Qatar	Korea DPR	0	1	Friendly	Doha	Qatar	False

838 rows × 9 columns

複数の条件に一致するデータを取り出す

複数の条件を設定する時、必須な知識が存在する。

和集合　記号|(Shiftキー + ￥)　日本語で言う「または」
積集合　記号&(Shiftキー + 6)　日本語で言う「かつ」

これらを利用する必要がある。

judge2 = (df['home_score']>=3) | (df['away_score']>=3)

df[]+条件　とすると、中身は T or F　の式になる
｜では条件をそれぞれカッコでくくる
print(judge2)
df[judge2]

0        False
1         True
2        False
3        False
4         True
         ...  
39003    False
39004    False
39005    False
39006    False
39007     True
Length: 39008, dtype: bool

	date	home_team	away_team	home_score	away_score	tournament	city	country	neutral
1	1873-03-08	England	Scotland	4	2	Friendly	London	England	False
4	1876-03-04	Scotland	England	3	0	Friendly	Glasgow	Scotland	False
5	1876-03-25	Scotland	Wales	4	0	Friendly	Glasgow	Scotland	False
6	1877-03-03	England	Scotland	1	3	Friendly	London	England	False
8	1878-03-02	Scotland	England	7	2	Friendly	Glasgow	Scotland	False
…	…	…	…	…	…	…	…	…	…
38989	2018-06-02	Iceland	Norway	2	3	Friendly	Reykjavík	Iceland	False
38994	2018-06-03	Albania	Ukraine	1	4	Friendly	Évian-les-Bains	France	True
38995	2018-06-03	Saudi Arabia	Peru	0	3	Friendly	St. Gallen	Switzerland	True
38998	2018-06-03	Costa Rica	Northern Ireland	3	0	Friendly	San José	Costa Rica	False
39007	2018-06-04	India	Kenya	3	0	Friendly	Mumbai	India	False

print(judge)とdf[judge]の2種類の出力がされている
home_teamかaway_teamどちらかに3以上の数字が入っている行を抜き出している
前半の条件文と後半の条件文を「|」で区切ることで「または」という意味になる
データ名[(条件文1) | (条件文2)]で「条件文1または条件文2を満たすものを指定する」という意味になる

df2 = pd.read_csv("sample-data.csv")
df2.info()
young = (df2["Age"] <= 40) & (df2["Gender"] == "M")
df2[young]　youngは　Tor Fなので

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             25 non-null     float64
 1   Blood_pressure  27 non-null     float64
 2   Vital_capacity  27 non-null     float64
 3   Gender          30 non-null     object 
 4   Disease         30 non-null     int64  
 5   Weight          25 non-null     float64
 6   Height          23 non-null     float64
dtypes: float64(5), int64(1), object(1)
memory usage: 1.8+ KB

	Age	Blood_pressure	Vital_capacity	Gender	Disease	Weight	Height
0	22.0	110.0	4300.0	M	1	79.0	183.0
4	27.0	108.0	4800.0	M	0	80.0	192.0
11	32.0	124.0	3900.0	M	0	61.0	177.0
15	36.0	128.0	3420.0	M	1	55.0	154.0
16	37.0	116.0	3800.0	M	1	70.0	171.0
17	37.0	NaN	4150.0	M	1	NaN	NaN
19	39.0	116.0	4550.0	M	1	86.0	187.0