titanic 생존자 예측

데이터 불러오기

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

raw_data = pd.read_excel("https://github.com/PinkWink/ML_tutorial/raw/refs/heads/master/dataset/titanic.xls")
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB

컬럼명	설명	데이터 타입
pclass	객실 등급 (1=1등급, 2=2등급, 3=3등급)	int64
survived	생존 여부 (0=사망, 1=생존)	int64
name	승객 이름	object
sex	성별	object
age	나이	float64
sibsp	형제자매/배우자 수 (Siblings/Spouse)	int64
parch	부모/자녀 수 (Parents/Children)	int64
ticket	티켓 번호	object
fare	요금	float64
cabin	객실 번호	object
embarked	승선 항구 (C=Cherbourg, Q=Queenstown, S=Southampton)	object
boat	구명보트 번호	object
body	시체 식별 번호	float64
home.dest	집/목적지	object

생존 상황 파악하기

In [11]:

# 1행 2열의 서브플롯 생성 (가로 12인치, 세로 6인치)
# f: figure 객체, ax: axes 배열 (2개의 서브플롯)
f, ax = plt.subplots(1, 2, figsize=(12, 6))

# 왼쪽 서브플롯: 파이 차트
# survived 컬럼의 값 개수를 세어 파이 차트로 시각화
# explode=[0, 0.1]: 두 번째 조각(생존자)을 0.1만큼 분리하여 강조
# autopct='%1.1f%%': 각 조각에 퍼센트 표시 (소수점 1자리)
raw_data['survived'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0])

# 왼쪽 서브플롯 제목 설정
ax[0].set_title('Survived')
# y축 레이블 제거 (파이 차트에서는 불필요)
ax[0].set_ylabel('')

# 오른쪽 서브플롯: 막대 그래프 (countplot)
# survived 컬럼의 값별 개수를 막대 그래프로 시각화
sns.countplot(x='survived', data=raw_data, ax=ax[1])
# 오른쪽 서브플롯 제목 설정
ax[1].set_title('Count plot - Survived')

# 그래프 출력
plt.show()

나이별 탑승 상황

In [12]:

raw_data['age'].hist(bins=20, figsize=(18, 8), grid=False)

<Axes: >

성별에 따른 생존 상황

In [16]:

f, ax = plt.subplots(1, 2, figsize = (12, 6))

raw_data['survived'].value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0])
ax[0].set_title('Survived')
ax[0].set_ylabel('')

sns.countplot(x='survived', data=raw_data, ax=ax[1], hue='sex')
ax[1].set_title('Survived')

Text(0.5, 1.0, 'Survived')

경재력 대비 생존률

In [21]:

raw_data.groupby('pclass')['survived'].mean()

pclass
1    0.619195
2    0.429603
3    0.255289
Name: survived, dtype: float64

지금까지의 분석으로는 여성과 1등실 승객의 생존률이 높다는 것을 알 수 있다. 그럼 1등실에는 여성이 많이 타고 있었나?

선실 등급별 성별 상황

In [25]:

# FacetGrid 생성: 여러 서브플롯을 그룹별로 나누어 표시
# row='pclass': 행을 객실 등급(pclass)으로 구분 (1등급, 2등급, 3등급)
# col="sex": 열을 성별(sex)로 구분 (male, female)
# height=4: 각 서브플롯의 높이 (인치)
# aspect=2: 가로/세로 비율 (aspect=2는 가로가 세로의 2배)
grid = sns.FacetGrid(raw_data, row='pclass', col="sex", height=4, aspect=2)

# 각 서브플롯에 히스토그램 그리기
# plt.hist: matplotlib의 히스토그램 함수
# 'age': 나이 컬럼을 x축으로 사용
# bins=10: 히스토그램 구간(bin) 개수
# 결과: 3행(pclass) × 2열(sex) = 6개의 서브플롯에 각각 나이 분포 히스토그램 표시
grid.map(plt.hist, 'age', bins=10)

# 범례 추가
grid.add_legend()

<seaborn.axisgrid.FacetGrid at 0x77d34085b950>

3등실에는 남성이 많았다. 특히 20대 남성

등실별 생존률을 연령별로 보기

In [26]:

grid = sns.FacetGrid(raw_data, col="survived", row="pclass", height=4, aspect=2)
grid.map(plt.hist, "age", alpha=0.5, bins=20)
grid.add_legend()

<seaborn.axisgrid.FacetGrid at 0x77d33cfa5580>

3등실의 젊은 사람의 사망률이 높다는 것을 알 수 있다.

나이를 5단계로 정리하기

In [27]:

raw_data['age_cat'] = pd.cut(
                raw_data['age'], 
                bins=[0, 7, 15, 30, 60, 100], 
                include_lowest=True, 
                labels=['baby', 'child', 'youth', 'adult', 'old'])
raw_data.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	age_cat
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	youth
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	baby
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	baby
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON	youth
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	youth

Plain text view

   pclass  survived                                             name     sex  \
0       1         1                    Allen, Miss. Elisabeth Walton  female   
1       1         1                   Allison, Master. Hudson Trevor    male   
2       1         0                     Allison, Miss. Helen Loraine  female   
3       1         0             Allison, Mr. Hudson Joshua Creighton    male   
4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   

       age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
0  29.0000      0      0   24160  211.3375       B5        S    2    NaN   
1   0.9167      1      2  113781  151.5500  C22 C26        S   11    NaN   
2   2.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   
3  30.0000      1      2  113781  151.5500  C22 C26        S  NaN  135.0   
4  25.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   

                         home.dest age_cat  
0                     St Louis, MO   youth  
1  Montreal, PQ / Chesterville, ON    baby  
2  Montreal, PQ / Chesterville, ON    baby  
3  Montreal, PQ / Chesterville, ON   youth  
4  Montreal, PQ / Chesterville, ON   youth

나이, 성별, 등급별 생존자 수를 한번에 파악해보기

In [35]:

# 전체 figure 크기 설정 (가로 14인치, 세로 4인치)
plt.figure(figsize=(14,4))

# 첫 번째 서브플롯 (1행 3열 중 첫 번째)
# 객실 등급(pclass)별 생존률(survived)을 막대 그래프로 표시
# barplot은 자동으로 survived의 평균값을 계산하여 표시
plt.subplot(1, 3, 1)
sns.barplot(x='pclass', y="survived", data=raw_data)

# 두 번째 서브플롯 (1행 3열 중 두 번째)
# 나이 카테고리(age_cat)별 생존률을 막대 그래프로 표시
plt.subplot(1, 3, 2)
sns.barplot(x='age_cat', y="survived", data=raw_data)

# 세 번째 서브플롯 (1행 3열 중 세 번째)
# 성별(sex)별 생존률을 막대 그래프로 표시
plt.subplot(1, 3, 3)
sns.barplot(x='sex', y="survived", data=raw_data)

# 서브플롯 간 간격 조정
# top, bottom, left, right: figure 경계와의 간격 (0.0~1.0)
# hspace: 서브플롯 간 세로 간격 (행 간격)
# wspace: 서브플롯 간 가로 간격 (열 간격)
plt.subplots_adjust(top=1, bottom=0.1, left=0.1, right=1, hspace=0.5, wspace=0.3)

# 그래프 출력
plt.show()

어리고, 여성이고, 1등실 승객일 수록 생존에 더 유리했던 것으로 보인다.

탑승객의 이름에서 신분을 추출하기

In [37]:

# 첫 번째 승객의 이름 확인
name = raw_data['name'][0]
print(f"1단계 - 원본 이름: {name}")

# 쉼표(,)로 분리 (성과 이름/호칭 분리)
name_parts = name.split(",")
print(f"2단계 - 쉼표로 분리: {name_parts}")

# 두 번째 부분 선택 (이름/호칭 부분)
name_part = name_parts[1]
print(f"3단계 - 두 번째 부분 선택: '{name_part}'")

# 점(.)으로 분리 (호칭과 이름 분리)
title_parts = name_part.split(".")
print(f"4단계 - 점으로 분리: {title_parts}")

# 첫 번째 부분 선택 (호칭)
title = title_parts[0]
print(f"5단계 - 첫 번째 부분 선택: '{title}'")

# 앞뒤 공백 제거
title_clean = title.strip()
print(f"6단계 - 공백 제거 후 최종 결과: '{title_clean}'")

# 최종 결과 반환
title_clean

1단계 - 원본 이름: Allen, Miss. Elisabeth Walton
2단계 - 쉼표로 분리: ['Allen', ' Miss. Elisabeth Walton']
3단계 - 두 번째 부분 선택: ' Miss. Elisabeth Walton'
4단계 - 점으로 분리: [' Miss', ' Elisabeth Walton']
5단계 - 첫 번째 부분 선택: ' Miss'
6단계 - 공백 제거 후 최종 결과: 'Miss'

'Miss'

In [39]:

extract_title = lambda name: name.split(",")[1].split(".")[0].strip()
raw_data['title'] = raw_data['name'].map(extract_title)

titles = raw_data['title'].unique()
titles

array(['Miss', 'Master', 'Mr', 'Mrs', 'Col', 'Mme', 'Dr', 'Major', 'Capt',
       'Lady', 'Sir', 'Mlle', 'Dona', 'Jonkheer', 'the Countess', 'Don',
       'Rev', 'Ms'], dtype=object)

In [41]:

raw_data['title'] = raw_data['title'].replace("Mlle", "Miss")
raw_data['title'] = raw_data['title'].replace("Ms", "Miss")
raw_data['title'] = raw_data['title'].replace("Mme", "Mrs")

Rare_f = ["Lady", "Dona", "the Countess"]
Rare_m = ["Capt", "Master", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]

for each in Rare_f:
    raw_data['title'] = raw_data['title'].replace(each, "Rare_f")

for each in Rare_m:
    raw_data['title'] = raw_data['title'].replace(each, "Rare_m")

raw_data['title'].unique()

array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)

# 호칭(title)별 생존률 계산
# - title과 survived 컬럼만 선택하여 그룹화
# - groupby("title"): 호칭별로 그룹화 (Miss, Mr, Mrs, Master 등)
# - as_index=False: 그룹화 기준(title)을 인덱스가 아닌 일반 컬럼으로 유지
# - mean(): 각 그룹의 평균값 계산
#   * survived 컬럼의 경우 0(사망)과 1(생존)의 평균이므로 생존률을 의미
#   * 예: 0.5는 50% 생존률, 0.8은 80% 생존률
raw_data[["title", "survived"]].groupby(["title"], as_index=False).mean()

	title	survived
0	Miss	0.678030
1	Mr	0.162483
2	Mrs	0.787879
3	Rare_f	1.000000
4	Rare_m	0.448276

Plain text view

    title  survived
0    Miss  0.678030
1      Mr  0.162483
2     Mrs  0.787879
3  Rare_f  1.000000
4  Rare_m  0.448276

머신러닝을 이용한 생존자 예측

from sklearn.preprocessing import LabelEncoder

# 성별을 숫자로 변환하기
raw_data['sex'].unique()
labelEncoder = LabelEncoder()
labelEncoder.fit(raw_data['sex'])
raw_data['gender'] = labelEncoder.transform(raw_data['sex'])
raw_data.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	age_cat	title	gender
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	youth	Miss	0
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	baby	Rare_m	1
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	baby	Miss	0
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON	youth	Mr	1
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	youth	Mrs	0

Plain text view

   pclass  survived                                             name     sex  \
0       1         1                    Allen, Miss. Elisabeth Walton  female   
1       1         1                   Allison, Master. Hudson Trevor    male   
2       1         0                     Allison, Miss. Helen Loraine  female   
3       1         0             Allison, Mr. Hudson Joshua Creighton    male   
4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   

       age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
0  29.0000      0      0   24160  211.3375       B5        S    2    NaN   
1   0.9167      1      2  113781  151.5500  C22 C26        S   11    NaN   
2   2.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   
3  30.0000      1      2  113781  151.5500  C22 C26        S  NaN  135.0   
4  25.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   

                         home.dest age_cat   title  gender  
0                     St Louis, MO   youth    Miss       0  
1  Montreal, PQ / Chesterville, ON    baby  Rare_m       1  
2  Montreal, PQ / Chesterville, ON    baby    Miss       0  
3  Montreal, PQ / Chesterville, ON   youth      Mr       1  
4  Montreal, PQ / Chesterville, ON   youth     Mrs       0

# 결측값(missing value) 제거 - 데이터 전처리
# age 컬럼에서 결측값이 아닌 행만 선택 (나이 정보가 있는 승객만 유지)
# notnull(): 결측값(NaN)이 아닌 값을 True로 반환하는 boolean Series 생성
# 이를 인덱싱에 사용하여 결측값이 있는 행을 제거
raw_data = raw_data[raw_data['age'].notnull()]

# fare 컬럼에서 결측값이 아닌 행만 선택 (요금 정보가 있는 승객만 유지)
# 이전 단계에서 필터링된 데이터에 대해 추가로 필터링
# 머신러닝 모델 학습을 위해 필수적인 특성(age, fare)의 결측값을 가진 행은 제거
raw_data = raw_data[raw_data['fare'].notnull()]

# 필터링 후 데이터프레임의 기본 정보 확인
# 행 수가 줄어든 것을 확인할 수 있음 (1309 -> 1045)
# age와 fare 모두 값이 있는 행만 남았음을 확인
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1045 entries, 0 to 1308
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1045 non-null   int64   
 1   survived   1045 non-null   int64   
 2   name       1045 non-null   object  
 3   sex        1045 non-null   object  
 4   age        1045 non-null   float64 
 5   sibsp      1045 non-null   int64   
 6   parch      1045 non-null   int64   
 7   ticket     1045 non-null   object  
 8   fare       1045 non-null   float64 
 9   cabin      272 non-null    object  
 10  embarked   1043 non-null   object  
 11  boat       417 non-null    object  
 12  body       119 non-null    float64 
 13  home.dest  685 non-null    object  
 14  age_cat    1045 non-null   category
 15  title      1045 non-null   object  
 16  gender     1045 non-null   int64   
dtypes: category(1), float64(3), int64(5), object(8)
memory usage: 140.0+ KB

In [48]:

# 특성을 선택하고 데이터를 나누기

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X = raw_data[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']]
y = raw_data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)

pred = dt.predict(X_test)
print(accuracy_score(y_test, pred))

0.7511961722488039

타이타닉 주인공의 생존율을 구해보기

In [50]:

# [['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender']]
dicaprio = pd.DataFrame([[3, 18, 0, 0, 5, 1]], columns=['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender'])
winslet = pd.DataFrame([[1, 16, 1, 1, 100, 0]], columns=['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender'])

print("Decaprio :", dt.predict_proba(dicaprio)[0,1])
print("Winslet :", dt.predict_proba(winslet)[0,1])

Decaprio : 0.14606741573033707
Winslet : 0.984375

디카프리오는 사망이 거의 확실해보인다.