PIMA 인디언 당뇨병 예측

PIMA 인디언 당뇨병 예측 모델 만들기

데이터 가져오기

In [13]:

import pandas as pd

PIMA_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/diabetes.csv"

PIMA = pd.read_csv(PIMA_url)
PIMA.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

Plain text view

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1

데이터 컬럼 설명

Pregnancies: 임신 횟수
Glucose: 혈당 수치 (mg/dL)
BloodPressure: 혈압 (mmHg)
SkinThickness: 피부 두께 (mm)
Insulin: 인슐린 수치 (mu U/ml)
BMI: 체질량 지수 (Body Mass Index)
DiabetesPedigreeFunction: 당뇨병 가족력 함수 (유전적 요인)
Age: 나이
Outcome: 당뇨병 여부 (0: 정상, 1: 당뇨병)

In [14]:

PIMA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

float 으로 데이터 변환

In [15]:

PIMA = PIMA.astype(float)
PIMA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
 8   Outcome                   768 non-null    float64
dtypes: float64(9)
memory usage: 54.1 KB

상관관계 확인

In [16]:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(PIMA.corr(), cmap="YlGnBu")
plt.show()

Outcome 과의 관계에서 Glucose, BMI 가 색이 짙은 것으로 보아 당뇨와 관련이 있을듯

데이터에 0 이 있는 값들을 확인해보기

In [17]:

(PIMA == 0).astype(int).sum()

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

Glucose, BloodPressure, SkinThickness, BMI 에 대해 평균값으로 대체

In [18]:

zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']

for feature in zero_features:
    PIMA[feature] = PIMA[feature].replace(0, PIMA[feature].mean())

(PIMA == 0).astype(int).sum()

Pregnancies                 111
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                     374
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

Logistic Regression 사용

In [19]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

X = PIMA.drop('Outcome', axis=1)
y = PIMA['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

estimators = [
    ("scaler", StandardScaler()),
    ("logistic_regression", LogisticRegression(max_iter=1000)),
]


pipe = Pipeline(estimators)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

In [20]:

from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    roc_auc_score,
    f1_score
)

print("Accuracy: ", accuracy_score(y_test, pred))
print("Recall: ", recall_score(y_test, pred))
print("Precision: ", precision_score(y_test, pred))
print("ROC AUC: ", roc_auc_score(y_test, pred))
print("F1 Score: ", f1_score(y_test, pred))

Accuracy:  0.7662337662337663
Recall:  0.6545454545454545
Precision:  0.6792452830188679
ROC AUC:  0.7414141414141414
F1 Score:  0.6666666666666666

다변수 방정식의 각 계수값을 확인해보기

In [22]:

# Pipeline에서 모델 접근 (모델 이름: "logistic_regression")
model = pipe.named_steps["logistic_regression"]

# 계수값과 특성명을 함께 확인
coeff = model.coef_[0]
feature_names = X_train.columns

# 데이터프레임으로 정리하여 보기 쉽게 표시
coeff_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coeff
}).sort_values('Coefficient', key=abs, ascending=False)

print("각 특성의 계수값:")
print(coeff_df)
print(f"\n절편(Intercept): {model.intercept_[0]}")

각 특성의 계수값:
                    Feature  Coefficient
1                   Glucose     1.143266
5                       BMI     0.723541
7                       Age     0.376918
4                   Insulin    -0.241530
6  DiabetesPedigreeFunction     0.219849
0               Pregnancies     0.218681
2             BloodPressure    -0.178127
3             SkinThickness     0.050739

절편(Intercept): -0.8750942655320281

예상했던 대로 Glucose 와 BMI 의 중요도 수치가 높음

# Visualize coefficients as bar graph
model = pipe.named_steps["logistic_regression"]
coeff = model.coef_[0]
feature_names = X_train.columns

# Sort by coefficient value (high to low)
sorted_indices = sorted(range(len(coeff)), key=lambda i: coeff[i])
sorted_features = [feature_names[i] for i in sorted_indices]
sorted_coeff = [coeff[i] for i in sorted_indices]

# Assign colors based on coefficient value (positive: blue, negative: red)
colors = ['blue' if c > 0 else 'red' for c in sorted_coeff]

# Horizontal bar graph (y-axis: features, x-axis: coefficient values)
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_coeff, color=colors)
plt.xlabel('Coefficient', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Logistic Regression Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()