addinedu

PIMA 인디언 당뇨병 예측

PIMA 인디언 당뇨병 데이터셋으로 Logistic Regression 모델 학습. 데이터 전처리, 상관관계 분석, 계수값 시각화를 통한 특성 중요도 분석.

Machine Learning
Python
In [13]:
import pandas as pd

PIMA_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/diabetes.csv"

PIMA = pd.read_csv(PIMA_url)
PIMA.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Plain text view
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

데이터 컬럼 설명

  • Pregnancies: 임신 횟수
  • Glucose: 혈당 수치 (mg/dL)
  • BloodPressure: 혈압 (mmHg)
  • SkinThickness: 피부 두께 (mm)
  • Insulin: 인슐린 수치 (mu U/ml)
  • BMI: 체질량 지수 (Body Mass Index)
  • DiabetesPedigreeFunction: 당뇨병 가족력 함수 (유전적 요인)
  • Age: 나이
  • Outcome: 당뇨병 여부 (0: 정상, 1: 당뇨병)
In [14]:
PIMA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

float 으로 데이터 변환

In [15]:
PIMA = PIMA.astype(float)
PIMA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    float64
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    float64
 8   Outcome                   768 non-null    float64
dtypes: float64(9)
memory usage: 54.1 KB

상관관계 확인

In [16]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(PIMA.corr(), cmap="YlGnBu")
plt.show()
Notebook output

Outcome 과의 관계에서 Glucose, BMI 가 색이 짙은 것으로 보아 당뇨와 관련이 있을듯

데이터에 0 이 있는 값들을 확인해보기

In [17]:
(PIMA == 0).astype(int).sum()
Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

Glucose, BloodPressure, SkinThickness, BMI 에 대해 평균값으로 대체

In [18]:
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']

for feature in zero_features:
    PIMA[feature] = PIMA[feature].replace(0, PIMA[feature].mean())

(PIMA == 0).astype(int).sum()
Pregnancies                 111
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                     374
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

Logistic Regression 사용

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

X = PIMA.drop('Outcome', axis=1)
y = PIMA['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

estimators = [
    ("scaler", StandardScaler()),
    ("logistic_regression", LogisticRegression(max_iter=1000)),
]


pipe = Pipeline(estimators)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
In [20]:
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    roc_auc_score,
    f1_score
)

print("Accuracy: ", accuracy_score(y_test, pred))
print("Recall: ", recall_score(y_test, pred))
print("Precision: ", precision_score(y_test, pred))
print("ROC AUC: ", roc_auc_score(y_test, pred))
print("F1 Score: ", f1_score(y_test, pred))
Accuracy:  0.7662337662337663
Recall:  0.6545454545454545
Precision:  0.6792452830188679
ROC AUC:  0.7414141414141414
F1 Score:  0.6666666666666666

다변수 방정식의 각 계수값을 확인해보기

In [22]:
# Pipeline에서 모델 접근 (모델 이름: "logistic_regression")
model = pipe.named_steps["logistic_regression"]

# 계수값과 특성명을 함께 확인
coeff = model.coef_[0]
feature_names = X_train.columns

# 데이터프레임으로 정리하여 보기 쉽게 표시
coeff_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coeff
}).sort_values('Coefficient', key=abs, ascending=False)

print("각 특성의 계수값:")
print(coeff_df)
print(f"\n절편(Intercept): {model.intercept_[0]}")
각 특성의 계수값:
                    Feature  Coefficient
1                   Glucose     1.143266
5                       BMI     0.723541
7                       Age     0.376918
4                   Insulin    -0.241530
6  DiabetesPedigreeFunction     0.219849
0               Pregnancies     0.218681
2             BloodPressure    -0.178127
3             SkinThickness     0.050739

절편(Intercept): -0.8750942655320281

예상했던 대로 Glucose 와 BMI 의 중요도 수치가 높음

# Visualize coefficients as bar graph
model = pipe.named_steps["logistic_regression"]
coeff = model.coef_[0]
feature_names = X_train.columns

# Sort by coefficient value (high to low)
sorted_indices = sorted(range(len(coeff)), key=lambda i: coeff[i])
sorted_features = [feature_names[i] for i in sorted_indices]
sorted_coeff = [coeff[i] for i in sorted_indices]

# Assign colors based on coefficient value (positive: blue, negative: red)
colors = ['blue' if c > 0 else 'red' for c in sorted_coeff]

# Horizontal bar graph (y-axis: features, x-axis: coefficient values)
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_coeff, color=colors)
plt.xlabel('Coefficient', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Logistic Regression Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
Notebook output