In [13]:
import pandas as pd
PIMA_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/diabetes.csv"
PIMA = pd.read_csv(PIMA_url)
PIMA.head()| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
Plain text view
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \ 0 6 148 72 35 0 33.6 1 1 85 66 29 0 26.6 2 8 183 64 0 0 23.3 3 1 89 66 23 94 28.1 4 0 137 40 35 168 43.1 DiabetesPedigreeFunction Age Outcome 0 0.627 50 1 1 0.351 31 0 2 0.672 32 1 3 0.167 21 0 4 2.288 33 1
데이터 컬럼 설명
- Pregnancies: 임신 횟수
- Glucose: 혈당 수치 (mg/dL)
- BloodPressure: 혈압 (mmHg)
- SkinThickness: 피부 두께 (mm)
- Insulin: 인슐린 수치 (mu U/ml)
- BMI: 체질량 지수 (Body Mass Index)
- DiabetesPedigreeFunction: 당뇨병 가족력 함수 (유전적 요인)
- Age: 나이
- Outcome: 당뇨병 여부 (0: 정상, 1: 당뇨병)
In [14]:
PIMA.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
float 으로 데이터 변환
In [15]:
PIMA = PIMA.astype(float)
PIMA.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null float64 1 Glucose 768 non-null float64 2 BloodPressure 768 non-null float64 3 SkinThickness 768 non-null float64 4 Insulin 768 non-null float64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null float64 8 Outcome 768 non-null float64 dtypes: float64(9) memory usage: 54.1 KB
상관관계 확인
In [16]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
sns.heatmap(PIMA.corr(), cmap="YlGnBu")
plt.show()Outcome 과의 관계에서 Glucose, BMI 가 색이 짙은 것으로 보아 당뇨와 관련이 있을듯
데이터에 0 이 있는 값들을 확인해보기
In [17]:
(PIMA == 0).astype(int).sum()Pregnancies 111 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 500 dtype: int64
Glucose, BloodPressure, SkinThickness, BMI 에 대해 평균값으로 대체
In [18]:
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']
for feature in zero_features:
PIMA[feature] = PIMA[feature].replace(0, PIMA[feature].mean())
(PIMA == 0).astype(int).sum()
Pregnancies 111 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 374 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 500 dtype: int64
Logistic Regression 사용
In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X = PIMA.drop('Outcome', axis=1)
y = PIMA['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [
("scaler", StandardScaler()),
("logistic_regression", LogisticRegression(max_iter=1000)),
]
pipe = Pipeline(estimators)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)In [20]:
from sklearn.metrics import (
accuracy_score,
recall_score,
precision_score,
roc_auc_score,
f1_score
)
print("Accuracy: ", accuracy_score(y_test, pred))
print("Recall: ", recall_score(y_test, pred))
print("Precision: ", precision_score(y_test, pred))
print("ROC AUC: ", roc_auc_score(y_test, pred))
print("F1 Score: ", f1_score(y_test, pred))Accuracy: 0.7662337662337663 Recall: 0.6545454545454545 Precision: 0.6792452830188679 ROC AUC: 0.7414141414141414 F1 Score: 0.6666666666666666
다변수 방정식의 각 계수값을 확인해보기
In [22]:
# Pipeline에서 모델 접근 (모델 이름: "logistic_regression")
model = pipe.named_steps["logistic_regression"]
# 계수값과 특성명을 함께 확인
coeff = model.coef_[0]
feature_names = X_train.columns
# 데이터프레임으로 정리하여 보기 쉽게 표시
coeff_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coeff
}).sort_values('Coefficient', key=abs, ascending=False)
print("각 특성의 계수값:")
print(coeff_df)
print(f"\n절편(Intercept): {model.intercept_[0]}")각 특성의 계수값:
Feature Coefficient
1 Glucose 1.143266
5 BMI 0.723541
7 Age 0.376918
4 Insulin -0.241530
6 DiabetesPedigreeFunction 0.219849
0 Pregnancies 0.218681
2 BloodPressure -0.178127
3 SkinThickness 0.050739
절편(Intercept): -0.8750942655320281
예상했던 대로 Glucose 와 BMI 의 중요도 수치가 높음
# Visualize coefficients as bar graph
model = pipe.named_steps["logistic_regression"]
coeff = model.coef_[0]
feature_names = X_train.columns
# Sort by coefficient value (high to low)
sorted_indices = sorted(range(len(coeff)), key=lambda i: coeff[i])
sorted_features = [feature_names[i] for i in sorted_indices]
sorted_coeff = [coeff[i] for i in sorted_indices]
# Assign colors based on coefficient value (positive: blue, negative: red)
colors = ['blue' if c > 0 else 'red' for c in sorted_coeff]
# Horizontal bar graph (y-axis: features, x-axis: coefficient values)
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_coeff, color=colors)
plt.xlabel('Coefficient', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Logistic Regression Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()