와인 분류기

데이터 가져오기

import pandas as pd
import matplotlib.pyplot as plt

red_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/refs/heads/master/dataset/winequality-red.csv"
white_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/refs/heads/master/dataset/winequality-white.csv"

red_wine = pd.read_csv(red_url, sep=";")
white_wine = pd.read_csv(white_url, sep=";")

In [64]:

red_wine.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

Plain text view

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5

In [65]:

white_wine.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6

Plain text view

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.0              0.27         0.36            20.7      0.045   
1            6.3              0.30         0.34             1.6      0.049   
2            8.1              0.28         0.40             6.9      0.050   
3            7.2              0.23         0.32             8.5      0.058   
4            7.2              0.23         0.32             8.5      0.058   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 45.0                 170.0   1.0010  3.00       0.45   
1                 14.0                 132.0   0.9940  3.30       0.49   
2                 30.0                  97.0   0.9951  3.26       0.44   
3                 47.0                 186.0   0.9956  3.19       0.40   
4                 47.0                 186.0   0.9956  3.19       0.40   

   alcohol  quality  
0      8.8        6  
1      9.5        6  
2     10.1        6  
3      9.9        6  
4      9.9        6

레드와인과 화이트와인 데이터의 구조는 동일하다

In [66]:

white_wine.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

컬럼 설명

와인 데이터셋의 각 컬럼의 의미는 다음과 같다:

fixed acidity (고정 산도): 와인의 산성도를 나타내는 주요 지표. 와인의 신맛과 신선함에 영향을 준다
volatile acidity (휘발성 산도): 와인의 신맛과 관련된 지표. 너무 높으면 비린 냄새가 날 수 있다
citric acid (시트르산): 와인에 신선함과 신맛을 더해주는 산
residual sugar (잔여 당분): 발효 후 남은 당분의 양으로, 와인의 단맛 정도를 나타냄
chlorides (염화물): 소금의 양
free sulfur dioxide (자유 이산화황): 항균제 및 산화 방지제로 사용되는 이산화황의 자유 형태
total sulfur dioxide (총 이산화황): 자유 이산화황과 결합 이산화황의 합계
density (밀도): 와인의 밀도로, 주로 당분과 알코올 함량에 영향
pH: 와인의 산성도를 측정하는 지표 (0-14 스케일, 낮을수록 산성)
sulphates (황산염): 항균제로 사용되며, 와인의 보존에 도움을 줌
alcohol (알코올): 와인의 알코올 함량(%)
quality (품질): 와인의 품질 점수(보통 0-10 스케일)

두 데이터를 하나로 합치기

In [67]:

red_wine['color'] = 1.
white_wine['color'] = 0.

wine = pd.concat([red_wine, white_wine])
wine.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6497 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color                 6497 non-null   float64
dtypes: float64(12), int64(1)
memory usage: 710.6 KB

와인 퀄리티 등급 확인

In [68]:

wine["quality"].unique()

array([5, 6, 7, 4, 8, 3, 9])

와인 퀄리티에 따른 histogram 그려보기

In [69]:

import plotly.express as px

fig = px.histogram(wine, x="quality")
fig.show()

그래프를 로드하는 중...

레드 / 화이트 와인별 등급 histogram 그려보자

In [70]:

fig = px.histogram(wine, x="quality", color="color")
fig.show()

그래프를 로드하는 중...

레드와인 / 화이트와인 분류기 만들기

In [71]:

from sklearn.model_selection import train_test_split
import numpy as np

# 특성(features)과 타겟(target) 분리
# X: 모델이 학습할 입력 데이터 (color 컬럼을 제외한 모든 특성)
# y: 모델이 예측해야 할 출력 데이터 (color: 레드와인=1, 화이트와인=0)
X = wine.drop(['color'], axis = 1) # axis = 1 (컬럼 삭제), axis = 0 (행 삭제)
y = wine['color']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 학습 데이터의 클래스 분포 확인
# return_counts=True: 각 클래스(0=화이트와인, 1=레드와인)의 개수를 함께 반환
# 클래스 불균형을 확인하여 모델 학습에 영향을 주는지 체크
np.unique(y_train, return_counts = True)

(array([0., 1.]), array([3918, 1279]))

훈련용과 테스트용이 레드/화이트와인에 따라 어느정도 구분되었는지 확인

In [72]:



import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Histogram(x=X_train['quality'], name="Train"))
fig.add_trace(go.Histogram(x=X_test['quality'], name="Test"))
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.show()

그래프를 로드하는 중...

결정나무 훈련

In [73]:

from sklearn.tree import DecisionTreeClassifier

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=42)
wine_tree.fit(X_train, y_train);

In [74]:

from sklearn.metrics import accuracy_score

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print("Train accuracy: ", accuracy_score(y_train, y_pred_tr))
print("Test accuracy: ", accuracy_score(y_test, y_pred_test))

Train accuracy:  0.9545891860688859
Test accuracy:  0.9584615384615385

데이터 전처리 - MinMaxScaler 와 StandardScaler 가 결정나무에 영향을 미치는지 알아보기

와인 데이터 중 몇개만 BoxPlot 을 그려보기

In [75]:

fig = go.Figure()
fig.add_trace(go.Box(y=X['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=X['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=X['quality'], name='quality'))
fig.show()

그래프를 로드하는 중...

scaler 적용해보기

In [76]:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

MMS = MinMaxScaler()
SS = StandardScaler()

MMS.fit(X)
SS.fit(X)

X_ss = SS.transform(X)
X_mms = MMS.transform(X)

X_ss_pd = pd.DataFrame(X_ss, columns=X.columns)
X_mms_pd = pd.DataFrame(X_mms, columns=X.columns)

사실 결정나무에서는 이런 전처리는 의미를 가지지 않는다. Scaler 는 주로 Cost Function 을 최적화 할때 유효하다.

MinMaxScaler 관찰해보기

In [77]:

fig = go.Figure()
fig.add_trace(go.Box(y=X_mms_pd['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=X_mms_pd['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=X_mms_pd['quality'], name='quality'))
fig.show()

그래프를 로드하는 중...

최대 최소값이 1과 0으로 변한 것을 관찰 할 수 있다.

StandardScaler 관찰해보기

In [78]:

fig = go.Figure()
fig.add_trace(go.Box(y=X_ss_pd['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=X_ss_pd['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=X_ss_pd['quality'], name='quality'))
fig.show()

그래프를 로드하는 중...

평균이 0이고 표준편차가 1이 되었다.

MinMaxScaler 를 적용해서 학습해보기

In [79]:

X_train, X_test, y_train, y_test = train_test_split(X_mms_pd, y, test_size=0.2, random_state=42, stratify=y)
wine_tree = DecisionTreeClassifier(max_depth=2, random_state=42)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print("Train accuracy: ", accuracy_score(y_train, y_pred_tr))
print("Test accuracy: ", accuracy_score(y_test, y_pred_test))

Train accuracy:  0.9545891860688859
Test accuracy:  0.9584615384615385

결정나무에서는 전처리가 아무 효과가 없다는 것을 알 수 있다.

StandardScaler 를 적용해서 학습해보기

In [80]:

X_train, X_test, y_train, y_test = train_test_split(X_ss_pd, y, test_size=0.2, random_state=42, stratify=y)
wine_tree = DecisionTreeClassifier(max_depth=2, random_state=42)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print("Train accuracy: ", accuracy_score(y_train, y_pred_tr))
print("Test accuracy: ", accuracy_score(y_test, y_pred_test))

Train accuracy:  0.9545891860688859
Test accuracy:  0.9584615384615385

결정나무가 화이트와인과 레드와인을 구분하는 방법?

In [95]:

import matplotlib.pyplot as plt
from sklearn import tree

fig = plt.figure(figsize=(15, 8))
_ = tree.plot_tree(wine_tree,
        feature_names=list(X_train.columns),
        class_names=['white', 'red'],
        rounded=True,
        filled=True)

In [82]:

dict(zip(X_train.columns, wine_tree.feature_importances_))

{'fixed acidity': np.float64(0.0),
 'volatile acidity': np.float64(0.0),
 'citric acid': np.float64(0.0),
 'residual sugar': np.float64(0.0),
 'chlorides': np.float64(0.2431047789775635),
 'free sulfur dioxide': np.float64(0.0),
 'total sulfur dioxide': np.float64(0.7568952210224364),
 'density': np.float64(0.0),
 'pH': np.float64(0.0),
 'sulphates': np.float64(0.0),
 'alcohol': np.float64(0.0),
 'quality': np.float64(0.0)}

의사결정 트리 해석: 레드와인 vs 화이트와인 구별 방법

의사결정 트리 모델이 학습한 결과를 바탕으로, 어떤 특성으로 레드와인과 화이트와인을 구별하는지 분석해보자.

1. 가장 중요한 요소: 알코올 함량 (alcohol)

루트 노드: alcohol <= 10.25

알코올이 10.25% 이하: 화이트와인 1,907개, 레드와인 3,290개 (레드와인이 더 많음)
알코올이 10.25% 초과: 화이트와인 512개, 레드와인 2,126개 (레드와인이 압도적으로 많음)

→ 알코올 함량이 높을수록 레드와인일 가능성이 높다!

2. 알코올이 낮은 경우 (≤ 10.25%)

두 번째 분기: volatile acidity <= 0.252

휘발성 산도 ≤ 0.252: 화이트와인 1,395개 > 레드와인 1,164개 → 화이트와인일 가능성 높음
휘발성 산도 > 0.252: 레드와인 585개 > 화이트와인 277개 → 레드와인일 가능성 높음

→ 알코올이 낮을 때는 휘발성 산도로 구별할 수 있다. 휘발성 산도가 낮으면 화이트와인, 높으면 레드와인일 가능성이 높다.

3. 알코올이 높은 경우 (> 10.25%)

두 번째 분기: alcohol <= 11.525

알코올 10.25~11.525%: 화이트와인 433개, 레드와인 1,173개 → 약 73%가 레드와인
알코올 > 11.525%: 화이트와인 79개, 레드와인 953개 → 약 92%가 레드와인!

→ 알코올이 11.525%를 넘으면 거의 확실히 레드와인이다!

결론: 레드와인과 화이트와인 구별 특징

알코올 함량이 가장 중요한 구별 요소
- 알코올 > 11.525%: 약 92%가 레드와인
- 알코올 10.25~11.525%: 약 73%가 레드와인
- 알코올 ≤ 10.25%: 레드와인과 화이트와인이 비슷한 비율
알코올이 낮을 때(≤ 10.25%)
- 휘발성 산도가 낮으면(≤ 0.252): 화이트와인일 가능성이 높음
- 휘발성 산도가 높으면(> 0.252): 레드와인일 가능성이 높음
특성 중요도 순위
- 1순위: 알코올 함량 (가장 중요!)
- 2순위: 휘발성 산도 (알코올이 낮을 때 중요)

결론:

알코올 함량이 11.5% 이상이면 높은 확률로 레드와인이다
알코올이 낮은 와인을 구별할 때는 휘발성 산도를 확인해보자
일반적으로 레드와인이 화이트와인보다 알코올 함량이 높은 경향이 있다

맛있는 와인을 찾아내는 분류기 만들기

quality 컬럼을 이진화하기

In [85]:

wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]
wine.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	color	taste
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	1.0	0.0
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	1.0	0.0
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	1.0	0.0
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	1.0	1.0
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	1.0	0.0

Plain text view

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  color  taste  
0      9.4        5    1.0    0.0  
1      9.8        5    1.0    0.0  
2      9.8        5    1.0    0.0  
3      9.8        6    1.0    1.0  
4      9.4        5    1.0    0.0

In [88]:

X = wine.drop(['taste', 'quality'], axis = 1)
y = wine['taste']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=42)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print("Train accuracy: ", accuracy_score(y_train, y_pred_tr))
print("Test accuracy: ", accuracy_score(y_test, y_pred_test))

Train accuracy:  0.7367712141620165
Test accuracy:  0.74

어떤 와인을 "맛있다"고 할 수 있나?

In [93]:

fig = plt.figure(figsize=(15, 8))
_ = tree.plot_tree(wine_tree,
        feature_names=list(X_train.columns),
        class_names=['Soso', 'Good'],
        rounded=True,
        filled=True)

의사결정 트리 해석: 맛있는 와인(Good)의 특성

의사결정 트리 모델이 학습한 결과를 바탕으로, 어떤 특성을 가진 와인이 맛있는 와인인지 분석해보자.

1. 가장 중요한 요소: 알코올 함량 (alcohol)

루트 노드: alcohol <= 10.25

알코올이 10.25% 이하: 5,197개 샘플 중 Good 3,290개 (약 63%)
알코올이 10.25% 초과: 2,638개 샘플 중 Good 2,126개 (약 81%)

→ 알코올 함량이 높을수록 맛있는 와인일 가능성이 높다!

2. 알코올이 낮은 경우 (≤ 10.25%)

두 번째 분기: volatile acidity <= 0.252

휘발성 산도 ≤ 0.252: Soso (1,118개) > Good (579개)
휘발성 산도 > 0.252: Good (585개) > Soso (277개)

→ 알코올이 낮을 때는 휘발성 산도가 높을수록 맛있는 와인일 가능성이 높다.

3. 알코올이 높은 경우 (> 10.25%)

두 번째 분기: alcohol <= 11.525

알코올 10.25~11.525%: Good (1,173개) > Soso (433개) - 약 73%가 Good
알코올 > 11.525%: Good (953개) >> Soso (79개) - 약 92%가 Good!

→ 알코올이 11.525%를 넘으면 거의 확실히 맛있는 와인이다!

결론: 맛있는 와인(Good)의 특징

알코올 함량이 높을수록 좋다
- 11.525% 초과: 약 92%가 맛있는 와인
- 10.25~11.525%: 약 73%가 맛있는 와인
알코올이 낮은 경우(≤ 10.25%)
- 휘발성 산도가 높을수록 맛있는 와인일 가능성이 높다
- 휘발성 산도가 낮으면 보통 와인일 가능성이 높다
특성 중요도 순위
- 1순위: 알코올 함량 (가장 중요!)
- 2순위: 휘발성 산도 (알코올이 낮을 때 중요)

결론:

알코올 함량이 11.5% 이상이면 높은 확률로 맛있는 와인이다
알코올이 낮은 와인을 선택할 때는 휘발성 산도가 높은 것을 고려해보자