Lab Walkthrough: Machine Learning Classifier

Unit 8: Machine Learning Foundations (Capstone) — Lab Walkthrough

Use this walkthrough only after you have attempted the lab yourself.

Academic integrity requires that your submitted notebook represents your own original work. Use this walkthrough to understand why each step works and to check your approach — not to copy code. Understanding the reasoning behind each decision is far more valuable than having working output.

Overview

This walkthrough provides a complete solution to the Week 8 Machine Learning Classifier Lab. We will train and evaluate Decision Tree and k-NN classifiers on the wine quality dataset, compare performance, diagnose overfitting, and make a justified algorithm recommendation.

Part 1: Data Loading and Exploration

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (accuracy_score, precision_score,
                             recall_score, f1_score, confusion_matrix)
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('wine_quality.csv')

print("First 5 rows:")
print(df.head())
print("\nDataset info:")
df.info()
print("\nMissing values:", df.isnull().sum().sum())

What this does: Imports all necessary libraries and loads the dataset. df.info() confirms data types and reveals missing values (there are none in this dataset). Always check for missing values before proceeding — real datasets often require imputation.

# Create binary target
df['good_quality'] = (df['quality'] >= 6).astype(int)

print("Class distribution:")
print(df['good_quality'].value_counts())
print("\nAs percentages:")
print(df['good_quality'].value_counts(normalize=True).mul(100).round(1))

Expected output:

Class distribution:
0    614   (61.4% -- normal quality)
1    386   (38.6% -- good quality)

Key observation: The classes are slightly imbalanced (61.4% vs 38.6%). This is manageable, but it means that a model predicting "always normal quality" would achieve 61.4% accuracy. Accuracy alone is therefore not a sufficient metric for this problem — we also need precision, recall, and F1 to detect whether the model is learning or just predicting the majority class.

Part 2: Train-Test Split

X = df.drop(['quality', 'good_quality'], axis=1)
y = df['good_quality']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y        # preserve class proportions in both splits
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")
print(f"\nTrain class distribution:\n{y_train.value_counts()}")
print(f"\nTest class distribution:\n{y_test.value_counts()}")

Expected output:

Training set: 800 samples
Test set:     200 samples

Train class distribution:
0    491
1    309

Test class distribution:
0    123
1     77

The stratify=y parameter is essential here. Without it, random chance could produce a test set that has 80% normal quality wines, making the evaluation unrepresentative. Stratified splitting ensures both sets reflect the original 61.4% / 38.6% class balance.

Part 3: Decision Tree Classifier

# Train default (unlimited depth) decision tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

# Compute metrics
print(f"Accuracy:  {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_dt):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_dt):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_dt):.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_dt)
print(f"\nConfusion Matrix:\n{cm}")
print(f"TN={cm[0,0]}  FP={cm[0,1]}")
print(f"FN={cm[1,0]}  TP={cm[1,1]}")

Typical output:

Accuracy:  0.7150
Precision: 0.6667
Recall:    0.5974
F1 Score:  0.6301

Confusion Matrix:
[[102  21]
 [ 31  46]]
TN=102  FP=21
FN=31   TP=46

Interpreting these results:

  • The model correctly classified 71.5% of wines.

  • Precision 66.7%: When it predicts "good quality," it is right about two-thirds of the time.

  • Recall 59.7%: It catches just under 60% of genuinely good wines — missing 40%.

  • F1 63.0%: The harmonic mean reflects the imperfect balance between precision and recall.

The unlimited-depth tree can become very complex and may be overfitting. The depth experiments below will reveal whether constraining depth improves generalization.

Depth Experiments

depths = [3, 5, 10, None]
print(f"{'Depth':<10} {'Accuracy':<12} {'F1 Score':<10}")
print("-" * 35)

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    label = str(depth) if depth else 'None (unlimited)'
    print(f"{label:<10} {acc:.4f}       {f1:.4f}")

Typical output:

Depth      Accuracy     F1 Score
-----------------------------------
3          0.7100       0.6370
5          0.7200       0.6444
10         0.7200       0.6444
None       0.7150       0.6301

Key findings:

  • Depth 3 is too shallow — the tree cannot capture enough patterns (F1 63.7%).

  • Depths 5 and 10 achieve the same performance, meaning additional complexity beyond depth 5 adds nothing.

  • Unlimited depth is slightly worse than depth 5 — a clear sign of overfitting. The unlimited tree memorizes training noise that does not generalize.

  • Best depth: 5.

Part 4: K-Nearest Neighbors Classifier

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)

print(f"Accuracy:  {accuracy_score(y_test, y_pred_knn):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_knn):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_knn):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_knn):.4f}")

Typical output:

Accuracy:  0.7350
Precision: 0.6892
Recall:    0.6623
F1 Score:  0.6755

k Value Experiments

k_values = [1, 3, 5, 10, 20]
print(f"{'k':<6} {'Accuracy':<12} {'F1 Score':<10}")
print("-" * 30)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"{k:<6} {acc:.4f}       {f1:.4f}")

Typical output:

k      Accuracy     F1 Score
------------------------------
1      0.6950       0.6588
3      0.7250       0.6623
5      0.7350       0.6755
10     0.7400       0.6887
20     0.7300       0.6797

Key findings:

  • k=1 is the worst — it is extremely sensitive to noise and outliers.

  • Performance improves as k increases from 1 to 10.

  • k=10 achieves the best F1 (68.9%) — it averages over enough neighbors to smooth out noise.

  • k=20 begins to oversmooth and loses local patterns.

  • Best k: 10.

Part 5: Model Comparison and Overfitting Analysis

# Retrain with best hyperparameters
best_dt = DecisionTreeClassifier(max_depth=5, random_state=42)
best_dt.fit(X_train, y_train)
y_pred_best_dt = best_dt.predict(X_test)

best_knn = KNeighborsClassifier(n_neighbors=10)
best_knn.fit(X_train, y_train)
y_pred_best_knn = best_knn.predict(X_test)

# Training accuracy for overfitting analysis
train_acc_dt  = accuracy_score(y_train, best_dt.predict(X_train))
train_acc_knn = accuracy_score(y_train, best_knn.predict(X_train))
test_acc_dt   = accuracy_score(y_test, y_pred_best_dt)
test_acc_knn  = accuracy_score(y_test, y_pred_best_knn)

print("Model Comparison (Test Set)")
print("=" * 60)
for name, y_pred in [("Decision Tree (depth=5)", y_pred_best_dt),
                      ("k-NN (k=10)",            y_pred_best_knn)]:
    print(f"\n{name}")
    print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"  F1 Score:  {f1_score(y_test, y_pred):.4f}")

print("\nOverfitting Analysis")
print("=" * 60)
print(f"Decision Tree -- Train: {train_acc_dt:.4f}  Test: {test_acc_dt:.4f}  Gap: {train_acc_dt-test_acc_dt:.4f}")
print(f"k-NN          -- Train: {train_acc_knn:.4f}  Test: {test_acc_knn:.4f}  Gap: {train_acc_knn-test_acc_knn:.4f}")

Typical output:

Model Comparison (Test Set)
============================================================

Decision Tree (depth=5)
  Accuracy:  0.7200
  Precision: 0.6667
  Recall:    0.6234
  F1 Score:  0.6444

k-NN (k=10)
  Accuracy:  0.7400
  Precision: 0.7027
  Recall:    0.6753
  F1 Score:  0.6887

Overfitting Analysis
============================================================
Decision Tree -- Train: 0.7363  Test: 0.7200  Gap: 0.0163
k-NN          -- Train: 0.7725  Test: 0.7400  Gap: 0.0325

Analysis:

  • k-NN wins on all four metrics.

  • Neither model shows severe overfitting. The Decision Tree gap is only 1.6%; the k-NN gap is 3.3%. Both are well within acceptable bounds.

  • Limiting complexity (max_depth=5, k=10) successfully prevented overfitting in both cases.

Part 6: Sample Reflection Answers

Q1: Which model performed better, and why?

K-NN with k=10 outperformed the Decision Tree on every metric (74.0% vs 72.0% accuracy; 68.9% vs 64.4% F1). This makes intuitive sense for wine quality: wines with similar chemical compositions tend to be similar in quality. K-NN naturally exploits this by looking directly at similar training examples. The Decision Tree makes hard axis-aligned cuts that may miss nuanced chemical relationships.

Q2: Did you observe overfitting?

No severe overfitting. The Decision Tree had only a 1.6% training-test gap; k-NN had a 3.3% gap. Both gaps are small. This is because we constrained complexity: max_depth=5 prevented the decision tree from growing arbitrarily deep, and k=10 smoothed out k-NN’s sensitivity to individual training points.

Q3: Which hyperparameter settings worked best?

max_depth=5 for the Decision Tree; k=10 for k-NN. Both represent the "just right" balance: complex enough to learn real patterns, simple enough to avoid memorizing noise. Going simpler (depth=3, k=20) produced underfitting; going more complex (unlimited depth, k=1) produced overfitting.

Q4: Recommendation for the wine quality problem?

I would recommend the Decision Tree despite its slightly lower accuracy, for this application context. Winemakers and quality managers need to understand why a wine was rated good or poor in order to act on that information. A decision tree gives them explicit, interpretable rules: "IF alcohol > 10.5 AND volatile_acidity < 0.4 THEN good quality." This is actionable. A k-NN prediction — "the 10 most similar wines in our database said good quality" — is much harder for a domain expert to verify or act on. For a 2% accuracy tradeoff, interpretability is worth it in this context.

Key Learnings from This Lab

You experienced the complete professional ML workflow: load, explore, split, train, tune, evaluate, compare.

Five key takeaways:

  1. Accuracy alone misleads — on imbalanced data, always check precision, recall, and F1.

  2. Overfitting is real and diagnosable — compare training vs. test accuracy; a large gap is a red flag.

  3. Hyperparameter tuning matters — 5—​6% accuracy differences from tuning depth and k are typical.

  4. No single best algorithm — k-NN won on accuracy; Decision Tree wins on interpretability. "Best" depends on the problem constraints, not just the numbers.

  5. The gap is more informative than training accuracy — a small gap between training and test is a better sign of generalization than a high training score.


Lab code uses scikit-learn, BSD License.

Wine Quality dataset from the UCI Machine Learning Repository, CC BY 4.0.

This work is licensed under CC BY-SA 4.0.