Lab: Machine Learning Classifier

Unit 8: Machine Learning Foundations (Capstone) — Lab

Prerequisites: Before starting this lab, make sure you are comfortable with:

The supervised learning workflow (Section 8.2)
Accuracy, precision, recall, and F1 score (Section 8.3)
Decision trees and information gain (Section 8.2)
K-Nearest Neighbors and the effect of k (Section 8.2)

You will also need Python 3.x with pandas, numpy, scikit-learn, and matplotlib. These are pre-installed in Google Colab and Anaconda environments.

Lab Overview

In this lab you will work through the complete machine learning workflow on a real dataset. You will train two classification algorithms — a Decision Tree and k-Nearest Neighbors — evaluate their performance using multiple metrics, tune their hyperparameters, and draw reasoned conclusions about which algorithm is better suited for this problem.

This is the same workflow that data scientists and machine learning engineers use professionally. The goal is not just to achieve high accuracy, but to understand why your model performs the way it does.

Estimated time: 3—4 hours (reading instructions, coding, testing, documenting)

Learning Objectives

By completing this lab, you will be able to:

Load and explore a real-world tabular dataset using pandas
Create a binary classification target from a multi-class variable
Perform a stratified train/test split
Train a Decision Tree classifier and tune its max_depth hyperparameter
Train a k-NN classifier and tune its n_neighbors hyperparameter
Evaluate both classifiers using accuracy, precision, recall, F1 score, and a confusion matrix
Detect overfitting by comparing training vs. test accuracy
Compare two algorithms and justify a recommendation based on problem requirements

What You Will Need

Dataset

Dataset: wine_quality.csv

Download wine_quality.csv from the Week 8 folder in Brightspace.

This dataset contains 1,000 red wine samples with 11 chemical property features and a quality rating (scale 3—8). Your task: predict whether a wine is good quality (rating ≥ 6) or normal quality (rating < 6).

Source: UCI Machine Learning Repository, Wine Quality Dataset (CC BY 4.0).

Starter Notebook

Notebook: Week8_Lab_Starter.ipynb

Download Week8_Lab_Starter.ipynb from the Week 8 folder in Brightspace.

The notebook provides the complete code structure with # TODO comments marking exactly where you need to add code. Work through the cells in order.

Direct download: assets/notebooks/Week8_Lab_Starter.ipynb

About the Dataset

The wine quality dataset contains 1,000 red wines with 11 chemical features and a quality rating.

Feature Description

Feature	Description
`fixed_acidity`	Non-volatile acids (g/dm³)
`volatile_acidity`	Acetic acid (g/dm³) — high levels produce vinegar flavor
`citric_acid`	Citric acid content (g/dm³)
`residual_sugar`	Sugar remaining after fermentation (g/dm³)
`chlorides`	Salt content (g/dm³)
`free_sulfur_dioxide`	Free SO₂ (mg/dm³) — prevents microbial growth
`total_sulfur_dioxide`	Total SO₂ (mg/dm³)
`density`	Wine density (g/cm³)
`pH`	Acidity on the pH scale
`sulphates`	Potassium sulphate (g/dm³) — preservative
`alcohol`	Alcohol percentage
`quality`	Target: Rating from 3 (poor) to 8 (excellent)

fixed_acidity

Non-volatile acids (g/dm³)

volatile_acidity

Acetic acid (g/dm³) — high levels produce vinegar flavor

citric_acid

Citric acid content (g/dm³)

residual_sugar

Sugar remaining after fermentation (g/dm³)

chlorides

Salt content (g/dm³)

free_sulfur_dioxide

Free SO₂ (mg/dm³) — prevents microbial growth

total_sulfur_dioxide

Total SO₂ (mg/dm³)

density

Wine density (g/cm³)

pH

Acidity on the pH scale

sulphates

Potassium sulphate (g/dm³) — preservative

alcohol

Alcohol percentage

quality

Target: Rating from 3 (poor) to 8 (excellent)

Lab Tasks

Part 1: Data Loading and Exploration (15 points)

Import pandas, numpy, sklearn, and matplotlib
Load wine_quality.csv into a DataFrame
Display the first five rows with df.head()
Display data types and missing value counts with df.info()
Create the binary target column: good_quality = 1 if quality >= 6, else good_quality = 0
Display the class distribution with value_counts() and note whether the classes are balanced

Part 2: Train-Test Split (10 points)

Separate features X (all columns except quality and good_quality) from target y (good_quality)
Split into 80% training and 20% testing using train_test_split
Use random_state=42 for reproducibility
Use stratify=y to preserve the class distribution in both splits
Print the shape and class distribution of the training and test sets to verify

Part 3: Decision Tree Classifier (25 points)

Create a DecisionTreeClassifier(random_state=42) and train it on the training data
Generate predictions on the test set
Calculate and print accuracy, precision, recall, and F1 score
Display the confusion matrix and label each cell (TP, FP, FN, TN)
Experiment with max_depth values of 3, 5, 10, and None (unlimited)
For each depth, record accuracy and F1 score
Identify the depth that gives the best F1 score on the test set

Part 4: K-Nearest Neighbors Classifier (25 points)

Create a KNeighborsClassifier(n_neighbors=5) and train it on the training data
Generate predictions on the test set
Calculate and print accuracy, precision, recall, and F1 score
Display the confusion matrix
Experiment with k values of 1, 3, 5, 10, and 20
For each k, record accuracy and F1 score
Identify the k value that gives the best F1 score on the test set

Part 5: Model Comparison and Overfitting Analysis (15 points)

Retrain both models using the best hyperparameters you found in Parts 3 and 4
Create a summary table comparing accuracy, precision, recall, and F1 for both models
Compute training accuracy for each model (predict on the training set)
Calculate the training-test gap for each model
State whether either model is overfitting based on the gap (a gap above ~5% is a concern)
Summarize the tradeoffs between the two algorithms for this specific problem

Part 6: Reflection Questions (10 points)

Answer each question in a Markdown cell in your notebook:

Which model performed better overall, and why? Base your answer on the metrics from Part 5.
Did you observe overfitting in either model? Cite specific numbers (training accuracy, test accuracy, gap) as evidence.
Which hyperparameter settings worked best? State the best max_depth for the Decision Tree and the best k for k-NN, and explain why those values make sense.
For this wine quality problem, would you recommend Decision Trees or k-NN? Consider both accuracy and interpretability in your recommendation. Is it important for a winemaker to understand why a wine was rated good?

Hints and Tips

Getting Started

Run cells one at a time and check the output before moving on.
Use df.head() and df.describe() to understand the data before transforming it.
The # TODO comments in the starter notebook guide you step by step.

Common Issues

File not found: Make sure wine_quality.csv is in the same folder as your notebook.
Shape mismatch: Verify that X and y have the same number of rows before splitting.
Accuracy surprisingly high (> 95%): Check for data leakage — you may have accidentally included the quality column in your features.
Accuracy very low (< 50%): This dataset is genuinely challenging; 70—75% accuracy is a reasonable result.

Grading Rubric

Component Points Criteria

Component	Points	Criteria
Part 1: Data Loading and Exploration	15	Data loaded correctly; `df.info()` output shown; binary target created; class distribution displayed
Part 2: Train-Test Split	10	80/20 split with `random_state=42` and `stratify=y`; both set sizes confirmed
Part 3: Decision Tree	25	Tree trained; all four metrics computed; confusion matrix labeled; depth experiments (3, 5, 10, None) conducted; best depth identified
Part 4: k-NN Classifier	25	k-NN trained; all four metrics computed; confusion matrix labeled; k experiments (1, 3, 5, 10, 20) conducted; best k identified
Part 5: Model Comparison	15	Clear comparison table; training vs. test gap computed for both models; overfitting analysis stated with evidence
Part 6: Reflection Questions	10	Four thoughtful answers demonstrating conceptual understanding (not just numbers)
Total	100

Part 1: Data Loading and Exploration

Data loaded correctly; df.info() output shown; binary target created; class distribution displayed

Part 2: Train-Test Split

80/20 split with random_state=42 and stratify=y; both set sizes confirmed

Part 3: Decision Tree

Tree trained; all four metrics computed; confusion matrix labeled; depth experiments (3, 5, 10, None) conducted; best depth identified

Part 4: k-NN Classifier

k-NN trained; all four metrics computed; confusion matrix labeled; k experiments (1, 3, 5, 10, 20) conducted; best k identified

Part 5: Model Comparison

Clear comparison table; training vs. test gap computed for both models; overfitting analysis stated with evidence

Part 6: Reflection Questions

Four thoughtful answers demonstrating conceptual understanding (not just numbers)

Total

100

Deliverables

Submit to Brightspace:

Completed Jupyter Notebook
- Filename: Week8_Lab_YourLastName.ipynb
- All code cells executed with visible output
- All # TODO items completed
- Reflection answers written in Markdown cells
Optional Brief Report (1—2 paragraphs in the notebook or as a separate document)
- Which model worked better and why
- What you would do differently with more time

Academic Integrity

You may discuss concepts and approaches with classmates, but every line of code and every analysis must be your own work. Using code from online sources without attribution or submitting another student’s work constitutes academic dishonesty under CPCC policy.

Optional Extensions

Finished early? Try these additional investigations:

Feature Importance: Decision trees compute a feature_importances_ attribute. Identify which wine chemical properties most strongly predict quality and create a bar chart visualization.

Feature Scaling for k-NN: K-NN is sensitive to feature scale. Apply StandardScaler to normalize all features and compare k-NN performance before and after scaling.

Additional Algorithms: Try RandomForestClassifier or LogisticRegression from scikit-learn and compare their performance to Decision Tree and k-NN.

Want to explore further?

TensorFlow Playground (Apache 2.0) — Visualize classification boundaries interactively
Google Teachable Machine (Apache 2.0) — Train image and audio classifiers with no code
scikit-learn User Guide (BSD) — Comprehensive documentation for every algorithm you used in this lab

Optional: 8.L Solution Walkthrough →

Lab code uses scikit-learn, BSD License.

Wine Quality dataset from the UCI Machine Learning Repository, CC BY 4.0.

Code examples adapted from aima-python, MIT License.

This work is licensed under CC BY-SA 4.0.