Lab: Machine Learning Classifier
Unit 8: Machine Learning Foundations (Capstone) — Lab
Prerequisites: Before starting this lab, make sure you are comfortable with:
-
The supervised learning workflow (Section 8.2)
-
Accuracy, precision, recall, and F1 score (Section 8.3)
-
Decision trees and information gain (Section 8.2)
-
K-Nearest Neighbors and the effect of k (Section 8.2)
You will also need Python 3.x with pandas, numpy, scikit-learn, and matplotlib.
These are pre-installed in Google Colab and Anaconda environments.
Lab Overview
In this lab you will work through the complete machine learning workflow on a real dataset. You will train two classification algorithms — a Decision Tree and k-Nearest Neighbors — evaluate their performance using multiple metrics, tune their hyperparameters, and draw reasoned conclusions about which algorithm is better suited for this problem.
This is the same workflow that data scientists and machine learning engineers use professionally. The goal is not just to achieve high accuracy, but to understand why your model performs the way it does.
Estimated time: 3—4 hours (reading instructions, coding, testing, documenting)
Learning Objectives
By completing this lab, you will be able to:
-
Load and explore a real-world tabular dataset using pandas
-
Create a binary classification target from a multi-class variable
-
Perform a stratified train/test split
-
Train a Decision Tree classifier and tune its
max_depthhyperparameter -
Train a k-NN classifier and tune its
n_neighborshyperparameter -
Evaluate both classifiers using accuracy, precision, recall, F1 score, and a confusion matrix
-
Detect overfitting by comparing training vs. test accuracy
-
Compare two algorithms and justify a recommendation based on problem requirements
What You Will Need
Dataset
Dataset: wine_quality.csv
Download wine_quality.csv from the Week 8 folder in Brightspace.
This dataset contains 1,000 red wine samples with 11 chemical property features and a quality rating (scale 3—8). Your task: predict whether a wine is good quality (rating ≥ 6) or normal quality (rating < 6).
Source: UCI Machine Learning Repository, Wine Quality Dataset (CC BY 4.0).
Starter Notebook
Notebook: Week8_Lab_Starter.ipynb
Download Week8_Lab_Starter.ipynb from the Week 8 folder in Brightspace.
The notebook provides the complete code structure with # TODO comments marking exactly where you need to add code.
Work through the cells in order.
Direct download: assets/notebooks/Week8_Lab_Starter.ipynb
About the Dataset
The wine quality dataset contains 1,000 red wines with 11 chemical features and a quality rating.
| Feature | Description |
|---|---|
|
Non-volatile acids (g/dm³) |
|
Acetic acid (g/dm³) — high levels produce vinegar flavor |
|
Citric acid content (g/dm³) |
|
Sugar remaining after fermentation (g/dm³) |
|
Salt content (g/dm³) |
|
Free SO₂ (mg/dm³) — prevents microbial growth |
|
Total SO₂ (mg/dm³) |
|
Wine density (g/cm³) |
|
Acidity on the pH scale |
|
Potassium sulphate (g/dm³) — preservative |
|
Alcohol percentage |
|
Target: Rating from 3 (poor) to 8 (excellent) |
Lab Tasks
Part 1: Data Loading and Exploration (15 points)
-
Import
pandas,numpy,sklearn, andmatplotlib -
Load
wine_quality.csvinto a DataFrame -
Display the first five rows with
df.head() -
Display data types and missing value counts with
df.info() -
Create the binary target column:
good_quality = 1ifquality >= 6, elsegood_quality = 0 -
Display the class distribution with
value_counts()and note whether the classes are balanced
Part 2: Train-Test Split (10 points)
-
Separate features
X(all columns exceptqualityandgood_quality) from targety(good_quality) -
Split into 80% training and 20% testing using
train_test_split -
Use
random_state=42for reproducibility -
Use
stratify=yto preserve the class distribution in both splits -
Print the shape and class distribution of the training and test sets to verify
Part 3: Decision Tree Classifier (25 points)
-
Create a
DecisionTreeClassifier(random_state=42)and train it on the training data -
Generate predictions on the test set
-
Calculate and print accuracy, precision, recall, and F1 score
-
Display the confusion matrix and label each cell (TP, FP, FN, TN)
-
Experiment with
max_depthvalues of 3, 5, 10, andNone(unlimited) -
For each depth, record accuracy and F1 score
-
Identify the depth that gives the best F1 score on the test set
Part 4: K-Nearest Neighbors Classifier (25 points)
-
Create a
KNeighborsClassifier(n_neighbors=5)and train it on the training data -
Generate predictions on the test set
-
Calculate and print accuracy, precision, recall, and F1 score
-
Display the confusion matrix
-
Experiment with k values of 1, 3, 5, 10, and 20
-
For each k, record accuracy and F1 score
-
Identify the k value that gives the best F1 score on the test set
Part 5: Model Comparison and Overfitting Analysis (15 points)
-
Retrain both models using the best hyperparameters you found in Parts 3 and 4
-
Create a summary table comparing accuracy, precision, recall, and F1 for both models
-
Compute training accuracy for each model (predict on the training set)
-
Calculate the training-test gap for each model
-
State whether either model is overfitting based on the gap (a gap above ~5% is a concern)
-
Summarize the tradeoffs between the two algorithms for this specific problem
Part 6: Reflection Questions (10 points)
Answer each question in a Markdown cell in your notebook:
-
Which model performed better overall, and why? Base your answer on the metrics from Part 5.
-
Did you observe overfitting in either model? Cite specific numbers (training accuracy, test accuracy, gap) as evidence.
-
Which hyperparameter settings worked best? State the best
max_depthfor the Decision Tree and the bestkfor k-NN, and explain why those values make sense. -
For this wine quality problem, would you recommend Decision Trees or k-NN? Consider both accuracy and interpretability in your recommendation. Is it important for a winemaker to understand why a wine was rated good?
Hints and Tips
Getting Started
-
Run cells one at a time and check the output before moving on.
-
Use
df.head()anddf.describe()to understand the data before transforming it. -
The
# TODOcomments in the starter notebook guide you step by step.
Common Issues
-
File not found: Make sure
wine_quality.csvis in the same folder as your notebook. -
Shape mismatch: Verify that
Xandyhave the same number of rows before splitting. -
Accuracy surprisingly high (> 95%): Check for data leakage — you may have accidentally included the
qualitycolumn in your features. -
Accuracy very low (< 50%): This dataset is genuinely challenging; 70—75% accuracy is a reasonable result.
Grading Rubric
| Component | Points | Criteria |
|---|---|---|
Part 1: Data Loading and Exploration |
15 |
Data loaded correctly; |
Part 2: Train-Test Split |
10 |
80/20 split with |
Part 3: Decision Tree |
25 |
Tree trained; all four metrics computed; confusion matrix labeled; depth experiments (3, 5, 10, None) conducted; best depth identified |
Part 4: k-NN Classifier |
25 |
k-NN trained; all four metrics computed; confusion matrix labeled; k experiments (1, 3, 5, 10, 20) conducted; best k identified |
Part 5: Model Comparison |
15 |
Clear comparison table; training vs. test gap computed for both models; overfitting analysis stated with evidence |
Part 6: Reflection Questions |
10 |
Four thoughtful answers demonstrating conceptual understanding (not just numbers) |
Total |
100 |
Deliverables
Submit to Brightspace:
-
Completed Jupyter Notebook
-
Filename:
Week8_Lab_YourLastName.ipynb -
All code cells executed with visible output
-
All
# TODOitems completed -
Reflection answers written in Markdown cells
-
-
Optional Brief Report (1—2 paragraphs in the notebook or as a separate document)
-
Which model worked better and why
-
What you would do differently with more time
-
Academic Integrity
You may discuss concepts and approaches with classmates, but every line of code and every analysis must be your own work. Using code from online sources without attribution or submitting another student’s work constitutes academic dishonesty under CPCC policy.
Optional Extensions
Finished early? Try these additional investigations:
Feature Importance:
Decision trees compute a feature_importances_ attribute. Identify which wine chemical properties most strongly predict quality and create a bar chart visualization.
Feature Scaling for k-NN:
K-NN is sensitive to feature scale. Apply StandardScaler to normalize all features and compare k-NN performance before and after scaling.
Additional Algorithms:
Try RandomForestClassifier or LogisticRegression from scikit-learn and compare their performance to Decision Tree and k-NN.
Want to explore further?
-
TensorFlow Playground (Apache 2.0) — Visualize classification boundaries interactively
-
Google Teachable Machine (Apache 2.0) — Train image and audio classifiers with no code
-
scikit-learn User Guide (BSD) — Comprehensive documentation for every algorithm you used in this lab
Lab code uses scikit-learn, BSD License.
Wine Quality dataset from the UCI Machine Learning Repository, CC BY 4.0.
Code examples adapted from aima-python, MIT License.
This work is licensed under CC BY-SA 4.0.