Lab 7: Spam Classifier with Naive Bayes
Unit 7: Probability and Uncertainty in AI — Lab
In this lab you will build a naive Bayes spam classifier from scratch. Rather than using a pre-built machine learning library, you will compute the prior probabilities and word likelihoods directly from data — exactly as the algorithm works in a real email system. By the end, you will have a working classifier that achieves over 95% accuracy on real SMS messages, and a clear understanding of why it works.
Learning Objectives
-
Load and explore a real-world text dataset (UCI SMS Spam Collection).
-
Preprocess text data: tokenize, normalize, and build a vocabulary.
-
Implement the training step: compute prior probabilities and word likelihoods with Laplace smoothing.
-
Implement the classification step: compute log-probability scores and predict class labels.
-
Evaluate the classifier: compute accuracy, identify false positives and false negatives.
-
Reflect on the naive independence assumption and its practical implications.
Prerequisite Concepts
Before starting this lab, make sure you can answer:
-
What is Bayes' theorem and what do prior/posterior mean? (Section 7.4)
-
What is the naive independence assumption? (Section 7.5)
-
How does Laplace smoothing prevent zero probabilities? (Section 7.5)
-
How do we use log probabilities to avoid numerical underflow? (Section 7.5)
If any of these feel shaky, review Sections 7.4 and 7.5 before opening the notebook.
Dataset
This lab uses the UCI SMS Spam Collection dataset: 5,574 SMS messages labeled as either "spam" or "ham" (legitimate). The dataset is provided in the course assets folder.
Dataset: SMS Spam Collection
The SMSSpamCollection file is a tab-separated file.
Each line has two fields:
<label><TAB><message text>
Example lines:
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. ham Ok lar... Joking wif u oni...
Download: The file is already available in the course datasets folder and is referenced in the notebook. No separate download is required.
License: The UCI SMS Spam Collection is shared under CC BY 4.0. Tiago A. Almeida and José María Gómez Hidalgo, "Contributions to the Study of SMS Spam Filtering: New Collection and Results."
Setup
This lab runs in Google Colab (recommended) or any local Jupyter environment with Python 3.
Getting Started
-
Open Google Colab: colab.research.google.com
-
Upload the starter notebook
Spam_Classifier_Lab_Student.ipynbfrom the course assets. -
Run all cells in order using Runtime → Run all or Shift+Enter cell by cell.
-
The first cell will upload and display the dataset. If you are working locally, update the
DATA_PATHvariable to point to your copy ofSMSSpamCollection.
Notebook Download
Starter Notebook
Download Spam_Classifier_Lab_Student.ipynb from the course assets.
The notebook contains:
-
Skeleton code with
# TODOmarkers for each task -
Inline comments explaining each step
-
Built-in test cells that check your implementations
-
Visualization cells for exploring results
Open in Google Colab or local Jupyter.
Lab Tasks
The notebook is organized into five tasks. Each task builds on the previous one.
Task 1: Load and Explore the Data
Objective: Load the SMS Spam Collection, split into training and test sets, and compute basic statistics.
You will:
-
Read the tab-separated data file into a list of
(label, message)tuples. -
Split 80/20 into training set and test set using a fixed random seed.
-
Count the number of spam and ham messages in the training set.
-
Print several example messages from each class.
Expected output: Training set has ~4,459 messages; test set has ~1,115 messages.
Task 2: Text Preprocessing
Objective: Convert raw message text into a list of word tokens.
You will implement a tokenize(text) function that:
-
Converts the text to lowercase.
-
Splits on whitespace and punctuation.
-
Removes tokens that are purely numeric or shorter than 2 characters.
-
Returns a list of word strings.
|
You do not need to remove stop words ("the", "a", "is") for this lab. Naive Bayes handles them naturally — they appear equally in spam and ham and have near-zero discriminative power. |
Task 3: Train the Classifier
Objective: Compute prior probabilities and word likelihood tables from the training set.
You will implement a train(training_data) function that:
-
Counts spam and ham messages to compute priors:
P(Spam) = count_spam / total_messages P(Ham) = count_ham / total_messages
-
Counts word occurrences in each class (word frequency dictionaries).
-
Builds the vocabulary: the set of all unique words across all training messages.
-
Computes word likelihoods with Laplace smoothing:
P(word | Spam) = (count(word in spam) + 1) / (total_words_in_spam + vocabulary_size) P(word | Ham) = (count(word in ham) + 1) / (total_words_in_ham + vocabulary_size)
The function should return a dictionary containing the priors, word likelihood tables, and vocabulary size.
Task 4: Classify Messages
Objective: Use the trained model to classify new messages.
You will implement a classify(message, model) function that:
-
Tokenizes the message.
-
Computes the log-probability score for each class:
score(Spam) = log(P(Spam)) + sum of log(P(word | Spam)) for each word in message score(Ham) = log(P(Ham)) + sum of log(P(word | Ham)) for each word in message
-
Returns "spam" if score(Spam) > score(Ham), otherwise "ham".
|
Use |
Task 5: Evaluate and Reflect
Objective: Evaluate accuracy on the test set and examine errors.
You will:
-
Run
classify()on every message in the test set. -
Compute overall accuracy: (correct predictions) / (total test messages).
-
Print a selection of false positives (ham classified as spam) and false negatives (spam classified as ham).
-
Answer the reflection questions in the final notebook cell.
Target performance: A correct implementation should achieve at least 95% accuracy on the test set.
Reflection Questions
Answer these questions in the final cell of your notebook (text cell, not code):
-
False positives vs. false negatives: Which type of error is more costly for a spam filter? Why might a service like Gmail accept more false negatives (letting spam through) rather than false positives (blocking real messages)?
-
Independence assumption: Find one false negative in your results. Is there a reason a real human would recognize this message as spam that the naive Bayes model might miss due to the independence assumption?
-
Vocabulary coverage: What happens when the classifier sees a word it has never seen in training? How does Laplace smoothing handle this, and is it an ideal solution?
-
Improvement ideas: Name two specific changes you could make to the preprocessing or training steps that might improve accuracy. (Do not implement them — just reason about what effect they would have.)
Grading Criteria
| Task | Points | Criteria |
|---|---|---|
Task 1: Data loading and exploration |
15 |
Data loads correctly; train/test split with correct seed; statistics printed |
Task 2: Tokenizer implementation |
15 |
Lowercase; splits on whitespace/punctuation; removes short/numeric tokens; passes test cell |
Task 3: Training (priors + likelihoods + smoothing) |
30 |
Correct prior computation; correct word counts; Laplace smoothing correctly applied; vocabulary size correct |
Task 4: Classification (log scores) |
25 |
Log probability scores computed correctly; correct argmax decision; passes at least 90% of provided test cases |
Task 5: Evaluation and reflection |
15 |
Accuracy ≥ 95% on test set; reflection questions answered thoughtfully (2-4 sentences each) |
Total |
100 |
Submission Instructions
-
Download your completed notebook from Colab: File → Download → Download .ipynb.
-
Rename the file:
Spam_Classifier_Lab_YourLastName.ipynb. -
Submit via Brightspace: Assignments → Lab 7: Spam Classifier.
-
Deadline: End of the unit week (see course schedule).
Debugging Tips
If your accuracy is below 80%:
-
Check that Laplace smoothing adds 1 to counts and adds vocabulary size to the denominator.
-
Verify you are computing log probabilities, not raw probabilities (raw probabilities underflow to 0 for long messages).
-
Make sure unknown words are skipped, not given probability 0.
If accuracy is above 95% but some spam slips through: - Look for messages that use code words to avoid trigger words ("fr33", "m0ney"). - These expose the limit of word-level naive Bayes and motivate more advanced NLP preprocessing.
Going Further
If you want to extend your classifier beyond this lab:
-
Sklearn implementation:
sklearn.naive_bayes.MultinomialNBimplements the same algorithm with optimized math. -
BiteSizeBayes notebooks: Allen Downey’s MIT-licensed notebooks at allendowney.github.io/BiteSizeBayes/ explore probability and Bayes' theorem with Python from first principles.
-
Feature engineering: Try adding features for message length, presence of phone numbers, or capitalization ratio.
Lab code approach adapted from BiteSizeBayes by Allen Downey, MIT License, and aima-python by the AIMA textbook contributors, MIT License.
UCI SMS Spam Collection dataset: Tiago A. Almeida and José María Gómez Hidalgo, licensed under CC BY 4.0.
This work is licensed under CC BY-SA 4.0.