Lab 7: Spam Classifier with Naive Bayes

Unit 7: Probability and Uncertainty in AI — Lab

In this lab you will build a naive Bayes spam classifier from scratch. Rather than using a pre-built machine learning library, you will compute the prior probabilities and word likelihoods directly from data — exactly as the algorithm works in a real email system. By the end, you will have a working classifier that achieves over 95% accuracy on real SMS messages, and a clear understanding of why it works.

Learning Objectives

Load and explore a real-world text dataset (UCI SMS Spam Collection).
Preprocess text data: tokenize, normalize, and build a vocabulary.
Implement the training step: compute prior probabilities and word likelihoods with Laplace smoothing.
Implement the classification step: compute log-probability scores and predict class labels.
Evaluate the classifier: compute accuracy, identify false positives and false negatives.
Reflect on the naive independence assumption and its practical implications.

Prerequisite Concepts

Before starting this lab, make sure you can answer:

What is Bayes' theorem and what do prior/posterior mean? (Section 7.4)
What is the naive independence assumption? (Section 7.5)
How does Laplace smoothing prevent zero probabilities? (Section 7.5)
How do we use log probabilities to avoid numerical underflow? (Section 7.5)

If any of these feel shaky, review Sections 7.4 and 7.5 before opening the notebook.

Dataset

This lab uses the UCI SMS Spam Collection dataset: 5,574 SMS messages labeled as either "spam" or "ham" (legitimate). The dataset is provided in the course assets folder.

Dataset: SMS Spam Collection

The SMSSpamCollection file is a tab-separated file. Each line has two fields:

<label><TAB><message text>

Example lines:

ham     Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.
ham     Ok lar... Joking wif u oni...

Download: The file is already available in the course datasets folder and is referenced in the notebook. No separate download is required.

License: The UCI SMS Spam Collection is shared under CC BY 4.0. Tiago A. Almeida and José María Gómez Hidalgo, "Contributions to the Study of SMS Spam Filtering: New Collection and Results."

Setup

This lab runs in Google Colab (recommended) or any local Jupyter environment with Python 3.

Getting Started

Open Google Colab: colab.research.google.com
Upload the starter notebook Spam_Classifier_Lab_Student.ipynb from the course assets.
Run all cells in order using Runtime → Run all or Shift+Enter cell by cell.
The first cell will upload and display the dataset. If you are working locally, update the DATA_PATH variable to point to your copy of SMSSpamCollection.

Notebook Download

Starter Notebook

Download Spam_Classifier_Lab_Student.ipynb from the course assets. The notebook contains:

Skeleton code with # TODO markers for each task
Inline comments explaining each step
Built-in test cells that check your implementations
Visualization cells for exploring results

Open in Google Colab or local Jupyter.

Lab Tasks

The notebook is organized into five tasks. Each task builds on the previous one.

Task 1: Load and Explore the Data

Objective: Load the SMS Spam Collection, split into training and test sets, and compute basic statistics.

You will:

Read the tab-separated data file into a list of (label, message) tuples.
Split 80/20 into training set and test set using a fixed random seed.
Count the number of spam and ham messages in the training set.
Print several example messages from each class.

Expected output: Training set has ~4,459 messages; test set has ~1,115 messages.

Task 2: Text Preprocessing

Objective: Convert raw message text into a list of word tokens.

You will implement a tokenize(text) function that:

Converts the text to lowercase.
Splits on whitespace and punctuation.
Removes tokens that are purely numeric or shorter than 2 characters.
Returns a list of word strings.

You do not need to remove stop words ("the", "a", "is") for this lab. Naive Bayes handles them naturally — they appear equally in spam and ham and have near-zero discriminative power.

Task 3: Train the Classifier

Objective: Compute prior probabilities and word likelihood tables from the training set.

You will implement a train(training_data) function that:

Counts spam and ham messages to compute priors:

P(Spam) = count_spam / total_messages
P(Ham)  = count_ham  / total_messages

Counts word occurrences in each class (word frequency dictionaries).
Builds the vocabulary: the set of all unique words across all training messages.

Computes word likelihoods with Laplace smoothing:

P(word | Spam) = (count(word in spam) + 1) / (total_words_in_spam + vocabulary_size)
P(word | Ham)  = (count(word in ham)  + 1) / (total_words_in_ham  + vocabulary_size)

The function should return a dictionary containing the priors, word likelihood tables, and vocabulary size.

Task 4: Classify Messages

Objective: Use the trained model to classify new messages.

You will implement a classify(message, model) function that:

Tokenizes the message.

Computes the log-probability score for each class:

score(Spam) = log(P(Spam)) + sum of log(P(word | Spam)) for each word in message
score(Ham)  = log(P(Ham))  + sum of log(P(word | Ham))  for each word in message

Returns "spam" if score(Spam) > score(Ham), otherwise "ham".

Use math.log() or numpy.log() for log probabilities. Only include words from your vocabulary — skip unknown words (they appear equally in both classes by the smoothing logic).

Task 5: Evaluate and Reflect

Objective: Evaluate accuracy on the test set and examine errors.

You will:

Run classify() on every message in the test set.
Compute overall accuracy: (correct predictions) / (total test messages).
Print a selection of false positives (ham classified as spam) and false negatives (spam classified as ham).
Answer the reflection questions in the final notebook cell.

Target performance: A correct implementation should achieve at least 95% accuracy on the test set.

Reflection Questions

Answer these questions in the final cell of your notebook (text cell, not code):

False positives vs. false negatives: Which type of error is more costly for a spam filter? Why might a service like Gmail accept more false negatives (letting spam through) rather than false positives (blocking real messages)?
Independence assumption: Find one false negative in your results. Is there a reason a real human would recognize this message as spam that the naive Bayes model might miss due to the independence assumption?
Vocabulary coverage: What happens when the classifier sees a word it has never seen in training? How does Laplace smoothing handle this, and is it an ideal solution?
Improvement ideas: Name two specific changes you could make to the preprocessing or training steps that might improve accuracy. (Do not implement them — just reason about what effect they would have.)

Grading Criteria

Task	Points	Criteria
Task 1: Data loading and exploration	15	Data loads correctly; train/test split with correct seed; statistics printed
Task 2: Tokenizer implementation	15	Lowercase; splits on whitespace/punctuation; removes short/numeric tokens; passes test cell
Task 3: Training (priors + likelihoods + smoothing)	30	Correct prior computation; correct word counts; Laplace smoothing correctly applied; vocabulary size correct
Task 4: Classification (log scores)	25	Log probability scores computed correctly; correct argmax decision; passes at least 90% of provided test cases
Task 5: Evaluation and reflection	15	Accuracy ≥ 95% on test set; reflection questions answered thoughtfully (2-4 sentences each)
Total	100

Task

Points

Criteria

Task 1: Data loading and exploration

Data loads correctly; train/test split with correct seed; statistics printed

Task 2: Tokenizer implementation

Lowercase; splits on whitespace/punctuation; removes short/numeric tokens; passes test cell

Task 3: Training (priors + likelihoods + smoothing)

Correct prior computation; correct word counts; Laplace smoothing correctly applied; vocabulary size correct

Task 4: Classification (log scores)

Log probability scores computed correctly; correct argmax decision; passes at least 90% of provided test cases

Task 5: Evaluation and reflection

Accuracy ≥ 95% on test set; reflection questions answered thoughtfully (2-4 sentences each)

Total

100

Submission Instructions

Download your completed notebook from Colab: File → Download → Download .ipynb.
Rename the file: Spam_Classifier_Lab_YourLastName.ipynb.
Submit via Brightspace: Assignments → Lab 7: Spam Classifier.
Deadline: End of the unit week (see course schedule).

Debugging Tips

If your accuracy is below 80%:

Check that Laplace smoothing adds 1 to counts and adds vocabulary size to the denominator.
Verify you are computing log probabilities, not raw probabilities (raw probabilities underflow to 0 for long messages).
Make sure unknown words are skipped, not given probability 0.

If accuracy is above 95% but some spam slips through: - Look for messages that use code words to avoid trigger words ("fr33", "m0ney"). - These expose the limit of word-level naive Bayes and motivate more advanced NLP preprocessing.

Going Further

If you want to extend your classifier beyond this lab:

Sklearn implementation: sklearn.naive_bayes.MultinomialNB implements the same algorithm with optimized math.
BiteSizeBayes notebooks: Allen Downey’s MIT-licensed notebooks at allendowney.github.io/BiteSizeBayes/ explore probability and Bayes' theorem with Python from first principles.
Feature engineering: Try adding features for message length, presence of phone numbers, or capitalization ratio.

Next: 7.W Wrap-Up and Self-Assessment →

Lab code approach adapted from BiteSizeBayes by Allen Downey, MIT License, and aima-python by the AIMA textbook contributors, MIT License.

UCI SMS Spam Collection dataset: Tiago A. Almeida and José María Gómez Hidalgo, licensed under CC BY 4.0.

This work is licensed under CC BY-SA 4.0.