K-Nearest Neighbors (k-NN)

Supplementary — K-Nearest Neighbors (k-NN)

You have now seen several ways a machine learning model can learn from data. Decision trees learn explicit rules by splitting features. k-Nearest Neighbors takes a completely different approach: it skips the learning phase entirely and classifies new data by asking, "Which training examples are most similar to this one?"

Learn how the k-NN algorithm classifies new data points using distance and majority voting.

K-Nearest Neighbors Algorithm (StatQuest)

The Core Idea

K-Nearest Neighbors (k-NN): A classification (and regression) algorithm that stores all training examples and classifies new data by finding the k most similar training examples, then taking a majority vote (classification) or computing the average (regression). k-NN performs no explicit training — all computation happens at prediction time.

The intuition behind k-NN: "Show me who your friends are, and I’ll tell you who you are."

If you want to predict whether a new restaurant will suit your taste, find people with similar preferences and ask what they think. If most of them like it, you probably will too. k-NN does exactly this — find the k most similar labeled examples and let them vote.

How k-NN Works

The k-NN Algorithm

Store all training data. No fitting, no model parameters — just keep every labeled example in memory.
Choose k. Decide how many neighbors will vote on each prediction (e.g., k = 3, k = 5, k = 7).
When a new example arrives: Calculate the distance from the new point to every training example.
Find the k nearest neighbors. Sort all training examples by distance and select the k closest.
Make a prediction.
- Classification: Majority vote — the most common class label among the k neighbors wins.
- Regression: Average the output values of the k neighbors.

Visual Classification Example

Training data: red-labeled and blue-labeled points in a 2D feature space. New point to classify: placed in an ambiguous region.

With k = 3 (three nearest neighbors):

Closest 3 points: Blue, Blue, Red
Vote: 2 Blue, 1 Red
Prediction: Blue (majority wins)

With k = 5 (five nearest neighbors):

Closest 5 points: Blue, Blue, Red, Blue, Red
Vote: 3 Blue, 2 Red
Prediction: Blue (still majority)

Both values of k agree here, but in ambiguous cases they can disagree — which is why choosing k matters.

Distance Metrics

k-NN depends entirely on measuring "closeness." The choice of distance metric affects which neighbors are selected.

Common Distance Metrics

Euclidean Distance (most common — straight-line distance):

d = √[(x₁ − x₂)² + (y₁ − y₂)²]

Manhattan Distance (grid-like distance, like city blocks):

d = |x₁ − x₂| + |y₁ − y₂|

Minkowski Distance (generalization):

d = (|x₁ − x₂|ᵖ + |y₁ − y₂|ᵖ)^(1/p)

When p = 1 this is Manhattan; when p = 2 this is Euclidean.

Feature scaling matters. If one feature ranges from 0 to 10,000 and another ranges from 0 to 1, the first feature will dominate all distance calculations. Always normalize or standardize features before applying k-NN.

Choosing k: The Goldilocks Problem

The choice of k controls the tradeoff between sensitivity and stability.

k Too Small (e.g., k = 1):

The model is extremely sensitive to noise and outliers.
One mislabeled or unusual point can corrupt a prediction entirely.
Decision boundaries are jagged and complex.
The model overfits the training data.

k Just Right (e.g., k = 3, 5, or 7):

Noise is averaged out over multiple neighbors.
Decision boundaries are smoother.
The model generalizes better to new data.

k Too Large (e.g., k = n, all training examples):

The model always predicts the majority class.
Local patterns are lost.
The model underfits the data.

Rule of thumb: Try odd values (avoids ties in binary classification) like k = 3, 5, 7. Use cross-validation to select the best k for your specific dataset.

Advantages of k-NN

k-NN’s strengths are precisely where other algorithms are weak:

Simple and interpretable: Easy to explain — "these were the most similar cases."
No training phase: Store the data and you are done.
Instant adaptation: Add new labeled examples and they are immediately used for future predictions — no retraining required.
Works for any number of classes: Majority voting handles multi-class problems naturally.
Non-parametric: Makes no assumptions about the shape of the data distribution.
Non-linear boundaries: Can model arbitrarily complex decision regions.

Disadvantages and Challenges

k-NN’s simplicity comes at a cost:

Slow prediction: Must compute distance to every training point for each new example. With millions of training points this becomes expensive.
Memory intensive: The entire training set must be stored.
Curse of dimensionality: In high-dimensional spaces, all points tend to become equally distant from each other, making "nearest neighbors" meaningless.
Sensitive to irrelevant features: All features are treated equally; irrelevant ones add noise to distance calculations.
Imbalanced data: If one class dominates the training set, majority voting will favor it.

Mitigations: Feature scaling (normalize features), dimensionality reduction (PCA), efficient data structures (KD-trees, Ball trees), and feature selection.

k-NN vs. Decision Trees

Aspect	Decision Trees	k-NN
Training speed	Slow (builds tree)	Instant (stores data)
Prediction speed	Fast (follow rules)	Slow (compute all distances)
Memory usage	Compact (tree structure)	Large (entire training set)
Interpretability	High (explicit rules)	Lower (similarity-based)
Adding new data	Must retrain	Just add to dataset
Handles non-linear data	Yes (deep trees)	Yes (naturally)

Aspect

Decision Trees

k-NN

Training speed

Slow (builds tree)

Instant (stores data)

Prediction speed

Fast (follow rules)

Slow (compute all distances)

Memory usage

Compact (tree structure)

Large (entire training set)

Interpretability

High (explicit rules)

Lower (similarity-based)

Adding new data

Must retrain

Just add to dataset

Handles non-linear data

Yes (deep trees)

Yes (naturally)

When to prefer k-NN over decision trees:

Choose k-NN when your data changes frequently and you need the model to adapt instantly without retraining. It is well-suited for small to medium datasets with continuous features where local patterns matter.

Choose decision trees (or random forests) when prediction speed matters, when you need to explain individual decisions with rules, or when you have very large datasets.

Key Takeaways

k-Nearest Neighbors (k-NN): An instance-based learning algorithm that classifies new examples by majority vote among the k most similar training examples.
Distance Metric: A function that measures how similar (close) two data points are; Euclidean distance is the most commonly used.
Overfitting (in k-NN): Occurs when k is too small; the model memorizes training data including noise and performs poorly on new data.
Underfitting (in k-NN): Occurs when k is too large; the model ignores local patterns and tends to predict the majority class.
Curse of Dimensionality: The phenomenon where high-dimensional spaces make distance-based methods unreliable because all points become nearly equidistant.

k-NN is a natural entry point into machine learning because it requires no mathematical optimization and makes the core idea of supervised learning — learning from labeled examples — completely explicit.

As you move deeper into machine learning, you will encounter algorithms that do optimize parameters (logistic regression, neural networks). Those algorithms sacrifice k-NN’s interpretability in exchange for faster prediction and better performance on high-dimensional data.

Back to Unit 8: Supervised Learning Basics →

Based on the UC Berkeley CS 188 Online Textbook by Nikhil Sharma, Josh Hug, Jacky Liang, and Henry Zhu, licensed under CC BY-SA 4.0.

This work is licensed under CC BY-SA 4.0.