Unit 7 Wrap-Up and Self-Assessment
Unit 7: Probability and Uncertainty in AI — Wrap-Up
You have completed one of the most conceptually important units in this course. The shift from deterministic reasoning to probabilistic reasoning is the shift that makes AI applicable to the real world. Let’s consolidate what you have learned.
Real-world AI operates in environments that are partially observable, stochastic, and noisy. Probability provides a principled, consistent mathematical framework for representing uncertainty and making rational decisions despite it. Bayes' theorem is the engine that updates beliefs when evidence arrives. Bayesian networks compactly represent probabilistic relationships among many variables. Naive Bayes applies these ideas to practical classification at scale.
Key Takeaways
Why Uncertainty Matters (Section 7.1)
-
Logic works for closed, fully-observable worlds; probability works for the open, uncertain real world.
-
Three sources of uncertainty: laziness (too many rules to list), theoretical ignorance (science doesn’t fully know), and practical ignorance (agent lacks access to all data).
-
Noisy sensors and stochastic actions require agents to maintain probability distributions over states rather than asserting definite facts.
Probability Fundamentals (Section 7.2)
-
A sample space lists all possible outcomes; an event is a subset.
-
The Kolmogorov axioms ensure probabilities are internally consistent.
-
Conditional probability P(A | B) = P(A ∧ B) / P(B) is the foundation of probabilistic inference.
-
Independence (P(A | B) = P(A)) and conditional independence are the key structural properties that make large models tractable.
-
The law of total probability allows marginalization — computing a variable’s probability by summing out other variables.
From Logic to Probability (Section 7.3)
-
Logic is the special case of probability where all degrees of belief are 0 or 1.
-
Probability adds: prior knowledge, incremental belief updates, hypothesis ranking, and graceful handling of contradiction.
-
The same scenario analyzed with logic produces brittle, all-or-nothing conclusions; analyzed with probability, it produces calibrated, actionable estimates.
Bayesian Reasoning (Section 7.4)
-
Bayes' theorem: P(H | E) = P(E | H) × P(H) / P(E)
-
The components: prior (before evidence), likelihood (how well H explains E), posterior (after evidence), marginal likelihood (normalization constant).
-
The mammogram paradox shows why low base rates dominate: even an accurate test produces many false positives when the disease is rare.
-
Bayesian updating allows sequential reasoning — each new piece of evidence refines the posterior.
Probabilistic Models (Section 7.5)
-
A Bayesian network is a DAG where nodes are random variables and edges encode direct causal influence.
-
Each node stores a conditional probability table; the full joint distribution factorizes into a product of CPTs.
-
Naive Bayes assumes all features are conditionally independent given the class label.
-
Despite this unrealistic assumption, naive Bayes achieves excellent accuracy on text classification because ranking (not exact probability values) is what matters for classification.
Concept Map: Uncertainty → Machine Learning
Uncertainty in the world
|
v
Probability theory
|
+-----------+-----------+
| |
v v
Conditional Independence
probability structure
| |
v v
Bayes' theorem Bayesian networks
| |
v v
Belief updates Compact joint models
| |
+----------+------------+
|
v
Naive Bayes classifier
|
v
Machine Learning (Unit 8)
Summary Table: Core Concepts
| Concept | Definition | Where Used |
|---|---|---|
Prior P(H) |
Probability of hypothesis before evidence |
Medical diagnosis, spam filter |
Likelihood P(E|H) |
Probability of evidence if hypothesis is true |
Bayes' theorem computation |
Posterior P(H|E) |
Updated probability after observing evidence |
Classification output |
Conditional independence |
P(A|B,C) = P(A|C) — B adds no info once C is known |
Bayesian networks, naive Bayes |
CPT |
Conditional probability table stored at each BN node |
Bayesian network structure |
Laplace smoothing |
Adding 1 to all word counts to prevent zero probabilities |
Naive Bayes training |
Log probability |
log P(x) — prevents numerical underflow for long sequences |
Naive Bayes classification |
Final self-assessment: Unit 7 concepts.
Glossary: Unit 7 Key Terms
- Uncertainty
-
The condition in which an agent lacks complete information about the state of the world. Arises from incomplete observation, sensor noise, or stochastic environments.
- Sample Space (Ω)
-
The set of all possible outcomes of a random experiment.
- Event
-
Any subset of the sample space; a collection of outcomes we care about.
- Conditional Probability P(A|B)
-
The probability of event A given that event B has occurred. Formula: P(A|B) = P(A ∧ B) / P(B).
- Independence
-
Events A and B are independent if P(A|B) = P(A) — knowing B gives no information about A.
- Conditional Independence
-
A is conditionally independent of B given C if P(A|B,C) = P(A|C) — once C is known, B adds nothing.
- Prior Probability P(H)
-
Probability of a hypothesis before observing any evidence.
- Likelihood P(E|H)
-
Probability of observing evidence E if hypothesis H is true.
- Posterior Probability P(H|E)
-
Probability of hypothesis H after observing evidence E. Computed via Bayes' theorem.
- Bayes' Theorem
-
P(H|E) = P(E|H) × P(H) / P(E). The fundamental equation for updating beliefs with evidence.
- Bayesian Network
-
A directed acyclic graph where nodes are random variables and edges encode conditional dependencies; each node has a CPT.
- Conditional Probability Table (CPT)
-
A table stored at each node of a Bayesian network giving P(node value | parent values) for all combinations.
- Naive Bayes Classifier
-
A classifier applying Bayes' theorem with the naive assumption that all features are conditionally independent given the class label.
- Laplace Smoothing
-
Adding a small count (usually 1) to all word counts during naive Bayes training to prevent zero-probability assignments.
- Degree of Belief
-
A numerical value 0-1 representing an agent’s confidence that a proposition is true.
Optional Further Reading: Decision Theory and Expected Utility
Probability is not just for classification — it is also the foundation of decision theory. When an agent must choose among actions with uncertain outcomes, it can maximize expected utility: the probability-weighted average payoff.
This topic is explored in the supplementary reading: Decision Theory and Expected Utility (Supplementary)
Also recommended: the lecture on expected utility from the video youtube.com/embed/UnX8RPB5vFM (Decision Theory overview).
Preview: Unit 8 — Machine Learning Foundations
You have spent Unit 7 learning how to specify probabilistic models: you built naive Bayes by hand, explicitly computing priors and likelihoods from data.
Unit 8 asks: what if the model is too complex to specify by hand? What if we have thousands of parameters that need to be tuned, and we want the computer to figure them out from data?
That is machine learning. Every machine learning algorithm is, at its core, a system for learning parameters of a probabilistic (or function-fitting) model from data. The probability foundation you built this week is the intellectual foundation for everything in Unit 8.
This work is licensed under CC BY-SA 4.0.