Probability Fundamentals

Unit 7: Probability and Uncertainty in AI — Section 7.2

You have an intuitive sense of probability from everyday life: a coin flip is 50/50, there is a 30% chance of rain, a certain student "probably" passed the exam. This section puts that intuition on a firm mathematical foundation. We will introduce the key concepts — sample spaces, events, conditional probability, and independence — that form the vocabulary every AI probability algorithm speaks.

Build intuition for conditional probability with a concrete visual explanation.

Conditional Probability Explained (Khan Academy)

Sample Spaces and Events

Every probability calculation begins with a clear description of the possible outcomes.

Sample Space: The set of all possible outcomes of a random experiment, usually written as Ω (omega). For a coin flip: Ω = {heads, tails}. For a six-sided die: Ω = {1, 2, 3, 4, 5, 6}. For a weather forecast: Ω = {sunny, cloudy, rainy, snowy}.

Event: Any subset of the sample space — a collection of outcomes we are interested in. For a die: the event "even number" = {2, 4, 6}. For a weather forecast: the event "precipitation" = {rainy, snowy}.

The probability of an event A, written P(A), is a number between 0 and 1 that measures how likely A is to occur. Three axioms (the Kolmogorov axioms) define what counts as a valid probability:

Non-negativity: P(A) ≥ 0 for every event A.
Normalization: P(Ω) = 1 — something must happen.
Additivity: If A and B are mutually exclusive (they cannot both occur), then P(A ∪ B) = P(A) + P(B).

Basic Probability (equally likely outcomes)

P(A) = (number of outcomes in A) / (total number of outcomes in Ω)

Example: A bag contains 3 red marbles, 2 blue marbles, and 5 green marbles. P(red) = 3 / (3 + 2 + 5) = 3/10 = 0.30

Random Variables

In AI, we rarely work with raw outcomes like "heads" or "rainy." Instead, we define random variables that give names to the uncertain aspects of the world.

Random Variable: A variable whose value is determined by the outcome of a random process. A discrete random variable takes on a countable number of values (e.g., Disease ∈ {flu, cold, COVID-19}). A continuous random variable can take on any value in a range (e.g., Temperature ∈ [36.0, 42.0]).

A probability distribution over a discrete random variable lists the probability of each possible value. For example:

P(Weather = sunny)  = 0.60
P(Weather = cloudy) = 0.30
P(Weather = rainy)  = 0.10
                      ----
                      1.00  (must sum to 1)

This is called a full joint distribution when it covers multiple variables simultaneously.

Conditional Probability: Updating Beliefs with Evidence

The most powerful concept in probability for AI applications is conditional probability. When we observe evidence, we update our beliefs. Conditional probability formalizes this update.

Conditional Probability

P(A | B) = P(A ∧ B) / P(B)

Read as: "The probability of A given B." Requires P(B) > 0.

Intuition: knowing B is true restricts our attention to only those outcomes where B occurs. Within that restricted space, we ask what fraction also satisfies A.

Medical Diagnosis: Fever and Flu

Suppose we have data on 100 patients:

	Has Flu	No Flu	Total
Has Fever	18	12	30
No Fever	2	68	70
Total	20	80	100

P(Flu | Fever) = ?

Step 1: P(Flu ∧ Fever) = 18/100 = 0.18

Step 2: P(Fever) = 30/100 = 0.30

Step 3: P(Flu | Fever) = 0.18 / 0.30 = 0.60

Interpretation: A patient with fever has a 60% chance of flu — much higher than the baseline flu rate of 20% (20/100 = 0.20). The evidence (fever) updated our belief significantly.

The Product Rule

Rearranging the definition of conditional probability gives the product rule, which is used constantly when building probabilistic models:

Product Rule

P(A ∧ B) = P(A | B) × P(B)
         = P(B | A) × P(A)

Both expressions are equivalent; use whichever makes the numbers easier to find.

The Chain Rule

For more than two variables, the product rule extends to the chain rule:

Chain Rule

P(A ∧ B ∧ C) = P(A | B ∧ C) × P(B | C) × P(C)

In general: any joint probability can be decomposed into a product of conditional probabilities.

Independence: When Evidence Doesn’t Change Beliefs

Two events are independent if knowing one tells you nothing about the other.

Independence: Events A and B are independent if and only if:

P(A | B) = P(A)

An equivalent condition: P(A ∧ B) = P(A) × P(B)

Knowing B occurred does not change the probability of A.

Independent vs. Dependent Events

Independent:

Two separate coin flips: P(Heads on flip 2 | Heads on flip 1) = 0.5 = P(Heads)
Rolling two dice: knowing one die result tells you nothing about the other

Dependent (not independent):

Fever and flu: P(Flu | Fever) = 0.60 ≠ 0.20 = P(Flu)
Carrying an umbrella and rain: people carry umbrellas because of rain; these are highly dependent

Independence is extremely valuable for AI systems because it allows us to simplify computations dramatically. If we know that two variables A and B are independent, we can store P(A) and P(B) separately and compute P(A ∧ B) = P(A) × P(B) without needing to store the full joint distribution.

Conditional Independence

A subtler and even more useful concept is conditional independence.

Conditional Independence: Variables A and B are conditionally independent given C if:

P(A | B ∧ C) = P(A | C)

Once C is known, learning B gives no additional information about A.

Example: Fever (A) and runny nose (B) are both symptoms that are conditionally independent given the disease ©. If you already know a patient has influenza, learning that they also have a runny nose does not change your belief that they have a fever.

Conditional independence is the key insight behind Bayesian networks and naive Bayes classifiers, which you will study in Sections 7.4 and 7.5.

Marginal Probability and the Law of Total Probability

Sometimes we need to compute the probability of a variable by "summing out" other variables we do not care about.

Law of Total Probability

For a variable B with mutually exclusive values b₁, b₂, …, bₙ:

P(A) = P(A | B=b₁) × P(B=b₁)
     + P(A | B=b₂) × P(B=b₂)
     + ...
     + P(A | B=bₙ) × P(B=bₙ)

This is sometimes called marginalization — computing a marginal probability by summing over all values of another variable.

Medical Test: Computing P(Positive)

Suppose:

P(Disease) = 0.01
P(Positive | Disease) = 0.95
P(Positive | No Disease) = 0.05

By the law of total probability:

P(Positive) = P(Positive | Disease) × P(Disease)
            + P(Positive | No Disease) × P(No Disease)
            = 0.95 × 0.01 + 0.05 × 0.99
            = 0.0095 + 0.0495
            = 0.059

About 5.9% of people test positive in this population. We will use this result when applying Bayes' theorem in Section 7.4.

Putting It Together: A Probability Calculation Checklist

How to Work Through a Probability Problem

Identify the sample space and define the random variables.
Write down all given probabilities (priors, likelihoods, false positive rates).
Determine what you need to find: joint, marginal, or conditional?
Apply the appropriate formula:
- Joint: product rule
- Marginal: law of total probability (sum over other variables)
- Conditional: definition P(A|B) = P(A ∧ B) / P(B)
Check: does your answer make intuitive sense?

Return to the medical test example above. If a patient tests positive, what is P(Disease | Positive)?

You now have P(Positive) = 0.059, P(Positive | Disease) = 0.95, and P(Disease) = 0.01. Can you calculate the answer using the conditional probability definition? (Hint: you will need the product rule first.)

We will work through exactly this calculation — and explain why the answer surprises most people — in Section 7.4 when we introduce Bayes' theorem.

Test your understanding of probability fundamentals.

Next: 7.3 From "Is It True?" to "How Likely?" →

Probability content adapted from OpenStax Introductory Statistics, Chapter 3, licensed under CC BY 4.0.

Based on the UC Berkeley CS 188 Online Textbook by Nikhil Sharma, Josh Hug, Jacky Liang, and Henry Zhu, licensed under CC BY-SA 4.0.

This work is licensed under CC BY-SA 4.0.