2 The Language of Probability

A phenomenon is random if there are multiple potential possibilities, and there is uncertainty about which possibility is realized. This chapter introduces the fundamental terminology and objects of random phenomena, including

Possible outcomes (possibilities) of the random phenomenon
Related events that could occur
Random variables which measure numeric quantities based on outcomes
Probability measures which assign degrees of likelihood or plausibility to events in a logically coherent way and reflect assumptions about the random phenomenon
Distributions of random variables which describe their pattern of variability, and can be summarized by percentiles, expected values, standard deviations (and variances), and correlations (and covariances).
Conditioning, which involves revising probabilities and distributions to reflect additional information

Probability models put all of the above together. A probability model of a random phenomenon consists of a sample space of possible outcomes, associated events and random variables, and a probability measure which specifies probabilities of events and determines distributions of random variables according to the assumptions of the model and available information.

Throughout this chapter we will illustrate ideas using the following examples.

Total or best? Roll a four-sided¹ die twice and consider the sum and the larger of the two rolls (or the common roll if a die). Not very exciting? Maybe, but it is a familiar, simple, and concrete example. Also, a “toy” example can provide insight into more interesting problems, such as the following. In many sports, a competitor’s final ranking is based on the results of multiple attempts. Competitors in Olympic bobsled, for example, make four separate timed runs on the same course and their ranking is based on their total time. Competitors in Olympic shot put make six throws, but their ranking is based on their best throw. In sports with multiple attempts, how do the rankings compare if they are based on the total (or average) over all attempts (as in bobsled) or on the best attempt (as is shot put)?

Matching problem. A group of people all put their names in a hat for a Secret Santa gift exchange. The names are shuffled and everyone draws a name from the hat. We might be interested in questions like: What is the probability that someone selects their own name? How many people are expected to draw their own name? How do the answers to these questions depend on the name of people in the group? This a is version of a well known probability problem called the “matching problem”. The general setup involves $n$ distinct “objects” labeled $1, \ldots, n$ which are placed in $n$ distinct “boxes” labeled $1, \ldots, n$, with exactly one object placed in each box; for how many objects does the label on the object match the label on the box it is placed in?

Meeting problem. Several people plan to meet for lunch, but their arrival times are uncertain. We might be interested in whether they arrive within 15 minutes of one another, who arrives first and at what time, or how long the first person to arrive needs to wait for the others.

Collector problem. Each box of a brand of cereal contains a single prize from a collection. We might be interested in how many boxes we need to buy to complete the collection, or how many boxes we need to buy to complete five collections (say one collection for each of five kids), or which prize we get in the most boxes.

Arrivals over time. Customers enter a deli and take a number to mark their place in line. When the deli opens the counter starts 0; the first customer to arrive takes number 1, the second 2, etc. We record the counter over time, continuously, as it changes as customers arrive. We might be interested in the number of customers that arrive in some window of time, the time between customer arrivals, or the amount of time it takes for some number of customers to arrive. (And this is just the arrivals; we might also be interested in questions which involve the departures, such as: how much time a customer spends in the deli or how many customers are in the deli at a certain time.)

Full disclosure: many of the examples in this chapter involve rather dry tasks like discussing mathematical notation or listing elements of sets. Also, some of the things we do in these examples are rarely done in practice. So why bother? Many common mistakes in solving probability problems arise from misunderstanding these foundational objects. We hope that concrete—though sometimes uninteresting—examples foster understanding of fundamental concepts.

This chapter introduces what the fundamental objects of probability are, but not yet how to solve probability problems. Don’t worry; we’ll solve many interesting problems in the remaining chapters. Think of this chapter as introducing the “language” or “grammar” of probability. When first learning to write, we learn the basic elements of sentences: subjects, predicates, clauses, modifiers, etc. Understanding these fundamental building blocks is essential to learning how to write well, even if we don’t explicitly identify the subject, the verb, etc., in every sentence we write. Likewise, understanding the language of probability is crucial to learning how to solve probability problems, even if the language is sometimes unspoken.

2.1 Outcomes

Probability models can be applied to any situation in which there are multiple potential outcomes and there is uncertainty about which outcome is realized. Due to the wide variety of types of random phenomena, an outcome can be virtually anything:

the result of a coin flip
the results of a sequence of coin flips
a shuffle of a deck of cards
the weather conditions tomorrow in your city
the path of a particular Atlantic hurricane
the daily closing price of a certain stock over the next 30 days
a noisy electrical signal
the result of a diagnostic medical test
a sample of car insurance polices
the customers arriving at a store
the result of an election
the next World Series champion
a play in a basketball game

And on and on. In particular, an outcome does not have to be a number.

The first step in defining a probability model for a random phenomenon is to identify the possible outcomes.

Definition 2.1 The sample space is the collection of all possible outcomes of a random phenomenon.

Mathematically, the sample space is a set containing all possible outcomes, while any individual outcome is an element in the sample space. The sample space is typically denoted² $\Omega$, the uppercase Greek letter “Omega”. An outcome is typically denoted $\omega$, the lowercase Greek letter “omega”; $\omega$ denotes a generic outcome much like the symbol $u$ in $\sqrt{u}$ denotes a generic input to the square root function. We write $\omega \in \Omega$ (read $\in$ as “in” or “an element of”) to represent that $\omega$ is a possible outcome of sample space $\Omega$.

The simplest random phenomena have just two distinct outcomes, in which case the sample space is just a set with two elements, e.g., $\Omega=\{\text{no}, \text{yes}\}$, $\Omega=\{\text{off}, \text{on}\}$, $\Omega=\{0, 1\}$, $\Omega=\{-1, 1\}$. For example, the sample space for a single coin flip could be $\Omega = \{H, T\}$. If the coin lands on heads, we observe the outcome $\omega = H$; if tails we observe $\omega=T$.

In simple examples we can describe the sample space by listing all possible outcomes. However, constructing a list of all possible outcomes is rarely done in practice. We do so here only to provide some concrete examples of sample spaces. While a random phenomenon always has a corresponding sample space, in most situations the sample space of outcomes is at best only vaguely specified and can not be feasibly enumerated.

Example 2.1 Roll a four-sided die twice, and record the result of each roll in sequence. For example, a 3 on the first roll and a 1 on the second is not the same outcome as a 1 on the first roll and a 3 on the second.

Identify the sample space.
We might be interested in the sum of the two rolls. Explain why it is still advantageous to define the sample space as in the previous part, rather than as just $2, \ldots, 8$.

Solution (click to expand)

Solution 2.1.

We simply enumerate all the possible outcomes: first roll is a 1 and second roll is a 1, first roll is a 1 and second roll is a 2, etc. The sample space consists of 16 possible ordered pairs of rolls, which we can display in a list or table. See Table 2.1; any row is a possible outcome, and the table contains all possible outcomes.

We can write the sample space as a set of ordered pairs, \[\begin{align*} \Omega & = \{(1, 1), (1, 2), (1, 3), (1, 4),\\ & \qquad (2, 1), (2, 2), (2, 3), (2, 4),\\ & \qquad (3, 1), (3, 2), (3, 3), (3, 4),\\ & \qquad (4, 1), (4, 2), (4, 3), (4, 4)\}. \end{align*}\] Any element of this set is a possible outcome $\omega$. For example, the outcome $\omega = (4, 2)$ occurs when the first roll is a 4 and the second roll is a 2.
Yes, we might be interested in the sum of the two dice. But we might also be interested in other things, like the larger of the two rolls, or if at least one 3 was rolled, or the result of the first roll. Knowing just the sum of the rolls does not provide as much information about the outcome of the random phenomenon as the sequence of individual rolls does.

Table 2.1: Table representing the sample space of two rolls of a four-sided die. Each row represents an outcome.

First roll	Second roll
1	1
1	2
1	3
1	4
2	1
2	2
2	3
2	4
3	1
3	2
3	3
3	4
4	1
4	2
4	3
4	4

A random phenomenon is modeled by a single sample space. In Example 2.1 there was a single sample space whose outcomes represented the result of the pair of rolls; in particular, there was not a separate sample space for each of the individual rolls³. Whenever possible, a sample space outcome should be defined to provide the maximum amount of information about the outcome of random phenomenon.

Here’s another concrete example where we can list all the outcomes in the sample space. However, keep in mind that enumerating the sample space is rarely done in practice.

Example 2.2 Consider the matching problem with $n=4$. Label the objects 1, 2, 3, 4, and the spots 1, 2, 3, 4, with spot 1 the correct spot for object 1, etc. Identify an appropriate sample space.

Solution (click to expand)

Solution 2.2. We can consider each outcome to be a particular placement of objects in the spots. For example, one outcome is when object 3 is placed in spot 1, object 2 in spot 2, object 1 in spot 3, and object 4 in spot 4; another is when object 3 is placed in spot 1, object 2 in spot 2, object 4 in spot 3, and object 1 in spot 4. The sample space consists of all the possible arrangments of the object labels into the 4 spots. There are 24 outcomes; see Table 2.2). Recording outcomes in this way provides more information than if we had chosen the sample space to correspond to, for example, the number of objects that were placed in the correct spot.

Using set notation the sample space is \[\begin{align*} \Omega & = \{1234, 1243, 1324, 1342, 1423, 1432 \\ & \qquad 2134, 2143, 2314, 2341, 2413, 2431 \\ & \qquad 3124, 3142, 3214, 3241, 3412, 3421 \\ & \qquad 4123, 4132, 4213, 4231, 4312, 4321\} \end{align*}\] For example, the outcome 3214 (or $(3, 2, 1, 4)$) represents that object 3 is placed in spot 1, object 2 in spot 2, object 1 in spot 3, and object 4 in spot 4.

Table 2.2: Table representing the sample space in the matching problem with $n=4$. Each row represents an outcome.

Spot 1	Spot 2	Spot 3	Spot 4
1	2	3	4
1	2	4	3
1	3	2	4
1	3	4	2
1	4	2	3
1	4	3	2
2	1	3	4
2	1	4	3
2	3	1	4
2	3	4	1
2	4	1	3
2	4	3	1
3	1	2	4
3	1	4	2
3	2	1	4
3	2	4	1
3	4	1	2
3	4	2	1
4	1	2	3
4	1	3	2
4	2	1	3
4	2	3	1
4	3	1	2
4	3	2	1

In the two previous examples, the sample space was discrete, in the sense that the outcomes could be enumerated in a list (though it could be a very long list). But in many cases, it is not possible to enumerate outcomes in a list, even in principle.

For example, consider the circular spinner (like from a kids game) in Figure 2.1. Imagine a needle anchored at the center of the circle which is spun and eventually lands pointing at a number on the outside of the circle. The values in the picture are rounded to two decimal places, but consider an idealized model where the spinner is infinitely precise and the needle infinitely fine so that any real number between 0 and 1 is a possible outcome. The sample space corresponding to a single spin of this spinner is the interval⁴ [0, 1]. There are uncountably many numbers in [0, 1] so it would not be possible to enumerate them in a list. The interval [0, 1] is an example of a continuous sample space.

Example 2.3 Consider a version of the meeting problem where two people. Regina and Cady, will each definitely arrive between noon and 1, but their exact arrival times are uncertain. Rather than dealing with clock time, it is helpful to represent noon as time 0 and measure time as minutes after noon, including fractions of a minute, so that arrival times take values in the continuous interval [0, 60].

Describe an appropriate sample space. Hint: it might be easiest to draw a picture.

Solution (click to expand)

Solution 2.3. We can represent an outcome as a (Regina, Cady) pair of arrival times, each in [0, 60]. For example, the outcome (30, 45.2) represents Regina arriving at 12:30:00 and Cady at 12:45:12, while (45.2, 0) represents Regina arriving at time (12:45:12) and Cady at noon. The sample space is the set of all possible pairs. Figure 2.2 displays the sample space⁵ as the set of points within the colored square with [0, 60] sides.

Figure 2.2: The square represents the sample space in Example 2.3). Each point within the square is a (Regina, Cady) pair of arrival times in $[0, 60]$.

In the previous example, outcomes were measured on a continuous scale; any real number between 0 and 60 was a possible arrival time. In practice we might round the arrival time to the nearest minute or second, but in principle and with infinite precision any real number in the continuous interval $[0, 60]$ is possible.

Furthermore, even in situations where outcomes are inherently discrete, it is often more convenient to model them as continuous. For example, if an outcome represents the annual salary in dollars of a randomly selected U.S. household, it would be more convenient to model the sample space as the continuous interval⁶ $[0, \infty)$ rather than discrete intervals like $\{0, 1, 2, \ldots\}$ or $\{0, 0.01, 0.02, \ldots\}$. Continuous models are often more tractable mathematically than discrete models.

Example 2.4 Consider the collector problem with 3 prizes in the collection, labeled 1, 2, and 3. We open boxes one at a time. Identify an appropriate sample space. Is it possible to identify a sample space in which all outcomes have the same “length”?

Solution (click to expand)

Solution 2.4. An outcome could represent the sequence of prizes we obtain in order. For example, (2, 3, 3, 2, 2, 2, 3, 1) represents prize 2 in the first box, prize 3 in the second and third boxes, prize 2 in the fourth, and so on, completing a collection with prize 1 in the eighth box. Outcomes recorded in this way can have different lengths if we only record the boxes we open until we complete a collection; for example (2, 3, 1) versus (2, 3, 3, 1) versus (2, 3, 3, 2, 1). However, it is often convenient for sample space outcomes to have the same length.

We can define outcomes with the same “length” if we assume the process continues indefinitely, that is, if we continue to open boxes even after we complete a set. Now an outcome is an infinite sequence, with each component of the sequence taking a value of 1, 2, or 3; for example, (2, 3, 3, 2, 2, 2, 3, 1, 2, 1, 1, 3, $\ldots$). Thus the sample space⁷ is the set of all infinite sequences whose components take values in $\{1, 2, 3\}$. Outcomes of this sample space all have the same “length”—infinite. Moreover, this sample space allows for a broader range of questions to be investigated. For example, we might be interested in the number of packages needed to obtain 5 complete collections (which might be relevant if you have 5 kids and they all want their own collection).

In the previous examples, the sample space could be defined rather explicitly, either by direct enumeration or using set notation (like a Cartesian product). However, explicitly defining a sample space in a compact way is often not possible, as in the following example.

Example 2.5 Customers enter a deli and take a number to mark their place in line. When the deli opens the counter starts 0; the first customer to arrive takes number 1, the second 2, etc. We record the counter over time, continuously, as it changes as customers arrive. Time is measured in minutes after the deli opens (time 0). How might you define an appropriate sample space?

Solution (click to expand)

Solution 2.5. A sample space outcome could be represented as a path of the value of the counter over time; a few such paths are illustrated in Figure 2.3. The horizontal axis represents time and the vertical axis represents the number of customers that have arrived. Notice the stairstep feature: a customer arrives and takes a number then the counter stays on that number for some time (the flat spots) until another customer arrives and the counter increases by one (the jumps). In other words, an outcome is a nondecreasing function mapping the time interval $[0, \infty)$ to nonnegative integers $\{0, 1, 2, \ldots\}$, that only jumps by one unit at a time. The sample space consists of all possible functions of this form.

(a) A single sample path of the number of customer arrivals over time.

Any random phenomenon has a corresponding sample space but in some situations explicitly defining a outcome is not feasible. For example, suppose the random phenomenon is tomorrow’s weather. In order to describe an outcome, we need to specify (among other things): temperature, atmospheric pressure, wind, humidity, precipitation, and cloudiness, and how it all evolves over over the course of tomorrow, possibly in multiple locations. Representing all of this information in a compact way to define even just one outcome is virtually impossible; explicitly defining a sample space of all possible outcomes is hopeless. Regardless, the sample space is still there in the background whether we specify it or not.

Even though the sample space often is at best vaguely defined (“tomorrow’s weather”) and plays a background role, it is important to first consider what is possible before determining how probable events are. The sample space essentially defines the denominator in probability calculations. In particular, considering the sample space can help distinguish between “the particular and the general” (as discussed in Section 1.6).

2.1.1 Counting outcomes

When there are finitely many possibilities, we can ask: how many possible outcomes are there? In Example 2.1 and Example 2.2 we counted outcomes by enumerating them in a list. Of course, listing all the outcomes is unfeasible unless the sample space is very small. Now we’ll see a simple principle that can be applied to count outcomes.

Example 2.6 A soft serve ice cream shop has three flavors: vanilla, chocolate, or swirl. Customers can choose to have a cone or a bowl.

How can you determine the number of different ways the ice cream can be served? (Two possible ways are vanilla in a cone and chocolate in a bowl).
Customers can also add rainbow or chocolate sprinkles⁸ or not. Now how many different ways can the ice cream be served?
Customers who request bowls can choose to add whipped cream. Is the number of different ways the ice cream can be served equal to the answer to the previous part multiplied by two?

Solution (click to expand)

Solution 2.6.

Each flavor can be served in 2 ways, cone or bowl. Since there are 3 flavors, the number of ways to serve is $2+2+2 = 3\times2 = 6$.
Each of the 6 pairs from the previous part can be served in 3 ways, with rainbow sprinkles, with chocolate sprinkles, or without sprinkles. So the number of ways to serve is now $3\times 2 \times 3 = 18$.
No. Only the bowls can get whipped cream so we can’t just multiply 18 by 2. Of the 18 possibilities from the previous part, 9 are in bowls. So these 9 can be served with or without whipped cream, but the other 9 in cones can only be served without whipped cream. The number of possibilities is now $9\times 2 + 9 = 27$.

All of the counting rules we will see are based on multiplying like in Example 2.6.

Lemma 2.1 (Multiplication principle for counting) Suppose that stage 1 of a process can be completed in any one of $n_1$ ways. Further, suppose that for each way of completing the stage 1, stage 2 can be completed in any one of $n_2$ ways. Then the two-stage process can be completed in any one of $n_1\times n_2$ ways. This rule extends naturally to a $\ell$-stage process, which can then be completed in any one of $n_1\times n_2\times n_3\times\cdots\times n_\ell$ ways.

In the multiplication principle it is not important whether there is a “first” or “second” stage. What is important is that there are distinct stages, each with its own number of “choices”. In Example 2.6, there was a bowl/cone stage, an ice cream flavor stage, and a sprinkle stage; it didn’t matter if the flavor was chosen first or second or third.

We can use the multiplication principle to verify the total number of possible outcomes for a few of our previous examples. In Example 2.1 an outcome is a pair (first roll, second roll). There are 4 possibilities for the first roll and 4 for the second, so $4\times4 = 16$ possible pairs. In Example 2.2 an outcome is an arrangement of the 4 outcomes in the 4 spots. There are 4 possibilities for the object placed in spot 1. After placing that object, there are 3 possibilities for spot 2, then 2 possibilities for spot 3, with one object left for spot 4. So there are $4\times3\times2\times1 = 24$ possible arrangements.

Example 2.7 Use the multiplication principle to count the total number of possible outcomes in each of the following situations.

Roll a six-sided die three times and record the result of each roll in sequence
Roll a four-sided die and a six-sided die and record the result of each roll.
Flip a coin four times and record the result of each flip in sequence
The number of arrangements of the objects in the matching problem with $n=10$.
The number of possible draws in Example 1.18.

Solution (click to expand)

Solution 2.7.

An outcome is a sequence of the results of each of the 3 flips. There are six possibilities for each roll, so $6\times6\times6= 6^3=216$ possible outcomes.
There are 4 possibilities for one roll and 6 possibilities for the other, so there are $4\times 6 = 24$ possible outcomes.
An outcome is a sequence of the H/T results of each of the 4 flips. There are two possibilities for each flip, so $2\times2\times2\times2 = 2^4=16$ possible outcomes.
There are 10 objects that could potentially go in spot 1, then 9 objects that could potentially go in spot 2, 8 to spot 3, and so on, with 1 left for spot 10. This results in $10\times9\times8\times\cdots\times1=10! = 3,628,800$ possible outcomes. (That’s over 3.6 million possibilities; we certainly wouldn’t want to make a table!)
There are 5 possibilities for the first draw, then 4 possibilities for the second. If we record the outcome as the (first draw, second draw) result, there are $5\times4 = 20$ possible outcomes.

The multiplication principle provides the foundation for some other counting rules we will see later.

2.1.2 Exercises

Exercise 2.1 Consider the outcome of a sequence of 4 flips of a coin.

Without enumerating the sample space, determine the number of outcomes.
Enumerate the sample space and confirm the number of outcomes.
We might be interested in the number of flips that land on heads. Explain why it is still advantageous to define the sample space as in the previous part, rather than as $\Omega=\{0, 1, 2, 3, 4\}$.

Exercise 2.2 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package.

Without enumerating the sample space, determine the number of outcomes.
Enumerate the sample space and confirm the number of outcomes.

2.2 Events

An event is something that might happen or might be true. For example, if we’re interested in the weather conditions in our city tomorrow, events include

it rains
it does not rain
the high temperature is 75°F (rounded to the nearest °F)
the high temperature is above 75°F
it rains and the high temperature is above 75°F
it does not rain or the high temperature is not above 75°F

There are many possible outcomes for tomorrow’s weather, but each of the above will be true only for certain outcomes.

Definition 2.2 An event is a subset of the sample space. An event represents a collection of outcomes that some criteria.

The sample space is the collection of all possible outcomes; an event represents only those outcomes which satisfy some criteria. Events are typically denoted with capital letters near the start of the alphabet, with or without subscripts (e.g. $A$, $B$, $C$, $A_1$, $A_2$).

Mathematically, events are sets, so events can be composed from others using basic set operations like unions ($A\cup B$), intersections ($A \cap B$), and complements ($A^c$).

Complements. Read $A^c$ as “not $A$”, the outcomes that do not satisfy $A$
Intersections. Read $A\cap B$ as “$A$ and $B$”, the outcomes that satisfy both $A$ and $B$
Unions. Read $A \cup B$ as “$A$ or $B$”, the outcomes that satisfy $A$ or $B$. Unions ($\cup$, “or”) are always inclusive: $A\cup B$ occurs if $A$ occurs but $B$ does not, $B$ occurs but $A$ does not, or both $A$ and $B$ occur. Note that the complement of a union is the intersection of the complements, and vice versa: $(A \cup B)^c = A^c \cap B^c$ and $(A \cap B)^c = A^c \cup B^c$,

In the weather example above we can write

$A$: it rains
$B=A^c$: it does not rain
$C$: the high temperature is 75°F (rounded to the nearest °F)
$D$: the high temperature is above 75°F
$E = A \cap D$: it rains and the high temperature is above 75°F
$F = A^c \cup D^c = (A\cap D)^c = B\cap D^c = E^c$: it does not rain or the high temperature is not above 75°F

Example 2.8 Every year the NBA Draft Lottery is conducted to determine which non-playoff teams will receive the top draft picks. Table 2.3 displays the teams that participated in the 2023 NBA Draft Lottery (held on May 16, 2023) along with some team statistics from the 2022-2023 season and the number of previous NBA championships won by the franchise (as of 2023).

Imagine it’s early May 2023 and the lottery hasn’t happened yet. The lottery determines—at random—the top three picks, but we’ll just consider the team who wins the first pick in the 2023 draft. The sample space is the 14 teams in Table 2.3. Identify each of the following events relating to the team that wins the top pick.

$A$, the team is in the Western Conference.
$B$, the team has never won a championship.
$C$, the team won fewer than 25 games (Wins) in the 2022-2023 season
$D$, the team scored over 115 points per game (PPG) in the 2022-2023 season
Identify and interpret $A\cap B$
Identify and interpret $A \cap B \cap D$
Identify and interpret $A \cup B$
Identify and interpret $B^c$

Table 2.3: Teams in the 2023 NBA Draft Lottery

Team	Conference	Championships	Wins	PPG	FG3	FG3A	FG2	FG2A	FT	FTA
Detroit Pistons	Eastern	3	17	110.3	11.4	32.4	28.2	54.6	19.8	25.7
Houston Rockets	Western	2	22	110.7	10.4	31.9	30.2	56.9	19.1	25.3
San Antonio Spurs	Western	5	22	113.0	11.1	32.2	32.0	60.4	15.8	21.2
Charlotte Hornets	Eastern	0	27	111.0	10.7	32.5	30.5	57.9	17.6	23.6
Portland Trail Blazers	Western	1	33	113.4	12.9	35.3	27.6	50.1	19.6	24.6
Orlando Magic	Eastern	0	34	111.4	10.8	31.1	29.8	55.2	19.6	25.0
Indiana Pacers	Eastern	0	35	116.3	13.6	37.0	28.4	52.6	18.7	23.7
Washington Wizards	Eastern	1	35	113.2	11.3	31.7	30.9	55.2	17.6	22.4
Utah Jazz	Western	0	37	117.1	13.3	37.8	29.2	52.0	18.7	23.8
Dallas Mavericks	Western	1	38	114.2	15.2	41.0	24.8	43.3	19.0	25.1
Chicago Bulls	Eastern	6	40	113.1	10.4	28.9	32.1	57.9	17.6	21.8
Oklahoma City Thunder	Western	1	40	117.5	12.1	34.1	31.0	58.5	19.2	23.7
Toronto Raptors	Eastern	1	41	112.9	10.7	32.0	31.1	59.3	18.4	23.4
New Orleans Pelicans	Western	0	42	114.4	11.0	30.1	31.1	57.5	19.3	24.4

Solution (click to expand)

Solution 2.8. We’ll write each of the events as a set of teams, but you can also think of each event as a subset of the rows of Table 2.3.

$A = \{\text{Houston, San Antonio, Portland, Utah, Dallas, Oklahoma City, New Orleans}\}$
$B = \{\text{Charlotte, Orlando, Indiana, Utah, New Orleans}\}$
$C=\{\text{Detroit, Houston, San Antonio}\}$
$D = \{\text{Indiana, Utah, Oklahoma City}\}$
$A\cap B = \{\text{Utah, New Orleans}\}$ is the event that the team is in the Western Conference and has won no previous championships.
$A \cap B \cap D=\{\text{Utah}\}$ is the event that the team is in the Western Conference, has won no previous championships, and scored over 115 points per game in the 2022-2023 season.
$A \cup B=\{\text{Charlottle, Houston, San Antonio, Portland, Orlando, Indians, Utah, Dallas, Oklahoma City, New Orleans}\}$ is the event that the team is in the Western Conference or has won no previous championships. Notice that teams that satisfy both $A$ and $B$, Utah and New Orleans, are included but only once.
$B^c=\{\text{Detroit, Houston, San Antonio, Portland, Washington, Dallas, Chicago, Oklahoma City, Toronto}\}$ is the event that the team has won at least one previous championship.

In Example 2.8 notice that we ony said the winner was determined “at random”; we didn’t mention how. “At random” only implies that the winning team will be selected in a manner that involves uncertainly. “At random” does not necessarily imply that the 14 teams are equally likely. In fact, the 2023 NBA Draft Lottery was weighted to give teams with fewer wins the previous season a greater probability of winning the top pick. We’ll return to this idea later. For now, we’re just defining some events that are possible; later we will consider how probable they are.

If the outcomes of a sample space are represented by rows in a table, then events are subsets of rows which satisfy some criteria.

Example 2.9 Roll a four-sided die twice, and record the result of each roll in sequence. Using the sample space from Example 2.1, identify the following events.

$A$, the event that the sum of the two dice is 4.
$B$, the event that the sum of the two dice is at most 3.
$C$, the event that the larger of the two rolls (or the common roll if a tie) is 3.
$A\cap C$ (identify and interpret).
$D$, the event that the first roll is a 3.
$E$, the event that the second roll is a 3.
$D \cap E$ (identify and interpret).
$D \cup E$ (identify and interpret).
If the outcome is $(1, 3)$, which of the events above occurred?

Solution (click to expand)

Solution 2.9. Remember that the sample space consists of 16 possible ordered pairs of rolls, (first roll, second roll); see Table 2.1. All events must be defined as subsets of this sample space.

$A$ consists of the outcomes (1, 3), (2, 2), and (3, 1). In set notation, event $A$ is the set $\{(1, 3), (2, 2), (3, 1)\}$. This event is highlighted in Table 2.4.
$B$ consists of the outcomes (1, 1), (1, 2), and (2, 1). In set notation, $B = \{(1, 1), (1, 2), (2, 1)\}$.
$C$ consists of the outcomes (1, 3), (2, 3), (3, 1), (3, 2), and (3, 3). In set notation, $B = \{(1, 3), (2, 3), (3, 1), (3, 2), (3, 3)\}$.
$A\cap C$, which consists of the outcomes (1, 3) and (3, 1), is the event that both the sum of the two dice is 4 and the larger of the two rolls is 3. In set notation, $A \cap C = \{(1, 3), (3, 1)\}$.
Each outcome in the sample space consists of a pair of rolls, so we must account for both rolls in defining events, even if the event of interest involves just the first roll. (Remember, there is always a single sample space upon which all events are defined.) So $D$ consists of the outcomes (3, 1), (3, 2), (3, 3), and (3, 4). In set notation, $D = \{(3, 1), (3, 2), (3, 3), (3, 4)\}$.
$E$ consists of the outcomes (1, 3), (2, 3), (3, 3), and (4, 3). In set notation, $E = \{(1, 3), (2, 3), (3, 3), (4, 3)\}$. Note that this is not the same event as $D$.
$D \cap E$, which consists only of the outcome (3, 3), is the event that both rolls result in a 3. While an event is always a set, it can be a set consisting of a single outcome (or the empty set). In set notation, $D\cap E = \{(3, 3)\}$.
$D \cup E$, which consists of the outcomes (3, 1), (3, 2), (3, 3), (3, 4), (1, 3), (2, 3), and (4, 3) is the event that at least one of the two rolls results in a 3. In set notation, $D \cup E = \{(3, 1), (3, 2), (3, 3), (3, 4), (1, 3), (2, 3), (4, 3)\}$. Notice that the union is inclusive: $(3, 3)$, the outcome that satisfies both $D$ and $E$, is an element of $D\cup E$. But also notice that the outcome $(3, 3)$ only appears once in $D\cup E$.
If the outcome is $(1, 3)$ then events $A$, $C$, $A\cap C$, $E$, $D\cup E$ all occur. Events $B,$ $D$ and $D\cap E$ do not occur. The outcome $(1, 3)$ is an element of each of the sets $A$, $C$, $A\cap C$, $E$, $D\cup E$, but not of $B,$ $D$ and $D\cap E$.

Table 2.4: Table representing the sample space of two rolls of a four-sided die. The outcomes in orange comprise the event $A$, the sum is equal to 4.

First roll	Second roll	Sum is 4?
1	1	no
1	2	no
1	3	yes
1	4	no
2	1	no
2	2	yes
2	3	no
2	4	no
3	1	yes
3	2	no
3	3	no
3	4	no
4	1	no
4	2	no
4	3	no
4	4	no

We reiterate (again!) that there is a single sample space, upon which all events are defined. In the above example, events that involved only the first or second roll such as $D$ and $E$ were still defined in terms of pairs of rolls. An outcome in a sample space should be defined to record as much information as possible so that the occurrence or non-occurrence of all events of interest can be determined.

Some events consist of a single outcome, or no outcomes at all (the “empty set” denoted $\{\}$ or $\emptyset$).

Definition 2.3 Events $A_1, A_2. A_3, \ldots$ are disjoint (a.k.a. mutually exclusive) if none of the events have any outcomes in common; that is, if $A_i \cap A_j = \emptyset$ for all $i\neq j$.

Roughly, disjoint events do not “overlap”. In Example 2.9, events $B$ and $C$ are disjoint since $B \cap C = \emptyset$; there are no outcomes for which both the sum of the dice is at most 3 and the larger roll is a 3.

Example 2.10 In the matching problem with $n=4$ objects labeled 1, 2, 3, 4, are placed in spots labeled 1, 2, 3, 4, with spot 1 the correct spot for object 1, etc. Using the sample space from Example 2.2, identify the following events.

$A$, the event that all objects are put in the correct spot.
$F$, the event that no objects are put in the correct spot.
$D$, the event that exactly 3 objects are put in the correct spot.
$A_3$, the event that object 3 is put (correctly) in spot 3.

Solution (click to expand)

Solution 2.10. Recall that each outcome is a particular placement of objects in the spots. For example, the outcome (3, 2, 1, 4)—which we’ll shorten to 3214—signifies that object 3 is put in spot 1, object 2 in spot 2, object 1 in spot 3, and object 4 in spot 4.

There is only one outcome, 1234, for which all objects are put in the correct spot, so $A=\{1234\}$. Remember that an event is always a set, but it can be a set consisting of a single outcome.
For each outcome in the sample space check to see if the criteria holds to identify $F=\{2143, 2341, 2413, 3142, 3412, 3421, 4123, 4312, 4321\}$ as the event that no objects are put in the correct spot.
There are no outcomes in which exactly 3 objects are put in the correct spot so $D=\emptyset$. (For $n=4$, if three objects are in their correct spots, then the remaining object must be in its correct spot too.)
$A_3=\{1234, 1432, 2134, 2431, 4132, 4231\}$ is the event that object 3 is put (correctly) in spot 3. This event is highlighted in Table 2.5. Even though event $A_3$ only concerns object 3, since the sample space consists of the placements of each of the objects then all events must be expressed in terms of these outcomes.

Table 2.5: Table representing the sample space in the matching problem with $n=4$. The outcomes in orange comprise the event $A_3$, object 3 in spot 3.

Spot 1	Spot 2	Spot 3	Spot 4	Object 3 in spot 3?
1	2	3	4	yes
1	2	4	3	no
1	3	2	4	no
1	3	4	2	no
1	4	2	3	no
1	4	3	2	yes
2	1	3	4	yes
2	1	4	3	no
2	3	1	4	no
2	3	4	1	no
2	4	1	3	no
2	4	3	1	yes
3	1	2	4	no
3	1	4	2	no
3	2	1	4	no
3	2	4	1	no
3	4	1	2	no
3	4	2	1	no
4	1	2	3	no
4	1	3	2	yes
4	2	1	3	no
4	2	3	1	yes
4	3	1	2	no
4	3	2	1	no

We can use the multiplication principle to count the number of outcomes that satisfy event $A_3$ in Table 2.5. If object 3 is in spot 3, there are 3 objects that can go in spot 1, then 2 that can go in spot 2, leaving 1 for spot 4; for a total of $3\times2\times1\times1=6$ of the 24 outcomes which satisfy event $A_3$.

When more than just a few events are of interest, subscripts are commonly used to identify different events. In the previous example, we might also be interested in $A_1$, the event that object 1 is placed in spot 1; $A_2$, the event that object 2 is placed in spot 2; and so on.

Remember that intervals of real numbers such as $(a,b), [a,b], (a,b]$ are also sets, and so can also be events. For example, if an outcome is the result of a single spin of the spinner in Figure 2.1, events include

$[0, 0.5]$, the result is between 0 and 0.5 (the needle lands in the right half of the spinner)
$[0.75, 1]$, the result is between 0.75 and 1 (the needle lands in the northwest quarter of the spinner)
$[0.595, 0.605)$, the result rounded to two decimal places is 0.60
$\{0.6\}$, the result is 0.6 exactly (the needle points exactly at 0.60000000$\ldots$)

It is often helpful to conceptualize and visualize events (sets) with pictures, especially when dealing with continuous sample spaces.

Example 2.11 Using the sample space from Example 2.3, identify the following events using pictures.

$A$, the event that Regina arrives after Cady.
$B$, the event that either Regina or Cady arrives before 12:30.
$C$, the event that Cady arrives first and Regina arrives at most 15 minutes after Cady.
$D$, the event that Regina arrives before 12:24.

Solution (click to expand)

Solution 2.11. Figure 2.2 represents the sample space, a square with $[0, 60]$ sides. Each point within the square is a (Regina, Cady) pair of arrival times. We can shade the regions of the sample space corresponding to pairs of points that satisfy these events. See Figure 2.4.

Figure 2.4: Depiction of the events in Example 2.11.

In Example 2.11 the sample space consists of (Regina, Cady) pairs of arrival times so any event must be expressed as a collection of pairs. Even though the criteria for event $D$ involves only Regina’s arrival time, the event is not simply [0, 24]; we need to consider all (Regina, Cady) pairs for which the Regina component is in the interval [0, 24].

Example 2.12 Consider the sample space in Example 2.5. Sketch a picture representing $A$, the event that no arrivals occur in the first 5 minutes.

Solution (click to expand)

Solution 2.12. The sample space consists of all possible non-decreasing integer-valued paths that start at 0 at time 0; a few outcomes are depicted in Figure 2.5. Only paths that result in 0 arrivals at time 5 satisfy event $A$. Figure 2.5 highlights in orange a few possible outcomes that satisfy event $A$. Event $A$ consists of all paths that stay constant at 0 from time 0 until at least time 5.

Figure 2.5: Sample space outcomes for Example 2.12. Paths in orange satisfy $A$, the event that no arrivals occur in the first 5 minutes; paths in blue do not satisfy event $A$.

In many situations it is not possible to explicitly define a sample space in a compact way, and so outcomes and events are often only vaguely defined. Nevertheless, there is always a sample space in the background representing possible outcomes, and collections of these outcomes represent events of interest.

Example 2.13 Donny Don’t is asked a series of questions involving a pair of rolls of six-sided dice, such as “identify the event that the sum of the dice is at least 10”. Donny’s responses are below; explain to him what is wrong with his responses and help him understand the correct answers.

The possible rolls are 1 through 6, so the sample space is $\{1, 2, 3, 4, 5, 6\}$.
The sum of the two dice can be 2 through 12, so the event that the sum of the two dice is at least 10 is $\{10, 11, 12\}$.
The event that the first roll is a 3 is $\{3\}$.
The event that the first roll is a 3 and the second roll is a 1 is $\{3, 1\}$
Donny’s sample space from the first question might correspond to what dice rolling scenario?
Donny says “OK, I get that $\{1, 2, 3, 4, 5, 6\}$ is the sample space for a single roll of a six-sided die. But $\{3, 1\}$ doesn’t make any as an event because the die can’t land on both 1 and 3.” Explain to Donny what the event $\{3, 1\}$ represents in this scenario.

Solution (click to expand)

Solution 2.13.

The questions involve a pair of rolls, so best to record an outcome as an ordered pair, e.g., (5, 2) for 5 on the first roll and 2 on the second. Therefore, the sample space would be the following set of 36 possible outcomes. \[\begin{align*} \Omega = & \{ (1, 1), (1, 2), \ldots, (1, 6),\\ & \;\; (2, 1), (2, 2), \ldots, (2, 6),\\ & \;\; \vdots\qquad \qquad \quad \cdots \qquad \vdots\\ & \;\; (6, 1), (6, 2), \ldots, (6, 6) \} \end{align*}\]
Donny’s answers to the first two parts are inconsistent, since there is always a single sample space. So if he says the answer to the first part is $\{1, \ldots, 6\}$, then any event must be a subset of that sample space and his answer to the second part must be wrong. Using the sample space of 36 ordered pairs from the previous answer, the correct event that the sum of the two dice is at least 10 is $\{(4, 6), (5, 5), (5, 6), (6, 4), (6, 5), (6, 6)\}$. If Donny’s sample space in the first part had been $\{2, \ldots, 12\}$, corresponding to the sum of the two dice, then his answer of $\{10, 11, 12\}$ would have been correct. However, using such a sample space, he would not have been able to answer the remaining questions (which don’t involve the sum of the rolls). There is always one sample space on which all events are defined.
Donny didn’t take into account that an outcome is a pair of rolls. The correct event is $\{(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)\}$, the set of all pairs of rolls for which the first roll is 3.
Maybe Donny is just using sloppy notation here, but it sure looks like he is confusing an outcome with an event. The answer should be $\{(3, 1)\}$, the set containing the single outcome $(3, 1)$. Notice that this is not the same set as $\{(1, 3)\}$. (But the set $\{3, 1\}$ is the same as the set $\{1, 3\}$.)
The sample space of $\{1, 2, 3, 4, 5, 6\}$ could correspond to a single roll of a fair six-sided die. The possible outcomes for a single roll would be $\{1, 2, 3, 4, 5, 6\}$.
For a single roll, $\{3, 1\}$ is the event the the die lands on 3 or 1, that is, “1 or 3”; the set $\{3, 1\}$ is the same as the set $\{1, 3\}$. The event “1 or 3” occurs if the die lands on 1, so 1 satisifies the event and is in the corresponding set; The event “1 or 3” occurs if the die lands on 3, so 3 satisifies the event and is in the corresponding set. Therefore the event “1 or 3” corresponds to the set $\{1, 3\}$. The event occurs if any of the outcomes that satisfy the event occurs (not if they all do). (Donny is right that the die can’t land on both 1 and 3, but the event “1 and 3” would be $\{1\}\cap \{3\}=\emptyset$; there are no outcomes that satisfy the event “1 and 3”.)

2.2.1 The collection of events of interest

An outcome is a possible realization of a random phenomenon. The sample space is the set of all possible outcomes. An event is a subset of the space space consisting of outcomes that satisfy some criteria. There are many events of interest for any random phenomenon. The collection of all events of interest is often denoted $\mathcal{F}$.

An event $A$ is a set. The collection $\mathcal{F}$ of events of interest is a collection of sets. For the purposes of this text, $\mathcal{F}$ can be considered to be the set of all subsets⁹ of $\Omega$.

As an example, consider a single roll of a four-sided die.

Caution

This section concerns a single roll of a fair four-sided die. Don’t confuse this scenario with many other examples that involve two rolls.

Example 2.14 Consider a single roll of a four-sided die. The sample space consists of the four possible outcomes $\Omega = \{1, 2, 3, 4\}$.

Identify $A$, the event that the die lands on 1.
Identify $B$, the event that the die lands on an odd number.
Identify $C$, the event that the die lands on 1 or 2.
Identify $D$, the event that the die does not land on 4.
Which of the events $A, B, C, D$ occur if the die lands on 3?
Identify all possible events for this sample space, and determine whether or not each event occurs if the die lands on 3.

Solution (click to expand)

Solution 2.14. For a single roll of a four-sided die, the sample space consists of the four possible outcomes $\Omega = \{1, 2, 3, 4\}$. (Again, don’t confuse this scenario with other examples!) Any subset of this sample space is an event.

$A = \{1\}$ is the event that the die lands on 1.
$B = \{1, 3\}$ is the event that the die lands on an odd number. (This event occurs if the die lands on 1 so 1 is in $B$; it also occurs if the die lands on 3 so 3 is in $B$.)
$C = \{1, 2\}$ is the event that the die lands on 1 or 2.
$D = \{1, 2, 3\}$ is the event that the die does not land on 4.
If the die lands on 3 then events $B$ and $D$ occur and events $A$ and $C$ do not occur. Notice that 3 is an element of the sets $B$ and $D$ but not $A$ and $C$.
Any subset of $\{1, 2, 3, 4\}$ is an event, including $\emptyset$ and $\{1, 2, 3, 4\}$ itself. Table 2.6 lists all possible events, and whether they occur if the single roll results in a 3 (that is, for the outcome $\omega=3$).

Caution

We usually think of a table as a list of all possible outcomes, with one row for each outcome. Table 2.6 is a different kind of table. Each row of Table 2.6 is an event, and the table lists the collection of all events ($\mathcal{F}$)

Table 2.6: All possible events associated with a single roll of a four-sided die.

Event	Description	Occurs upon observing outcome $\omega=3$?
$\emptyset$	Roll nothing (not possible)	No
$\{1\}$	Roll a 1	No
$\{2\}$	Roll a 2	No
$\{3\}$	Roll a 3	Yes
$\{4\}$	Roll a 4	No
$\{1, 2\}$	Roll a 1 or a 2	No
$\{1, 3\}$	Roll a 1 or a 3	Yes
$\{1, 4\}$	Roll a 1 or a 4	No
$\{2, 3\}$	Roll a 2 or a 3	Yes
$\{2, 4\}$	Roll a 2 or a 4	No
$\{3, 4\}$	Roll a 3 or a 4	Yes
$\{1, 2, 3\}$	Roll a 1, 2, or 3 (a.k.a. do not roll a 4)	Yes
$\{1, 2, 4\}$	Roll a 1, 2, or 4 (a.k.a. do not roll a 3)	No
$\{1, 3, 4\}$	Roll a 1, 3, or 4 (a.k.a. do not roll a 2)	Yes
$\{2, 3, 4\}$	Roll a 2, 3, or 4 (a.k.a. do not roll a 1)	Yes
$\{1, 2, 3, 4\}$	Roll something	Yes

A random phenomenon corresponds to a single sample space, but there are many events of interest. Listing the collection of all possible events as in the previous table is rarely done in practice, but we do so here to provide a concrete example of $\mathcal{F}$.

2.2.2 Exercises

Exercise 2.3 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package.

Let $A_1$ be the event that prize 1 is obtained—that is, at least one of the packages contains prize 1—and define $A_2, A_3$ similarly for prize 2, 3.
Let $B_1$ be the event that only prize 1 is obtained—that is, all three packages contain prize 1—and define $B_2, B_3$ similarly for prize 2, 3.

Identify the following events as sets and interpret them in words

$A_1$ (hint: define $A_1^c$ first)
$B_1$
$A_1 \cap A_2 \cap A_3$
$A_1 \cup A_2 \cup A_3$
$B_1 \cap B_2 \cap B_3$
$B_1 \cup B_2 \cup B_3$

Exercise 2.4 Katniss throws a dart at a circular dartboard with radius 1 foot. (Assume that Katniss’s dart never misses the dartboard.)

Draw a picture to represent each of these events.

$A$, Katniss’s dart lands within 1 inch of the center of the dartboard.
$B$, Katniss’s dart lands more than 1 inch but less than 2 inches away from the center of the dartboard.
$E$, Katniss’s dart lands within 1 inch of the outside edge of the dartboard.

2.3 Random variables

Statisticians use the terms observational unit and variable. Observational units are the people, places, things, etc., for which data is observed. Variables are the measurements made on the observational units. For example, the observational units in a study could be college students, while variables could be age, high college GPA, major GPA, number of credits completed, number of Statistics courses taken, etc.

In probability, an outcome of a random phenomenon plays a role analogous to an observational unit in statistics. The sample space of outcomes is often only vaguely defined. In many situations we are less interested in detailing the outcomes themselves and more interested in whether or not certain events occur, or with measurements that we can make for the outcomes. For example, if the random phenomenon corresponds to randomly selecting a single student at a college an outcome would be the selected student, but we are more interested in quantities like the student’s GPA or number of credits completed. If we randomly select a sample of students, we are less interested in who the students are, and more interested in questions which involve variables such as what is the relationship between college GPA and major GPA? In probability, random variables play a role analogous to variables in statistics.

Definition 2.4 A random variable assigns a number measuring some quantity of interest to each outcome of a random phenomenon. That is, a random variable is a function that takes an outcome in the sample space as input and returns a number as output.

If we’re interested in the weather conditions in our city tomorrow, random variables include

high temperature (°F)
amount of precipitation (cm)
humidity (%)
maximum wind speed (mph)

Each of these quantities will take a value that depends on tomorrow’s weather conditions. Since there are a range of possibilities for tomorrow’s weather conditions, there is a range of values that each of these random variables can take.

Random variables are typically denoted by capital letters near the end of the alphabet, with or without subscripts: e.g. $X$, $Y$, $Z$, or $X_1$, $X_2$, $X_3$, etc.

Example 2.15 Donny Don’t is working on a problem that starts “let $X$ be a random variable representing tomorrow’s high temperature in your city”. Donny says: “There is only one tomorrow and there will only be one high temperature tomorrow in my city. Tomorrow’s high temperature will just be a single number, there’s nothing variable about it.” Explain to Donny what it means to say “tomorrow’s high temperature is a random variable”.

Solution (click to expand)

Solution 2.15. Yes, tomorrow’s high temperature will be a single number, but we do not know what that number will be. Tomorrow’s weather conditions are uncertain, that is, random. Even if the forecast calls for a high of 75 degrees F, the high temperature could be 75 degrees, or 78 or 72 or 74, etc. A random variable represents all the different possible values that tomorrow’s high temperature might be depending on the uncertain weather conditions.

A random variable is “variable” in the sense that it can take different values—that is, it can vary—and the value it takes is uncertain—that is, “random”.

Example 2.16 Continuning Example 2.8. Let $X$ be the number of previous championships won by the team that wins the top pick in the lottery. Explain what it means for $X$ to be a random variable and identify its possible values.

Solution (click to expand)

Solution 2.16. Before the lottery is conducted, the team that will win the top pick is uncertain and so their number of previous championships is also uncertain, and the value will vary depending on which team wins. The possible values are $\{0, 1, 2, 3, 5, 6\}$.

In statistics, data is often stored in a spreadsheet or data table with rows corresponding to observational units and columns to variables. Likewise, in probability it helps to visualize a table with rows corresponding to outcomes and columns to random variables. Each outcome is associated with a value of the random variable. Since the outcome is uncertain, the value the random variable takes is also uncertain.

Example 2.17 Roll a four-sided die twice, and record the result of each roll in sequence. Recall the sample space from Example 2.1. Let $X$ be the sum of the two dice, and let $Y$ be the larger of the two rolls (or the common value if both rolls are the same).

Construct a table identifying the values of $X$ and $Y$ for each outcome in the sample space. Hint: add columns to Table 2.1.
Evaluate $X((1, 3))$, $X((4, 3))$, and $X((2, 2))$.
Evaluate $Y((1, 3))$, $Y((4, 3))$, and $Y((2, 2))$.
Identify the possible values of $X$.
Identify the possible values of $Y$.
Identify the possible values of the pair $(X, Y)$.

Solution (click to expand)

Solution 2.17.

See Table 2.7. The first column corresponds to sample space outcomes, and there is a column for each random variable.
$X$ is the sum of the two rolls, so $X((1, 3))=1+3=4$, $X((4, 3))=4+3=7$, and $X((2, 2))=2+2=4$.
$Y$ is the larger of the two rolls (or the common value if a tie) so $Y((1, 3))=\max(1, 3) = 3$, $Y((4, 3))=\max(4, 3) =4$, and $Y((2, 2))=\max(2, 2) = 2$.
The possible values of $X$ are $2, 3, 4, 5, 6, 7, 8$.
The possible values of $Y$ are $1, 2, 3, 4$.
The possible values of the pair $(X, Y)$ are: (2, 1), (3, 2), (4, 2), (4, 3), (5, 3), (5, 4), (6, 3), (6, 4), (7, 4), (8,4). Notice that while, for example, 8 is a possible value of $X$ and 1 is a possible value of $Y$, (8, 1) is not a possible value of the pair $(X, Y)$; it’s not possible for the larger of the two dice to be 1 but their sum to be 8.

Table 2.7: Table representing the sum ($X$) and larger ($Y$) of two rolls of a four-sided die

Outcome (First roll, second roll)	X (sum)	Y (max)
(1, 1)	2	1
(1, 2)	3	2
(1, 3)	4	3
(1, 4)	5	4
(2, 1)	3	2
(2, 2)	4	2
(2, 3)	5	3
(2, 4)	6	4
(3, 1)	4	3
(3, 2)	5	3
(3, 3)	6	3
(3, 4)	7	4
(4, 1)	5	4
(4, 2)	6	4
(4, 3)	7	4
(4, 4)	8	4

Mathematically, a random variable $X$ is a function that takes an outcome $\omega$ in the sample space $\Omega$ as input and returns a number $X(\omega)$ as output; we write $X:\Omega\mapsto \mathbb{R}$. The random variable itself is typically denoted with a capital letter ($X$); possible values of that random variable are denoted with lower case letters ($x$). Think of the capital letter $X$ as a label standing in for a formula like “the sum of two rolls of a four-sided die” and $x$ as a dummy variable standing in for a particular value like 3.

In Example 2.17, the pair $(X, Y)$ is a random vector¹⁰. The output of each of $X$ and $Y$ is a number; the output of $(X, Y)$ is an ordered pair of numbers. A random vector is simply a vector of random variables.

One of the main reasons for modeling a sample space as the set of possible outcomes rather than the set of all possible values of some random variable is that we often want to define many random variables on the same sample space, and study relationships between them. As a statistics analogy, you would not be able to study the relationship between college GPA and major GPA unless you measured both variables for the same set of students.

2.3.1 Types of random variables

There are two main types of random variables.

Discrete random variables take at most countably many possible values (e.g., $0, 1, 2, \ldots$). They are often counting variables (e.g., the number of coin flips that land on heads).
Continuous random variables can take any real value in some interval (e.g., $[0, 1]$, $[0,\infty)$, $(-\infty, \infty)$.). That is, continuous random variables can take uncountably many different values. Continuous random variables are often measurement variables (e.g., height, weight, income).

In some problems, there are many random variables of interest, as in the following example.

Example 2.18 Recall Example 2.5. Customers enter a deli and take a number to mark their place in line. When the deli opens the counter starts 0; the first customer to arrive takes number 1, the second 2, etc. We record the counter over time, continuously, as it changes as customers arrive. Time is measured in minutes after the deli opens (time 0).

There are many random variables that could be of interest, including

$N_t$, the number of customers that have arrived by time $t$, where $t\ge0$ is minutes after time 0
$T_j$, the time (in minutes after time 0) at which the $j$th customer arrives, for $j=1, 2, \ldots$
$W_j$, the “waiting” time (in minutes) between the arrival of the $j$th and the $(j-1)$th customer.

For each of the following random variables, classify it as discrete or continuous. Then identify its value for the outcome represented by the path in Figure 2.6 (as best as you can from the plot).

$N_4$
$N_{6.5}$
$T_4$
$T_5$
$W_1$
$W_5$

Figure 2.6: A sample outcome for Example 2.5.

Solution (click to expand)

Solution 2.18. Let $\omega$ denote the outcome represented by Figure 2.6.

The random variable $N_4$ counts the number of customers who have arrived by time 4. $N_4$ is a discrete random variable taking values in the countable set $\{0, 1, 2, \ldots \}$. To compute the value of $N_4$ for the outcome $\omega$ in Figure 2.6, find time $t=4$ on the horizontal axis and find the corresponding value on the vertical axis: $N_4(\omega)=8$ customers; for this outcome, the number of customers that have arrived by time 4 is 8.
The random variable $N_{6.5}$ which counts the number of customers who have arrived by time 6.5 is a discrete random variable taking values in the countable set $\{0, 1, 2, \ldots \}$. $N_{6.5}(\omega)=13$ customers; the number of customers that have arrived by time 6.5 is 13. The number of customers is a whole number, but time is measured continuously (e.g., 6.5 minutes after opening). There is a different discrete random variable $N_t$ corresponding to each value of time $t\ge 0$, but the possible values of each of these variables are $\{0, 1, 2, \ldots\}$.
The random variable $T_4$, which measures the time at which the fourth customer arrives, is a continuous random variable (since we’re assuming time is measured continuously). The smallest possible value of $T_4$ is 0, but there is no definite largest possible value of $T_4$, so $T_4$ can take theoretically take any value in the continuous time interval $[0, \infty)$. The fourth customer arrives when the path jumps from 3 customers so far to 4. For the outcome $\omega$ in Figure 2.6 the path jumps from 3 to 4 a little after time 2 so $T_4(\omega)\approx 2.1$ minutes after noon.
The random variable $T_5$ which measures the arrival time of the fifth customer is a continuous random variable taking values in $[0, \infty)$. The fifth customer arrives when the path jumps from 4 customers so far to 5. For this outcome the path jumps from 4 to 5 almost right after it jumps from 3 to 4, and almost right after time 2, so $T_5(\omega)\approx 2.2$ minutes after noon. There is a different continuous random variable $T_j$ for each customer $j=1, 2, \ldots$, but the possible values of each of these random variables is $[0,\infty)$.
$W_1$ is the waiting time from open until the first customer arrives, a continuous random variable taking values in $[0, \infty)$. For the outcome $\omega$ in Figure 2.6 the path jumps from 0 to 1 a little after time 0 so $W_1(\omega)\approx 0.1$ minutes.
$W_5$ is the time elapsed between the arrival of the fourth customer (at roughly time 2.1 for outcome $\omega$) and fifth customer (at roughly time 2.2 for outcome $\omega$), which is about 0.1 minutes.

2.3.2 A random variable is a function

Recall that for a mathematical function¹¹ $g$, given an input $u$, the function returns a real number $g(u)$. For example, if $g$ is the square root function, $g(u) = \sqrt{u}$, then $g(9) = 3$ and $g(10) = 3.162278...$. If the input comes from some set $S$ (i.e. $u\in S$), we write $g:S\mapsto \mathbb{R}$.

A random variable $X$ is a function which maps each outcome $\omega$ in the sample space $\Omega$ to a real number $X(\omega)$; $X:\Omega\mapsto\mathbb{R}$. For a single outcome $\omega$, the value $x = X(\omega)$ is a single number; notice that $x$ represents the output of the function $X$ rather than the input. However, it is important to remember that the random variable $X$ itself is a function, and not a single number.

You are probably familiar with functions expressed as simple closed form formulas of their inputs: $g(u)=5u$, $g(u)=u^2$, $g(u)=\log u$, etc. While any random variable is some function, the function is rarely specified as an explicit mathematical formula of its input $\omega$. Often, outcomes are not even numbers (e.g., sequences of coin flips), or only vaguely specified if at all (e.g., tomorrow’s weather conditions). In Example 2.17 we defined $X$ through the words “sum of two rolls of a fair four-sided dice” instead of as a formula¹².

It is more appropriate to think of a random variable as a function in the sense of a scale at a grocery store which maps a fruit to its weight, $X: \text{fruit}\mapsto\text{weight}$. Put an apple on the scale and the scale returns a number, $X(\text{apple})$, the weight of the apple. Likewise, $X(\text{orange})$, $X(\text{banana})$. The random variable $X$ is the scale itself. This simplistic analogy assumes a sample space outcome is a single fruit. Of course, it’s even more complicated in reality since an outcome can be considered a set of fruits, so that we have for example $X(\{\text{2 apples}, \text{3 oranges}\})$, and all fruits do not weigh the same, so that $X(\text{this apple})$ is not the same as $X(\text{that apple})$. But the idea is that a function is like a scale, with an input (fruits) and an output (weight). The input does not have to be a number, but the output does.

Suppose I’m going to randomly select some fruits, put them in a brown grocery bag, and place it on the scale. It wouldn’t be feasible to enumerate all the combinations of fruits I could put in the bag, but even so you know that any possible combination has some weight which could be measured by the scale. There is still a function (scale) that maps an input (fruits in the bag) to a numerical output (weight), even if that function is not explicitly specified with a mathematical formula. Now suppose I’ve selected some fruits and put the bag on the scale. Even if you can’t see what fruits are inside the bag, you can still read the weight off the scale. But even if you only observe the weight, you know there was still a background random process of putting fruits in a bag which resulted in a particular outcome having the observed weight.

The “weighing fruits in a bag” scenario in the previous paragraph illustrates how probability usually works:

We typically don’t explicitly specify outcomes or the sample space, but we know that different outcomes can result in different values of random variables. That is, we know there is some function which maps outcomes of the random phenomenon to values of the random variable, even if we don’t have an explicit formula for the inputs to the function (sample space outcomes) or the function itself.
We might not observe outcomes in full detail (e.g., tomorrow’s weather conditions), but we often can still observe values of random variables (e.g., tomorrow’s high temperature).

Example 2.19 Recall the sample space from Example 2.3. Let $R$ be the random variable representing Regina’s arrival time, and $Y$ for Cady.

Identify the function that defines $R$, and the possible values of $R$. (Hint: remember the sample space.)
Identify the function that defines $Y$, and the possible values of $Y$.

Solution (click to expand)

Solution 2.19.

Recall that an outcome is an ordered pair representing the arrival times of (Regina, Cady); we can write an outcome as $\omega\equiv(\omega_1, \omega_2)$. Remember there is a single sample space corresponding to the pairs of arrival times, rather than a separate sample space for each. Therefore, random variables need to be defined on this single sample space; inputs to random variables defined on this sample space are pairs of arrival times. Regina’s arrival time is defined by the function $R(\omega) \equiv R((\omega_1, \omega_2))=\omega_1$. That is, $R$ maps the ordered pair $(\omega_1, \omega_2)$ to its first coordinate $\omega_1$. For example, $R((45, 30.2)) = 45$.
On this sample space, Cady’s arrival time is defined by the function $Y(\omega) \equiv Y((\omega_1, \omega_2))=\omega_2$. That is, $Y$ maps the ordered pair to its second coordinate. The input to $Y$ is a pair of numbers (Regina, Cady) and the output is Cady’s arrival time only. For example, $Y((45, 30.2)) = 30.2$.

2.3.3 Tranformations of random variables

We are often interested in random variables that are derived from others. For example, if the random variable $X$ represents the radius (cm) of a randomly selected circle, then $Y = \pi X^2$ is a random variable representing the circle’s area ($\text{cm}^2$). If the random variables $W$ and $T$ represent the weight (kg) and height (m), respectively, of a randomly selected person, then $S = W / T^2$ is a random variable representing the person’s body mass index ($\text{kg}/\text{m}^2$).

A function of a random variable is also a random variable. That is, if $X$ is a random variable and $g$ is a function, then $Y=g(X)$ is also a random variable¹³. For example, if $u$ is a radius of a circle, the function $g(u) = \pi u^2$ outputs its area; if $X$ is a random variable representing the radius of a randomly selected circle then $Y = g(X)=\pi X^2$ is a random variable representing the circle’s area.

Sums and products, etc., of random variables defined on the same sample space are random variables. That is, if random variables $X$ and $Y$ are defined on the same sample space then $X+Y$, $X-Y$, $XY$, and $X/Y$ are also random variables. Similarly, it is possible to make comparisons such as $X\ge Y$ and apply other transformations for random variables defined on the same sample space.

Example 2.20 Continuning Example 2.8. In basketball games, teams attempt field goals and they score points for made field goals. Made field goals are worth either 2 or 3 points. Teams also attempt free throws which are worth 1 point if made. Each team plays 82 games in a season.

For the team that wins the top pick in the lottery in the 2022-2023 season, let

$W$ be the number of wins
$Y_1$ be free throw attempts per game (FT)
$Y_2$ be 2 point field goal attempts per game (FG2)
$Y_3$ be 3 point field goal attempts per game (FG3)
$X_1$ be free throws made per game (FTA)
$X_2$ be 2 point field goals made per game (FG2A)
$X_3$ be 3 point field goals made per game (FG3A)

Interpret the following random variables in this context. How could they be represented in a table like Table 2.3?

$82 - W$
$W / 82$
$X_1 / Y_1$
$\frac{X_2 + X_3}{Y_2+Y_3}$
$\frac{Y_3}{Y_2+Y_3}$
$3X_3 + 2X_2 + X_1$

Solution (click to expand)

Solution 2.20. Table 2.8 provides a representation of these random variables. For the team that wins the 2023 NBA draft in the 2022-2023 season:

$82 - W$ is the team’s number of losses
$W / 82$ is the proportion of games in the season that the team won (a.k.a., winning percentage, as a decimal)
$X_1 / Y_1$ is the proportion of free throw attempts that the team successfully made (a.k.a., free throw percentage, as a decimal)
$\frac{X_2 + X_3}{Y_2+Y_3}$ is the proportion of total field goal attempts that the team successfully made (a.k.a., field goal percentage, as a decimal)
$\frac{Y_3}{Y_2+Y_3}$ is the proportion of the team’s total field goal attempts that were 3 point attempts
$3X_3 + 2X_2 + X_1$ is the total points per game scored by the team (any differences between Table 2.8 and PPG in Table 2.3 are due to rounding.)

Table 2.8: The random variables from Example 2.20 for teams in the 2023 NBA Draft Lottery.

Team	$W$	$X_3$	$Y_3$	$X_2$	$Y_2$	$X_1$	$Y_1$	$82-W$	$W/82$	$X_1/Y_1$	$\frac{X_2+X_3}{Y_2+Y_3}$	$\frac{Y_3}{Y_2+Y_3}$	$3X_3 + 2X_2 + X_1$
Detroit Pistons	17	11.4	32.4	28.2	54.6	19.8	25.7	65	0.207	0.770	0.455	0.372	110.4
Houston Rockets	22	10.4	31.9	30.2	56.9	19.1	25.3	60	0.268	0.755	0.457	0.359	110.7
San Antonio Spurs	22	11.1	32.2	32.0	60.4	15.8	21.2	60	0.268	0.745	0.465	0.348	113.1
Charlotte Hornets	27	10.7	32.5	30.5	57.9	17.6	23.6	55	0.329	0.746	0.456	0.360	110.7
Portland Trail Blazers	33	12.9	35.3	27.6	50.1	19.6	24.6	49	0.402	0.797	0.474	0.413	113.5
Orlando Magic	34	10.8	31.1	29.8	55.2	19.6	25.0	48	0.415	0.784	0.470	0.360	111.6
Indiana Pacers	35	13.6	37.0	28.4	52.6	18.7	23.7	47	0.427	0.789	0.469	0.413	116.3
Washington Wizards	35	11.3	31.7	30.9	55.2	17.6	22.4	47	0.427	0.786	0.486	0.365	113.3
Utah Jazz	37	13.3	37.8	29.2	52.0	18.7	23.8	45	0.451	0.786	0.473	0.421	117.0
Dallas Mavericks	38	15.2	41.0	24.8	43.3	19.0	25.1	44	0.463	0.757	0.474	0.486	114.2
Chicago Bulls	40	10.4	28.9	32.1	57.9	17.6	21.8	42	0.488	0.807	0.490	0.333	113.0
Oklahoma City Thunder	40	12.1	34.1	31.0	58.5	19.2	23.7	42	0.488	0.810	0.465	0.368	117.5
Toronto Raptors	41	10.7	32.0	31.1	59.3	18.4	23.4	41	0.500	0.786	0.458	0.350	112.7
New Orleans Pelicans	42	11.0	30.1	31.1	57.5	19.3	24.4	40	0.512	0.791	0.481	0.344	114.5

Remember that we can visualize outcomes as rows in a spreadsheet with random variables as columns. Random variables defined on the same sample space can be put in a single spreadsheet. Each row corresponds to an outcome, and reading across any row there is a value in the column corresponding to each random variable. Random variables derived from transformations of other random variables append columns to the spreadsheet. New random variables can be defined by going row-by-row, outcome-by-outcome, and applying a transformation within each row to the values of other random variables.

Using capital letters like $X$ or $Y$ to denote random variables is standard practice. To help develop comfort with this mathematical notation, we will often label columns in tables with their random variable symbols (as we did in Table 2.8). Later, when writing code we will often denote random variables with symbols like X or Y. However, keep in mind that mathematical symbols like $X$ or $Y$ represent variables in a context. While you should develop comfort with the notation, you can—and probably should—use more informative labels like “wins” or wins rather than $W$.

Example 2.21 Continuing Example 2.19.

What does the random variable $U_1 = R / 60$ represent in context? What are the possible values of $U_1$?
What does the random variable $T = \min(R, Y)$ represent in context? What are the possible values of $T$?
What does the random variable $W = |R - Y|$ represent in context? What are the possible values of $W$?

Solution (click to expand)

Solution 2.21.

$U_1= R / 60$ represents Regina’s arrival time measured as a fraction of the hour after noon. For example, if Regina arrives 15 minutes after noon then $R=15$ and $U_1= 15/60 = 0.25$. $U_1$ takes values in [0, 1].
$T=\min(R, Y)$ represents the time (minutes after noon) of the first arrival. For example, if Regina arrives 15 minutes after noon and Cady 22.3 minutes after noon then $R=15$, $C=22.3$ and $T=\min(15, 22.3) = 15$. $T$ takes values in $[0, 60]$. If either Regina and Cady arrives at time 0 (noon) then $T$ is 0, the smallest possible value of $T$; if both arrive at 1:00 then $T$ is 60, the largest possible value of $T$.
$W=|R-Y|$ represents the amount of time the first person to arrive waits for the second person to arrive. For example, if Regina arrives 15 minutes after noon and Cady 22.3 minutes after noon then $R=15$, $C=22$ and $W = |15-22.3| = 7.3$. $W$ takes values in $[0, 60]$. If both Regina and Cady arrive at the same time then $W$ is 0; if one arrives at noon and the other at 1:00 then $W$ is 60, the largest possible value of $W$.

2.3.4 Indicator random variables

Random variables that only take two possible values, 0 and 1, have a special name.

Definition 2.5 An indicator (a.k.a., Bernoulli, a.k.a. Boolean) random variable can take only the values 0 or 1. If $A$ is an event then the corresponding indicator random variable $\textrm{I}_A$ is defined as \[ \textrm{I}_A(\omega) = \begin{cases} 1, & \omega \in A,\\ 0, & \omega \notin A \end{cases} \] That is, $\textrm{I}_A$ equals 1 if event $A$ occurs, and $\textrm{I}_A$ equals 0 if event $A$ does not occur.

Indicators provide the bridge between events (sets) and random variables (functions). Any event either occurs or not; a realization of any event is either true ($\omega \in A$) or false ($\omega \notin A$). An indicator random variable just translates “true” or “false” into numbers, 1 for “true” and 0 for “false”.

Example 2.22 Recall the sample space from Example 2.2 for the matching problem with $n=4$. Let the random variable $X$ count the number of objects that are placed in the correct spot. Let $I_1$ be equal to 1 if object 1 is placed (correctly) in spot 1, and define $I_2, I_3, I_4$ similarly.

Construct a table identifying the value of $X, I_1, \ldots, I_4$ for each outcome in the sample space.
Identify the possible values of $X$.
What is the relationship between $I_3$ and event $A_3$ from Example 2.10?
How can you express $X$ in terms of $I_1, \ldots, I_4$?

Solution (click to expand)

Solution 2.22.

See Table 2.9. Each random variable corresponds to a different column in the table.
$X$ can take values 0, 1, 2, and 4, but 3 is not a possible value of $X$.
$I_3$ is equal to 1 only for outcomes that satisfy $A_3$, the event that object 3 is placed in spot 3; $I_3$ is equal 0 for outcomes that do not satisfy event $A_3$. In this way, the value of the random variable $I_3$ indicates whether or not the event $A_3$ occurs; that is, $I_3$ is the indicator random variable of event $A_3$, $I_3 = \textrm{I}_{A_3}$.
For every outcome (row), the value of $X$ is equal to the sum of the values of $I_1$, $I_2$, $I_3$, $I_4$. That is, $X = I_1+I_2+I_3+I_4$. For example, for outcome 2134, $X$ is equal to 2, $I_1$ and $I_2$ are equal to 0, and $I_3$ and $I_4$ are equal to 1, and $2 = 0 + 0 + 1 + 1$. The relationship $X = I_1+I_2+I_3+I_4$ is true for every outcome (row). The spot-by-spot indicators provide a way to incrementally count the total number of matches.

Table 2.9: Total number of matches and indicator random variables for each item in the matching problem with $n=4$.

Outcome	$X$	$I_1$	$I_2$	$I_3$	$I_4$
1234	4	1	1	1	1
1243	2	1	1	0	0
1324	2	1	0	0	1
1342	1	1	0	0	0
1423	1	1	0	0	0
1432	2	1	0	1	0
2134	2	0	0	1	1
2143	0	0	0	0	0
2314	1	0	0	0	1
2341	0	0	0	0	0
2413	0	0	0	0	0
2431	1	0	0	1	0
3124	1	0	0	0	1
3142	0	0	0	0	0
3214	2	0	1	0	1
3241	1	0	1	0	0
3412	0	0	0	0	0
3421	0	0	0	0	0
4123	0	0	0	0	0
4132	1	0	0	1	0
4213	1	0	1	0	0
4231	2	0	1	1	0
4312	0	0	0	0	0
4321	0	0	0	0	0

Even though they seem simple, indicator random variables are very useful. In the matching problem, it is not feasible to enumerate the outcomes and count when there is a large number $n$ of items and spots. Using indicators allows you to count incrementally—is just this item in the correct spot?— rather than all at once. Representing a count as a sum of indicator random variables is a very common and useful strategy, especially in problems that involve “find the expected number of…”

Here is a little story that illustrates the idea of incremental counting with indicators. Imagine a dad and his young child are reading a picture book. They come to a page that has twenty pictures of fruits, of which seven are bananas. The following conversation ensues.

Dad: Can you count all the bananas? Let’s see! How many bananas have we counted so far?
Kid: We haven’t started counting yet!
Dad: Right, so how many bananas have we counted so far?
Kid: Zero.
Dad: That’s right! We’ve counted zero bananas so far. (Dad points to a banana.) Is that a banana?
Kid: Yes!
Dad: So how many more bananas did we just count?
Kid: One more.
Dad: So how many bananas have we counted so far?
Kid: One.
Dad: Great job! We’ve counted one banana so far. (Dad points to a different banana.) Is that a banana?
Kid: Yes!
Dad: So how many more bananas did we just count?
Kid: We counted one more banana.
Dad: So how many bananas have we counted so far?
Kid: Two.
Dad: Great job! We’ve counted two bananas so far. (Dad points to a different banana.) Is that a banana?
Kid: Yes!
Dad: So how many more bananas did we just count?
Kid: We counted one more banana.
Dad: So how many bananas have we counted so far?
Kid: Three.
Dad: Great job! We’ve counted three bananas so far. (Dad points to an orange¹⁴.) Is that a banana?
Kid: No, that’s an orange!
Dad: So how many more bananas did we just count?
Kid: Zero. It was not a banana!
Dad: So how many bananas have we counted so far?
Kid: Still three.
Dad: Great job! We’ve counted three bananas so far. (Continues in this manner until Dad points to the twentieth and last fruit on the page, a banana.) Almost done. We’ve counted six bananas so far. Is that a banana?
Kid: Yes!
Dad: So how many more bananas did we just count?
Kid: We counted one more banana.
Dad: So how many bananas have we counted so far?
Kid: Seven.
Dad: We looked at each fruit on the page. How many were bananas?
Kid: Seven.
Dad: Great job! Now you know how indicator random variables can be used to count.

In the story, the kid counted the bananas by examining each object, determining whether or not it was a banana, and then incrementing the banana counter by 1 for each object that was a banana (and by 0 for the objects that were not bananas). The kid essentially created an indicator (of “banana”) variable for each object on the page ($I_{B_1}=1$, $I_{B_2}=1$, $I_{B_3}=1$, $I_{B_4}=0\ldots$, $I_{B_{20}}=1$) and then summed these indicators to obtain the total count of bananas. This strategy gives a way of breaking down a complicated counting problem into smaller pieces and counting incrementally.

Example 2.23 Continuing Example 2.22, interpret each of the following random variables.

$1 - \textrm{I}_3$
$\textrm{I}_1 \textrm{I}_2$, that is, the product of $\textrm{I}_1$ and $\textrm{I}_2$
$\textrm{I}_1 + \textrm{I}_2 - \textrm{I}_1 \textrm{I}_2$.

Solution (click to expand)

Solution 2.23.

$1 - \textrm{I}_3$ is 1 if $\textrm{I}_3 = 0$, which occurs when object 3 is not in spot 3, and $1 - \textrm{I}_3$ is 0 otherwise. So $1 - \textrm{I}_3$ is the indicator that object 3 is not placed in spot 3; $1-\textrm{I}_3 = \textrm{I}_{A_3^c}$.
$\textrm{I}_1 \textrm{I}_2$ is 1 if both $\textrm{I}_1$ and $\textrm{I}_2$ are 1, which occurs when objects 1 and 2 are in the correct spots; otherwise, $\textrm{I}_1 \textrm{I}_2$ is 0. Therefore $\textrm{I}_1 \textrm{I}_2$ is the indicator that object 1 is placed in spot 1 and object 2 is placed in spot 2; $\textrm{I}_1\textrm{I}_2 = \textrm{I}_{A_1 \cap A_2}$.
$\textrm{I}_1 + \textrm{I}_2 - \textrm{I}_1 \textrm{I}_2$ is 1 if: object 1 is in spot 1 but object 2 is not in spot 2 (1 + 0 - 0), object 1 is not in spot 1 but object 2 is in spot 2 (0 + 1 - 0), or if both object 1 is in spot 1 and object 2 is in spot 2 (1 + 1 - 1). $\textrm{I}_1 + \textrm{I}_2 - \textrm{I}_1 \textrm{I}_2$ is 0 only if object 1 is not in spot 1 and object 2 is not in spot 2. Therefore $\textrm{I}_1 + \textrm{I}_2 - \textrm{I}_1 \textrm{I}_2$ is the indicator that object 1 is placed in spot 1 or object 2 is placed in spot 2; $\textrm{I}_1 + \textrm{I}_2 - \textrm{I}_1 \textrm{I}_2 = \textrm{I}_{A_1 \cup A_2}$.

Example 2.23 illustrates that for two events $A$ and $B$ \[\begin{align*} \textrm{I}_{A^c} & = 1 - \textrm{I}_A & & \\ \textrm{I}_{A \cap B} & = \textrm{I}_A \textrm{I}_B & & =\min(\textrm{I}_A, \textrm{I}_B)\\ \textrm{I}_{A \cup B} & = \textrm{I}_A + \textrm{I}_B - \textrm{I}_{A \cap B} & & = \max(I_A, I_B) \end{align*}\]

In particular, the indicator of an intersection is the product of the indicators of each event. The $\min, \max$, and product formulas work for more than two events, but the addition formula is more complicated¹⁵.

2.3.5 Events involving random variables

Many events of interest involve random variables. The event “tomorrow’s high temperature is above 75°F” involves the random variable “tomorrow’s high temperature”. Each possible outcome of tomorrow’s weather conditions will correspond to a value of high temperature, but only some of these outcomes will result in values of high temperature above 75 °F.

Example 2.24 Continuning Example 2.8. For the team that wins the top pick in the lottery, let

$X$ be the number of previous championships won by the team that wins the top pick in the lottery
$Y$ be the number of wins in the 2022-2023 season
$Z$ be the points per game in the 2022-2023 season
$I$ be the indicator random variable that the team is in the Western Conference

Identify and interpret the following events.

$\{I=1\}$
$\{X = 0\}$
$\{Y < 25\}$
$\{Z > 115\}$
$\{I = 1, X = 0\}$
$\{I = 1, X = 0, Y < 25\}$
$\{I = 1\}\cup \{X = 0\}$
$\{X \ge 1\}$

Solution (click to expand)

Solution 2.24. These events are the same as those in Example 2.24 just with different notation.

$\{I = 1\} = \{\text{Houston, San Antonio, Portland, Utah, Dallas, Oklahoma City, New Orleans}\}$ is the event that the team is in the Western Conference.
$\{X = 0\} = \{\text{Charlotte, Orlando, Indiana, Utah, New Orleans}\}$ is the event that the team has never won a championship.
$\{Y < 25\} = \{\text{Detroit, Houston, San Antonio}\}$ is the event that the team won fewer than 25 games in the 2022-2023 season
$\{Z > 115\} = \{\text{Indiana, Utah, Oklahoma City}\}$ is the event that the team scored over 115 points per game in the 2022-2023 season
$\{I = 1, X = 0\} = \{\text{Utah, New Orleans}\}$ is the event that the team is in the Western Conference and has won no previous championships.
$\{I = 1, X = 0, Y < 25\}=\{\text{Utah}\}$ is the event that the team is in the Western Conference, has won no previous championships, and scored over 115 points per game in the 2022-2023 season.
$\{I = 1\}\cup \{X = 0\}=\{\text{Charlottle, Houston, San Antonio, Portland, Orlando, Indians, Utah, Dallas, Oklahoma City, New Orleans}\}$ is the event that the team is in the Western Conference or has won no previous championships.
$\{X \ge 1\}=\{\text{Detroit, Houston, San Antonio, Portland, Washington, Dallas, Chicago, Oklahoma City, Toronto}\}$ is the event that the team has won at least one previous championship.

The expressions $X=x$ or $\{X=x\}$ are shorthand for the event that the random variable $X$ takes the value $x$. Remember that any event is a collection of outcomes that satisfy some criteria, a subset of the sample space. So objects like $\{X=x\}$ are sets representing the outcomes for which the value of the random variable $X$ is equal to the number $x$. Remember to think of the capital letter $X$ as a label standing in for a formula like “the sum of two rolls of a four-sided die” and $x$ as a dummy variable standing in for a particular value like 3.

Example 2.25 Roll a four-sided die twice, and record the result of each roll in sequence. Recall the sample space from Example 2.1. Let $X$ be the sum of the two dice, and let $Y$ be the larger of the two rolls (or the common value if both rolls are the same). Identify and interpret each of the following.

$\{X = 4\}$.
$\{X = 3\}$.
$\{X \le 3\}$.
$\{Y = 4\}$.
$\{Y = 3\}$.
$\{Y \le 3\}$.
$\{X = 4, Y = 3\}$ (that is, $\{X = 4\}\cap \{Y = 3\}$).
$\{X = 4, Y \le 3\}$.
$\{X = 3, Y = 3\}$.

Solution (click to expand)

Solution 2.25. Notice we encountered many of these events in Example 2.9, but now we are denoting the events in terms of random variables.

$\{X = 4\}$, which consists of the outcomes (1, 3), (2, 2), (3, 1), is the event that the sum of the two dice is 4. Recalling Example 2.9, $A = \{X = 4\}$.
$\{X = 3\}$, which consists of outcomes (1, 2) and (2, 1), is the event that the sum of the two dice is 3.
$\{X \le 3\}$, which consists of outcomes (1, 1), (1, 2), and (2, 1), is the event that the sum of the two dice is at most 3. Recalling Example 2.9, $B = \{X \le 3\}$.
$\{Y = 4\}=\{(1, 4), (2, 4), (3, 4), (4, 4), (4, 1), (4, 2), (4,3)\}$ is the event that the larger of the two rolls is 4.
$\{Y = 3\}=\{(1, 3), (2, 3), (3, 3), (3, 1), (3, 2)\}$ is the event that the larger of the two rolls is 3. Recalling Example 2.9, $C = \{Y = 3\}$.
$\{Y \le 3\}=\{(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)\}$ is the event that the larger of the two rolls is at most 3. Notice that since in this example $Y$ can only take values 1, 2, 3, 4, we have $\{Y\le 3\} = \{Y=4\}^c$.
$\{X = 4, Y = 3\} \equiv \{X = 4\}\cap \{Y = 3\}=\{(1, 3), (3, 1)\}$ is the event that both the sum of the two dice is 4 and the larger of the two rolls is 3. Even though this involves two random variables, it is a single event (that is, a single subset of the sample space). There are only two outcomes for which both the sum of the two dice is 4 and the larger of the two dice is 3.
$\{X = 4, Y \le 3\} \equiv \{X = 4\}\cap \{Y \le 3\}=\{(1, 3), (2, 2), (3, 1)\}$ is the event that both the sum of the two dice is 4 and the larger of the two rolls is at most 3. Notice that since in this example $\{X=4\} \subset \{Y\le 3\}$, we have $\{X = 4, Y \le 3\} = \{X=4\}$. If the sum is 4 we know the larger roll must be at most 3.
$\{X = 3, Y = 3\} \equiv \{X = 3\}\cap \{Y = 3\}=\emptyset$, since there are no outcomes for which both the sum is 3 and the larger of the two dice is 3. (If the the larger of the two dice is 3, then the sum must be at least 4.)

Table 2.10: Table representing the sum ($X$) and larger ($Y$) of two rolls of a four-sided die. The event $X = 4$ is highlighted in orange.

Outcome (First roll, second roll)	X (sum)	Y (max)
(1, 1)	2	1
(1, 2)	3	2
(1, 3)	4	3
(1, 4)	5	4
(2, 1)	3	2
(2, 2)	4	2
(2, 3)	5	3
(2, 4)	6	4
(3, 1)	4	3
(3, 2)	5	3
(3, 3)	6	3
(3, 4)	7	4
(4, 1)	5	4
(4, 2)	6	4
(4, 3)	7	4
(4, 4)	8	4

When dealing with probabilities, it is common to write $X=3$ instead of¹⁶ $\{X=3\}$, and $X = 4, Y = 3$ instead of $\{X = 4\}\cap \{Y = 3\}$; read the comma in $X = 4, Y = 3$ as “and”. But keep in mind that an expression like “$X=3$” really represents an event $\{X=3\}$, a subset of outcomes of the sample space.

Example 2.26 Regina and Cady plan to meet for lunch between noon and 1 but they are not sure of their arrival times. Recall the sample space from Example 2.3. Let $R$ be the random variable representing Regina’s arrival time (minutes after noon), and $Y$ for Cady. Interpret each of the following in words and draw a picture representing it.

$\{R > Y\}$.
$\{\min(R, Y) < 30\}$.
$\{Y<R<Y+15\}$.
$\{R < 24\}$.

Solution (click to expand)

Solution 2.26. The parts of this problem are almost identical to those in Example 2.11. The main difference is in notation; we are now denoting events in terms of random variables.

See Figure 2.7 for pictures. $\{R>Y\}$is the event that Regina arrives after Cady (event $A$ from Example 2.11).
$\{\min(R, Y)<30\}$, is the event that the earlier of the two arrival times is before 12:30 (event $B$ from Example 2.11). This event can also be written as $\{R < 30\}\cup \{Y < 30\}$, the event that either Regina or Cady arrives before 12:30.
$\{Y<R<Y+15\}$ is the event that Cady arrives first and Regina arrives at most 15 minutes after Cady (event $C$ from Example 2.11).
$\{R < 24\} = \{(\omega_1, \omega_2): \omega_1<24\}$ is the event that Regina arrives before 12:24 (event $D$ from Example 2.11).

Figure 2.7: Illustration of the events in Example 2.26. The square represents the possible values of $(R, Y)$, the random vector representing the arrival times of Regina and Cady.

2.3.6 Outcomes, events, and random variables

Outcomes, events, and random variables are some of the main objects of probability. While they are related, these are distinct objects. Thinking in terms of a spreadsheet, an outcome is a row, an event is a subset of rows, and a random variable is a column. Mathematically, an outcome is a point, an event is a set, and a random variable is a function which outputs a number. As such, different operations are valid depending on what you’re dealing with. Don’t confuse operations like $\cap$ that operate on sets (events, “and”) with operations like $+$ that operate on numbers and functions (random variables, “plus” meaning addition).

Example 2.27 At various points in his homework, Donny Don’t writes the following. Explain to Donny why each of the following symbols is nonsense, both mathematically and intuitively using a simple example (like tomorrow’s weather). Below, $A$ and $B$ represent events, $X$ and $Y$ represent random variables.

$A = 0.5$
$A + B$
$X = A$
$X + A$
$X \cap Y$

Solution (click to expand)

Solution 2.27. We’ll respond to Donny using tomorrow’s weather as an example, with $A$ representing the event that it rains tomorrow, $X$ tomorrow’s high temperature (degrees F), $B=\{X>80\}$ the event that tomorrow’s high temperature is above 80 degrees, and $Y$ tomorrow’s rainfall (inches).

$A$ is a set and 0.5 is a number; it doesn’t make mathematical sense to equate them. It doesn’t make sense to say “it rains tomorrow equals 0.5”.
$A$ and $B$ are sets; it doesn’t make mathematical sense to add them. It doesn’t make sense to say “the sum of (it rains tomorrow) and (tomorrow’s high temperature is above 80 degrees F)”. If we want “(it rains tomorrow) OR (tomorrow’s high temperature is above 80 degrees F)”, then we need $A\cup B$. Union is an operation on sets; addition is an operation on numbers.
$X$ is a random variable (a function) and $A$ is an event (a set), and it doesn’t make sense to equate these two different mathematical objects. It doesn’t make sense to say “tomorrow’s high temperature equals the event that it rains tomorrow”.
$X$ is a random variable (a function) and $A$ is an event (a set), and it doesn’t make sense to add these two different mathematical objects. It doesn’t make sense to say “the sum of (tomorrow’s high temperature) and (the event that it rains tomorrow)”.
$X$ and $Y$ are random variables (functions) and intersection is an operation on sets. $X \cap Y$ is attempting to say “tomorrow’s high temperature in degrees F and the amount of rainfall in inches tomorrow”. If we’re talking about a random vector containing these two variables, we would write $(X, Y)$ not $X \cap Y$. If we’re interested in an event involving $X$ and $Y$, we’re missing qualifying information to define a valid event. We could write $X >80, Y < 2$ or $\{X > 80\} \cap \{Y < 2\}$ to represent “the event that (tomorrow’s high temperature is greater than 80 degrees F) AND (the amount of rainfall tomorrow is less than 2 inches)”.

2.3.7 Exercises

Exercise 2.5 Consider the outcome of a sequence of 4 flips of a coin. One random variable is $X$, the number of heads flipped.

Explain why $X$ is a random variable.
Evaluate each of the following: $X(HHHH), X(HTHT), X(TTHH)$.
Identify the possible values of $X$. Why not let the sample space just consist of this set of possible values?
What does $4-X$ represent?
What does $X/4$ represent?

Exercise 2.6 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Let $X$ be the number of distinct prizes obtained in these 3 packages. Let $Y$ be the number of these 3 packages that contain prize 1.

The sample space consists of 27 outcomes, listed in the table below.

	111	112	113	121	122	123	131	132	133
$X$
$Y$
	211	212	213	221	222	223	231	232	233
$X$
$Y$
	311	312	313	321	322	323	331	332	333
$X$
$Y$

Use the table above and evaluate $X$ and $Y$ for each of the outcomes.
Identify the possible values of $X$.
Identify the possible values of $Y$.
Identify the possible $(X, Y)$ pairs.
Identify and interpret $\{X = 1\}$.
Identify and interpret $\{X = 2\}$.
Identify and interpret $\{X = 3\}$.
Identify and interpret $\{Y = 0\}$.
Identify and interpret $\{Y = 1\}$.
Identify and interpret $\{Y = 2\}$.
Identify and interpret $\{Y = 3\}$.
Identify and interpret $\{X = 2, Y = 1\}$.
Identify and interpret $\{X = Y\}$.
Let $I_1$ be the indicator random variable that prize 1 is obtained (in at least one of the three packages). Identify and intepret $\{I_1 = 0\}$.
Let $I_2$ be the indicator random variable that prize 2 is obtained (in at least one of the three packages), and similarly $I_3$ for prize 3. What is the relationship between $X$ and $I_1, I_2, I_3$?
How can you write $Y$ in terms of indicator random variables?

Exercise 2.7 Katniss throws a dart at a circular dartboard with radius 1 foot. (Assume that Katniss’s dart never misses the dartboard.) Let $X$ be the distance (inches) from the location of the dart to the center of the dartboard.

Identify (with a picture) and interpret $\{X \le 1\}$
Identify (with a picture) and interpret $\{1 < X < 2\}$
Identify (with a picture) and interpret $\{X > 11\}$
Identify (with a picture) and interpret $\{X = 0\}$
Identify (with a picture) and interpret $\{X = 1\}$

2.4 Probability spaces

In this chapter we have defined outcomes, events, and random variables, the main mathematical objects associated with a random phenomenon. But we haven’t actually computed any probabilities yet! So far we have only been concerned with what is possible. You might have noticed that the examples often did not include any assumptions like the “die is fair”, “each object is equally likely to be put in any spot”, or “Regina is more likely to arrive late and Cady is more likely to arrive early”. Now we will start to incorporate assumptions of the random phenomenon to determine how probable various events are.

2.4.1 Probability measures

As we saw in Section 1.3, there are some basic logical consistency requirements that probabilities must satisfy, which are formalized in three “axioms”.

Definition 2.6 A probability measure, typically denoted $\textrm{P}$, assigns probabilities to events to quantify their relative likelihoods, plausibilities, or degrees of uncertainty according to the assumptions of the model of the random phenomenon. The probability of event¹⁷ $A$ is denoted $\textrm{P}(A)$.

Any valid probability measure must satisfy the following axioms.

For any event $A$, $0 \le \textrm{P}(A) \le 1$.
If $\Omega$ represents the sample space then $\textrm{P}(\Omega) = 1$.
Countable additivity. If $A_1, A_2, A_3, \ldots$ are disjoint events (recall Definition 2.3), then \[ \textrm{P}(A_1 \cup A_2 \cup A_2 \cup \cdots) = \textrm{P}(A_1) + \textrm{P}(A_2) +\textrm{P}(A_3) + \cdots \]

An event $A$ is something that can happen or can be true; $\textrm{P}(A)$ quantifies how likely it is that $A$ will happen or how plausible it is that $A$ is true. Probabilities are always defined for events (sets) but remember than many events are defined in terms of random variables. For example, if $X$ is tomorrow’s high temperature (degrees F) we might be interested in $\textrm{P}(\{X>80\})$, the probability of the event that tomorrow’s high temperature is above 80 degrees F. If $Y$ is the amount of rainfall tomorrow (inches) we might be interested in $\textrm{P}(\{X > 80\}\cap \{Y < 2\})$, the probability of the event that tomorrow’s high temperature is above 80 degrees F and the amount of rainfall is less than 2 inches. To simplify notation, it is common to write $\textrm{P}(X>80)$ instead of $\textrm{P}(\{X>80\})$, or $\textrm{P}(X > 80, Y < 2)$ instead of $\textrm{P}(\{X > 80\}\cap \{Y < 2\})$. Read the comma in $\textrm{P}(X > 80, Y < 2)$ as “and”. But keep in mind that an expression like “$X>80$” really represents an event $\{X>80\}$.

The three axioms require that probabilities of different events must fit together in a logically coherent way.

The requirement $0\le \textrm{P}(A)\le 1$ makes sense in light of the relative frequency interpretation: an event $A$ can not occur on more than 100% of repetitions or less than 0% of repetitions of the random phenomenon.

The requirement that $\textrm{P}(\Omega)=1$ just ensures that the sample space accounts for all of the possible outcomes. Basically, $\textrm{P}(\Omega)=1$ says that on any repetition of the random phenomenon, “something has to happen”. Roughly, $\textrm{P}(\Omega)=1$ implies that all outcomes taken together need to account for 100% of the probability. If $\textrm{P}(\Omega)$ were less than 1, then the sample space hasn’t accounted for all of the possible outcomes.

Event $A_1 \cup A_2 \cup \cdots$ is the event that $A_1$ occurs OR $A_2$ occurs OR… In other words, $A_1 \cup A_2 \cup \cdots$ is the event that at least one of the $A_i$’s occur. Countable additivity says that as long as events share no outcomes in common, then the probability that at least one of the events occurs is equal to the sum of the probabilities of the individual events. In Example 1.7, the events $B$=“the Braves win the 2023 World Series” and $A$=“the Rays win the 2023 World Series” are disjoint, $A\cap B = \emptyset$; in a single World Series, both teams cannot win. If $\textrm{P}(B) = 0.19$ and $\textrm{P}(A) = 0.16$, then the probability of $A\cup B$, the event that either the Rays or the Braves win, must be $\textrm{P}(A\cup B)=0.29$.

Countable additivity can be understood through a diagram with areas representing probabilities, as in the figure below which represents two events (yellow / and blue \). On the left, there is no “overlap” between areas so the total area is the sum of the two pieces; this depicts countable additivity for two disjoint events. On the right, there is overlap between the two areas, so simply adding the two areas “double counts” the intersection (green $\times$) and does not result in the correct total area. Countable additivity applies to any countable number¹⁸ of events, as long as there is no “overlap”.

Figure 2.8: Illustration of countable additivity for two events. The events in the picture on the left are disjoint, but not on the right.

The three axioms of a probability measure are simply minimal logical consistency requirements that must be satisfied by any probability model to ensure that probabilities fit together in a coherent way. There are also many physical aspects of the random phenomenon or assumptions (e.g. “fairness”, independence, conditional relationships) that must be considered when determining a reasonable probability measure for a particular situation. Sometimes $\textrm{P}(A)$ is defined explicitly for an event $A$ via a formula. But it is much more common for a probability measure to be defined only implicitly through modeling assumptions; probabilities of events then follow from the axioms and related properties.

2.4.2 Some probability measures for a four-sided die

Caution

This section concerns a single roll of a fair four-sided die. Don’t confuse this scenario with many other examples that involve two rolls. This section also discusses all possible events (beyond just all possible outcomes). We encourage you to review Section 2.2.1 before reading this section.

Consider a single roll of a four-sided die. The sample space consists of four possible outcomes, $\Omega = \{1, 2, 3, 4\}$. Events concern what might happen on a single roll. For example, if $A$ is the event that we roll an odd number then $A = \{1, 3\}$; “roll an odd number” occurs if we roll a 1 so 1 is an $A$, or “roll an odd number” occurs if we roll a 3 so 3 is in $A$. Table 2.6 lists the collection of all events.

Caution

We usually think of a table as a list of all possible outcomes, with one row for each outcome. Table 2.6 and the tables in this section (Table 2.11, Table 2.12, Table 2.13) are different kinds of tables. Each row of these tables is an event, and the tables list the collection of all events.

Example 2.28 Let’s first assume that the die is fair. Let $\textrm{P}$ denote the probability measure corresponding to a single roll of a fair four-sided die.

What is the probability that a single roll lands on 1? 2? 3? 4?
Compute the probability of each of the events in Table 2.6.
Specify a general formula for $\textrm{P}(A)$ for any event of interest $A$ in this example.

Solution (click to expand)

Solution 2.28.

Assuming the die is fair implies that all four outcomes are equally likely, each with probability¹⁹ 1/4.
Given the probability of each outcome²⁰, we can find the probability of an event via countable additivity—sum the probabilities of the distinct outcomes that comprise the event. For example, if $A=\{1, 3\}$ is the event that the die lands on an odd number, then \[ \textrm{P}(A) = \textrm{P}(\{1, 3\}) = \textrm{P}(\{1\}\cup \{3\}) = \textrm{P}(\{1\}) + \textrm{P}(\{3\}) = 1/4+ 1/4 = 2/4. \] Table 2.11 lists all the possible events, and their probabilities according to the probability measure $\textrm{P}$.
Since there are four equally likely outcomes, the probability of any event is \[ \textrm{P}(A) = \frac{\text{number of outcomes that satisfy $A$}}{4}, \qquad{\text{$\textrm{P}$ assumes a fair four-sided die}} \]

Table 2.11: All possible events associated with a single roll of a four-sided die, and their probabilities assuming the die is fair.

Event	Description	Probability of event assuming a fair die
$\emptyset$	Roll nothing (not possible)	0
$\{1\}$	Roll a 1	1/4
$\{2\}$	Roll a 2	1/4
$\{3\}$	Roll a 3	1/4
$\{4\}$	Roll a 4	1/4
$\{1, 2\}$	Roll a 1 or a 2	2/4
$\{1, 3\}$	Roll a 1 or a 3	2/4
$\{1, 4\}$	Roll a 1 or a 4	2/4
$\{2, 3\}$	Roll a 2 or a 3	2/4
$\{2, 4\}$	Roll a 2 or a 4	2/4
$\{3, 4\}$	Roll a 3 or a 4	2/4
$\{1, 2, 3\}$	Roll a 1, 2, or 3 (a.k.a. do not roll a 4)	3/4
$\{1, 2, 4\}$	Roll a 1, 2, or 4 (a.k.a. do not roll a 3)	3/4
$\{1, 3, 4\}$	Roll a 1, 3, or 4 (a.k.a. do not roll a 2)	3/4
$\{2, 3, 4\}$	Roll a 2, 3, or 4 (a.k.a. do not roll a 1)	3/4
$\{1, 2, 3, 4\}$	Roll something	1

When outcomes are equally likely, we find the probability of an event by counting the number of outcomes that satisfy the event.

The probability measure $\textrm{P}$ in Example 2.28 satisfies all the axioms and so it is a valid probability measure. However, assuming that the outcomes are equally likely is a much stricter condition than the basic logical consistency requirements of the axioms. There are many other possible probability measures, like in the following.

Example 2.29 Now consider a single roll of a four-sided die, but suppose the die is weighted so that the outcomes are no longer equally likely. Suppose that the probability of event $\{2, 3\}$ is 0.5, of event $\{3, 4\}$ is 0.7, and of event $\{1, 2, 3\}$ is 0.6. Let $\textrm{Q}$ denote the probability measure corresponding to a single roll of this weighted four-sided die.

In what particular way is the die weighted? That is, what is the probability of each the four possible outcomes?
Complete a table, like Table 2.11, listing the probability of each event for this particular weighted die.

Solution (click to expand)

Solution 2.29.

Since the probability of event $\{1, 2, 3\}$—that is, not rolling a 4—is 0.6, the probability of rolling a 4 must be²¹ 0.4. Since the probability of rolling a 3 or 4 is 0.7 and the probability of rolling a 4 is 0.4, the probability of rolling a 3 must be 0.3. Similarly, the probability of rolling a 2 must be 0.2, and the probability of rolling a 1 must be 0.1.
Given the probability of each outcome we can find the probability of an event by summing the probabilities of the distinct outcomes that comprise the event. For example, the probability that a single roll of this die lands on an odd number is \[ \textrm{Q}(\{1, 3\}) = \textrm{Q}(\{1\}\cup \{3\}) = \textrm{Q}(\{1\}) + \textrm{Q}(\{3\}) = 0.1+ 0.3 = 0.4. \] We can similarly find the probabilities of all possible events for this particular weighted die, displayed in Table 2.12.

Table 2.12: All possible events associated with a single roll of a four-sided die, and their probabilities assuming the die is weighted: roll a 1 with probability 0.1, 2 with probability 0.2, 3 with probability 0.3, 4 with probability 0.4.

Event	Description	Probability of event assuming a particular weighted die
$\emptyset$	Roll nothing (not possible)	0
$\{1\}$	Roll a 1	0.1
$\{2\}$	Roll a 2	0.2
$\{3\}$	Roll a 3	0.3
$\{4\}$	Roll a 4	0.4
$\{1, 2\}$	Roll a 1 or a 2	0.3
$\{1, 3\}$	Roll a 1 or a 3	0.4
$\{1, 4\}$	Roll a 1 or a 4	0.5
$\{2, 3\}$	Roll a 2 or a 3	0.5
$\{2, 4\}$	Roll a 2 or a 4	0.6
$\{3, 4\}$	Roll a 3 or a 4	0.7
$\{1, 2, 3\}$	Roll a 1, 2, or 3 (a.k.a. do not roll a 4)	0.6
$\{1, 2, 4\}$	Roll a 1, 2, or 4 (a.k.a. do not roll a 3)	0.7
$\{1, 3, 4\}$	Roll a 1, 3, or 4 (a.k.a. do not roll a 2)	0.8
$\{2, 3, 4\}$	Roll a 2, 3, or 4 (a.k.a. do not roll a 1)	0.9
$\{1, 2, 3, 4\}$	Roll something	1

The symbol $\textrm{P}$ is more than just shorthand for the word “probability”. $\textrm{P}$ denotes the underlying probability measure, which represents all the assumptions about the random phenomenon. Changing assumptions results in a change of the probability measure and a different probability model. We often consider several probability measures for the same sample space and collection of events; these several measures represent different sets of assumptions or available information and different probability models.

The probability measure $\textrm{P}$ in Example 2.28 corresponds to the assumption of a fair die (equally likely outcomes). With this measure $\textrm{P}(A) = 2/4=0.5$ for $A = \{1, 3\}$. But under the probability measure $\textrm{Q}$ corresponding to the weighted die in Example 2.29, $\textrm{Q}(A) = 0.4$. The outcomes and events are the same in both scenarios, because both scenarios involve a four sided-die. What is different is the probability measure that assigns probabilities to the events. One scenario assumes the die is fair while the other assumes the die has a particular weighting, resulting in two different probability measures.

Both probability measures $\textrm{P}$ and $\textrm{Q}$ can be written as explicit set functions: for an event $A$

\[\begin{align*} \textrm{P}(A) & = \frac{\text{number of outcomes that satisfy $A$}}{4}, & & {\text{a fair four-sided die}} \\ \textrm{Q}(A) & = \frac{\text{sum of elements in $A$}}{10}, & & {\text{a specific weighted four-sided die}} \end{align*}\]

We provide the above descriptions to illustrate that a probability measure operates on sets. However, in many situations there does not exist a simple closed form expression for the set function defining the probability measure which maps events to probabilities.

Example 2.30 Consider again a single roll of a weighted four-sided die. Suppose that

Rolling a 1 is twice as likely as rolling a 4
Rolling a 2 is three times as likely as rolling a 4
Rolling a 3 is 1.5 times as likely as rolling a 4

Let $\tilde{\textrm{Q}}$ be the probability measure corresponding to this die.

In what particular way is the die weighted? That is, what is the probability of each the four possible outcomes?
Compute $\tilde{\textrm{Q}}(A)$ for each event in Table 2.11.
Weighting a die in a particular way might be hard to conceptualize and even harder to achieve in practice. Construct a circular spinner to represent this weighted die.

Solution (click to expand)

Solution 2.30.

Let $q = \tilde{\textrm{Q}}(\{4\})$ denote the probability of rolling a 4. Then $\tilde{\textrm{Q}}(\{1\}) = 2q$, $\tilde{\textrm{Q}}(\{2\}) = 3q$, and $\tilde{\textrm{Q}}(\{3\}) = 1.5q$. Since these probabilities must sum to 1, we have $2q + 3q + 1.5q + q = 1$ so $q = 2/15$. Therefore, the probability of rolling a 1 is 4/15, a 2 is 6/15, a 3 is 3/15, and a 4 is 2/15. (We could have also solved this without algebra similiar to how we solved Example 1.8.)
Given the probability of each outcome we can find the probability of an event by summing the probabilities of the distinct outcomes that comprise the event. For example, the probability of rolling an odd number is \[ \tilde{\textrm{Q}}(\{1, 3\}) = \tilde{\textrm{Q}}(\{1\}\cup \{3\}) = \tilde{\textrm{Q}}(\{1\}) + \tilde{\textrm{Q}}(\{3\}) = 4/15+ 3/15 = 7/15 \approx 0.467. \]

We can similarly find the probabilities of all possible events for this particular weighted die, displayed in Table 2.13. Note this probability measure does not have a simple closed formula for $\tilde{\textrm{Q}}(A)$.
Construct a spinner with four sectors of area 4/15, 6/15, 3/15, and 2/15 representing, respectively, the values 1, 2, 3, and 4. See Figure 2.9 (c).

Table 2.13: All possible events associated with a single roll of a four-sided die, and their probabilities assuming the die is weighted: roll a 1 with probability 4/15, 2 with probability 6/15, 3 with probability 3/15, 4 with probability 2/15.

Event	Description	Probability of event assuming a particular weighted die
$\emptyset$	Roll nothing (not possible)	0
$\{1\}$	Roll a 1	4/15
$\{2\}$	Roll a 2	6/15
$\{3\}$	Roll a 3	3/15
$\{4\}$	Roll a 4	2/15
$\{1, 2\}$	Roll a 1 or a 2	10/15
$\{1, 3\}$	Roll a 1 or a 3	7/15
$\{1, 4\}$	Roll a 1 or a 4	6/15
$\{2, 3\}$	Roll a 2 or a 3	9/15
$\{2, 4\}$	Roll a 2 or a 4	8/15
$\{3, 4\}$	Roll a 3 or a 4	5/15
$\{1, 2, 3\}$	Roll a 1, 2, or 3 (a.k.a. do not roll a 4)	13/15
$\{1, 2, 4\}$	Roll a 1, 2, or 4 (a.k.a. do not roll a 3)	12/15
$\{1, 3, 4\}$	Roll a 1, 3, or 4 (a.k.a. do not roll a 2)	9/15
$\{2, 3, 4\}$	Roll a 2, 3, or 4 (a.k.a. do not roll a 1)	11/15
$\{1, 2, 3, 4\}$	Roll something	1

The die rolling example is not the most exciting or practical scenario. But the example does illustrate the idea of several probability measures, each corresponding to a different set of assumptions about the random phenomenon. If it’s difficult to imagine how to physically weight a die in these particular ways, consider the spinners (like from a kids game) in Figure 2.9).

Example 2.31 Using the set up of this section, the event $A = \{1, 3\}$, and the spinners from Figure 2.9,

Interpret each of $\textrm{P}(A) = 0.5$, $\textrm{Q}(A) = 0.4$, and $\tilde{\textrm{Q}}(A) = 7/15$ as a long run relative frequency.
Interpret each of $\textrm{P}(A) = 0.5$, $\textrm{Q}(A) = 0.4$, and $\tilde{\textrm{Q}}(A) = 7/15$ in terms of a relative degree of likelihood.

Solution (click to expand)

Solution 2.31.

Use the spinners from Figure 2.9.
- $\textrm{P}(A) = 0.5$: Spin the spinner on the left many times; it will land on an odd number on about 50% of spins.
- $\textrm{Q}(A) = 0.4$: Spin the spinner in the middle many times; it will land on an odd number on about 40% of spins.
- $\tilde{\textrm{Q}}(A) = 7/15$: Spin the spinner on the right many times; it will land on an odd number on about 46.67% of spins.
Use the spinners from Figure 2.9.
- $\textrm{P}(A) = 0.5$: The spinner on the left is equally likely to land on an odd number or an even number
- $\textrm{Q}(A) = 0.4$: The spinner in the middle is 1.5 times more likely to land on an even number than on an odd number.
- $\tilde{\textrm{Q}}(A) = 7/15$: The spinner on the right is $8/7\approx 1.14$ times more likely to land on an even number than on an odd number.

It is usually reasonable to assume that dice are fair, but most real world situations are not as simple as rolling dice. Just because a situation has 16 possible outcomes doesn’t mean the outcomes have to be equally likely. For example, there might be 12 contestants on your favorite reality competition show, but that doesn’t mean that all of the 12 contestants are equally likely to win the season.

2.4.3 Some probability measures in the meeting problem

Recall the meeting problem. The general problem involves multiple people, but we’ll first consider the arrival time of just a single person, who we’ll call Han²².

Caution

Some of the examples in this section involve just a single person arriving, while other examples involve two people.

Suppose that Han’s arrival time will definitely be between noon and 1:00, so that the sample space—with time measured in minutes after noon, including fractions of a minute—is $\Omega = [0, 60]$.

Example 2.32 Suppose that Han arrives “uniformly at random” at a time in $[0, 60]$. Use your intuition to provide your best guess for each of the following probabilities.

The probability that Han arrives before 12:30.
The probability that Han arrives before 12:15.
The probability that Han arrives after 12:45.
The probability that Han arrives between 12:15 and 12:45.
The probability that Han arrives before 12:05.
The probability that Han arrives between 12:15 and 12:20.
Let $\textrm{P}$ denote the corresponding probability measure. Suggest a general formula for $\textrm{P}([a, b])$, the probability that Han arrives between $a$ and $b$ minutes after noon for $0\le a<b\le 60$ (e.g., $a=15, b=20$ for “between 12:15 and 12:20” or $a = 0, b=5$ for “before 12:05”).
Find the probability that Han’s arrival time, truncated to the nearest minute, is 0 minutes after noon; that is, find the probability that Han arrives between 12:00 and 12:01.
Continue to find the probability that Han’s arrival time truncated²³ to the nearest minute is $1, 2, 3, \ldots, 59$, and sketch a plot with arrival time (truncated minutes after noon) on the horizontal axis and probability on the vertical axis. Is the plot what you would expect for arriving “uniformly at random”?

Solution (click to expand)

Solution 2.32. “Uniformly at random” means that in some sense Han is “equally likely” to arrive at any time between noon and 1:00.

0.5. It seems that Han should be as likely to arrive before 12:30 as after.
0.25. The time interval before 12:15 is 0.25 of the total noon to 1:00 time interval, so if he is “equally likely” to arrive at any time, the probability that he arrives in a 15 minute interval is $15/60 = 0.25$.
0.25. Similar to the previous part.
0.5. This 30 minute interval makes up half of the total noon to 1:00 time interval.
$1/12 = 0.083$. The time interval before 12:05 is $5/60=1/12$ of the total noon to 1:00 time interval, so if he is “equally likely” to arrive at any time, the probability that he arrives in a 5 minute interval is $5/60 = 1/12=0.083$.
1/12, similar to the previous part.
It seems reasonable that if Han arrives uniformly at random within the 60 minute time interval, the probability that he arrives within any time interval $[a, b]$ (in $[0, 60]$) is $\textrm{P}([a, b]) = (b-a)/60$, the length of the time interval of interest divided by the length of the total time interval.
The probability that Han arrives in the one minute interval between 12:00 and 12:01 is $1/60\approx 0.0167$.
The probability that Han arrives in any one minute interval is $1/60\approx 0.0167$. See Figure 2.10.

Figure 2.10: Probability of Han arriving at each minute between noon (0) and 1:00 (60), truncated to the nearest minute, for the uniform probability measure in Example 2.32).

Example 2.28 illustrated that for a finite sample space with equally likely outcomes, computing the probability of an event reduces to counting the number of outcomes that satisfy the event and dividing by the total number of possible outcomes. The continuous analog of equally likely outcomes is a uniform probability measure. When the sample space is uncountable, size is measured continuously (length, area, volume) rather that discretely (counting).

\[ \textrm{P}(A) = \frac{\text{size of } A}{\text{size of } \Omega} \qquad \text{if $\textrm{P}$ is a uniform probability measure} \]

The uniform probability measure in Example 2.32 is just one probability measure for Han’s arrival, reflecting an assumption that Han is “equally likely” to arrive at any time between noon and 1:00. Now we’ll model Han’s arrival time with a non-uniform probability measure which reflects that he is more likely to arrive near certain times than others.

Example 2.33 Assume that the probability that Han arrives between $a$ and $b$ minutes after noon is $(b/60)^2 - (a/60)^2$ (for $0\le a<b\le 60$) . Let $\textrm{Q}$ denote the corresponding probability measure; notice that $\textrm{Q}([0, 60]) = (60/60)^2 - (0/60)^2 = 1$. (We will see where such a probability measure might come from later. For now, we’ll just use it to compute probabilities and observe that it is a non-uniform measure.) Compute the following probabilities and compare your answers to the corresponding parts from Example 2.32.

The probability that Han arrives before 12:30.
The probability that Han arrives before 12:15.
The probability that Han arrives after 12:45.
The probability that Han arrives between 12:15 and 12:45.
Find the probability Han’s arrival time, truncated to the nearest minute, is 0 minutes after noon; that is, find the probability that Han arrives between 12:00 and 12:01.
Continue to find the probability that Han’s arrival time truncated to the nearest minute is $1, 2, 3, \ldots, 59$, and sketch a plot with arrival time (truncated minutes after noon) on the horizontal axis and probability on the vertical axis. What assumptions about Han’s arrival time does this probability measure reflect?

Solution (click to expand)

Solution 2.33.

The probability Han arrives before 12:30 is $\textrm{Q}([0, 30]) = (30/60)^2 - (0/60)^2 =0.5^2 =0.25$. Han is 3 times more likely to arrive after 12:30 than before 12:30. (Han is now less likely to arrive before 12:30 than in the uniform case.)
The probability Han arrives before 12:15 is $\textrm{Q}([0, 15)) = (15/60)^2 - (0/60)^2 =0.25^2 = 0.0625$; $\textrm{Q}([0, 15)) = 0.0625$. Han is 15 times more likely to arrive after 12:15 than before 12:15. (Han is now less likely to arrive within 15 minutes of noon than in the uniform case.)
The probability Han arrives before 12:45 is $\textrm{Q}([0, 45)) = (45/60)^2 - (0/60)^2 =0.75^2 = 0.5625$. Therefore, the probability that Hans arrives after 12:45 is $\textrm{Q}([45, 60]) = 1 - 0.5625 = 0.4375$. Han is 7 times more likely to arrive within 15 minutes of 1:00 than within 15 minutes of noon. (Han is now more likely to arrive with 15 minutes of 1:00 than in the uniform case.)
The probability that Han arrives between 12:15 and 12:45 is $\textrm{Q}((15, 45)) = (45/60)^2-(15/60)^2 = 0.5$. (This probability happens to be the same as in the uniform case.)
$\textrm{Q}([0, 1]) = (1/60)^2 - (0/60)^2 = 1/3600 = 0.000278$ is the probability that Han arrives within 1 minute of noon. (This probability is less than what it was in the uniform case.)
Continue as in the previous part. For example, $\textrm{Q}([1, 2]) = (2/60)^2 - (1/60)^2 = 3/3600 = 0.000833$ is the probability that Han arrives between 12:01 and 12:02; $\textrm{Q}([2, 3]) = (3/60)^2 - (2/60)^2 = 5/3600 = 0.000833$ is the probability that Han arrives between 12:01 and 12:02; $\textrm{Q}([59, 60]) = (60/60)^2 - (59/60)^2 = 119/3600 = 0.033$ is the probability that Han arrives within 1 minute of 1:00. In general, $\textrm{Q}([a, a+1]) = ((a+1)/60)^2 - (a/60)^2 = (2a+1)/3600$ for $a = 0, 1, \ldots, 59$. See Figure 2.10. This probability measure assumes that Han is more likely to arrive closer to 1:00 than to noon.

Figure 2.11: Probability of Han arriving at each minute between noon (0) and 1:00 (60), truncated to the nearest minute, for a uniform probability measure (blue) and the probability measure in Example 2.33 (orange).

The probability measure in Example 2.33 is a non-uniform measure. Han is much more likely to arrive between 12:45 and 1:00 than between 12:00 and 12:15, even though both these intervals have the same length.

Warning

Be careful when reading Figure 2.11! For example, the probability that the arrival time truncated to the nearest minute is 0 is 1/60 for the uniform measure in Example 2.32 and 1/3600 for the non-uniform measure in Example 2.33; these probabilities represent the probability that Han arrives within one minute of noon rather than the probability that Han arrives exactly at noon. Likewise, the probability that the arrival time truncated to the nearest minute is 59 is 0.0167 = 60/3600 for the uniform measure in Example 2.32 and 0.033 = 119/3600 for the non-uniform measure in Example 2.33; these probabilities represent the probability that Han arrives within one minute of 1:00 rather than the probability that Han arrives exactly at 1:00. All of the dots in Figure 2.11 correspond to one-minute intervals, not exact time points.

Example 2.34 Continuing with the uniform probability measure of Example 2.32.

Find the probability that Han arrives between 12:00 and 12:01, within 1 minute after noon.
Find the probability that Han arrives between 12:00:00 and 12:00:01, within 1 second after noon.
Find the probability that Han arrives between 12:00:00.000 and 12:01:00.001, within 1 millisecond after noon.
Find the probability that Han arrives at the exact time 12:00:00.00000… (with infinite precision).

Solution (click to expand)

Solution 2.34. For the uniform probability measure, the probability of arriving in any interval is the length of the time interval divided by the length of the total interval. With time measured in minutes, a one minute interval has length 1, a one second interval has length $1/60$, and a one millisecond interval has length $1/60000$.

$1/60 = 0.0167$ is Han’s probability of arriving in this 1 minute interval.
$(1/60)/60 = 0.000278$ is Han’s probability of arriving in this 1 second interval.
$(1/60000)/60 = 0.000000278$ is Han’s probability of arriving in this 1 millisecond interval
The exact time 12:00:00.0000 represents a single point the sample space, an interval of length 0. The probability that Han arrives at the exact time 12:00:00000 (with infinite precision) is 0; $\textrm{P}(\{0\}) = 0$.

The last part in Example 2.34 might seem counterintuitive at first. There was nothing special about 12:00; pick any precise time in the continuous interval from noon to 1:00, and the probability that Han arrives at that exact time, with infinite precision, is 0. This idea can be understood as a limit. The probability that Han arrives within one minute of the specified time is small, within one second of the specified time is even smaller, within one millisecond of the specified time is even smaller still; with infinite precision these time increments can get smaller and smaller indefinitely. Of course, infinite precision is not practical, but assuming the possible arrival times are represented by a continuous interval provides a reasonable mathematical model. Even though any particular time has probability 0 of being the precise arrival time, intervals of time still have positive probability of containing the arrival time. When we ask a question like “what is the probability that Han arrives at noon”, “at noon” really means “within 1 minute of noon” or “within 1 second of noon” or within whatever degree of precision is good enough for our practical purposes, and such intervals have non-zero probability.

Example 2.35 Continuing with the non-uniform probability measure of Example 2.33.

Find the probability that Han arrives between 12:00 and 12:01, within 1 minute after noon.
Find the probability that Han arrives between 12:00:00 and 12:00:01, within 1 second after noon.
Find the probability that Han arrives between 12:00:00.000 and 12:01:00.001, within 1 millisecond after noon.
Find the probability that Han arrives at the exact time 12:00:00.00000… (with infinite precision).
Find the probability that Han arrives between 12:59 and 1:00, within 1 minute before 1:00.
Find the probability that Han arrives between 12:59:59 and 1:00:00, within 1 second before 1:00.
Find the probability that Han arrives between 12:59:59.999 and 1:00:00.000, within 1 millisecond before 1:00.
Find the probability that Han arrives at the exact time 1:00:00.00000… (with infinite precision).
Which is more likely: that Han arrives “at noon” or “at 1:00”? Explain.

Solution (click to expand)

Solution 2.35. With time measured in minutes, one minute is 1, one second is $1/60$, and one millisecond is $1/60000$.

$(1/60)^2 - 0 = 0.000278$ is Han’s probability of arriving within 1 minute of noon.
$((1/60)/60)^2 - 0 = 0.000000077$ is Han’s probability of arriving within 1 second of noon.
$((1/60000)/60)^2 - 0= 0.000000000000077$ is Han’s probability of arriving within 1 millisecond of noon.
The exact time 12:00:00.0000 represents a single point the sample space, an interval of length 0. The probability that Han arrives at the exact time 12:00:00000 (with infinite precision) is 0; $\textrm{Q}(\{0\}) = 0$.
$1-((60 - 1)/60)^2 = 0.033$ is Han’s probability of arriving within 1 minute of 1:00.
$1-((60 - 1/60)/60)^2 = 0.00055$ is Han’s probability of arriving within 1 second of 1:00.
$1-((60 - 1/60000)/60)^2 = 0.00000055$ is Han’s probability of arriving within 1 millisecond of 1:00.
The exact time 1:00:00.0000 represents a single point the sample space, an interval of length 0. The probability that Han arrives at the exact time 1:00:00000 (with infinite precision) is 0; $\textrm{Q}(\{60\}) = 0$.
The probabilities of arriving precisely at noon or 1:00 are both 0. However, the probability of arriving “close to” 1:00 is greater than the probability of arriving “close to” noon, regardless of how “close to” is defined (within 1 minute, within 1 second, etc). In practice, “at” is probably best interpreted as “close to”, and in this sense Han is more likely to arrive “at 1:00” than “at noon”.

Continuous sample spaces introduce some complications that we didn’t encounter when dealing with discrete sample spaces. For a continuous sample space, the probability of any particular outcome²⁴ is 0. However, Example 2.35 illustrates that in some sense certain outcomes can be more likely than others; Han is more likely to arrive close to 1:00 than close to noon. For continuous sample spaces it makes more sense to consider “close to” probabilities rather than “equals to” probabilities. We will investigate related ideas in much more detail as we go.

Now we’ll return to the two-person (Regina, Cady) meeting problem from Example 2.3, with sample space depicted in Figure 2.2. We will use pictures to represent a few probability measures corresponding to different assumptions about the arrival times. In the pictures below, lighter colors represent regions of outcomes that are more likely; darker colors, less likely.

Figure 2.12 corresponds to a uniform probability measure under which all outcomes are “equally likely”. This probability measure would be appropriate if we assume that Regina and Cady each arrive at a time uniformly at random between noon and 1, independently of each other.

add — Figure 2.12: A uniform probability measure in the (Regina, Cady) meeting problem.

Example 2.36 Assume the uniform probability measure $\textrm{P}$ represented by Figure 2.12. Hint: when finding probabilities below, recall Example 2.11 and Figure 2.13.

Find the probability that Regina arrives after Cady.
Find the probability that either Regina or Cady arrives before 12:30.
Find the probability that Cady arrives first and Regina arrives at most 15 minutes after Cady.
Find the probability that Regina arrives before 12:24.

Solution (click to expand)

Solution 2.36. For a uniform probability measure, the probability of an event is the size of the event divided by the size of the sample space. Since the sample space is $[0, 60]\times[0, 60]$, a continuous two-dimensional region, size is measured by area. The sample space has area 3600.

See Figure 2.13 for pictures of the events of interest.

The triangular region corresponding to the event that Regina arrives after Cady has area 3600/2 = 1800. So the probability that Regina arrives after Cady is $1800/3600=0.5$.
The L-shaped region corresponding to the event that Regina or Cady arrives before 12:30 has area $(0.75)(3600)$, so the probability is 0.75.
The trapezoidal region corresponding to the event that that Regina arrives at most 15 minutes after Cady (and Cady arrives first) has area $(7/32)(3600) = (0.21875)(3600)$. (It’s easiest to find the area of the two unshaded triangles and subtract from the total area of 3600: $3600 - 0.5(3600) - (3600)(1-0.25)^2/2=7/32(3600)$.) So the probability that Regina arrives at most 15 minutes after Cady (and Cady arrives first) is 0.21875.
The rectangular region corresponding to the event that that Regina arrives before 12:24 has area (0.4)(3600), so the probability is 0.4.

Figure 2.13: Illustration of the events in Exercise Example 2.36. The square represents the sample space. With a uniform probability measure, the areas of the shaded regions relative to the whole represent their probabilities.

Example 2.37 Continuing Example 2.36

Find the probability that Cady arrives first and Regina arrives at most 1 minute after Cady.
Find the probability that Cady arrives first and Regina arrives at most 1 second after Cady.
Find the probability that Cady arrives first and Regina arrives at most 1 millisecond after Cady.
Find the probability that Regina and Cady arrive at exactly the same time, with infinite precision.

Solution (click to expand)

Solution 2.37.

Similar to part 3 of Example 2.36, the probability that Regina arrives at most 1 minute after Cady (and Cady arrives first) is $1 - 0.5 - (1-1/60)^2/2=0.0165$.
The probability that Regina arrives at most 1 second after Cady (and Cady arrives first) is $1 - 0.5 - (1-1/3600)^2/2=0.000278$.
The probability that Regina arrives at most 1 millisecond after Cady (and Cady arrives first) is $1 - 0.5 - (1-1/3600000)^2/2=0.000000278$.
The event that Regina and Cady arrive at exactly the same time, with infinite precision, corresponds to the “Regina = Cady” line segment. The area of this line segment is 0, so the probability that Regina and Cady arrive at exactly the same time, with infinite precision, is 0.

Example 2.37 and Example 2.34 illustrate similar ideas. Regardless of the precise time in the continuous interval $[0, 60]$ at which Regina arrives, the probability that Cady arrives at that exact time, with infinite precision, is 0. In practice, if we’re interested in “the probability that Regina at Cady arrive at the same time”, we really mean “close enough to the same time”, where “close enough” could be within one minute or one second or whatever degree of precision is good enough for practical purposes.

Most random phenomenon do not involve equally likely outcomes or uniform probability measures. Even when the underlying outcomes are equally likely, the values of related random variables are usually not. Therefore, most interesting probability problems involve “non-uniform” probability measures.

Figure 2.14 corresponds to one non-uniform probability measure for the two-person meeting problem; certain outcomes are more likely than others. (Lighter colors represent regions of outcomes that are more likely; darker colors, less likely.) Such a probability measure would be appropriate if we assume that Regina and Cady each are more likely to arrive around 12:30 than noon or 1:00, independently of each other. Switching from the uniform probability measure represented by Figure 2.12 to the non-uniform one represented by Figure 2.14 would change the probability of the events in Example 2.36 and Example 2.37. (We’ll see how to compute probabilities for non-uniform measures later.)

add — Figure 2.14: A non-uniform probability measure in the two person meeting problem. Lighter colors represent regions of outcomes that are more likely; darker colors, less likely.

Figure 2.15 corresponds to another “non-uniform” probability measure. Such a probability measure would be appropriate if we assume that Regina and Cady each are more likely to arrive around 12:30 than noon or 1:00, but they coordinate their arrivals so they are more likely to arrive around the same time.

add — Figure 2.15: Another non-uniform probability measure in the two person meeting problem. Lighter colors represent regions of outcomes that are more likely; darker colors, less likely. Notice how the probability is concentrated along the $R=Y$ line

There are many other probability measures for the meeting problem, representing different sets of assumptions. Each probability measure assigns a probability to events like “Cady arrives first”, “both arrive before 12:20”, and “the first person to arrive has to wait less than 15 minutes for the second to arrive”, and these probabilities can differ between models.

2.4.4 Some properties of probability measures

Many other properties follow from the axioms, some of which we state below. Don’t let notation or names like the “complement rule” confuse you. We have already successfully used all of the properties below intuitively when working with two-way tables. All that is new in this section is mathematical formalism. Yes, getting comfortable with proper notation is part of learning the language of probability. But don’t let formality get in the way of your intuition. Continue to use the ideas from Chapter 1, including tools like two-way tables.

The main “meat” of the axioms is countable additivity. Thus, the key to many proofs of probability properties is to express relevant events in terms of unions of disjoint events. (Proof are included in the footnotes.)

Lemma 2.2 (Complement rule) For any event²⁵ $A$, $\textrm{P}(A^c) = 1 - \textrm{P}(A)$.

The complement rule follows from the fact that an event either happens or it doesn’t. We’ll see that it is sometimes more convenient to compute directly the probability that an event does not happen and then use the complement rule.

Warning

Subtracting a computed probability from 1 seems like a small computational step, but it’s an important one. If you’re taking a test, a 0.9 probability of getting a question correct is much different than a 0.1 probability. Unfortunately, the complement rule step is often overlooked when doing probability calculations. It’s a good idea to ask yourself if the probability you are computing should be greater than or less than 0.5. If your computed value seems to be on the wrong side of 0.5, check your calculations to see if you have forgotten (or misapplied) the complement rule.

Lemma 2.3 (Subset rule) If $A \subseteq B$ then²⁶ $\textrm{P}(A) \le \textrm{P}(B)$.

The subset rule says that if every outcome that satisfies event $A$ also satisfies event $B$ then the probability of event $B$ must be at least as large as the probability of event $A$. We saw an application of the subset rule in Example 1.9.

Lemma 2.4 (Addition rule for two events) If $A$ and $B$ are any two events then²⁷

\[\begin{align*} \textrm{P}(A\cup B) = \textrm{P}(A) + \textrm{P}(B) - \textrm{P}(A \cap B) \end{align*}\]

Example 2.38 Donny Don’t says: “Wait a minute. You said unions are inclusive; $\textrm{P}(A\cup B)$ means the probability of $A$ or $B$ OR BOTH. So $\textrm{P}(A\cup B)$ should just be $\textrm{P}(A)+\textrm{P}(B)$.” Explain to Donny his mistake, using the picture on the right in Figure 2.8 as an example.

Solution (click to expand)

Solution 2.38. $A\cup B$ is inclusive so we do want to count the possibility of both, $A\cap B$. The problem with simply adding $\textrm{P}(A)$ and $\textrm{P}(B)$ is that their sum double counts $A \cap B$. We do want to count the outcomes that satisfy both $A$ and $B$, but we only want to count them once. Subtracting $\textrm{P}(A \cap B)$ in the general addition rule for two events corrects for the double counting.

For example, consider the picture on the right in Figure 2.8. Suppose each rectangular cell represents a distinct outcome; there are 16 outcomes in total. Assume the outcomes are equally likely, each with probability $1/16$. Let $A$ represent the yellow / event which has probability $4/16$ and let $B$ represent the blue \ event which has probability 4/16. (Remember, green represents outcomes that satisfy both blue and yellow.) Then $\textrm{P}(A\cup B) = 6/16$, since there are 6 outcomes which satisfy either event $A$ or $B$ (or both). However, simply adding $\textrm{P}(A)+\textrm{P}(B)$ yields $8/16$ because the two outcomes that satisfy the green event $A\cap B$ are counted both in $\textrm{P}(A)$ and $\textrm{P}(B)$. So to correct for this double counting, we subtract out $\textrm{P}(A\cap B)$: \[ \textrm{P}(A)+\textrm{P}(B)-\textrm{P}(A\cap B) = 4/16 + 4/16 -2/16 = 6/16 = \textrm{P}(A\cup B) \]

The addition rule for more than two events is complicated²⁸ (unless the events are disjoint). For example, the addition rule for three events is \[\begin{align*} \textrm{P}(A\cup B\cup C) & = \textrm{P}(A) + \textrm{P}(B) + \textrm{P}(C)\\ & \qquad - \textrm{P}(A\cap B) - \textrm{P}(A \cap C) - \textrm{P}(B \cap C)\\ & \qquad + \textrm{P}(A \cap B \cap C). \end{align*}\]

Many problems involve finding the “probability of at least one…” On the surface such problems involve unions (“at least one of events $A_1, A_2, \ldots$ occur if event $A_1$ occurs OR event $A_2$ occurs OR…”) Since the general addition rule for multiple events is complicated, unless the events are disjoint it is usually more convenient to use the complement rule and compute “the probability of at least one…” as one minus “the probability of none…” The “probability of none…” involves intersections (“none of the events $A_1, A_2, \ldots$ occur if event $A_1$ does not occur AND event $A_2$ does not occur AND…”). We will see more about probabilities of intersections later.

Lemma 2.5 (Law of total probability) If $C_1, C_2, C_3\ldots$ are disjoint events with $C_1\cup C_2 \cup C_3\cup \cdots =\Omega$, then²⁹

\[\begin{align*} \textrm{P}(A) & = \textrm{P}(A \cap C_1) + \textrm{P}(A \cap C_2) + \textrm{P}(A \cap C_3) + \cdots \end{align*}\]

Since $C$ and $C^c$ are disjoint with $C \cup C^c = \Omega$, a special case is

\[\begin{align*} \textrm{P}(A) & = \textrm{P}(A \cap C) + \textrm{P}(A \cap C^c) \end{align*}\]

In the law of total probability the events $C_1, C_2, C_3, \ldots$, which represent “cases”, form a partition of the sample space; each outcome in the sample space satisfies exactly one of the cases $C_i$. The law of total probability says that we can compute the “overall” probability $\textrm{P}(A)$ by breaking $A$ down into pieces and then summing the case-by-case probabilities $\textrm{P}(A\cap C_i)$. We use the law of total probability intuitively when we sum across rows and columns in two-way tables. (Later we will see a different and more useful expression of the law of total probability, involving conditional probabilities.)

The following example is one we have basically covered before, Example 1.15, but now we use mathematical notation and properties. However, the ideas are the same as we discussed in Example 1.15.

The following example involves randomly selecting a U.S. household. Note that while “randomly select” is commonly used terminology, it is not the best wording. Remember that “random” simply means uncertain, so technically “randomly select” just means selecting in a way that the outcome is uncertain. Suppose I want to “randomly select” one of two households, A or B. I could put 10 tickets in a hat, with 9 labeled A and 1 labeled B, and then draw a ticket; this is random selection because the outcome of the draw is uncertain. However, what is often meant by “randomly select” is selecting in a way that each outcome is equally likely. To give households A and B the same chance of being selected, I would put a single ticket for each in the hat. Randomly selecting in a way that each outcome is equally likely could be described more precisely as “selecting uniformly at random”. (We will discuss equally likely outcomes in more detail later.)

Example 2.39 Recall Example 1.15. Suppose that the probability that a randomly selected U.S. household has a pet dog is 0.47, and that the probability that a randomly selected U.S. household has a pet cat is 0.25.

Define the sample space and two events of interest in words.
Represent these probabilities using proper notation.
Donny Don’t says: “the probability that a randomly selected U.S. household has a pet dog OR a pet cat is $0.47 + 0.25=0.72$.” Do you agree? What must be true for Donny to be correct? Explain.
What is the largest possible value of the probability that a randomly selected U.S. household has a pet dog OR a pet cat? Describe the (unrealistic) situation in which this extreme case would occur.
What is the smallest possible value of the probability that a randomly selected U.S. household has a pet dog OR a pet cat? Describe the (unrealistic) situation in which this extreme case would occur.
Donny Don’t says: “I remember hearing once that in probability OR means add and AND means multiply. So the probability that a randomly selected U.S. household has a pet dog AND a pet cat is $0.47 \times 0.25=0.1175$.” Do you agree? Explain.
Suppose that the probability that a randomly selected U.S. household has a pet dog AND a pet cat is $0.14$. Compute the probability that a randomly selected U.S. household has a pet dog OR a pet cat.
Compute and interpret $\textrm{P}(C \cap D^c)$.

Solution (click to expand)

Solution 2.39.

The sample space consists of U.S. households. Let $C$ be the event that the household has a pet cat, and let $D$ be the event that the household has a pet dog.
Let $\textrm{P}$ be the probability measure corresponding to randomly selecting a U.S. household. The probability measure corresponds to however the random selection is done; though not specified, it’s assumed to be uniformly at random. Then $\textrm{P}(C) = 0.25$ and $\textrm{P}(D) = 0.47$.
Donny would only be correct if the events $C$ and $D$ were disjoint, which would only be true if no households had both a pet cat and a pet dog. This is an unrealistic scenario so $\textrm{P}(C \cup D)$ is less than 0.72.
Using the addition rule, $\textrm{P}(C \cup D) = \textrm{P}(C) + \textrm{P}(D) - \textrm{P}(C \cap D) = 0.25 + 0.47 - \textrm{P}(C\cap D).$ So $\textrm{P}(C\cup D)$ is the largest it can be when $\textrm{P}(C \cap D)$ is the smallest it can be. The smallest $\textrm{P}(C \cap D)$ can possibly³⁰ be is 0, and hence the largest $\textrm{P}(C\cup D)$ can be is 0.72, which would only be true if no households had both a pet cat and a pet dog.
$\textrm{P}(C \cup D) = \textrm{P}(C) + \textrm{P}(D) - \textrm{P}(C \cap D) = 0.25 + 0.47 - \textrm{P}(C\cap D)$. $\textrm{P}(C\cup D)$ is the smallest it can be when $\textrm{P}(C \cap D)$ is the largest it can be. The probability that the household has both a pet cat and a pet dog can not be larger than either of the two component probabilities; that is, by the subset rule, $\textrm{P}(C\cap D)\le \textrm{P}(C) = 0.25$ and $\textrm{P}(C\cap D)\le \textrm{P}(D) = 0.47$. The largest $\textrm{P}(C \cap D)$ can be is 0.25, and hence the smallest $\textrm{P}(C\cup D)$ can be is 0.47, which would only be true if every household that has a pet cat also has a pet dog.
Tell Donny to check the axioms of probability. There is no requirement that the probability of an intersection must be the product of the probabilities. The two previous parts show that $0\le \textrm{P}(C \cap D) \le 0.25$, but without further information we can’t determine the value of $\textrm{P}(C\cap D)$. We discussed this idea in Example 1.15, and we will explore probabilities of intersections in more detail later.
$\textrm{P}(C \cup D) = \textrm{P}(C) + \textrm{P}(D) - \textrm{P}(C \cap D) = 0.25 + 0.47 - 0.14 = 0.58.$ Notice that this is between the logical extremes of 0.47 and 0.72. Also notice that the actual $\textrm{P}(C \cap D)$ is between the logical extremes of 0 and 0.25, but it is not equal to the product of 0.25 and 0.47. The moral is that we are not able to compute probabilities involving both events ($\textrm{P}(C\cup D)$, $\textrm{P}(C^c \cap D$)) based on the probability of each event alone.
The law of total probability implies that $\textrm{P}(D) = \textrm{P}(D \cap C) + \textrm{P}(D \cap C^c)$ so $\textrm{P}(D \cap C^c) = \textrm{P}(D) - \textrm{P}(D \cap C) = 0.47 - 0.14 = 0.33$. This might look complicated, but all it says is that we can add across and down in the two-way table. A household either has a cat or not (the two cases, $C, C^c$); if 14% of households have both a dog and a cat and 33% of households have a dog but no cat, then 47% of households have a dog.

Probabilities involving multiple events, such as $\textrm{P}(A \cap B)$ or $\textrm{P}(X>80, Y<2)$, are often called joint probabilities. Note that the axioms do not specify any direct requirements on probabilities of intersections. In particular, is not necessarily true that $\textrm{P}(A\cap B)$ equals $\textrm{P}(A)\textrm{P}(B)$. It is true that probabilities of intersections can be obtained by multiplying, but the product generally involves at least one conditional probability that reflects any association between the events involved. In general, joint probabilities ($\textrm{P}(A \cap B)$) can not be computed based on the individual probabilities ($\textrm{P}(A)$, $\textrm{P}(B)$) alone. We will explore this topic in more depth later.

2.4.5 Probability models

A probability model (or probability space) puts all the objects we have seen so far in this chapter together in a model for the random phenomenon. Think of a probability model³¹ as the collection of all outcomes, events, and random variables associated with a random phenomenon along with the probabilities of all events of interest (and distributions of random variables) under the assumptions of the model.

There will be many probability measures that satisfy the logical consistency requirements of the probability axioms. Which one is most appropriate depends on the assumptions about the random phenomenon. We will study a variety of commonly used probability models throughout the book.

Perhaps the concept of multiple potential probability measures is easier to understand in a subjective probability situation. For example, each model that is used to forecast the 2024-2025 NFL season corresponds to a probability measure which assigns probabilities to events like “the Eagles win the 2025 Superbowl”. Different sets of assumptions and models can assign different probabilities for the same events. As another example, the weather forecaster on one local news station might report that the probability of rain tomorrow is 0.6, while an online source might report it as 0.5. Each weather forecasting model corresponds to a different probability measure which encodes a set of assumptions about the random phenomenon.

Before moving on, we want to reiterate: Most random phenomenon do not involve equally likely outcomes or uniform probability measures. Even when the underlying outcomes are equally likely, the values of related random variables are usually not. Equally like outcomes or uniform probability measures are the simplest probability measures, and therefore are the ones we typically encounter first. But don’t let that fool you; most interesting probability problems involve non-equally likely outcomes or non-uniform probability measures.

It’s easy to get confused between things like events, random variables, and probabilities, and the symbols that represent them. But a strong understanding of these fundamental concepts will help you solve probability problems. Examples like the following do more than encourage proper use of notation. Explaining to Donny why he is wrong will help you better understand the objects that symbols represent, how they are different from one another, and how they connect to real-world contexts.

Example 2.40 At various points in his homework, Donny Don’t writes the following. Explain to Donny why each of the following symbols is nonsense, both mathematically and intuitively using a simple example (like tomorrow’s weather). Try to guess what Donny intends to say, and help him write it properly. Below, $A$ and $B$ represent events, $X$ and $Y$ represent random variables.

$\textrm{P}(A = 0.5)$
$\textrm{P}(A + B)$
$\textrm{P}(A) \cup \textrm{P}(B)$
$\textrm{P}(X)$
$\textrm{P}(X = A)$
$\textrm{P}(X \cap Y)$

Solution (click to expand)

Solution 2.40. We’ll respond to Donny using tomorrow’s weather as an example, with $A$ representing the event that it rains tomorrow, $X$ tomorrow’s high temperature (degrees F), $B=\{X>80\}$ the event that tomorrow’s high temperature is above 80 degrees, and $Y$ tomorrow’s rainfall (inches).

$A$ is a set and 0.5 is a number; it doesn’t make mathematical sense to equate them. It doesn’t make sense to say “it rains tomorrow equals 0.5”. Donny probably means “the probability that it rains tomorrow equals 0.5” which he should write as $\textrm{P}(A) = 0.5$.
$A$ and $B$ are sets; it doesn’t make mathematical sense to add them. The symbol $A + B$ would represent “it rains tomorrow plus tomorrow’s high temperature is above 80 degrees F,” where “plus” literally means addition. Donny might mean “the probability that (it rains tomorrow) or (tomorrow’s high temperature is above 80 degrees),” which he should write as $\textrm{P}(A \cup B)$. Donny might have meant to write $\textrm{P}(A) + \textrm{P}(B)$, which is valid expression since $\textrm{P}(A)$ and $\textrm{P}(B)$ are numbers. However, he should keep in mind that $\textrm{P}(A) + \textrm{P}(B)$ is not necessarily a probability of anything; this sum could even be greater than one. In particular, since there are some rainy days with high temperatures above 80 degrees—that is, $A$ and $B$ are not disjoint—$\textrm{P}(A) + \textrm{P}(B)$ is greater than $\textrm{P}(A\cup B)$. Donny might also mean “the probability that (it rains tomorrow) and (tomorrow’s high temperature is above 80 degrees),” which he should write as $\textrm{P}(A \cap B)$.
$\textrm{P}(A)$ and $\textrm{P}(B)$ are numbers; union is an operation on sets, and it doesn’t make mathematical sense to take a union of numbers. See the previous part for related discussion.
$X$ is a random variable, and probabilities are assigned to events. $P(X)$ reads “the probability that tomorrow’s high temperature in degrees F”, a subject in need of a predicate; the phrase is missing any qualifying information that could define an event. We assign probabilities to things that might happen (events) like “tomorrow’s high temperature is above 80 degrees,” which has probability $\textrm{P}(X > 80)$.
$X$ is a random variable (a function) and $A$ is an event (a set), and it doesn’t make sense to equate these two different mathematical objects. It doesn’t make sense to say “tomorrow’s high temperature in degrees F equals the event that it rains tomorrow”. We’re not sure what Donny was thinking here.
$X$ and $Y$ are random variables (functions) and intersection is an operation on sets. $X \cap Y$ is attempting to say “tomorrow’s high temperature in degrees F and the amount of rainfall in inches tomorrow”, but this is still missing qualifying information to define a valid event for which a probability can be assigned. We could say $\textrm{P}(X > 80, Y < 2)$ to represent “the probability that (tomorrow’s high temperature is greater than 80 degrees F) AND (the amount of rainfall tomorrow is less than 2 inches)”. (Remember,“$X > 80, Y < 2$” is short for the event $\{X > 80\} \cap \{Y < 2\}$.) If we want to say something like “we measure tomorrow’s high temperature in degrees F and the amount of rainfall in inches tomorrow” we would write $(X, Y)$.

2.4.6 Exercises

Exercise 2.8 Consider the matching problem with $n=4$: objects labeled 1, 2, 3, 4, are placed at random in spots labeled 1, 2, 3, 4, with spot 1 the correct spot for object 1, etc. Recall the sample space from Table 2.2. Let the random variable $X$ count the number of objects that are put back in the correct spot; recall Table 2.9. Let $\textrm{P}$ denote the probability measure corresponding to the assumption that the objects are equally likely to be placed in any spot, so that the 24 possible placements are equally.

Compute and interpret $\textrm{P}(X=0)$.
Compute and interpret $\textrm{P}(X \ge 1)$.
Let $C_1$ be the event that object 1 is put correctly in spot 1. Find $\textrm{P}(C_1)$.
Let $C_2$ be the event that object 2 is put correctly in spot 2. Find $\textrm{P}(C_2)$.
Define $C_3$, and $C_4$ similarly. Represent the event $\{X \ge 1\}$ in terms of $C_1, C_2, C_3, C_4$.
Find and interpret $\textrm{P}(C_1\cap C_2 \cap C_3 \cap C_4)$.
Donny Don’t says: $\textrm{P}(C_1 \cup C_2 \cup C_3 \cup C_4)$ is equal to $\textrm{P}(C_1)+\textrm{P}(C_2)+\textrm{P}(C_3)+\textrm{P}(C_4)$.” Explain to Donny his mistake.
Donny Don’t says: “ok, the events are not disjoint so then by the general addition rule $\textrm{P}(C_1 \cup C_2 \cup C_3 \cup C_4)$ is equal to $\textrm{P}(C_1)+\textrm{P}(C_2)+\textrm{P}(C_3)+\textrm{P}(C_4)-\textrm{P}(C_1\cap C_2 \cap C_3 \cap C_4)$.” Explain to Donny his mistake.

Exercise 2.9 Consider the outcome of a sequence of 4 flips of a coin. Assume that the coin is fair so that all 16 possible outcomes are equally likely, and let $\textrm{P}$ be the corresponding probability measure. Let $X$ be the number of heads flipped and let $Y=4-X$.

Compute $\textrm{P}(X=1)$.
Compute $\textrm{P}(X = x)$ for each $x = 0, 1, 2, 3, 4$.
Compute $\textrm{P}(Y=1)$.
Compute $\textrm{P}(Y = y)$ for each $y = 0, 1, 2, 3, 4$.
Compute $\textrm{P}(X = Y)$.

Exercise 2.10 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Suppose that each package is equally likely to contain any of the 3 prizes, regardless of the contents of other packages, so that there are 27 equally likely outcomes, and let $\textrm{P}$ be the corresponding probability measure.

Let $A_1$ be the event that prize 1 is obtained—that is, at least one of the packages contains prize 1—and define $A_2, A_3$ similarly for prize 2, 3.
Let $B_1$ be the event that only prize 1 is obtained—that is, all three packages contain prize 1—and define $B_2, B_3$ similarly for prize 2, 3.

Compute $\textrm{P}(A_1)$
Compute $\textrm{P}(B_1)$
Interpret the values from parts 1 and 2 as long run relative frequencies.
Interpret the values from parts 1 and 2 as relative likelihoods.
Compute $\textrm{P}(A_1 \cap A_2 \cap A_3)$
Compute $\textrm{P}(A_1 \cup A_2 \cup A_3)$
Compute $\textrm{P}(B_1 \cap B_2 \cap B_3)$
Compute $\textrm{P}(B_1 \cup B_2 \cup B_3)$

Exercise 2.11 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Suppose that each package is equally likely to contain any of the 3 prizes, regardless of the contents of other packages, so that there are 27 equally likely outcomes, and let $\textrm{P}$ be the corresponding probability measure.

Let $X$ be the number of distinct prizes obtained in these 3 packages. Let $Y$ be the number of these 3 packages that contain prize 1.

The sample space consists of 27 outcomes, listed in the table below.

	111	112	113	121	122	123	131	132	133
$X$
$Y$
	211	212	213	221	222	223	231	232	233
$X$
$Y$
	311	312	313	321	322	323	331	332	333
$X$
$Y$

Compute $\textrm{P}(X = 1)$.
Compute $\textrm{P}(X = 2)$.
Compute $\textrm{P}(X = 3)$.
Interpret the values in parts 1 through 3 as long run relative frequencies.
Interpret the values in parts 1 through 3 as relative likelihoods.
Compute $\textrm{P}(Y = y)$ for each possible value $y$.
Compute $\textrm{P}(X = 2, Y = 1)$.
Compute $\textrm{P}(X = Y)$.

Exercise 2.12 Katniss throws a dart at a circular dartboard with radius 1 foot. Suppose that the dart lands uniformly at random anywhere on the dartboard, and let $\textrm{P}$ be the corresponding probability measure.

Compute $\textrm{P}(A)$, where $A$ is the event that Katniss’s dart lands within 1 inch of the center of the dartboard.
Compute $\textrm{P}(B)$, where $B$ is the event that Katniss’s dart lands more than 1 inch but less than 2 inches away from the center of the dartboard.
Compute $\textrm{P}(E)$, where $E$ is the event that Katniss’s dart lands within 1 inch of the outside edge of the dartboard.
Interpret the previous probabilities as long run relative frequencies.
Interpret the previous probabilities as relative likelihoods.

Exercise 2.13 Katniss throws a dart at a circular dartboard with radius 1 foot. Suppose that the dart lands uniformly at random anywhere on the dartboard, and let $\textrm{P}$ be the corresponding probability measure.

Let $X$ be the distance (inches) from the location of the dart to the center of the dartboard.

Compute $\textrm{P}(X \le 1)$
Compute $\textrm{P}(1 < X < 2)$
Compute $\textrm{P}(X > 11)$

Exercise 2.14 Katniss throws a dart at a circular dartboard with radius 1 foot. Suppose that the dart lands uniformly at random anywhere on the dartboard, and let $\textrm{P}$ be the corresponding probability measure.

Let $X$ be the distance (inches) from the location of the dart to the center of the dartboard.

Compute $\textrm{P}(X \le 0.1)$
Compute $\textrm{P}(X \le 0.01)$
Compute $\textrm{P}(X = 0)$
Compute $\textrm{P}(X \ge 11.9)$
Compute $\textrm{P}(X \ge 11.99)$
Compute $\textrm{P}(X = 12)$
Which is more likely: the dart lands exactly in the center or the darts lands exactly on the edge? Discuss.
Which is more likely: the dart lands close to the center or the darts lands close to the edge? Discuss.

2.5 Distributions of random variables (a brief introduction)

Even when outcomes of a random phenomenon are equally likely, values of related random variable are usually not. The probability distribution of a random variable describes the possible values that the random variable can take and their relative likelihoods or plausibilities. We will see several ways of summarizing and describing distributions throughout the book; this section only provides a brief introduction.

Example 2.41 Roll a four-sided die twice; recall the sample space in Example 2.17 and Table 2.7. One choice of probability measure $\textrm{P}$ corresponds to assuming that the die is fair and that the 16 possible outcomes are equally likely. Let $X$ be the sum of the two dice, and let $Y$ be the larger of the two rolls (or the common value if both rolls are the same).

Compute $\textrm{P}(E_1)$, where $E_1$ is the event that the first roll lands on 1.
Compute $\textrm{P}(X = 6)$.
Interpret $\textrm{P}(X = 6)$ as a long run relative frequency.
Interpret $\textrm{P}(X = 6)$ as a relative degree of likelihood. (Hint: compare to $\textrm{P}(X \neq 6)$.)
Construct a table displaying $\textrm{P}(X = x)$ for each possible value $x$ of $X$, and sketch a corresponding plot.
Construct a table displaying $\textrm{P}(Y = y)$ for each possible value $y$ of $Y$, and sketch a corresponding plot.
Construct a table displaying $\textrm{P}(X = x, Y= y)$ for each possible $(x, y)$ pair, and sketch a corresponding plot.

Solution (click to expand)

Solution 2.41.

Since $\textrm{P}$ corresponds to equally likely outcomes, we simply need to count the number of outcomes that satisfy the event and divide by the total number of outcomes. There are 16 equally likely outcomes, of which 4 satisfy event $E_1$. (Remember, the sample space corresponds to pairs of rolls, and there are 4 pairs for which the first roll is 1.) So $\textrm{P}(E_1) = 4 / 16 = 1/4$, which makes sense if we’re assuming the die is fair.
There are 16 equally likely outcomes, 3 of which satisfy the event that the sum is 6. So $\textrm{P}(X = 6) = 3 / 16 = 0.1875$.
Over many pairs of rolls of a fair four-sided die, around 18.75% of pairs will yield a sum of 6.
$\textrm{P}(X\neq 6) = 13/16 = 0.8125$. The ratio of $\textrm{P}(X \neq 6)$ to $\textrm{P}(X = 6)$ is $13/3 \approx 4.3$. If you roll a fair four-sided die twice, it is 4.3 times more likely for the sum to be something other than 6 than for it to be 6.
The possible values of $X$ are $2, 3, 4, 5, 6, 7, 8$. Find the probability of each value by counting the corresponding outcomes using Table 2.7. For example, $\textrm{P}(X = 3) = \textrm{P}(\{(1, 2), (2, 1)\}) = 2/16$. See Table 2.14. Figure 2.16 displays an “impulse” plot with the possible values of $X$ on the horizontal axis and the corresponding probabilities on the vertical axis.
The possible values of $Y$ are $1, 2, 3, 4$. See Table 2.15 and Figure 2.17 for probabilities. For example, $\textrm{P}(Y = 3) = \textrm{P}(\{(1, 3), (2, 3), (3, 1), (3, 2), (3, 3)\}) = 5/16$.
Similar to the previous parts, we can first construct a table with each row corresponding to a possible $(X, Y)$ pair, and then find the probabilities of the corresponding outcomes. For example, there are two pairs of rolls that result in $X=4$ and $Y=3$—$(1, 3), (3,1)$—so $\textrm{P}((X, Y) = (4, 3))=\textrm{P}(X = 4, Y=3) = \textrm{P}(\{(1, 3), (3, 1)\}) = 2/16$. See Table 2.16 for probabilities. Table 2.17 reorganizes Table 2.16 into a two-way table with rows corresponding to possible values of $X$ and columns corresponding to possible values of $Y$. Figure 2.18 displays the distribution of $(X, Y)$ pairs in a 3d impulse plot, and Figure 2.19 displays the distribution in a “tile” plot where lighter colors represent larger probabilities.

Table 2.14: The marginal distribution of $X$, the sum of two rolls of a fair four-sided die.

x	P(X=x)
2	0.0625
3	0.1250
4	0.1875
5	0.2500
6	0.1875
7	0.1250
8	0.0625

Figure 2.16: The marginal distribution of $X$, the sum of two rolls of a fair four-sided die.

Table 2.15: The marginal distribution of $Y$, the larger (or common value if a tie) of two rolls of a fair four-sided die.

y	P(Y=y)
1	0.0625
2	0.1875
3	0.3125
4	0.4375

Figure 2.17: The marginal distribution of $Y$, the larger (or common value if a tie) of two rolls of a fair four-sided die.

Table 2.16: Table representing the joint distribution of sum ($X$) and larger ($Y$) of two rolls of a four-sided die

(x, y)	P(X = x, Y = y)
(2, 1)	0.0625
(3, 2)	0.1250
(4, 2)	0.0625
(4, 3)	0.1250
(5, 3)	0.1250
(5, 4)	0.1250
(6, 3)	0.0625
(6, 4)	0.1250
(7, 4)	0.1250
(8, 4)	0.0625

Table 2.17: Two-way table representation of the joint distribution of $X$ and $Y$, the sum and the larger (or common value if a tie) of two rolls of a fair four-sided die. Possible values of $X$ are in the leftmost column; possible values of $Y$ are in the top row.

$x$ \ $y$	1	2	3	4
2	1/16	0	0	0
3	0	2/16	0	0
4	0	1/16	2/16	0
5	0	0	2/16	2/16
6	0	0	1/16	2/16
7	0	0	0	2/16
8	0	0	0	1/16

Figure 2.18: 3D Impulse plot representing the joint distribution of the sum ($X$) and larger ($Y$) of two rolls of a four-sided die.

Figure 2.19: Tile plot representing the joint distribution of the sum ($X$) and larger ($Y$) of two rolls of a four-sided die.

The above tables and plots represent the joint and marginal distributions of the random variables $X$ and $Y$ in Example 2.41 according to the probability measure $\textrm{P}$, which reflects the assumption that the die is fair and the rolls are independent.

Table 2.17, Table 2.16, Figure 2.18 and Figure 2.19 represent the joint distribution of the sum and larger of two rolls of a fair four-sided die.. The joint distribution of two random variables summarizes the possible pairs of values and their relative likelihoods or plausibilities.

In the context of multiple random variables, the distribution of any one of the random variables is called a marginal distribution. Table 2.14 and Figure 2.16 represent the marginal distribution of the sum and larger of two rolls of a fair four-sided die. Table 2.15 and Figure 2.17 represent the marginal distribution of the larger of two rolls of a fair four-sided die.

Example 2.42 Continuing Example 2.41, suppose that instead of a fair die, the weighted die in Example 2.29 is rolled twice. Answer the following without doing any computations.

Are the possible values of $X$ the same as in Table 2.14? Is the distribution of $X$ the same as in Table 2.14?
Are the possible values of $(X, Y)$ the same as in Table 2.16? Is the joint distribution of $(X, Y)$ the same as in Table 2.16?

Solution (click to expand)

Solution 2.42. In both parts, the possible values are the same. There are still 16 possible outcomes and the random variables are still measuring the same quantities as before. But the distributions are all different. With the weighted die some outcomes are more or less likely than others, and some values of the random variables are more or less likely than when the die is fair. For example, the probabilities of the events $\{X = 8\}$, $\{Y=4\}$, and $\{X = 8, Y=4\}$ are larger with the weighted die than with the fair die, because each roll of the weighted die is more likely to result in a 4 than the fair die.

Distributions of random variables depend on the underlying probability measure. Changing the probability measure can change distributions.

In Example 2.41, we first specified the probability space of 16 equally likely outcomes then derived the distribution. However, in many problems we often assume or identify distributions directly, without any mention of the underlying sample space or probability measure. Recall the brown bag analogy in Section 2.3.2. The probability space corresponds to the random selection of fruits to put in the bag. The random variable is weight. The distribution of weight can be obtained by randomly selecting fruits to put in the bag, weighing the bag, and then repeating this process many times to observe many weights. For example, maybe 10% of bags have weights less than 5 pounds, 75% of bags have weights less than 20 pounds, etc. We can observe the distribution of weights even if we don’t observe the actual fruits in the bag or fully specify the random phenomenon and its sample space.

Example 2.41 involved two discrete random variables. We will introduce distributions of continuous random variables later.

2.5.1 Marginal distributions do not determine the joint distribution

In Example 2.41, we can obtain the marginal distributions from the joint distribution by summing rows and columns: think of adding a total column (for $X$) and a total row (for $Y$) in the “margins” of the table. It is always possible to obtain marginal distributions of the random variables in a collection from their joint distribution. However, in general the marginal distributions alone are not enough to determine the joint distribution.

Example 2.43 Suppose $X$ and $Y$ are random variables whose joint distribution is represented by Table 2.18.

Find the marginal distribution of $X$.
Find the marginal distribution of $Y$.
Are the marginal distributions of $X$ and $Y$ in this example the same as those in Example 2.41?
Is the joint distribution of $X$ and $Y$ the same as the one in Example 2.41?

Solution (click to expand)

Solution 2.43.

The possible values of $X$ are 2, 3, 4, 5, 6, 7, 8. Add a total column to the joint distribution table to find the probabilities of each possible $x$ value, which are, respectively, 16/256, 32/356, 42/256, 64/256, 42/256, 32/356, 16/256, or 1/16, 2/16, 3/16, 4/16, 3/16, 2/16, 1/16.
The possible values of $Y$ are 1, 2, 3, 4. Add a total row to the joint distribution table to find the probabilities of each possible $y$ value, which are, respectively, 16/256, 48/356, 80/256, 112/256, or 1/16, 3/16, 5/16, 7/16.
Yes, the marginal distributions of $X$ and $Y$ are the same as in Example 2.41, represented by Table 2.14 and Table 2.15.
No, this is a different joint distribution than the one in Table 2.17. For example, in Table 2.17 some of the $(X, Y)$ pairs aren’t even possible, but the joint distribution in Table 2.18 assigns positive probability to all $(X, Y)$ pairs.

Table 2.18: The joint distribution for Example 2.43

$x$ \ $y$	1	2	3	4
2	1/256	3/256	5/256	7/256
3	2/256	6/256	10/256	14/256
4	3/256	9/256	15/256	21/256
5	4/256	12/256	20/256	28/256
6	3/256	9/256	15/256	21/256
7	2/256	6/256	10/256	14/256
8	1/256	3/256	5/256	7/256

Table 2.17 and Table 2.18 provide an illustration of two different joint distributons with the same marginal distributions. When representing the joint distribution of two discrete random variables in a table, just because you know the row and column totals doesn’t mean you know all the values of the interior cells.

A joint distribution represents all of the probabilistic behavior of a collection of random variables. It is always possible to obtain marginal distributions of the random variables in a collection from their joint distribution.

However, in general you cannot determine the joint distribution based on the marginal distributions alone. Marginal distributions only reflect how each random variable behaves in isolation. The joint distribution goes further and fully represents relationships between the random variables. Just because you know how each random variable behaves individually, you don’t necessarily know how they behave in relationship with each other.

Warning

In general, marginal distributions alone are not enough to determine a joint distribution.

The exception to this warning is when random variables are independent, which we’ll discuss later. But you shouldn’t simply assume random variables are independent without sufficient justification.

2.5.2 Interpretations of distributions

Distributions can be thought of as collections of probabilities of events involving random variables. As for probabilities, we can interpret probability distributions of random variables as:

long run relative frequency distributions: what pattern of values would emerge if we repeated the random process many times and observed many values of the random variables?
subjective probability distributions: which potential values of these uncertain quantities are relatively more plausible than others?

The long run relative frequency interpretation is natural for Example 2.41. We can roll a pair of fair four-sided dice and measure the sum of the rolls and the larger of the rolls. If we repeat this process many times, we would expect about 6.25% of repetitions to result in a sum of 2, 12.5% of repetitions to result in a sum of 3, 6.25% of repetitions to result in a larger roll of 1, 18.75% of repetitions to result in a larger roll of 3, 6.25% of repetitions to result in both a sum of 2 and a larger roll of 1, 12.5% of repetitions to result in both a sum of 3 and a larger roll of 2, etc. If we summarize the results of many repetitions—which we will do in the next chapter—we would expect the patterns to look like those in the tables and plots in this section.

In other situations the subjective distribution interpretation is more natural. For example, the total number of points scored in the next Superbowl will be one and only one number, but since we don’t know what that number is we can treat it as a random variable. Treating the number of points as a random variable allows us to quantify our uncertainty about it through probability statements like “there is a 0.6 probability that at most 45 points will be scored in the next Superbowl”. A subjective probability distribution for the number of points describes which possible values are relatively more plausible than others.

As with probabilities, the mathematics of distributions work the same way regardless of which interpretation is used, so we will use the two interpretations interchangeably.

2.5.3 Expected value

The distribution of a random variable specifies its possible values and the probability of any event that involves the random variable. It is also useful to summarize some key features of a distribution. Recall that in Section 1.7 we introduced the idea of a “probability-weighted average value”. We also saw how this value can be interpreted as a “long run average value”.

Example 2.44 Continuing Example 2.41, recall the marginal distributions of $X$ and $Y$ from Table 2.14 and Table 2.15.

Compute the probability-weighted average value of $X$.
Interpret the value from the previous part as a long run average value in context.
Compute the probability-weighted average value of $Y$.
Compute the probability that $Y$ is equal to the value from the previous part.
Is the value from part 3 the value we would expect for $Y$ when we roll the dice? If not, explain in what sense the value from part 3 is “expected”.
Suppose that instead of rolling a fair die we rolled the weighted die from Example 2.29) represented by the spinner Figure 2.9 (b). How would the probability-weighted average values of $X$ and $Y$ for the weighted die relate to those for the fair die?

Solution (click to expand)

Solution 2.44.

Multiply each possible of $X$ by its probability and sum \[ \small{ 2\times 0.0625 + 3 \times 0.1250 + 4 \times 0.1875 + 5 \times 0.2500 + 6 \times 0.1875 + 7 \times 0.1250 + 8 \times 0.0625 = 5 } \]
We can roll a pair of fair four-sided dice and measure the sum of the rolls. If we repeat this process many times and average the values of the sum, we would expect the average to be around 5.
Multiply each possible of $Y$ by its probability and sum \[ 1\times 0.0625 + 3 \times 0.1875 + 5 \times 0.3125 + 7 \times 0.4375 = 3.125 \]
$\textrm{P}(Y = 3.125) = 0$.
It’s not possible for the sum of two rolls of a fair four-sided die to be 3.125, so it’s certainly not expected. Over many pairs of rolls of a fair four-sided die, we would expect the average of the values of the larger roll in the pair to be around 3.125.
This weighted die is more likely to return large values than small values so we would expect the long run averages of $X$ and $Y$ to be greater with this weighted die than with the fair die.

In Example 2.44, 5 is the expected value of $X$, denoted $\textrm{E}(X)$. Likewise, $\textrm{E}(Y) = 3.125$. As we discussed in Section 1.7 the term “expected value” is somewhat of a misnomer. The expected value of $X$ is not necessarily the value of $X$ we expect to see when the random phenomenon is observed, but rather the value of $X$ we would expect to see on average in the long run over many observations of the random phenomenon.

The distribution of a random variable and hence its expected value depend on the probability measure. If the probability measure changes (e.g., from representing a fair die to a weighted die) then distributions and expected values of random variables can change.

Example 2.44 involved two discrete random variables. We will introduce expected values of continuous random variables later.

Expected value is just one feature of a distribution. We are also interested in other features, such as percentiles or the overall degree of variability. Usually there are multiple random variables of interest and we are interested in summarizing relationships between them. We will explore distributions of random variables and related concepts such as expected value, variance, and correlation in much more detail in the remaining chapters.

2.5.4 Exercises

Exercise 2.15 Consider the matching problem with $n=4$: objects labeled 1, 2, 3, 4, are placed at random in spots labeled 1, 2, 3, 4, with spot 1 the correct spot for object 1, etc. Recall the sample space from Table 2.2. Let the random variable $X$ count the number of objects that are put back in the correct spot; recall Table 2.9. Let $\textrm{P}$ denote the probability measure corresponding to the assumption that the objects are equally likely to be placed in any spot, so that the 24 possible placements are equally.

Find the distribution of $X$ by creating an appropriate table and plot.
Find the probability-weighted average value of $X$.
Is the value from part 2 the most likely value of $X$? Explain.
Is the value from part 2 the value that we would “expect” to see for $X$ in a single repetition of the phenomenon? Explain.
Explain in what sense the value from part 2 is “expected”.

Exercise 2.16 Continuing Exercise 2.6.

The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Let $X$ be the number of distinct prizes obtained in these 3 packages. Let $Y$ be the number of these 3 packages that contain prize 1. Suppose that each package is equally likely to contain any of the 3 prizes, regardless of the contents of other packages. There are 27 possible, equally likely outcomes

Construct a two-way table representing the joint distribution of $X$ and $Y$.
Sketch a plot representing the joint distribution of $X$ and $Y$.
Identify the marginal distribution of $X$, and sketch a plot of it.
Identify the marginal distribution of $Y$, and sketch a plot of it.
Compute and interpret $\text{E}(X)$.
Compute and interpret $\text{E}(Y)$.

2.6 Conditioning

All probabilities are conditional on some information. Conditioning concerns how probabilities of events or distributions of random variables are influenced by information about the occurrence of events or the values of random variables. We discussed some ideas related to conditioning in Section 1.5. This section—really a chapter within a chapter—explores conditioning in more detail, introducing some of the notation and math.

2.6.1 Conditional probability

A probability quantifies the likelihood or degree of uncertainty of an event. A conditional probability revises this value to reflect any newly available information about the outcome of the underlying random phenomenon.

Example 2.45 The probability³² that a randomly selected American adult (18+) uses Snapchat is 0.24.

Suppose the randomly selected adult is age 18-29. Do you think the probability that a randomly selected adult who is age 18-29 uses Snapchat is 0.24? What if the adult is age 65+? Explain.
The probability³³ that a randomly selected American adult is age 18-29 is 0.20. Is the probability that a randomly selected American adult both (1) is age 18-29, and (2) uses Snapchat equal to $0.20\times 0.24$? Explain.
Without further information, provide a range of “logically possible” values for the probability in the previous part. (“Logically possible” means they satisfy the rules of probability, even though they might not be realistic in context.)
Suppose that the probability that a randomly selected American adult both is age 18-29 and uses Snapchat is 0.13. Construct an appropriate two-way table.
Find the probability that a randomly selected American adult who is age 18-29 uses Snapchat.
How can the probability in the previous part be written in terms of the probabilities provided earlier?
Find the probability that a randomly selected American adult who uses Snapchat is age 18-29.

Solution (click to expand)

Solution 2.45.

The value 0.24 represents adults of all ages (18+). We might expect Snapchat use to vary with age, with younger adults more likely to use Snapchat than older adults. We might expect the probability that a randomly selected adult who is age 18-29 uses Snapchat to be greater than 0.24, while the probability that a randomly selected adult who is age 65+ uses Snapchat to be less than 0.24.
No. This would only be true if the probability that a randomly selected adult who is age 18-29 uses Snapchat is 0.24. But as we mentioned in the previous part we expect this probability to be greater than 0.24.
We could make a table like in the following part and see what values produce valid tables. If $A$ is the event that the selected adult is age 18-29, $C$ is the event that the selected adult uses Snapchat, and $\textrm{P}$ corresponds to randomly selecting an American adult, then $\textrm{P}(A) = 0.20$ and $\textrm{P}(C) = 0.24$. By the subset rule $\textrm{P}(A\cap C)\le \min(\textrm{P}(A), \textrm{P}(C)) = 0.20$. The largest $\textrm{P}(A\cap C)$ can be is 0.20, which corresponds to all adults age 18-29 using Snapchat. The smallest $\textrm{P}(A \cap C)$ can be is 0, which corresponds to no adults age 18-29 using Snapchat. The extremes are not realistic, but without knowing more information, we do not know where $\textrm{P}(A\cap C)$ lies in $0\le \textrm{P}(A \cap C) \le 0.20$.
If $\textrm{P}(A \cap C)= 0.13$, then the two-way table of probabilities³⁴ is

$A$ $A^c$ Total

$C$ 0.13 0.11 0.24

$C^c$ 0.07 0.69 0.76

Total 0.20 0.80 1.00
$\frac{0.13}{0.20}=0.65$ is the probability that a randomly selected American adult who is age 18-29 uses Snapchat. Imagine a group of 100 hypothetical adults; we would expect 20 to be age 18-29 of whom 13 use Snapchat, so 65% of adults age 18-29 use Snapchat.
$0.65= \frac{0.13}{0.20}=\frac{\textrm{P}(A\cap C)}{\textrm{P}(A)}$. The probability that a randomly selected American adult who is age 18-29 uses Snapchat is the probability that an adult both uses Snapchat and is age 18-29 divided by the probability that an adult is age 18-29.
The probability that a randomly selected American adult who uses Snapchat is age 18-29 is $\frac{\textrm{P}(A\cap C)}{\textrm{P}(C)} = \frac{0.13}{0.24} = 0.5417$. Note that the numerator is the same as in the previous part but now the denominator is the probability that an adult uses Snapchat. Imagine a group of 100 hypothetical adults; we would expect 24 to use Snapchat of whom 13 are age 18-29, so 54.17% of adults who use Snapchat are age 18-29.

	\(A\)	\(A^c\)	Total
\(C\)	0.13	0.11	0.24
\(C^c\)	0.07	0.69	0.76
Total	0.20	0.80	1.00

Definition 2.7 The conditional probability of event $A$ given event $B$, denoted $\textrm{P}(A|B)$, is defined as (provided³⁵ $\textrm{P}(B)>0$):

\[ \textrm{P}(A|B) = \frac{\textrm{P}(A\cap B)}{\textrm{P}(B)} \]

The conditional probability $\textrm{P}(A|B)$ represents the likelihood, plausibility, or degree of uncertainty of event $A$ reflecting information that event $B$ has occurred. The event to the left of the vertical bar, $A$ in $\textrm{P}(A|B)$, is the event we are evaluating the probability of. The unconditional probability $\textrm{P}(A)$ is often called the prior probability (a.k.a., base rate) of $A$ (prior to observing $B$). The event to the right of the vertical bar, $B$ in $\textrm{P}(A|B)$, is the event being conditioned on—how does the probability of $A$ change given that event $B$ occurs? The conditional probability $\textrm{P}(A|B)$ is the posterior probability of $A$ after observing $B$. Read the vertical bar $|$ in $\textrm{P}(A | B)$ as “given”.

In Example 2.45, $\textrm{P}(C|A) = 0.65$ is the conditional probability that an adult uses Snapchat given that they are age 18-29, and $\textrm{P}(A|C) = 0.5417$ is the conditional probability that an adult is age 18-29 given that they use Snapchat.

All of the ideas from Section 1.5 still apply. We’ll remind you of a few, using our new notation. Remember that, in general, knowing whether or not event $B$ occurs influences the probability of event $A$; that is, \[ \text{In general, } \textrm{P}(A|B) \neq \textrm{P}(A) \] Also remember that order is essential in conditioning; that is, \[ \text{In general, } \textrm{P}(A|B) \neq \textrm{P}(B|A) \] Lastly, remember to always ask “probability of what?” Thinking of a conditional probability as a fraction, the event being conditioned on identifies the total/baseline group which corresponds to the denominator.

2.6.2 Joint, conditional, and marginal probabilities

When dealing with multiple events, probabilities can be joint, conditional, or marginal. In the context of two events $A$ and $B$:

Joint: unconditional probability involving both events, $\textrm{P}(A \cap B)$.
Conditional: conditional probability of one event given the other, $\textrm{P}(A | B)$, $\textrm{P}(B | A)$.
Marginal: unconditional probability of a single event $\textrm{P}(A)$, $\textrm{P}(B)$.

The relationship $\textrm{P}(A|B) = \textrm{P}(A\cap B)/\textrm{P}(B)$ can be stated generically as \[ \text{conditional} = \frac{\text{joint}}{\text{marginal}} \] We will see several versions of this general relationship in the remaining chapters.

In Example 2.45, we were provided the marginal probabilities ($\textrm{P}(A) = 0.20$, $\textrm{P}(C) = 0.24$) and a joint probability ($\textrm{P}(A \cap C) = 0.13$) and we computed conditional probabilities ($\textrm{P}(C|A) = 0.65$, $\textrm{P}(A|C) = 0.5417$). In many problems some conditional probabilities are provided or can be determined directly.

Example 2.46 Continuing Example 2.45, suppose that

65% of American adults age 18-29 use Snapchat
24% of American adults age 30-49 use Snapchat
12% of American adults age 50-64 use Snapchat
2% of American adults age 65+ use Snapchat

Also suppose that

20% of American adults are age 18-29
33% of American adults are age 30-49
25% of American adults are age 50-64
22% of American adults are age 65+

If the probability measure $\textrm{P}$ corresponds to randomly selecting an American adult, write all the percentages above as probabilities using proper notation.
Compute and identify with proper notation the probability that a randomly selected American adult is age 18-29 and uses Snapchat.
Compute and identify with proper notation the probability that a randomly selected American adult is age 30-49 and does not use Snapchat.
Compute an appropriate two-way table.
Compute and identify with proper notation the probability that a randomly selected American adult uses Snapchat.
Compute and identify with proper notation the probability that a randomly selected American adult is age 18-29 given that they use Snapchat.
Repeat the previous part for each of the age groups. How do the conditional probabilities that the selected adult is in each group given that they use Snapchat compare to the prior probabilities?
Now suppose the randomly selected adult does not use Snapchat. Compute the conditional probability that the selected adult is in age group. How do the conditional probabilities compare to the prior probabilities?

Solution (click to expand)

Solution 2.46.

Let $C$ denote the event that the selected adult uses Snapchat, and let $A_1$, $A_2$, $A_3$, $A_4$ denote the events that the selected adult is in the age group 18-29, 30-49, 50-64, 65+, respectively. Then the marginal probabilities that the selected adult is in each age group are \[\begin{align*} \textrm{P}(A_1) & = 0.20\\ \textrm{P}(A_2) & = 0.33\\ \textrm{P}(A_3) & = 0.25\\ \textrm{P}(A_4) & = 0.22 \end{align*}\] The conditional probabilities that the selected adult uses Snapchat given that the adult is in each of the age groups are \[\begin{align*} \textrm{P}(C|A_1) & = 0.65\\ \textrm{P}(C|A_2) & = 0.24\\ \textrm{P}(C|A_3) & = 0.12\\ \textrm{P}(C|A_4) & = 0.02 \end{align*}\]
$\textrm{P}(C|A_1) = \frac{\textrm{P}(A_1 \cap C)}{\textrm{P}(A_1)}$ so $\textrm{P}(A_1 \cap C) = \textrm{P}(C|A_1)\textrm{P}(A_1) = 0.65\times 0.20 = 0.13$ is the probability that a randomly selected American adult is age 18-29 and uses Snapchat. (This value was provided to us directly in Example 2.45.)
$\textrm{P}(C^c|A_2) = 1 - \textrm{P}(C|A_2) = 1 - 0.24 = 0.76$ is the probability that the selected adult does not use Snapchat given that they are in the 30-49 age group. Then $\textrm{P}(A_2 \cap C^c) = \textrm{P}(C^c|A_2)\textrm{P}(A_2) = 0.76 \times 0.33 = 0.2508$ is the probability that a randomly selected American adult is age 30-49 and does not use Snapchat.

Multiply as in the previous parts to complete the table.

	$C$ (uses Snapchat)	$C^c$ (does not use Snapchat)	Total
$A_1$ (age 18-29)	0.1300	0.0700	0.2000
$A_2$ (age 30-49)	0.0792	0.2508	0.3300
$A_3$ (age 50-64)	0.0300	0.2200	0.2500
$A_4$ (age 65+)	0.0044	0.2156	0.2200
Total	0.2436	0.7564	1.0000

$\textrm{P}(C) = 0.2436$ is the marginal probability³⁶ that a randomly selected American adult uses Snapchat. We will discuss this value in further detail below.
$\textrm{P}(A_1|C) = \frac{0.13}{0.2436} = 0.5337$ is the conditional probability³⁷ that a randomly selected American adult is age 18-29 given that they use Snapchat.
Compute³⁸ the conditional probabilities similar to the previous part \[\begin{align*} \textrm{P}(A_1|C) & = \frac{0.1300}{0.2436} = 0.5337\\ \textrm{P}(A_2|C) & = \frac{0.0792}{0.2436} = 0.3251\\ \textrm{P}(A_3|C) & = \frac{0.0300}{0.2436} = 0.1232\\ \textrm{P}(A_4|C) & = \frac{0.0044}{0.2436} = 0.0181 \end{align*}\] Given that the adult uses Snapchat, we see the probability shift towards the younger ages. For example, the posterior probability (0.5337) that the adult is age 18-29 given that they use Snapchat is much greater than the prior probability (0.20), while the posterior probability (0.0181) that the adult is age 65+ given that they use Snapchat is much less than the prior probability (0.22).
Compute similar to the previous part \[\begin{align*} \textrm{P}(A_1|C^c) & = \frac{0.0700}{0.7564} = 0.0925\\ \textrm{P}(A_2|C^c) & = \frac{0.2508}{0.7564} = 0.3316\\ \textrm{P}(A_3|C^c) & = \frac{0.2200}{0.7564} = 0.2909\\ \textrm{P}(A_4|C^c) & = \frac{0.2156}{0.7564} = 0.2850 \end{align*}\] Given that the adult does not use Snapchat, we see the probability shift towards the older ages. For example, the posterior probability (0.0925) that the adult is age 18-29 given that they do not use Snapchat is less than the prior probability (0.20), while the posterior probability (0.2850) that the adult is age 65+ given that they do not use Snapchat is greater than the prior probability (0.22). However, the shift from prior to posterior given that the adult does not use Snapchat is less dramatic than given that the adult uses Snapchat. See Figure 2.20.

A mosaic plot provides a nice visual of joint, marginal, and one-way conditional probabilities. The mosaic plot in Figure 2.20 (a) represents conditioning on age group. The vertical bars represent the conditional probabilities of using/not using Snapchat for each group. The widths of the vertical bars are scaled in proportion to the marginal probabilities for the age groups; the bar for 30-49 is a little wider than the others. The area of each rectangle represents a joint probability; the rectangle for “age 18-29 and uses Snapchat” represents 13% of the total area. The single vertical bar on the right displays the marginal probabilities of using/not using Snapchat.

Figure 2.20 (b) represents conditioning on Snapchat use. Now the widths of the vertical bars represent the probabilities of using/not using Snapchat, the heights within the bars represent conditional probabilities of each age group given Snapchat use, and the single bar to the right represents the marginal probabilities of age group.

2.6.3 Multiplication rule

In Example 2.46 we were given marginal probabilities of age groups and conditional probabilities of Snapchat use given age groups, and we computed joint probabilities. For example:

20% of adults are age 18-29
65% of adults age 18-29 use Snapchat
So 13% of adults are age 18-29 and use Snapchat, $0.13 = 0.20\times 0.65$.

In fraction terms,

\[ \scriptsize{ \frac{\text{adults age 18-29 who use Snapchat}}{\text{adults}} = \left(\frac{\text{adults age 18-29}}{\text{adults}}\right)\left(\frac{\text{adults age 18-29 who use Snapchat}}{\text{adults age 18-29}}\right) } \]

This calculation is an application of the following multiplication rule which we have already applied intuitively in several examples.

Lemma 2.6 (Multiplication rule) The probability that two events $A$ and $B$ both occur is

\[ \begin{aligned} \textrm{P}(A \cap B) & = \textrm{P}(A|B)\textrm{P}(B)\\ & = \textrm{P}(B|A)\textrm{P}(A) \end{aligned} \]

The multiplication rule is just a rearranging of the definition of the conditional probability of one event given another. The multiplication rule says that you should think “multiply” when you see “and”. However, be careful about what you are multiplying: to find a joint probability you need an unconditional probability and an appropriate conditional probability. You can condition either on $A$ or on $B$, provided you have the appropriate marginal probability; often, conditioning one way is easier than the other based on the available information. Be careful: the multiplication rule does not say that $\textrm{P}(A\cap B)$ is equal to $\textrm{P}(A)\textrm{P}(B)$.

Generically, the multiplication rule says \[ \text{joint} = \text{conditional}\times\text{marginal} \] We will see several versions of this general relationship in the remaining chapters.

The multiplication rule is useful in situations where conditional probabilities are easier to obtain directly than joint probabilities.

Example 2.47 A standard deck of playing cards has 52 cards, 13 cards (2 through 10, jack, king, queen, ace) in each of 4 suits (hearts, diamonds, clubs, spades). Shuffle a deck and deals cards one at a time without replacement.

Compute the probability that the first card dealt is a heart.
If the first card dealt is a heart, determine the conditional probability that the second card is a heart.
Compute the probability that the first two cards dealt are hearts.
Compute the probability that the first two cards dealt are hearts and the third card dealt is a diamond.
Shuffle the deck and deal cards one at a time until an ace is dealt, and then stop. Compute the probability that more than 4 cards are dealt. (Hint: consider the first 4 cards dealt.)

Solution (click to expand)

Solution 2.47.

If the cards are well shuffled, then any of the cards in the deck is equally likely to be the first card dealt. There are 13 hearts out of 52 cards in the deck, so the probability that the first card is a heart is 13/52 = 1/4 = 0.25.
If the first card dealt is a heart, there are 51 cards left in the deck, 12 of which are hearts (and all the remaining cards in the deck are equally likely to be the next one drawn). So the conditional probability that the second card is a heart given that the first card is a heart is 12/51 = 0.235.
Use the multiplication rule: the probability the both cards are hearts is the product of the probability that the first card is a heart and the conditional probability that the second card is a heart given that the first card is a heart, (13/52)(12/51) = 0.0588. If we imagine 132600 repetitions (a convenient choice given these fractions), then we would expect the first card to be a heart in 33150=132600(13/52) repetitions, and among these 33150 repetitions we would expect the second card to be a heart in 7800=33150(12/51) repetitions, so the proportion of repetitions in which both cards are hearts is 7800/132600 = 0.0588.
The third card adds a third “stage” but the multiplication rule extends naturally. The probability the first two cards dealt are hearts and the third card dealt is a diamond is (13/52)(12/51)(13/50)= 0.0153, the product of:
- 13/52, the probability that the first card is a heart,
- 12/51, the conditional probability that the second card is a heart given that the first card is a heart, and
- 13/50, the conditional probability that the third card is a diamond given that the first two cards are hearts. (If the first two cards are hearts, then there are 50 cards remaining in the deck, of which 13 are diamonds.) Continuing the simulation from the previous part, among the 7800 repetitions in which the first two cards are hearts, we would expect the third card will be a diamond in 2028 = 7800(13/50) repetitions, so the proportion of repetitions in which the first two cards are hearts and the third is a diamond is 2028/132600 = 0.0153.
The key is to recognize that in this scenario more than 4 cards are needed to obtain the first ace if and only if the first four cards dealt are not aces. The probability that the first 4 cards are not aces is $(48/52)(47/51)(46/50)(45/49) = 0.719$.

The multiplication rule extends naturally to more than two events (though the notation gets messy). For three events, we have

\[ \textrm{P}(A_1 \cap A_2 \cap A_3) = \textrm{P}(A_1)\textrm{P}(A_2|A_1)\textrm{P}(A_3|A_1\cap A_2) \]

And in general, \[ \textrm{P}(A_1\cap A_2 \cap A_3 \cap A_4 \cap \cdots) = \textrm{P}(A_1)\textrm{P}(A_2|A_1)\textrm{P}(A_3|A_1\cap A_2)\textrm{P}(A_4|A_1\cap A_2 \cap A_4)\cdots \]

The multiplication rule is useful for computing probabilities of events that can be broken down into component “stages” where conditional probabilities at each stage are readily available. At each stage, condition on the information about all previous stages.

Example 2.48 The birthday problem concerns the probability that at least two people in a group of $n$ people have the same birthday³⁹. Ignore multiple births and February 29 and assume that the other 365 days are all equally likely⁴⁰.

If $n=30$, what do you think the probability that at least two people share a birthday is: 0-20%, 20-40%, 40-60%, 60-80%, 80-100%? How large do you think $n$ needs to be in order for the probability that at least two people share a birthday to be larger than 0.5? Just make your best guesses before proceeding to calculations.
Now consider $n=3$ people, labeled 1, 2, and 3. What is the probability that persons 1 and 2 have different birthdays?
What is the probability that persons 1, 2, and 3 all have different birthdays given that persons 1 and 2 have different birthdays?
What is the probability that persons 1, 2, and 3 all have different birthdays?
When $n = 3$, what is the probability that at least two people share a birthday?
Now consider $n=4$. What is the probability that 4 people all have different birthdays?
When $n = 4$, what is the probability that at least two people share a birthday?
For $n=30$, compute the probability that none of the people have the same birthday.
For $n=30$, compute the probability that at least two people have the same birthday.
Write a clearly worded sentence interpreting the probability in the previous part as a long run relative frequency.
When $n=30$, how much more likely than not is it for at least two people to have the same birthday?
Provide an expression of the probability for a general $n$ and find the smallest value of $n$ for which the probability is over 0.5. (You can just try different values of $n$.)
When $n=100$ the probability is about 0.9999997. If you are in a group of 100 people and no one shares your birthday, should you be surprised? Discuss.

Solution (click to expand)

Solution 2.48.

Your guesses are whatever they are. But many people who have never encountered this problem before say that the probability is 0-20%, and it takes $n$ over at least 100 to get to a probability greater than 0.5.
Whatever person 1’s birthday is, the probability that person 2 has the same birthday⁴¹ is 1/365, so the probability that person 2 has a different birthday than person 1 is 364/365.
Given that person 1 and person 2 are born on different days, the probability that person 3 is also born on a different day is 363/365. Notice the importance of the conditioning; if persons 1 and 2 share the same birthday, then the probability that person 3 is born on a different day is 364/365.
Use the multiplication rule: $(364/365)(363/365) = 0.992$, the probability that all three are born on different days, is the product of the probability that persons 1 and 2 are born on different days, and the conditional probability that person 3 is also born on a different day given that the first two are.
Exactly one of the following must be true: (1) all 3 people are born on different days, or (2) at least two people share a birthday. Use the complement rule: when $n=3$, the probability that at least two people share a birthday is $1-(364/365)(363/365) = 0.008$.
Now consider $n=4$. In order for all 4 people to have different birthdays, the first three people must have different birthdays and the fourth also has to be different from theirs.
- The probability that the first three people have different birthdays is $(364/365)(363/365)$ (from a previous part).
- Given that the first three people have different birthdays, the conditional probability that the fourth person’s birthday is also different is 362/365. Notice the importance of the conditioning; if for example, persons 1 and 2 and 3 all shared the same birthday, then the probability that person 4 is born on a different day is 364/365.
- Using the multiplication rule, the probability that the first three people have different birthdays and the fourth is also different from theirs is $(364/365)(363/365)(362/365) = 0.983$
Exactly one of the following must be true: (1) all 4 people are born on different days, or (2) at least two people share a birthday. Use the complement rule: when $n=4$, the probability that at least two people share a birthday is $1-(364/365)(363/365)(362/365) = 0.016$.
We can use the method for $n=3$ and $n=4$. Imagine lining the 30 people up in some order. Let $A_2$ be the event that the first two people have different birthdays, $A_3$ be the event that the first three people have different birthdays, and so on, until $A_{30}$, the event that all 30 people have different birthdays. Notice $A_{30}\subseteq A_{29} \subseteq \cdots \subseteq A_3 \subseteq A_2$, so $\textrm{P}(A_{30}) = \textrm{P}(A_2 \cap A_3 \cap \cdots \cap A_{30})$.
- The first person’s birthday can be any one of 365 days. In order for the second person’s birthday to be different, it needs to be on one of the remaining 364 days. So the probability that the second person’s birthday is different from the first is $\textrm{P}(A_2)=\frac{364}{365}$.
- Now if the first two people have different birthdays, in order for the third person’s birthday to be different it must be on one of the remaining 363 days. So $\textrm{P}(A_3|A_2) = \frac{363}{365}$. Notice that this is a conditional probability. (If the first two people had the same birthday, then the probability that the third person’s birthday is different would be $\frac{364}{365}$.)
- If the first three people have different birthdays, in order for the fourth person’s birthday to be different it must be on one of the remaining 362 days. So $\textrm{P}(A_4|A_2\cap A_3) = \frac{362}{365}$.
- And so on. If the first 29 people have different birthdays, in order for the 30th person’s birthday to be different it must be on one of the remaining 365-29=336 days. Then using the multiplication rule \[\begin{align*} \textrm{P}(A_{30}) & = \textrm{P}(A_{2}\cap A_3 \cap \cdots \cap A_{30})\\ & = \textrm{P}(A_2)\textrm{P}(A_3|A_2)\textrm{P}(A_4|A_2\cap A_3)\textrm{P}(A_5|A_2\cap A_3 \cap A_4)\cdots \textrm{P}(A_{30}|A_2\cap \cdots \cap A_{29})\\ & = \left(\frac{364}{365}\right)\left(\frac{363}{365}\right)\left(\frac{362}{365}\right)\left(\frac{361}{365}\right)\cdots \left(\frac{365-30 + 1}{365}\right)\approx 0.294 \end{align*}\]
By the complement rule, the probability that at least two people have the same birthday is $1-0.294=0.706$, since either (1) none of the people have the same birthday, or (2) at least two of the people have the same birthday.
In about 70% of groups of 30 people at least two people in the group will have the same birthday. For example, if Cal Poly classes all have 30 students, then in about 70% of your classes at least two people in the class will share a birthday.
$0.706 / 0.294 = 2.4.$ In a group of $n=30$ people it is about 2.4 times more likely to have at least two people with the same birthday than not.
For a general $n$, the probability that at least two people have the same birthday is \[ 1 - \left(\frac{364}{365}\right)\left(\frac{363}{365}\right)\left(\frac{362}{365}\right)\left(\frac{361}{365}\right)\cdots \left(\frac{365-n + 1}{365}\right) \] See Figure 2.21 which plots this probability as a function of $n$. When $n=23$ this probability is 0.507.
Maybe, but not because of the 0.999997. 0.999997 is the probability that at least two people in the group of 100 share a birthday. It is NOT the probability that someone shares YOUR birthday. The probability that no one shares your birthday is about $0.76$—we’ll see how to compute this later—so it’s about 3.1 times more likely than not that someone in a group of 100 people shares your birthday.

Figure 2.21: Probability of at least one birthday match as a function of the number of people in the room in Example 2.48. For 23 people, the probability of at least one birthday match is 0.507.

That only 23 people are needed to have a better than 50% chance of a birthday match is surprising to many people, because 23 doesn’t seem like a lot of people. But when determining if there is a birthday match, we need to consider every pair of people in the group. In a group of 23 people, there are $23(22)/2 = 253$ different pairs of people, and each one of these pairs has a chance of sharing a birthday.

2.6.4 Law of total probability

The law of total probability says that a marginal probability can be thought of as a weighted average of “case-by-case” conditional probabilities, where the weights are determined by the likelihood or plausibility of each case.

Example 2.49 Continuing Example 2.46.

Donny Don’t says: “The average of the probabilities of using Snapchat for each age group—0.65, 0.24, 0.12, and 0.02—is 0.2575. Why isn’t $\textrm{P}(C)$, the probability of using Snapchat equal to 0.2575 (instead of 0.2436)?” Can you answer Donny’s question? (Hint: consider an extreme case; for example, if everyone were age 18-29 what would $\textrm{P}(C)$ be?)
Show how $\textrm{P}(C) = 0.2436$ can be written as a weighted average of the values 0.65, 0.24, 0.12, and 0.02.
Show how $\textrm{P}(A_1)$ can be written as a weighted average—what probabilities are being averaged and what are they weights?

Solution (click to expand)

Solution 2.49.

If all adults were age 18-29 then the overall probability of using Snapchat would be 0.65; if all adults were age 65+ the overall probability of using Snapchat would be 0.02. The overall probability of using Snapchat depends on the age group breakdown. The marginal probabilities of the age groups differ, partly because the age groups cover different numbers of ages.
We can write our two-way table calculation of $\textrm{P}(C)$ in Example 2.46 in formulas as \[\begin{align*} \textrm{P}(C) & = \textrm{P}(C \cap A_1) + \textrm{P}(C \cap A_2) + \textrm{P}(C \cap A_3) + \textrm{P}(C \cap A_4)\\ & = \textrm{P}(C|A_1)\textrm{P}(A_1) + \textrm{P}(C|A_2)\textrm{P}(A_2) + \textrm{P}(C|A_3)\textrm{P}(A_3) +\textrm{P}(C|A_4)\textrm{P}(A_4)\\ & = 0.65\times 0.20 + 0.24\times 0.33 + 0.12 \times 0.25 + 0.02\times 0.22\\ & = 0.2436 \end{align*}\] The above shows that the overall probability of using Snapchat, $\textrm{P}(C) = 0.2436$, is a weighted average of the conditional probabilities of using Snapchat for each of the age groups—$\textrm{P}(C|A_1) = 0.65$, $\textrm{P}(C|A_2) = 0.24$, $\textrm{P}(C|A_3) = 0.12$, $\textrm{P}(C|A_4) = 0.02$—where the weights are the marginal probabilities of the age groups, $\textrm{P}(A_1) = 0.20$, $\textrm{P}(A_2) = 0.33$, $\textrm{P}(A_3) = 0.25$, $\textrm{P}(A_4) = 0.22$.
Now we break down the probability into two cases, using/not using Snapchat. \[\begin{align*} \textrm{P}(A_1) & = \textrm{P}(A_1\cap C) + \textrm{P}(A_1 \cap C^c)\\ & = \textrm{P}(A_1 | C)\textrm{P}(C) + \textrm{P}(A_1 | C^c)\textrm{P}(C^c)\\ & = 0.5337\times 0.2436 + 0.0925 \times (1 - 0.2436)\\ & = 0.20 \end{align*}\] The above shows that the overall probability of being age 18-29, $\textrm{P}(A_1) = 0.20$, is a weighted average of the conditional probabilities of being age 18-29 for each Snapchat use case—$\textrm{P}(A_1|C) = 0.5337$, $\textrm{P}(A_1|C^c) = 0.0925$—where the weights are the marginal probabilities of the Snapchat use cases, $\textrm{P}(C) = 0.2436$, $\textrm{P}(C^c) = 1-0.2436 = 0.7564$. Since the overall probability of not using Snapchat is about 3 times greater than the overall probability of using Snapchat, the conditional probability of being age 18-29 given the adult does not use Snapchat (0.0925) gets about 3 times more weight in the average than the conditional probability of being age 18-29 given the adult does uses Snapchat (0.5337). This is why the overall probability of being age 18-29 (0.20) is closer to 0.0925 than to 0.5337.

The previous example illustrates another version of the law of total probability.

Lemma 2.7 (Law of total probability) If $C_1, C_2, C_3\ldots$ are disjoint events with $C_1\cup C_2 \cup C_3\cup \cdots =\Omega$, then⁴²

\[\begin{align*} \textrm{P}(A) & = \textrm{P}(A |C_1)\textrm{P}(C_1) + \textrm{P}(A | C_2)\textrm{P}(C_2) + \textrm{P}(A | C_3)\textrm{P}(C_3) + \cdots \end{align*}\]

The events $C_1, C_2, C_3, \ldots$, which represent the “cases”, form a partition of the sample space; each outcome $\omega\in\Omega$ lies in exactly one of the $C_i$. The law of total probability says that we can interpret the unconditional probability $\textrm{P}(A)$ as a probability-weighted average of the case-by-case conditional probabilities $\textrm{P}(A|C_i)$ where the weights $\textrm{P}(C_i)$ represent the probability of encountering each case.

For an illustration of the law of total probability, consider the mosaic plots in Figure 2.20. In Figure 2.20 (a), the heights of the orange bars for each age group correspond to the conditional probabilities of using Snapchat given age group (0.65, 0.24, 0.12, 0.02). The widths of these bars are scaled in proportion to the marginal probabilities of the age groups; the width of the bar for age 30-49 is 1.65 (0.33/0.20) times wider than the width for age 18-29. The height of the orange part of the single vertical bar on the right represents the marginal probability of using Snapchat (0.2436), and this is the weighted average of the heights of the the other orange bars (the conditional probabilities of using Snapchat given the age groups), with the weights given by the widths of the other bars (the marginal probabilities of the age groups).

The influence of the weighting is even more apparent in the mosaic plot in Figure 2.20 (b). Since the marginal probability of not using Snapchat is greater than the marginal probability of using Snapchat, the marginal probabilities of the age groups are closer to those for the conditional probabilities of the age groups given the adult does not use Snapchat than those given that the adult uses Snapchat.

Conditioning and using the law of probability is an effective strategy in solving many problems, even when the problem doesn’t seem to involve conditioning. For example, when a problem involves iterations or steps it is often useful to condition on the result of the first step.

Example 2.50 You and your friend are playing the “lookaway challenge”.

The game consists of possibly multiple rounds. In the first round, you point in one of four directions: up, down, left or right. At the exact same time, your friend also looks in one of those four directions. If your friend looks in the same direction you’re pointing, you win! Otherwise, you switch roles and the game continues to the next round—now your friend points in a direction and you try to look away. As long as no one wins, you keep switching off who points and who looks. The game ends, and the current “pointer” wins, whenever the “looker” looks in the same direction as the pointer.

Suppose that each player is equally likely to point/look in each of the four directions, independently from round to round. What is the probability that you, starting as the pointer, win the game?

Why might you expect the probability to not be equal to 0.5?
If you start as the pointer, what is the probability that you win in the first round?
If $p$ denotes the probability that the player who starts as the pointer wins the game, what is the probability that the player who starts as the looker wins the game? (Note: $p$ is the probability that the person who starts as pointer wins the whole game, not just the first round.)
Let $A$ be the event that the person who starts as the pointer wins the game, and $B$ be the event that the person who starts as the pointer wins in the first round. What is $\textrm{P}(A|B)$?
Find a simple expression for $\textrm{P}(A | B^c)$ in terms of $p$. The key is to consider this question: if the player who starts as the pointer does not win in the first round, how does the game behave from that point forward?
Condition on the result of the first round and set up an equation to solve for $p$.
Interpret the probability from the previous part as a long run relative frequency.
How much more likely is the player who starts as the pointer to win than the player who starts as the looker?

Solution (click to expand)

Solution 2.50.

The player who starts as the pointer has the advantage of going first; that player can win the game in the first round, but cannot lose the game in the first round. So we might expect the player who starts as the pointer to be more likely to win than the player who starts as the looker.
1/4. Whichever direction the pointer points, the probability that the looker looks in the same direction is 1/4.Alternatively, if we represent an outcome in the first round as a pair (point, look) then there are 16 possible equally likely outcomes, of which 4 represent pointing and looking in the same direction.
$1-p$, since the game keeps going until someone wins.
$\textrm{P}(A|B)=1$ since if the person who starts as the pointer wins the first round then they win the game.
The key is to recognize that if the person who starts as the pointer does not win in the first round, it is like the game starts over with the roles reversed. That is, the player who originally started as the pointer, having not won the first round, is now starting as the looker, and the probability that the player who starts as the looker wins the game is $1-p$. That is, $\textrm{P}(A|B^c) = 1-p$.
Here is where we use conditioning and the law of total probability. We condition on what happens in the first round: either the person who starts as the pointer wins the first round and the game ends (event $B$), or the person who starts as the pointer does not win the the first round and the game continues with the other player becoming the pointer for the next round (event $B^c$). By the law of total probability \[ \textrm{P}(A) = \textrm{P}(A|B)\textrm{P}(B) + \textrm{P}(A|B^c)\textrm{P}(B^c) \] From previous parts $\textrm{P}(A)=p$, $\textrm{P}(B)=1/4$, $\textrm{P}(B^c)=3/4$, $\textrm{P}(A|B) = 1$, and $\textrm{P}(A|B^c)=1-p$. Therefore \[ p = (1)(1/4)+ (1-p)(3/4) \] Solve to find $p=4/7= 0.571$.
Over many games of the lookaway challenge, the player who starts as the pointer wins about 57.1% of games.
The player who starts as the pointer is about $(4/7)/(3/7) = 4/3= 1.33$ times more likely to win the game than the player who starts as the looker.

The game in Example 2.50 could potentially last any number of rounds (1, 2, 3, …) However, the law of total probability allowed us to take advantage of the iterative nature of the game, and consider only one round rather than enumerating all the possibilities of what might happen over many potential rounds.

2.6.5 Conditioning is “slicing and renormalizing”

Example 2.51 Continuing Example 2.46.

How many times more likely is it for an American adult to be a Snapchat user and age 18-29 than to be:
1. a Snapchat user and age 30-49?
2. a Snapchat user and age 50-64?
3. a Snapchat user and age 65+?
How many times more likely is it for an American adult who uses Snapchat to be age 18-29 than to be:
1. age 30-49?
2. age 50-64?
3. age 65+?
What do you notice about the answers to the two previous parts?

Solution (click to expand)

Solution 2.51.

Use the joint probabilities we computed in Example 2.46; see the “uses Snapchat” column of the two-way table in the solution to part 4. The joint probability that an American adult is a Snapchat user and age 18-29 is:
1. $\frac{0.1300}{0.0792} = 1.64$ times greater than the joint probability that an American adult is a Snapchat user and age 30-49
2. $\frac{0.1300}{0.0300} = 4.33$ times greater than the joint probability that an American adult is a Snapchat user and age 50-64
3. $\frac{0.1300}{0.0044} = 29.55$ times greater than the joint probability that an American adult is a Snapchat user and age 65+
Use the conditional probabilities we computed in Example 2.46; see the solution to part 7. The conditional probability that an American adult is age 18-29 given that they use Snapchat is:
1. $\frac{0.5337}{0.3251} = \frac{0.1300/0.2436}{0.0792/0.2436} = 1.64$ times greater than the conditional probability that an American adult is age 30-49 given that they use Snapchat.
2. $\frac{0.5337}{0.1232} = \frac{0.1300/0.2436}{0.0300/0.2436}= 4.33$ times greater than the conditional probability that an American adult is age 50-64 given that they use Snapchat.
3. $\frac{0.5337}{0.0181} = \frac{0.1300/0.2436}{0.0044/0.2436} = 29.55$ times greater than The conditional probability that an American adult is age 65+ given that they use Snapchat.
The ratios are the same⁴³! Conditioning on using Snapchat zooms in on the “slice” of adults who use Snapchat in the two-way table of joint probabilities. The ratios within this slice are determined by the joint probabilities for adults, as in part 1. The conditional probabilities given Snapchat use that we used to compute ratios in part 2 are simply rescaled versions of the joint probabilities for the “use Snapchat slice”; rescaled so that the slice represents 100% of the probability given that the adult uses Snapchat.

The process of conditioning can be thought of as “slicing and renormalizing”.

Extract the “slice” corresponding to the event being conditioned on (and discard the rest). For example, a slice might correspond to a particular row or column of a two-way table, or a section of a plot.
“Renormalize” the values in the slice so that corresponding probabilities add up to 1.

Slicing determines shape; renormalizing determines scale. Slicing determines relative probabilities; renormalizing just makes sure they add up to 1.

Consider the mosaic plot in Figure 2.20 (b), where the areas of rectangles represent joint probabilities. The areas of the rectangles in the “uses Snapchat” column represent the joint probabilities we used in part 1 of Example 2.46. Imagine taking the rectangles in this column and unstacking them to make a bar plot with heights determined by the joint probabilities of being in each age group and using Snapchat, as in Figure 2.22 (a). The “slice” determines the shape of the bar plot: The bar for age 18-29 is 1.64 times higher than the bar for age 30-49, 4.33 times higher than the bar for age 50-64, and 29.55 times higher than the bar for age 65+.

Summing the joint probabilities in Figure 2.22 (a) over the age groups yields 0.2436, the marginal probability that an adult uses Snapchat. Given that the adult uses Snapchat, we want the conditional probabilities of the age groups to sum to 1. Thus we “renormalize” the joint probabilities on the vertical axis—by dividing each by 0.2436—so that they sum to 1 to obtain the conditional probabilities of each age group given that the adult uses Snapchat, displayed in Figure 2.22 (b). Renormalizing only changes the absolute scale of the plot; compare the values on the vertical axes in Figure 2.22, which correspond to joint probabilities on the left and conditional probabilities on the right. Both plots have the same relative shape: the bar for age 18-29 is 1.64 times higher than the bar for age 30-49, 4.33 times higher than the bar for age 50-64, and 29.55 times higher than the bar for age 65+.

(a) Joint probability of each age group and Snapchat use for the uses Snapchat slice

Example 2.52 Each of the three Venn diagrams below represents a sample space with 16 equally likely outcomes. Let $A$ be the yellow / event, $B$ the blue \ event, and their intersection $A\cap B$ the green $\times$ event. Suppose that areas represent probabilities, so that for example $\textrm{P}(A) = 4/16$.

Find $\textrm{P}(A|B)$ for each of the scenarios. Be sure to indicate what represents the “slice” in each scenario.

Figure 2.23: Three sample spaces with 16 equally outcomes

Solution (click to expand)

Solution 2.52. In each case we’re conditioning on event $B$, so the slice represents the 4 blue outcomes. Imagine zooming in on the blue (including green) area of each plot: remove the slice with the 4 blue outcomes (and erase the rest) and then magnify (renormalize) the slice so that it is the size of the original rectangle.

Left: $\textrm{P}(A|B)=0$. After conditioning on $B$, there are now 4 equally likely outcomes, of which none satisfy $A$.
Middle: $\textrm{P}(A|B) = 2/4=1/2$. After conditioning on $B$, there are now 4 equally likely outcomes, of which 2 satisfy $A$. The green part represents 1/2 of the area of the blue slice.
Right: $\textrm{P}(A|B) = 1/4$. After conditioning on $B$, there are now 4 equally likely outcomes, of which 1 satisfies $A$. The green part represents 1/4 of the area of the blue slice.

We will see that “slicing and renormalizing” is a helpful way to conceptualize conditioning, especially when dealing with conditional distributions of random variables.

2.6.6 Bayes rule

Bayes’ rule describes how to update uncertainty in light of new information, evidence, or data. We’ll introduce it in the context of two-way tables.

Example 2.53 A recent survey of American adults asked: “Based on what you have heard or read, which of the following two statements best describes the scientific method?”

70% selected “The scientific method produces findings meant to be continually tested and updated over time”. (We’ll call this the “iterative” opinion.)
14% selected “The scientific method identifies unchanging core principles and truths”. (We’ll call this the “unchanging” opinion).
16% were not sure which of the two statements was best.

How does the response to this question change based on education level? Suppose education level is classified as: high school or less (HS), some college but no Bachelor’s degree (college), Bachelor’s degree (Bachelor’s), or postgraduate degree (postgraduate). The education breakdown is

Among those who agree with “iterative”: 31.3% HS, 27.6% college, 22.9% Bachelor’s, and 18.2% postgraduate.
Among those who agree with “unchanging”: 38.6% HS, 31.4% college, 19.7% Bachelor’s, and 10.3% postgraduate.
Among those “not sure”: 57.3% HS, 27.2% college, 9.7% Bachelor’s, and 5.8% postgraduate

Use the information to construct an appropriate two-way table.
Overall what percentage of adults have a postgraduate degree? How is this related to the values 18.2%, 10.3%, and 5.8%?
What percent of those with a postgraduate degree agree that the scientific method is “iterative”? How is this related to the values provided?

Solution (click to expand)

Solution 2.53.

Suppose there are 100000 hypothetical American adults. Of these 100000, $100000\times 0.7 = 70000$ agree with the “iterative” statement. Of the 70000 who agree with the “iterative” statement, $70000\times 0.182 = 12740$ also have a postgraduate degree. Continue in this way to complete Table 2.19
Overall 15.11% of adults have a postgraduate degree (15110/100000 in the table). The overall percentage is a weighted average of the three percentages; 18.2% gets the most weight in the average because the “iterative” statement has the highest percentage of people that agree with it compared to “unchanging” and “not sure”. \[ 0.1511 = (0.70)(0.182) + (0.14)(0.103) + (0.16)(0.058) \]
Of the 15110 who have a postgraduate degree 12740 agree with the “iterative” statement, and $12740/15110 = 0.843$. 84.3% of those with a graduate degree agree that the scientific method is “iterative”. The value 0.843 is equal to the product of (1) 0.70, the overall proportion who agree with the “iterative” statement, and (2) 0.182, the proportion of those who agree with the “iterative” statement that have a postgraduate degree; divided by 0.1511, the overall proportion who have a postgraduate degree. \[ 0.843 = \frac{0.182 \times 0.70}{0.1511} \]

Table 2.19: Two-way table for Example 2.53

hypothesis	HS	College	Bachelors	Postgrad	Total
iterative	21910	19320	16030	12740	70000
unchanging	5404	4396	2758	1442	14000
not sure	9168	4352	1552	928	16000
Total	36482	28068	20340	15110	100000

Lemma 2.8 (Bayes rule for events) Bayes’ rule for events⁴⁴ specifies how a prior probability $P(H)$ of event $H$ is updated in response to the evidence $E$ to obtain the posterior probability $P(H|E)$. \[ P(H|E) = \frac{P(E|H)P(H)}{P(E)} \]

Event $H$ represents a particular hypothesis⁴⁵ (or model or case)
Event $E$ represents observed evidence (or data or information)
$P(H)$ is the unconditional or prior probability of $H$ (prior to observing evidence $E$)
$P(H|E)$ is the conditional or posterior probability of $H$ after observing evidence $E$.
$P(E|H)$ is the likelihood of evidence $E$ given hypothesis (or model or case) $H$

Example 2.54 Continuing Example 2.53. Randomly select an American adult.

Consider the conditional probability that a randomly selected American adult agrees that the scientific method is “iterative” given that they have a postgraduate degree. Identify the hypothesis, prior probability, evidence, likelihood, and posterior probability, and use Bayes’ rule to compute the posterior probability.
Compute the conditional probability that a randomly selected American adult with a postgraduate degree agrees that the scientific method is “unchanging”.
Compute the conditional probability that a randomly selected American adult with a postgraduate degree is not sure about which statement is best.
How many times more likely is it for an American adult to have a postgraduate degree and agree with the “iterative” statement than to have a postgraduate degree and agree with the “unchanging” statement?
How many times more likely is it for an American adult with a postgraduate degree to agree with the “iterative” statement than to agree with the “unchanging” statement?
What do you notice about the answers to the two previous parts?
How many times more likely is it for an American adult to agree with the “iterative” statement than to agree with the “unchanging” statement?
How many times more likely is it for an American adult to have a postgraduate degree when the adult agrees with the “iterative” statement than when the adult agree with the “unchanging” statement?
How many times more likely is it for an American adult with a postgraduate degree to agree with the “iterative” statement than to agree with the “unchanging” statement?
How are the values in the three previous parts related?

Solution (click to expand)

Solution 2.54.

This is essentially the same question as the last part of Example 2.53, just with different terminology.
- The hypothesis is $H_1$, the event that the randomly selected adult agrees with the “iterative” statement.
- The prior probability is $\textrm{P}(H_1) = 0.70$, the overall or unconditional probability that a randomly selected American adult agrees with the “iterative” statement.
- The given “evidence” $E$ is the event that the randomly selected adult has a postgraduate degree. The marginal probability of the evidence is $\textrm{P}(E)=0.1511$, which can be obtained by the law of total probability as in Example 2.53.
- The likelihood is $\textrm{P}(E | H_1) = 0.182$, the conditional probability that the adult has a postgraduate degree (the evidence) given that the adult agrees with the “iterative” statement (the hypothesis).
- The posterior probability is $\textrm{P}(H_1 |E)=0.843$, the conditional probability that a randomly selected American adult agrees that the scientific method is “iterative” given that they have a postgraduate degree. By Bayes rule \[ \textrm{P}(H_1 | E) = \frac{\textrm{P}(E | H_1) \textrm{P}(H_1)}{\textrm{P}(E)} = \frac{0.182 \times 0.70}{0.1511} = 0.843 \]
Let $H_2$ be the event that the randomly selected adult agrees with the “unchanging” statement; the prior probability is $\textrm{P}(H_2) = 0.14$. The evidence $E$ is still “postgraduate degree” but now the likelihood of this evidence is $\textrm{P}(E | H_2) = 0.103$ under the “unchanging” hypothesis. The conditional probability that a randomly selected adult with a postgraduate degree agrees that the scientific method is “unchanging” is \[ \textrm{P}(H_2 | E) = \frac{\textrm{P}(E | H_2) \textrm{P}(H_2)}{\textrm{P}(E)} = \frac{0.103 \times 0.14}{0.1511} = 0.095 \]
Let $H_3$ be the event that the randomly selected adult is “not sure”; the prior probability is $\textrm{P}(H_3) = 0.16$. The evidence $E$ is still “postgraduate degree” but now the likelihood of this evidence is $\textrm{P}(E | H_3) = 0.058$ under the “not sure” hypothesis. The conditional probability that a randomly selected adult with a postgraduate degree is “not sure” is \[ \textrm{P}(H_3 | E) = \frac{\textrm{P}(E | H_3) \textrm{P}(H_3)}{\textrm{P}(E)} = \frac{0.058 \times 0.16}{0.1511} = 0.061 \]
The probability that an American adult has a postgraduate degree and agrees with the “iterative” statement is $\textrm{P}(E \cap H_1) = \textrm{P}(E|H_1)\textrm{P}(H_1) = 0.182\times 0.70 = 0.1274$. The probability that an American adult has a postgraduate degree and agrees with the “unchanging” statement is $\textrm{P}(E \cap H_2) = \textrm{P}(E|H_2)\textrm{P}(H_2) = 0.103\times 0.14 = 0.01442$. Since \[ \frac{\textrm{P}(E \cap H_1)}{\textrm{P}(E \cap H_2)} = \frac{0.182\times 0.70}{0.103\times 0.14} = \frac{0.1274}{0.01442} = 8.835 \] an American adult is 8.835 times more likely to have a postgraduate degree and agree with the “iterative” statement than to have a postgraduate degree and agree with the “unchanging” statement.
The conditional probability that an American adult with a postgraduate degree agrees with the “iterative” statement is $\textrm{P}(H_1 | E) = \textrm{P}(E|H_1)\textrm{P}(H_1)/\textrm{P}(E) = 0.182\times 0.70/0.1511 = 0.843$. The conditional probability that an American adult with a postgraduate degree agrees with the “unchanging” statement is $\textrm{P}(H_2|E) = \textrm{P}(E|H_2)\textrm{P}(H_2)/\textrm{P}(E) = 0.103\times 0.14/0.1511 = 0.09543$. Since \[ \frac{\textrm{P}(H_1 | E)}{\textrm{P}(H_2 | E)} = \frac{0.182\times 0.70/0.1511}{0.103\times 0.14/0.1511} = \frac{0.84315}{0.09543} = 8.835 \] an American adult with a postgraduate degree is 8.835 times more likely to agree with the “iterative” statement than to agree with the “unchanging” statement.
The ratios are the same! Conditioning on having a postgraduate degree just “slices” out the Americans who have a postgraduate degree. The ratios are determined by the overall probabilities for Americans. The conditional probabilities, given postgraduate degree, simply rescale the probabilities for Americans who have a postgraduate degree to add up to 1 (by dividing by 0.1511).
This is a ratio of prior probabilities: 0.70 / 0.14 = 5. An American adult is 5 times more likely to agree with the “iterative” statement than to agree with the “unchanging” statement.
This is a ratio of likelihoods: 0.182 / 0.103 = 1.767. An American adult is 1.767 times more likely to have a postgraduate degree when the adult agrees with the iterative statement than when the adult agree with the unchanging statement.
This is a ratio of posterior probabilities: 0.8432 / 0.0954 = 8.835. An American adult with a postgraduate degree is 8.835 times more likely to agree with the “iterative” statement than to agree with the “unchanging” statement.
The ratio of the posterior probabilities is equal to the product of the ratio of the prior probabilities and the ratio of the likelihoods: $8.835 = 5 \times 1.767$. The posterior is proportional to the product of prior and likelihood.

Bayes rule is often used when there are multiple hypotheses or cases. Suppose $H_1,H_2, \ldots$ is a series of distinct hypotheses which together account for all possibilities, and $E$ is any event (evidence). Then Bayes’ rule implies that the posterior probability of any particular hypothesis $H_j$ satisfies \[ \textrm{P}(H_j |E) = \frac{\textrm{P}(E|H_j)\textrm{P}(H_j)}{\textrm{P}(E)} \]

The marginal probability of the evidence, $\textrm{P}(E)$, in the denominator can be calculated using the law of total probability \[ \textrm{P}(E) = \textrm{P}(E|H_1) \textrm{P}(H_1) + \textrm{P}(E|H_2) \textrm{P}(H_2) + \textrm{P}(E|H_3) \textrm{P}(H_3) + \cdots \] Since $\textrm{P}(E)$ is the sum of the terms $\textrm{P}(E|H_j)\textrm{P}(H_j)$ over all the hypotheses, Bayes rule implies that $\textrm{P}(H_j |E)$ is proportional to⁴⁶ $\textrm{P}(E|H_j)\textrm{P}(H_j)$ \[\begin{align*} \textrm{P}(H_j |E) & = \frac{\textrm{P}(E|H_j)\textrm{P}(H_j)}{\textrm{P}(E)}\\ \textrm{P}(H_j |E) & \propto \textrm{P}(E|H_j)\textrm{P}(H_j) \end{align*}\]

In short, Bayes’ rule says that a posterior probability of a hypothesis is proportional to the product of the prior probability of the hypothesis and the likelihood of the evidence if the hypothesis were true.

\[ \textbf{posterior} \propto \textbf{prior} \times \textbf{likelihood} \]

Bayes rule calculations are often organized in a Bayes’ table like Table 2.20 which illustrates “posterior is proportional to likelihood times prior”. The table has one row for each hypothesis and columns for

prior probability: column sum is 1
likelihood of the evidence given each hypothesis
- likelihood depends on the evidence; if the evidence changes, the likelihood column changes
- the sum of the likelihood column is a meaningless number and can be any value
product of prior and likelihood: column sum is the marginal probability of the evidence
posterior probability: column sum is 1

Table 2.20: Bayes table representation of the posterior probabilities of each opinion about the scientific method (the hypotheses) given a postgraduate degree (the evidence) in Example 2.54

hypothesis	prior	likelihood	product	posterior
iterative	0.70	0.182	0.1274	0.8432
unchanging	0.14	0.103	0.0144	0.0954
not sure	0.16	0.058	0.0093	0.0614
Total	1.00	0.343	0.1511	1.0000

The likelihood column in a Bayes table depends on the evidence. In Table 2.20 the evidence is that the American has a postgraduate degree; the likelihood column contains the probability of the same event, $E$ = “the American has a postgraduate degree”, under each of the distinct hypotheses:

$\textrm{P}(E |H_1) = 0.182$, given the American agrees with the “iterative” statement
$\textrm{P}(E |H_2) = 0.103$, given the American agrees with the “unchanging” statement
$\textrm{P}(E |H_3) = 0.058$, given the American is “not sure”

Since each of these probabilities is computed under a different case, these values do not need to add up to anything in particular. The sum of the likelihoods is meaningless.

The “product” column contains the product of the values in the prior and likelihood columns. In Table 2.20 the product of prior and likelihood for “iterative” (0.1274) is 8.835 (0.1274/0.0144) times higher than the product of prior and likelihood for “unchanging” (0.0144). Therefore, Bayes rule implies that the conditional probability that an American with a postgraduate degree agrees with “iterative” should be 8.835 times higher than the conditional probability that an American with a postgraduate degree agrees with “unchanging”. Similarly, the conditional probability that an American with a postgraduate degree agrees with “iterative” should be $0.1274 / 0.0093 = 13.73$ times higher than the conditional probability that an American with a postgraduate degree is “not sure”, and the conditional probability that an American with a postgraduate degree agrees with “unchanging” should be $0.0144 / 0.0093 = 1.55$ times higher than the conditional probability that an American with a postgraduate degree is “not sure”. The last column just translates these relative relationships into probabilities that sum to 1.

The sum of the “product” column is $\textrm{P}(E)$, the marginal probability of the evidence or “average likelihood”. The sum of the product column represents the result of the law of total probability calculation. However, for the purposes of determining the posterior probabilities, it isn’t really important what $P(E)$ is. Rather, it is the ratio of the values in the “product” column that determine the posterior probabilities. $\textrm{P}(E)$ is whatever it needs to be to ensure that the posterior probabilities sum to 1 while maintaining the proper ratios.

Bayes rule is just another application of conditioning as “slicing and renormalizing”.

Extract the “slice” corresponding to the event being conditioned on (and discard the rest). For example, a slice might correspond to a particular row or column of a two-way table.
“Renormalize” the values in the slice so that corresponding probabilities add up to 1.

In Bayes rule, the product of prior and likelihood determines the shape of the slice. Slicing determines relative probabilities; renormalizing just makes sure they “add up” to 1 while maintaining the proper ratios.

Example 2.55 Continuing Example 2.53. Randomly select an American adult. Now suppose we want to compute the posterior probabilities for an American adult’s perception of the scientific method given that the randomly selected American adult has some college but no Bachelor’s degree (“College”).

Before computing, make an educated guess for the posterior probabilities. In particular, will the changes from prior to posterior be more or less extreme given the American has some college but no Bachelor’s degree than when given the American has a postgraduate degree? Why?
Construct a Bayes table and compute the posterior probabilities. Compare to the posterior probabilities given postgraduate degree from the previous examples.

Solution (click to expand)

Solution 2.55.

We start with the same prior probabilities as before: 0.70 for iterative, 0.14 for unchanging, 0.16 for not sure. Now the evidence is that the American has some college but no Bachelor’s degree. The likelihood of the evidence (“college”) is 0.276 under the iterative hypothesis, 0.314 under the unchanging hypothesis, and 0.272 under the not sure hypothesis. The likelihood of the evidence does not change as much across the different hypotheses when the evidence is “college” than when the evidence was “postgraduate degree”. Therefore, the changes from prior to posterior should be less extreme when the evidence is “college” than when the evidence was “postgraduate degree”. Furthermore, since the likelihood doesn’t vary much across hypotheses when the evidence is “college” we expect the posterior probabilities to be close to the prior probabilities.
See Table 2.21 As expected, the posterior probabilities are closer to the prior probabilities when the evidence is “college” than when the evidence is “postgraduate degree”.

Table 2.21: Bayes table representation of the posterior probabilities of each opinion about the scientific method (the hypotheses) given a college degree (the evidence) in Example 2.55

hypothesis	prior	likelihood	product	posterior
iterative	0.70	0.276	0.1932	0.6883
unchanging	0.14	0.314	0.0440	0.1566
not sure	0.16	0.272	0.0435	0.1551
Total	1.00	0.862	0.2807	1.0000

Example 2.56 True story: On a camping trip in Vermont, my wife and I were driving when a very large, hairy, black animal ran across the road in front of us and into the woods on the other side. It happened very quickly, and at first I said “It’s a gorilla!” But then after some thought, and much derision from my wife, I said “it was probably a bear.”

I think this story provides an anecdote about Bayesian reasoning, albeit bad reasoning at first but then good. Put the story in a Bayesian context by identifying hypotheses, prior, evidence, and likelihood. What was the mistake I made initially?

Solution (click to expand)

Solution 2.56. “Type of animal running across the road in Vermont” is playing the role of the hypothesis. It could be a gorilla or a bear, but it could also be a cow, deer, dog, rabbit, squirrel, etc.

That the animal is “very large, hairy, black, and running” is the evidence.

The likelihood is the probability of being “very large, hairy, black, and running” given each of the animals.

The likelihood of a gorilla being “very large, hairy, black, and running” is relatively high; a high percentage of gorillas are “very large, hairy, and black” and they can run.
The likelihood of a bear being “very large, hairy, black, and running” is also relatively high; a high percentage of bears are “very large, hairy, and black” and they can also run.
The likelihood of a dog being “very large, hairy, black, and running” is more middling; some dogs are “very large, hairy, and black” but many aren’t.
The likelihood of a squirrel being “very large, hairy, black, and running” is basically 0; thankfully, squirrels are not very large.
The likelihood of a deer being “very large, hairy, black, and running” is also very small, because most deer are not black.

We have identified hypotheses, evidence, and likelihood. Did you notice the important element that we skipped? The mistake I made initially was to neglect the base rates and not consider my prior probabilities. For a moment, let’s say the likelihood is 1 for both gorilla and bear and 0 for all other animals⁴⁷. Then based solely on these likelihoods, the posterior probability would be 50/50 for gorilla and bear, which maybe is why I guessed gorilla.

After my initial reaction, I paused to think about my prior probabilities for the animal crossing the road, which considering I was in Vermont should assign much higher probability to a bear than a gorilla. My prior probabilities should also have given even higher probability to animals such as deer and squirrels; maybe even cows and dogs would have a higher probability than a bear.

By combining prior and likelihood in the appropriate way, the posterior probability is

very high for a bear, due to high likelihood and not-too-small prior
close to 0 for a gorilla, due to the essentially 0 prior (even though the likelihood is basically 1)
very low for a squirrel or deer because of the essentially 0 likelihood, even if the prior is large.

After considering both prior and likelihood, the posterior probability is essentially 0 for a gorilla and relatively high for a bear, so it was good that I revised my conclusion from gorilla to bear⁴⁸.

Warning

Posterior probabilities can be highly influenced by the original prior probabilities (a.k.a., base rates). Example 2.56 illustrates that ignoring prior probabilities can lead to wildly inaccurate conclusions. Don’t neglect the prior probabilities when evaluating posterior probabilities!

Like the scientific method, applying Bayes rule is often an iterative process.

Example 2.57 Suppose that you are presented with six boxes, labeled 0, 1, 2, $\ldots$, 5, each containing five marbles. Box 0 contains 0 green and 5 gold marbles, box 1 contains 1 green and 4 gold, and so on with box $i$ containing $i$ green and $5-i$ gold. One of the boxes is chosen uniformly at random (perhaps by rolling a fair six-sided die), and then you will randomly select marbles from that box, without replacement. Imagine the boxes appear identical and you can’t see inside; all you observe is the color of the marbles you select. Based on the colors of the marbles selected, you will update the probabilities of which box had been chosen.

Suppose that a single marble is selected and it is green. Which box do you think is the most likely to have been chosen? Make a guess for the posterior probabilities for each box. Then construct a Bayes table to compute the posterior probabilities. How do they compare to the prior probabilities?
Now suppose a second marble is selected from the same box, without replacement, and its color is gold. Which box do you think is the most likely to have been chosen given these two marbles? Make a guess for the posterior probabilities for each box. Then construct a Bayes table to compute the posterior probabilities, using the posterior probabilities from the previous part after the selection of the green marble as the new prior probabilities before seeing the gold marble.
Now construct a Bayes table corresponding to the original prior probabilities (1/6 each) and the combined evidence that the first ball selected was green and the second was gold. How do the posterior probabilities compare to the previous part?
In the previous part, the first ball selected was green and the second was gold. Suppose you only knew that in a sample of two marbles, 1 was green and 1 was gold. That is, you didn’t know which was first or second. How would the previous part change? Should knowing the order matter? Construct the Bayes table and compute the posterior probabilities, and compare to the previous part.

Solution (click to expand)

Solution 2.57.

Since the prior probability is the same for each box, the posterior probability will be greatest for the box for which the likelihood of selecting a green marble (the evidence) is greatest, i.e., box 5 which has a likelihood of drawing a green marble of 1. See Table 2.22. The likelihood column provides the probability of drawing a green marble from each of the boxes, which is $i/5$ for box $i$. (The likelihood of drawing a green marble is 0 for box 0, so box 0 will have a posterior probability of 0.) Since the prior is “flat” the posterior probabilities are proportional to the likelihoods.
The posterior probabilities above quantify our uncertainty about the box after observing a single randomly selected marble is green. These probabilities serve as the prior probabilities before drawing any additional marbles. After drawing a green marble without replacement, each box has 4 marbles and 1 less green marble than before, and the likelihood of observing a second marble which is gold is computed for each of the 4-marble boxes. For example, after drawing a green marble, box 2 now contains 1 green marble and 3 gold marbles, so the likelihood of drawing a gold marble from box 2 is 3/4. (The likelihood for box 0 is technically undefined because the probability of drawing a green marble first from box 0 is 0. But since the prior probability for box 0 is 0, the posterior probability for box 0 will be 0 regardless of the likelihood.) See Table 2.23. Since we have observed green and gold in equal proportion in our sample, the posterior probabilities are highest for the boxes with closest to equal proportions of green and gold (box 2 and box 3).
In the previous part we updated the posterior probabilities after the first marble and again after selecting the second. What if we start with equally likely prior probabilities and only update the posterior probabilities after selecting both marbles? The likelihood now represents the probability of drawing a green and then a gold marble, without replacement, from each of the boxes. For example, for box 2, the probability of drawing a green marble first is 2/5 and the conditional probability of then drawing a gold marble is 3/4, so the probability of drawing green and then gold is (2/5)(3/4) = 0.3. See Table 2.24. Notice that the posterior probabilities are the same as in the previous part! It doesn’t matter if we sequentially update our probabilities after each draw as in the previous part, or only once after the entire sample is drawn. The posterior probabilities are the same either way.
What if we know the sample contains 1 green and 1 gold marble, but we don’t know which was drawn first? It seems that knowing the order shouldn’t matter in terms of our posterior probabilities. Technically, the likelihood does change since there are two ways to get a sample with 1 green and 1 gold: green followed by gold or gold followed by green. Therefore, each likelihood will be two times larger than in the previous part. For example, for box 2, the probability of green then gold is (2/5)(3/4) and the probability of gold then green is (3/5)(2/4), so the probability of 1 green and 1 gold is (2/5)(3/4) + (3/4)(2/5) = 2(0.3). However, the ratios of the likelihoods have not changed; since each likelihood is twice as large as it was in the previous part, the likelihood from this part is proportional to the likelihood from the previous part. Therefore, since the prior probabilities are the same as in the previous part and the likelihoods are proportionally the same as in the previous part, the posterior probabilities will also be the same as in the previous part. See Table 2.25.

Table 2.22: Bayes table given a single green marble in Example 2.57

Green	prior	likelihood	product	posterior
0	0.1667	0.0	0.0000	0.0000
1	0.1667	0.2	0.0333	0.0667
2	0.1667	0.4	0.0667	0.1333
3	0.1667	0.6	0.1000	0.2000
4	0.1667	0.8	0.1333	0.2667
5	0.1667	1.0	0.1667	0.3333
sum	1.0000	NA	0.5000	1.0000

Table 2.23: Bayes table given the second marble is gold in Example 2.57, using the posterior given a single green marble as the prior

Green	prior	likelihood	product	posterior
0	0.0000	1.00	0.0000	0.0
1	0.0667	1.00	0.0667	0.2
2	0.1333	0.75	0.1000	0.3
3	0.2000	0.50	0.1000	0.3
4	0.2667	0.25	0.0667	0.2
5	0.3333	0.00	0.0000	0.0
sum	1.0000	NA	0.3333	1.0

Table 2.24: Bayes table given one green then one gold marble in Example 2.57

Green	prior	likelihood	product	posterior
0	0.1667	0.0	0.0000	0.0
1	0.1667	0.2	0.0333	0.2
2	0.1667	0.3	0.0500	0.3
3	0.1667	0.3	0.0500	0.3
4	0.1667	0.2	0.0333	0.2
5	0.1667	0.0	0.0000	0.0
sum	1.0000	NA	0.1667	1.0

Table 2.25: Bayes table given one green and one gold marble in Example 2.57

Green	prior	likelihood	product	posterior
0	0.1667	0.0	0.0000	0.0
1	0.1667	0.4	0.0667	0.2
2	0.1667	0.6	0.1000	0.3
3	0.1667	0.6	0.1000	0.3
4	0.1667	0.4	0.0667	0.2
5	0.1667	0.0	0.0000	0.0
sum	1.0000	NA	0.3333	1.0

Like the scientific method, Bayesian analysis is often an iterative process. Posterior probabilities are updated after observing some information or data. These probabilities can then be used as prior probabilities before observing new data. Posterior probabilities can be sequentially updated as new data becomes available, with the posterior probabilities after the previous stage serving as the prior probabilities for the next stage. The final posterior probabilities only depend upon the cumulative data. It doesn’t matter if we sequentially update the posterior after each new piece of data or only once after all the data is available; the final posterior probabilities will be the same either way. Also, the final posterior probabilities are not impacted by the order in which the data are observed.

Example 2.58 Continuing Example 2.57. Imagine we haven’t selected the box or any marbles yet.

If one of the boxes is selected uniformly at random and a marble is selected at random from the box, what is the probability that the first marble we select from the box is green? Hint: the answer is a single number.
We select a box, select a marble, and observe that it is green. what is the probability that the next marble we select from the box will be green? Hint: use the result of part 1 of Example 2.57.
We select a box, select two marbles, and observe that they are both green. what is the probability that the next marble we select from the box will be green?

Solution (click to expand)

Solution 2.58.

The conditional probability that the selected marble is green depends on which box is selected: 0, 1/5, 2/5, 3/5, 4/5, 1 for boxes 0, 1, 2, 3, 4, 5, respectively. To compute the unconditional probability that the selected marble we using, we use the law of total probability, conditioning on the different boxes. Since prior to observing any marbles the boxes are equally likely, the probability of selecting a green marble is just the average of the six conditional probabilities. \[ (0)(1/6) + (1/5)(1/6) + (2/5)(1/6) + (3/5)(1/6) + (4/5)(1/6) + (1)(1/6) = 0.5 \] The probability that the selected marble is green is 0.5.
This is similar to the previous part. But now the probability of each box being the one selected has changed based on observing a green marble from the box. We use the law of total probability, but now the weights are the posterior probabilities given a green marble which we computed in part 1 of Example 2.57. \[ (0)(0) + (1/5)(1/15) + (2/5)(2/15) + (3/5)(3/15) + (4/5)(4/15) + (1)(5/15) = 0.7333 \] The probability that the next marble selected is green is given that the first marble selected is green is 0.7333.
Again we will use the law of total probability, but now we need to find the posterior probability for each box given that the first two marbles are green. We didn’t do this in Example 2.57, but it’s similar to how we computed the posterior probabilities after one green and one gold marble; see Table 2.26. Using the posterior probabilities for the boxes given the two marbles selected are green from Table 2.26, the law of total probability calculation is now \[ (0)(0) + (1/5)(0) + (2/5)(0.05) + (3/5)(0.15) + (4/5)(0.3) + (1)(0.5) = 0.85 \] The probability that the next marble selected is green is given that the first two marbles selected are green is 0.85. As we draw more green marbles, our posterior probability that the box had a lot of green marbles to begin with increases, which increases the probability that the next marble selected will be green.

Table 2.26: Bayes table given two green marbles in Example 2.57. The likelihood is the probability of selecting two green marbles given each of the boxes.

Green	prior	likelihood	product	posterior
0	0.1667	0.0	0.0000	0.00
1	0.1667	0.0	0.0000	0.00
2	0.1667	0.1	0.0167	0.05
3	0.1667	0.3	0.0500	0.15
4	0.1667	0.6	0.1000	0.30
5	0.1667	1.0	0.1667	0.50
sum	1.0000	NA	0.3333	1.00

The probabilities we computed in Solution 2.58 are examples of “predictive probabilities”; the value is part 1 is a prior predictive probability and the values in parts 2 and 3 are posterior predictive probabilities. Recall that prior or posterior probabilities assess the uncertainty of the different hypotheses or cases either before (prior) or after (posterior) observing some data. On the other hand, prior or posterior predictive probabilities assess the probability of potential data—while also accounting for the uncertainty of the hypotheses or cases—either before (prior predictive) or after (posterior predictive) observing some data.

2.6.7 Conditional probabilities are probabilities

Conditioning on an event $E$ can be viewed as a change in the probability measure⁴⁹ on $\Omega$, from $\textrm{P}(\cdot)$ to $\textrm{P}(\cdot|E)$. That is, the original probability measure $\textrm{P}(\cdot)$ assigns probability $\textrm{P}(A)$, a number, to event $A$, while the conditional probability measure $\textrm{P}(\cdot |E)$ assigns probability $\textrm{P}(A|E)$, a possibly different number, to event $A$. Switching to $\textrm{P}(\cdot |E)$ resembles the following.

Outcomes⁵⁰ in $E^c$ are assigned probability 0 under $\textrm{P}(\cdot|E)$. If $A$ consists only of outcomes not in $E$, i.e., if $A\subseteq E^c$, then $\textrm{P}(A\cap E)=0$ so $\textrm{P}(A|E)=0$.
The probabilities of outcomes in $E$ are rescaled so that they comprise 100% of the probability conditional on $E$, i.e. so that $\textrm{P}(E|E)=1$. This is the effect of dividing by $\textrm{P}(E)$. For example, if $A, B\subseteq E$ and $\textrm{P}(A)=2\textrm{P}(B)$, then also $\textrm{P}(A|E)=2\textrm{P}(B|E)$. That is, if event $A$ is twice as likely as event $B$ according to $\textrm{P}(\cdot)$, then the same will be true according to $\textrm{P}(\cdot|E)$ provided that the probabilities of none of the outcomes satisfying the events has been zeroed out due to conditioning on $E$.

Conditional probabilities are probabilities. Given an event $E$, the function $\textrm{P}(\cdot|E)$ defines a valid probability measure. Analogous versions of probability rules hold for conditional probabilities, just condition on event $E$ everywhere.

$0 \le \textrm{P}(A|E) \le 1$ for any event $A$.
$\textrm{P}(\Omega|E)=1$. Moreover, $\textrm{P}(E|E) = 1$.
If events $A_1, A_2, \ldots$ are disjoint (i.e. $A_i \cap A_j = \emptyset, i\neq j$) then \[ \textrm{P}(A_1 \cup A_2 \cup \cdots |E) = \textrm{P}(A_1|E) + \textrm{P}(A_2|E) + \cdots \]
$\textrm{P}(A^c|E) = 1-\textrm{P}(A|E)$. (Be careful! Do not confuse $\textrm{P}(A^c|E)$ with $\textrm{P}(A|E^c)$.)
$\textrm{P}(A|E) = \textrm{P}(A |C_1\cap E)\textrm{P}(C_1| E) + \textrm{P}(A | C_2\cap E)\textrm{P}(C_2|E) + \textrm{P}(A | C_3\cap E)\textrm{P}(C_3|E) + \cdots$

All probabilities are conditional on some information. The probability measure $\textrm{P}$ assigns probabilities that reflect all assumptions and information about the random phenomenon. When new information becomes available we revise our probabilities. The probability measure $\textrm{P}(\cdot |E)$ assigns probabilities that reflect all assumptions and information about the random phenomenon, including the information that event $E$ occurs. Our revised probabilities must still satisfy the logical consistency conditions required by the probability axioms, so $\textrm{P}(\cdot |E)$ must be a valid probability measure.

Like probabilities, conditional probabilities can be interpreted as long run relative frequencies or subjective probabilities. Imagine repeating the random phenomenon a large number of times. The unconditional probability $\textrm{P}(A)$ can be interpreted as the proportion of repetitions where event $A$ occurs. The conditional probability $\textrm{P}(A|E)$ can be interpreted as the proportion of repetitions on which event $E$ occurs where event $A$ occurs. From the subjective viewpoint, $\textrm{P}(A)$ represents the relative plausibility of event $A$, while $\textrm{P}(A|E)$ represents the relative plausibility of event $A$ given that event $E$ occurs.

2.6.8 Conditional distributions (a brief introduction)

The probability distribution of a random variable describes the possible values that the random variable can take and the relative likelihoods or plausibilities of these values. A conditional distribution revises this description to reflect newly available information.

Example 2.59 Continuing Example 2.41. Roll a fair four-sided die twice and let $X$ be the sum of the two rolls, and let $Y$ be the larger of the two rolls. We have previously found the joint and marginal distributions of $X$ and $Y$, displayed in Table 2.27; marginal probabilities of $X$ values are in the “Total” column, and marginal probabilities of $Y$ values are in the “Total” row.

Compute and interpret in context $\textrm{P}(X=6|Y=4)$.
Interpret $\textrm{P}(X=6|Y=4)$ as a long run relative frequency.
Construct a table and plot to represent the conditional distribution of $X$ given $Y=4$ by “slicing and renormalizing”.
Interpret the the conditional distribution of $X$ given $Y=4$ as a long run relative frequency distribution.
Construct a table and plot to represent the conditional distribution of $X$ given $Y=3$.
Construct a table and plot to represent the conditional distribution of $X$ given $Y=2$.
Construct a table and plot to represent the conditional distribution of $X$ given $Y=1$.
Compute and interpret $\textrm{P}(Y=4|X=6)$.
Construct a table and plot to represent the distribution of $Y$ given $X=6$.
Construct a table and plot to represent the distribution of $Y$ given $X=5$.
Construct a table and plot to represent the distribution of $Y$ given $X=4$.

Table 2.27: Two-way table representation of joint and marginal distributions of $X$ and $Y$, the sum and the larger (or common value if a tie) of two rolls of a fair four-sided die. Possible values of $X$ are in the leftmost column; possible values of $Y$ are in the top row.

$(x, y)$
	1	2	3	4	Total
2	1/16	0	0	0	1/16
3	0	2/16	0	0	2/16
4	0	1/16	2/16	0	3/16
5	0	0	2/16	2/16	4/16
6	0	0	1/16	2/16	3/16
7	0	0	0	2/16	2/16
8	0	0	0	1/16	1/16
Total	1/16	3/16	5/16	7/16

Solution (click to expand)

Solution 2.59.

Remember that $\{X=6\}$ and $\{Y=4\}$ are events, so we use the definition of conditional probability for events. \[ \textrm{P}(X = 6 | Y = 4) =\frac{\textrm{P}(X = 6, Y = 4)}{\textrm{P}(Y=4)} = \frac{2/16}{7/16} = 2/7 =0.286 \] The conditional probability that the sum of the rolls is 6 given that the larger roll is 4 is 2/7.
We can roll a pair of fair four-sided dice and measure the sum and larger of the rolls. If we repeat this process many times then we would expect the sum to be 6 on 28.6% of the repetitions for which the larger roll is 4. That is, among pairs of rolls of a fair four-sided die that result in a larger roll of 4, the sum of the rolls is 6 in 28.6% of these pairs.
We could find $\textrm{P}(X= x|Y=4)$ as in the previous part for each of the possible values of $x$. We can also slice the column of the joint distribution table corresponding to $Y=4$, and then renormalize. If $Y = 4$ then $X$ is either 5, 6, 7, or 8; $Y$ is equally likely to be 5, 6, or 7, and each of those values is twice as likely as 8. That is, the joint probabilities in the $Y=4$ column (slice) are in a 2:2:2:1 ratio for the values 5, 6, 7, 8, and renormalizing yields probabilities of 2/7, 2/7, 2/7, 1/7. The table below displays the conditional distribution of $X$ given $Y=4$. Note well that this is a distribution of values of $X$.

$x$ $\textrm{P}(X = x|Y = 4)$

5 2/7

6 2/7

7 2/7

8 1/7
We can roll a pair of fair four-sided dice and measure the sum and larger of the rolls. If we repeat this process many times then among only the repetitions for which the larger roll is 4 we would expect the sum to be 5 on 28.6%, 6 on 28.6%, 7 on 28.6%, and 8 on 14.3% of these repetitions.
Slice the column of the joint distribution table corresponding to $Y=3,$ and then renormalize. Given $Y=3,$ $X$ is equally likely to be either 4 or 5, and each of those values is twice as likely as 6. Note that changing the condition from $\{Y=4\}$ to $\{Y=3\}$ changes the possible values of $X$, along with their probabilities.

$x$ $\textrm{P}(X = x | Y = 3)$

4 2/5

5 2/5

6 1/5
Slice the column of the joint distribution table corresponding to $Y=2,$ and then renormalize. Given $Y=2,$ $X$ is twice as likely to be 3 than 4.

$x$ $\textrm{P}(X = x | Y = 2)$

3 2/3

4 1/3
Given $Y=1,$ $X$ is equal to 2 with probability 1: $\textrm{P}(X = 2 | Y = 1)=1$.
Remember that $\textrm{P}(X = 6 | Y = 4)$ and $\textrm{P}(Y = 4 | X = 6)$ measure different probabilities. \[ \textrm{P}(Y = 4 | X = 6) =\frac{\textrm{P}(X = 6, Y = 4)}{\textrm{P}(X=6)} = \frac{2/16}{3/16} = 2/3 =0.667 \] The conditional probability that the larger roll is 4 given that the sum of the rolls is 6 given is 2/3.
We could find $\textrm{P}(Y=y|X = 6)$ for each possible value of $Y$ as in the previous part. We can also slice the row of the joint distribution table corresponding to $X=6$, and then renormalize. If $X = 6$ then $Y$ is either 3 or 4, and $Y$ is twice as likely to be 4 than 3. The table below represents the conditional distribution of $Y$ given $X = 6$. Note that this is a distribution of values of $Y$.

$y$ $\textrm{P}(Y = y | X = 6)$

3 1/3

4 2/3
Slice the row of the joint distribution table corresponding to $X=5$, and then renormalize. If $X=5$ then $Y$ is equally likely to be 3 or 4.

$y$ $\textrm{P}(Y = y | X = 5)$

3 1/2

4 1/2
Slice the row of the joint distribution table corresponding to $X=4$, and then renormalize. If $X=4$ then $Y$ is twice as likely to be 3 than 2. Note that changing the condition from $\{X=5\}$ to $\{X=4\}$ changes the possible values of $Y$, along with their probabilities.

$y$ $\textrm{P}(Y = y | X = 5)$

2 1/3

3 2/3

\(x\)	\(\textrm{P}(X = x\|Y = 4)\)
5	2/7
6	2/7
7	2/7
8	1/7

\(x\)	\(\textrm{P}(X = x \| Y = 3)\)
4	2/5
5	2/5
6	1/5

\(x\)	\(\textrm{P}(X = x \| Y = 2)\)
3	2/3
4	1/3

\(y\)	\(\textrm{P}(Y = y \| X = 6)\)
3	1/3
4	2/3

\(y\)	\(\textrm{P}(Y = y \| X = 5)\)
3	1/2
4	1/2

\(y\)	\(\textrm{P}(Y = y \| X = 5)\)
2	1/3
3	2/3

The conditional distribution of $Y$ given $X=x$ is the distribution of $Y$ values over only those outcomes for which $X=x$. It is a distribution on values of $Y$ only; treat $x$ as a fixed constant when conditioning on the event $\{X=x\}$.

Conditional distributions can be obtained from a joint distribution by slicing and renormalizing. The conditional distribution of $Y$ given $X=x$, where $x$ represents a particular number, can be thought of as:

the slice of the joint distribution corresponding to $X=x$, a distribution on values of $Y$ alone with $X=x$ fixed
renormalized so that the slice accounts for 100% of the probability over the possible values of $Y$

The shape of the conditional distribution of $Y$ given $X=x$ is determined by the shape of the slice of the joint distribution over values of $Y$ for the fixed $x$.

For example, consider the joint distribution of $X$ and $Y$ in Example 2.59, depicted in Figure 2.18. To find the conditional distribution of $X$ given $Y=4$, remove the slice corresponding to $y=4$ in Figure 2.18, and then renormalize to obtain the plot in the bottom right of Figure 2.24.

For each fixed $x$, the conditional distribution of $Y$ given $X=x$ is a different distribution on values of the random variable $Y$. There is not one “conditional distribution of $Y$ given $X$”, but rather a family of conditional distributions of $Y$ given different values of $X$. In Example 2.59, Figure 2.25 depicts the conditional distribution of $Y$ given $X=x$ for each value $x$ of $X$, and Figure 2.24 depicts the conditional distribution of $X$ given $Y=y$ for each value $y$ of $Y$. Notice how each conditional distribution corresponds to a renormalized slice of the joint distribution depicted in Figure 2.18. We can also think of conditioning as slicing (and renormalizing) the joint distribution depicted in the tile plot in Figure 2.19, just remember that color represents probability in the tile plot.

Figure 2.24: Impulse plots representing the family of conditional distributions of $X$ given $Y$ in Example 2.59. Each plot represents a conditional distribution of $X$ given $Y=y$ for a particular value of $y$

Figure 2.25: Impulse plots representing the family of conditional distributions of $Y$ given $X$ in Example 2.59). Each plot represents a conditional distribution of $Y$ given $X=x$ for a particular value of $x$

We can also depict families of conditional distributions in mosaic plots; see Figure 2.26. A mosaic plot represents a family of conditional distributions where color represents the possible values of one variable and area represents probability.

Warning

Be careful to distinguish between mosaic plots (like Figure 2.26) and tile plots (like Figure 2.19). A tile plot represents a joint distribution and color represents probability.

(a) Conditional distributions of $Y$ given each value of $X$; color represents values of $Y$.

Example 2.60 Continuing Example 2.59.

Compute the probability-weighted average value of $X$ given $Y=4$.
Interpret the value from the previous part as a long run average value in context.
Without doing any calculations, determine if the probability-weighted average value of $X$ given $Y=3$ is greater than, less than, or equal to the value from part 1.

Solution (click to expand)

Solution 2.60.

Multiply each possible of $X$ by its conditional probability given $Y=4$ and sum \[ \small{ 5 \times 2/7 + 6 \times 2/7 + 7 \times 2/7 + 8 \times 1/7 = 44/7 = 6.286 } \]
We can roll a pair of fair four-sided dice and measure the sum and larger of the rolls. If we repeat this process many times and average the values of the sum for repetitions with a larger roll of 4 we would expect the average to be around 6.286.
The probability-weighted average value of $X$ given $Y=3$ is less than the probability-weighted average value of $X$ given $Y=4$ is greater than. Compare the conditional distributions; $X$ tends to be larger if $Y=4$ than if $Y=3$.

Each conditional distribution is a distribution, so we can summarize its characteristics, such as expected value. The value in part 1 of Example 2.60 is the conditional expected value of $X$ given $Y=4$, denoted $\textrm{E}(X|Y=4)$. The conditional expected value of $Y$ given $X=x$ represents the long run average of values of $Y$ over only $(X, Y)$ pairs with $X=x$. Since each value of $x$ typically corresponds to a different conditional distribution of $Y$ given $X=x$, the conditional expected value will typically be a function of $x$.

We will explore conditioning in more detail throughout the book, including conditional distributions when continuous random variables are involved. We will also see how to use conditioning as a problem-solving tool.

2.6.9 Exercises

Exercise 2.17 Each question on a multiple choice test has four options. You know with certainty the correct answers to 70% of the questions. For 20% of the questions, you can eliminate two of the incorrect choices with certainty, but you guess at random among the remaining two options. For the remaining 10% of questions, you have no idea and guess one of the four options at random.

Randomly select a question from this test. What is the probability that you answer the question correctly?

Construct an appropriate twoway table and use it to find the probability of interest.
For any given question on the exam, your probability of answering it correctly is either 1, 0.5, or 0.25, depending on if you know it, can eliminate two choices, or are just guessing. How does your probability of correcting answering a randomly selected question relate to these three values? Which value — 1, 0.5, or 0.25 —is the overall probability closest to, and why?

Exercise 2.18 Imagine a light that flashes every few seconds⁵¹. The light randomly flashes green with probability 0.75 and red with probability 0.25, independently from flash to flash.

Write down a sequence of G’s (for green) and R’s (for red) to predict the colors for the next 40 flashes of this light. Before you read on, please take a minute to think about how you would generate such a sequence yourself.
Most people produce a sequence that has 30 G’s and 10 R’s, or close to those proportions, because they are trying to generate a sequence for which each outcome has a 75% chance for G and a 25% chance for R. That is, they use a strategy in which they predict G with probability 0.75, and R with probability 0.25. How well does this strategy do? Compute the probability of correctly predicting any single item in the sequence using this strategy.
Describe a better strategy. (Hint: can you find a strategy for which the probability of correctly predicting any single flash is 0.75?)

Exercise 2.19 The ELISA test for HIV was widely used in the mid-1990s for screening blood donations. As with most medical diagnostic tests, the ELISA test is not perfect. If a person actually carries the HIV virus, experts estimate that this test gives a positive result 97.7% of the time. (This number is called the sensitivity of the test.) If a person does not carry the HIV virus, ELISA gives a negative (correct) result 92.6% of the time (the specificity of the test). Estimates at the time were that 0.5% of the American public carried the HIV virus.

Suppose that a randomly selected American tests positive; we are interested in the conditional probability that the person actually carries the virus.

Before proceeding, make a guess for the probability in question.
Denote the probabilities provided in the setup using proper notation
Construct an appropriate two-way table and use it to compute the probability of interest.
Construct a Bayes table and use it to compute the probability of interest.
Explain why this probability is small, compared to the sensitivity and specificity.
By what factor has the probability of carrying HIV increased, given a positive test result, as compared to before the test?
Now suppose that 5% of individuals in a high-risk group carry the HIV virus. Consider a randomly selectd person from this group who takes the test. Given that the test is positive, how many times more likely is it for the person not to have HIV than to have it? Answer without first computing a two-way or Bayes table.
Using the result from the previous part, compute the conditional probability that a person in this risk group who tests positive has HIV.
Is the posterior probability influenced by the prior probability? Discuss.

Exercise 2.20 Consider three tennis players A, B, and C⁵². One of these players is better than the other two, who are equally good/bad. When the best player plays either of the others, she has a 2/3 probability of winning the match. When the other two players play each other, each has a 1/2 probability of winning the match. But you do not know which player is the best. Based on watching the players warm up, you start with subjective probabilities of 0.5 that A is the best, 0.35 that B is the best, and 0.15 that C is the best. (Note: the fact that these are subjective probabilities doesn’t change at all how you would solve the problems.) A and B will play the first match.

Suppose that A beats B in the first match. Compute your posterior probability that each of A, B, C is best given that A beats B in the first match.
Compare the posterior probabilities from the previous part to the prior probabilities. Explain how your probabilities changed, and why that makes sense.
Suppose instead that B beats A in the first match. Compute your posterior probability that each of A, B, C is best given that B beats A in the first match.
Compare the posterior probabilities from the previous part to the prior probabilities. Explain how your probabilities changed, and why that makes sense.
Now suppose again that A beats B in the first match, and also that A beats C in the second match.
Compute your posterior probability that each of A, B, C is best given the results of the first two matches. (Hint: use as the prior your posterior probabilities from the previous part.) Explain how your probabilities changed, and why that makes sense.

Exercise 2.21 Continuing Exercise 2.20. Suppose A will play B in the first match.

Before any matches, if you had to choose the one player you think is best, who would you choose? What is your subjective probability that your choice is correct? (This should be a short answer, not requiring any calculations. The main reason to think about this is to compare to the last part.)
Compute your subjective probability that A will beat B in the first match.
If A beats B in the first match, you will update your subjective probabilities so they are: 0.6349 that A is the best, 0.2222 that B is the best, and 0.1429 that C is the best. (See Exercise 2.20.) Suppose that A beats B in the first match. If you had to choose the one player you think is best based on your updated subjective probabilities, who would you choose? What is your subjective probability that your choice is correct given that A beats B in the first match?
If B beats A in the first match, you will update your subjective probabilities so they are: 0.3509 that A is the best, 0.4912 that B is the best, and 0.1579 that C is the best. (See Exercise 2.20.) Suppose that B beats A in the first match. If you had to choose the one player you think is best based on your updated subjective probabilities, who would you choose? What is your subjective probability that your choice is correct given that B beats A in the first match?
After the first match you make your choice of who you think is the best player. Compute your subjective probability that your choice is correct. (Hint: this should be a single number, but you need to consider the two cases.) Compare to the first part; what is the “value” of observing the winner of the first match?

Exercise 2.22 Continuing Exercise 2.16.

Find the conditional distribution of $Y$ given $X=x$ for each possible value of $x$ of $X$.
Compute and interpret $\text{E}(Y|X=x)$ for each possible value of $x$ of $X$.
Find the conditional distribution of $X$ given $Y=y$ for each possible value of $y$ of $Y$.
Compute and interpret $\text{E}(X|Y=y)$ for each possible value of $y$ of $Y$.

2.7 Independence

We revise probabilities of events and distributions of random variables when new information becomes available. In this section we investigate situations where conditioning on information does not change probabilities or distributions.

2.7.1 Independence of two events

In general, the conditional probability of event $A$ given some other event $B$ is usually different from the unconditional probability of $A$; that is, in general $\textrm{P}(A | B) \neq \textrm{P}(A)$. Knowledge of the occurrence of event $B$ typically influences the probability of event $A$, and vice versa. If so, we say that events $A$ and $B$ are dependent.

However, in some situations knowledge of the occurrence of one event does not influence the probability of another. For example, if a coin is flipped twice then knowing that the first flip landed on heads does not change the probability that the second flips lands on heads. In such situations we say the events are independent.

Example 2.61 Consider the following hypothetical data.

	Democrat ($D$)	Not Democrat ($D^c$)	Total
Loves puppies ($L$)	180	270	450
Does not love puppies ($L^c$)	20	30	50
Total	200	300	500

Suppose a person is selected uniformly at random from this group. Let $L$ be the event that the selected person loves puppies and let $D$ be the event that the selected person is a Democrat.

Compute and interpret $\textrm{P}(L)$.
Compute and interpret $\textrm{P}(L|D)$.
Compute and interpret $\textrm{P}(L|D^c)$.
What do you notice about $\textrm{P}(L)$, $\textrm{P}(L|D)$, and $\textrm{P}(L|D^c)$?
Compute and interpret $\textrm{P}(D)$.
Compute and interpret $\textrm{P}(D|L)$.
Compute and interpret $\textrm{P}(D|L^c)$.
What do you notice about $\textrm{P}(D)$, $\textrm{P}(D|L)$, and $\textrm{P}(D|L^c)$?
Compute and interpret $\textrm{P}(D \cap L)$.
What is the relationship between $\textrm{P}(D \cap L$) and $\textrm{P}(D)$ and $\textrm{P}(L)$?
When randomly selecting a person from this particular group, would you say that events $D$ and $L$ are independent? Why?

Solution (click to expand)

Solution 2.61.

The probability that the randomly selected person loves puppies is $\textrm{P}(L)=450/500=0.9$.
The conditional probability that the randomly selected person loves puppies given that the person is a Democrat is $\textrm{P}(L|D)=180/200=0.9$.
The conditional probability that the randomly selected person loves puppies given that the person is not a Democrat is $\textrm{P}(L|D^c)=270/300=0.9$.
$\textrm{P}(L)=\textrm{P}(L|D)=\textrm{P}(L|D^c)=0.9$. Regardless of whether or not the person is a Democrat the probability that they love puppies is 0.9, the overall probability that a person loves puppies.
The probability that the randomly selected person is a Democrat is $\textrm{P}(D)=200/500=0.4$.
The conditional probability that the randomly selected person is a Democrat given that the person loves puppies is $\textrm{P}(D|L)=180/450=0.4$.
The conditional probability that the randomly selected person is a Democrat given that the person does not love puppies is $\textrm{P}(D|L^c)=20/50=0.4$.
$\textrm{P}(D)=\textrm{P}(D|L)=\textrm{P}(D|L^c)=0.4$. Regardless of whether or not the person loves puppies the probability that the person is a Democrat is 0.4, the overall probability that a person is a Democrat.
The probability that the randomly selected person is a Democrat and loves puppies is $\textrm{P}(D \cap L)=180/500=0.36$.
$\textrm{P}(D \cap L) = 0.36 = (0.4)(0.9)=\textrm{P}(D)\textrm{P}(L)$. In this case, the joint probability is a product of the marginal probabilities.
Yes, the events $D$ and $L$ are independent. Knowing whether or not the person is a Democrat does not change the probability that the person loves puppies, and vice versa.

Definition 2.8 For a probability space with probability measure $\textrm{P}$, two events $A$ and $B$ are⁵³ independent if $\textrm{P}(A \cap B) = \textrm{P}(A)\textrm{P}(B)$.

In general, the multiplication rule says \[\begin{align*} \textrm{P}(A \cap B) & = \textrm{P}(A|B)\textrm{P}(B)\\ \text{Joint} & = \text{Conditional}\times\text{Marginal} \end{align*}\] For independent events, the multiplication rule simplifies \[\begin{align*} \text{If $A$ and $B$ are independent then } && \textrm{P}(A \cap B) & = \textrm{P}(A)\textrm{P}(B)\\ \text{If independent then } && \text{Joint} & = \text{Product of Marginals} \end{align*}\]

2.7.2 Interpreting independence

Intuitively, events $A$ and $B$ are independent if the knowing whether or not one occurs does not change the probability of the other. The following lemma⁵⁴ formalizes this idea.

Lemma 2.9 (Equivalent conditions for independence of two events) For a probability space with probability measure $\textrm{P}$ the following statements are equivalent⁵⁵ for events $A$ and $B$ (with $0<\textrm{P}(A)<1$ and $0<\textrm{P}(B)<1$).

\[\begin{align*} \text{$A$ and $B$} & \text{ are independent} & &\\ \textrm{P}(A \cap B) & = \textrm{P}(A)\textrm{P}(B) & & \\ \textrm{P}(A|B) & = \textrm{P}(A) & & \\ \textrm{P}(A|B) & = \textrm{P}(A|B^c) & & \\ \textrm{P}(B|A) & = \textrm{P}(B) & & \\ \textrm{P}(B|A) & = \textrm{P}(B|A^c) & &\\ \textrm{P}(A^c \cap B) & = \textrm{P}(A^c)\textrm{P}(B) & & \text{that is, $A^c$ and $B$ are independent}\\ \textrm{P}(A \cap B^c) & = \textrm{P}(A)\textrm{P}(B^c) & & \text{that is, $A$ and $B^c$ are independent}\\ \textrm{P}(A^c \cap B^c) & = \textrm{P}(A^c)\textrm{P}(B^c) & & \text{that is, $A^c$ and $B^c$ are independent} \end{align*}\]

Example 2.62 Figure 2.27 displays four mosaic plots, each representing probabilities corresponding to two events $A$ and $B$. Which of the mosaic plots represent independent events?

Figure 2.27: Four different mosaic plots for two events $A$ and $B$. Example 2.62 discusses which plots represent independent events.

Solution (click to expand)

Solution 2.62. The bottom two plots represent independent events. In these situations $\textrm{P}(B|A) = \textrm{P}(B|A^c) = \textrm{P}(B)$.

Example 2.63 Continuing Example 2.52. In which of the three scenarios represented in Figure 2.23 are events $A$ and $B$ independent? Try your intuition first, and then determine which scenarios represent independence by computing and comparing appropriate probabilities.

Solution (click to expand)

Solution 2.63. We can check any one of the conditions for independence in Lemma 2.9. We’ll compute and compare $\textrm{P}(A)$ and $\textrm{P}(A |B)$ for each scenario. In each case, $\textrm{P}(A)=4/16$. Condition on event $B$ by zooming in on the blue slice, and see if $\textrm{P}(A|B)$ is the same as $\textrm{P}(A)$.

Left: $\textrm{P}(A|B)=0\neq 4/16 = \textrm{P}(A)$. Therefore, events $A$ and $B$ are not independent.
Middle: $\textrm{P}(A|B) = 2/4\neq 4/16 = \textrm{P}(A)$. Therefore, events $A$ and $B$ are not independent.
Right: $\textrm{P}(A|B) = 1/4= 4/16 = \textrm{P}(A)$. Therefore, events $A$ and $B$ are independent. The ratio of yellow to total is the same as the ratio of the yellow (green) part of blue to blue. If we zoom into the blue part of the picture (slice) and then resize it to the size of the original picture (renormalize), then the yellow (green) part takes up 1/4 of the area just as the yellow part did in the original picture.

Warning

Do not confuse “disjoint” with “independent”. Disjoint means two events do not “overlap”. Independence means two events “overlap in just the right way”. You can pretty much forget “disjoint” exists; you will naturally apply the addition rule for disjoint events correctly without even thinking about it. Independence is much more important and useful, but also requires more care.

Example 2.64 Roll two fair four-sided dice, one green and one gold. There are 16 total possible outcomes (roll on green, roll on gold), all equally likely. Consider the event $E=\{\text{the green die lands on 1}\}$. Answer the following questions by computing and comparing appropriate probabilities.

Consider $A=\{\text{the gold die lands on 4}\}$. Are $A$ and $E$ independent?
Consider $B=\{\text{the sum of the dice is 3}\}$. Are $B$ and $E$ independent?
Consider $C=\{\text{the sum of the dice is 5}\}$. Are $C$ and $E$ independent?

Solution (click to expand)

Solution 2.64. $\textrm{P}(E)=4/16=1/4$ since there 4 out of 16 equally likely outcomes satisfy event $E$: $E=\{(1, 1), (1, 2), (1, 3), (1, 4)\}$.

There are 4 outcomes which satisfy event $A=\{(1, 4), (2, 4), (3, 4), (4, 4)\}$, all equally likely, only one, (1, 4), of which also satisfies event $E$. $\textrm{P}(E|A) = 1/4 = 4/16 = \textrm{P}(E)$, so events $A$ and $E$ are independent. Intuitively, the ratio of the $E$ part of $A$ to $A$ is equal to the ratio of $E$ to the sample space.
There is only 1 outcome which satisfies event $B=\{(1, 1)\}$ and it also satisfies event $E$. $\textrm{P}(E|B) = 1 \neq 4/16 = \textrm{P}(E)$, so events $A$ and $E$ are not independent. If you know the sum of the dice is 2, then the green die must have landed on 1.
There are 4 outcomes which satisfy event $C=\{(1, 4), (2, 3), (3, 2), (4, 1)\}$, all equally likely, only one, (1, 4), of which also satisfies event $E$. $\textrm{P}(E|C) = 1/4 = 4/16 = \textrm{P}(E)$, so events $C$ and $E$ are independent. The ratio of the $E$ part of $C$ to $C$ is the ratio of $E$ to the sample space.

Independence concerns whether or not the occurrence of one event affects the probability of the other. Conditioning involves slicing and renormalizing; independence concerns whether the renormalized slice matches the original picture. Given two events it is not always obvious whether or not they are independent. When there is any doubt, be sure to check directly if one of the equivalent conditions for independence is true (that is, the directly compute the left side and right side and see if they’re equal).

2.7.3 Independence is an assumption

Independence is often a reasonable assumption based on the physical properties of the random phenomenon. But remember that it is an assumption, which might or might not match reality.

Example 2.65 You have just been elected president (congratulations!) and you need to choose one of four people to sing the national anthem at your inauguration: Alicia, Ariana, Beyonce, or Billie. You write their names on some cards—each name on possibly a different number of cards—shuffle the cards, and draw one. Let $A$ be the event that either Alicia or Ariana is selected, and $B$ be the event that either Alicia or Beyonce is selected.

The following questions ask you to specify probability models satisfying different conditions. You can specify the model by identifying how many cards each person’s name is written on. For each model, find the probabilities of $A$, $B$, and $A\cap B$, and verify whether or not events $A$ and $B$ are independent according to the model.

Specify a probability model according to which the events $A$ and $B$ are independent.
Specify a different probability model according to which the events $A$ and $B$ are independent.
Specify a probability model according to which the events $A$ and $B$ are not independent.

Solution (click to expand)

Solution 2.65. Note that $A \cap B$ is the event that Alicia is selected.

Write each person’s name on exactly one card, so the 4 outcomes are equally likely. Let $\textrm{P}$ represent this probability measure. Then $\textrm{P}(A \cap B) = 1/4 = (2/4)(2/4)=\textrm{P}(A)\textrm{P}(B)$, so $A$ and $B$ are independent.
The previous part involves a situation where $(1/2)(1/2)=1/4=(2/4)(2/4)$. We try to construct a situation where $(1/3)(1/3)=1/9=(3/9)(3/9)$. Suppose there are 9 cards, with Alicia on 1, Ariana and Beyonce on 2 each, and Billie on 4.

Outcome Alicia Ariana Beyonce Billie

Number of cards 1 2 2 4

Probability 1/9 2/9 2/9 4/9

Let $\textrm{Q}$ represent this probability measure. Then $\textrm{Q}(A \cap B) = 1/9 = (3/9)(3/9)=\textrm{Q}(A)\textrm{Q}(B)$, so events $A$ and $B$ are independent. Elaborating,
- There are 3 cards that satisfy $A$ and 6 that don’t so $A$ is 3/6 = 1/2 as likely to occur than not.
- If $B$ occurs, then it’s either Alicia (satisfies $A$, 1 card) or Beyonce (does not satisfy $A$, 2 cards), so given that $B$ occurs then $A$ is 1/2 times as likely to occur than not.
- If $B$ does not occur, then it’s either Ariana (satifies $A$, 2 cards) or Billie (does not satisfy $A$, 4 cards), so given that $B$ does not occur then $A$ is 2/4 = 1/2 times as likely to occur than not.
Knowing whether or not $B$ occurs doesn’t change the chance of $A$ occurring, so $A$ and $B$ are independent according to this probability model.
Independence requires probabilities to overlap in just the right way. Aside from equally likely situations, if we blindly write down numbers that sum to 1 we will probably not luck into a probability measure where the events are independent. For example,

Outcome Alicia Ariana Beyonce Billie

Number of cards 1 2 3 4

Probability 0.1 0.2 0.3 0.4

Let $\tilde{\textrm{Q}}$ represent this probability measure. Then $\tilde{\textrm{Q}}(A \cap B) = 0.1 \neq (0.3)(0.4)=\tilde{\textrm{Q}}(A)\tilde{\textrm{Q}}(B)$, so events $A$ and $B$ are not independent. Elaborating,
- There are 3 cards that satisfy $A$ and 7 that don’t so $A$ is 3/7 as likely to occur than not.
- If $B$ occurs, then it’s either Alicia (satisfies $A$, 1 card) or Beyonce (does not satisfy $A$, 3 cards), so given that $B$ occurs then $A$ is 1/3 times as likely to occur than not.
- If $B$ does not occur, then it’s either Ariana (satifies $A$, 2 cards) or Billie (does not satisfy $A$, 4 cards), so given that $B$ does not occur then $A$ is 2/4 = 1/2 times as likely to occur than not.
Knowing whether or not $B$ occurs changes the chance of $A$ occurring, so $A$ and $B$ are not independent according to this probability model.

Outcome	Alicia	Ariana	Beyonce	Billie
Number of cards	1	2	2	4
Probability	1/9	2/9	2/9	4/9

Outcome	Alicia	Ariana	Beyonce	Billie
Number of cards	1	2	3	4
Probability	0.1	0.2	0.3	0.4

Remember, independence is a statement about probabilities, not outcomes themselves. Given two events it is not always obvious whether or not they are independent.

Independence is determined by the probability measure. Events that are independent under one probability measure might not be independent under another. The probability measure represents all the assumptions about the random phenomenon. We often incorporate independence assumptions when specifying the probability measure. However, whether or not independence is a valid assumption depends on the underlying random phenomenon.

Be sure to make a distinction between assumption and observation. For example, flip a coin some number of times. It might be reasonable to assume the coin is fair and flips are independent. In this case, the probability that the next flip lands on heads is 1/2 regardless of what you observed on the previous flips. However, if you flip a coin twenty times and it lands on heads each time, this might cast doubt on your assumption that the coin is fair.

2.7.4 Independence of multiple events

Example 2.66 Flip a fair coin twice. Let

$A$ be the event that the first flip lands on heads
$B$ be the event that the second flip lands on heads,
$C$ be the event that both flips land on the same side.

Are the two events $A$ and $B$ independent?
Are the two events $A$ and $C$ independent?
Are the two events $B$ and $C$ independent?
Are the three events $A$, $B$, and $C$ independent?

Solution (click to expand)

Solution 2.66. There are four equally likely outcomes $\{HH, HT, TH, TT\}$.

$A = \{HH, HT\}$, so $\textrm{P}(A) = 2/4$
$B = \{HH, TH\}$, so $\textrm{P}(B) = 2/4$
$C = \{HH, TT\}$, so $\textrm{P}(C) = 2/4$

Yes, events $A$ and $B$ are independent. $A\cap B=\{HH\}$, $\textrm{P}(A\cap B)=1/4$, and $\textrm{P}(A\cap B)=\textrm{P}(A)\textrm{P}(B)$.
Yes, events $A$ and $C$ are independent. $A\cap C=\{HH\}$, $\textrm{P}(A\cap C)=1/4$, and $\textrm{P}(A\cap C)=\textrm{P}(A)\textrm{P}(C)$.
Yes, events $B$ and $C$ are independent. $B\cap C=\{HH\}$, $\textrm{P}(B\cap C)=1/4$, and $\textrm{P}(B\cap C)=\textrm{P}(B)\textrm{P}(C)$.
No, even though each pair of events is independent, the collection of the three events is not. If $A$ and $B$ occur then we know event $C$ occurs. That is, $\textrm{P}(C|A \cap B)=1$ but $\textrm{P}(C) = 1/2$.

Events $A_1, A_2, A_3, \ldots$ are independent if:

any pair of events $A_i, A_j, (i \neq j)$ satisfies $\textrm{P}(A_i\cap A_j)=\textrm{P}(A_i)\textrm{P}(A_j)$,
and any triple of events $A_i, A_j, A_k$ (distinct $i,j,k$) satisfies $\textrm{P}(A_i\cap A_j\cap A_k)=\textrm{P}(A_i)\textrm{P}(A_j)\textrm{P}(A_k)$,
and any quadruple of events $A_i, A_j, A_k, A_\ell$ (distinct $i,j,k,\ell$) satisfies $\textrm{P}(A_i\cap A_j\cap A_k \cap A_\ell)=\textrm{P}(A_i)\textrm{P}(A_j)\textrm{P}(A_k)\textrm{P}(A_\ell)$,
and so on.

Intuitively, a collection of events is independent if knowing whether or not any combination of the events in the collection occur does not change the probability of any other event in the collection.

In particular, three events $A$, $B$, $C$ are independent if and only if all of the following are true \[ \scriptsize{ \textrm{P}(A\cap B) = \textrm{P}(A)\textrm{P}(B), \quad \textrm{P}(A\cap C) = \textrm{P}(A)\textrm{P}(C),\quad \textrm{P}(B\cap C) = \textrm{P}(B)\textrm{P}(C),\quad \textrm{P}(A\cap B\cap C) = \textrm{P}(A)\textrm{P}(B)\textrm{P}(C) } \]

Equivalently, it can be shown that three events $A$, $B$, $C$ are independent if and only if all of the following⁵⁶ are true.

\[\begin{align*} & \textrm{P}(A| B) = \textrm{P}(A), \quad \textrm{P}(A| C) = \textrm{P}(A), \quad \textrm{P}(B|A) = \textrm{P}(B), \quad \textrm{P}(B| C) = \textrm{P}(B), \quad \textrm{P}(C|A) = \textrm{P}(C),\\ & \textrm{P}(C|B) = \textrm{P}(C), \quad \textrm{P}(A| B\cap C) = \textrm{P}(A), \quad \textrm{P}(B|A\cap C) = \textrm{P}(B), \quad \textrm{P}(C|A\cap B) = \textrm{P}(C) \end{align*}\]

2.7.5 Independence of random variables

We have focused on independence of events, but random variables can also be independent.

Example 2.67 Continuing Example 2.59. Let $U_1$ be the result of the first roll.

Are the events $\{U_1 = 1\}$ and $\{X = 5\}$ independent?
Do you think the random variables $U_1$ and $X$ are independent? Support your answer by computing and comparing appropriate probabilities.
Suggest a random variable in this context—not $X$ or $Y$—that is independent of $U_1$.

Solution (click to expand)

Solution 2.67.

Yes the events $\{U_1 = 1\}$ and $\{X = 5\}$. See the solution to part 3 of Example 2.64).
Intuitively, $U_1$ and $X$ are not independent. But remember that independence is a statement about probabilities, so we need to compare appropriate probabilities. For example, $\textrm{P}(X = 8 | U_1 = 1) = 0$ but $\textrm{P}(X = 8) = 1/16$. Knowing that $U_1=1$ changes the distribution of $X$.
If $U_2$ is the result of the second roll, then intuitively $U_1$ and $U_2$ are independent. $U_2$ is equally likely to be 1, 2, 3, or 4, regardless of the value of $U_1$.

Two random variables are independent if any event involving one of the random variables is independent of any event involving the other. Roughly, two random variables are independent if knowing the value of one does not change the distribution of the other.

Example 2.68 Suppose $X$ and $Y$ are random variables whose joint distribution is represented by Table 2.18. Recall that we found the marginal distributions in Example 2.43.

How is $\textrm{P}(X = 2, Y=1)$ related to $\textrm{P}(X = 2)$ and $\textrm{P}(Y=1)$? What does this imply?
How is the joint distribution of $X$ and $Y$ related to the marginal distributions of $X$ and $Y$?
Compute the conditional distribution of $Y$ given $X=2$. How does it compare to the marginal distribution of $Y$?
Compute the conditional distribution of $Y$ given $X=x$ for each possible $x$. What do you notice?
Compute the conditional distribution of $X$ given $Y=1$. How does it compare to the marginal distribution of $X$?
Compute the conditional distribution of $X$ given $Y=y$ for each possible $y$. What do you notice?
Are the random variables $X$ and $Y$ independent?

Solution (click to expand)

Solution 2.68.

The joint probability is the product of the marginal probabilities, $\textrm{P}(X = 2, Y=1)= 1/256 = (1/16)(1/16) = \textrm{P}(X = 2) \textrm{P}(Y=1)$, so the events $\{X=2\}$ and $\{Y=1\}$ are independent.
For each possible value $x$ of $X$ and $y$ of $Y$, $\textrm{P}(X = x, Y=y)= \textrm{P}(X = x) \textrm{P}(Y=y)$. In short, the joint distribution of $X$ and $Y$ is the product of their marginal distributions.
Slice and renormalize: the possible values of $Y$ are 1, 2, 3, 4 with respective probabilities 1/6, 3/16, 5/16, 7/16. The conditional distribution of $Y$ given $X=2$ is the same as the marginal distribution of $Y$.
Along each possible $x$ slice, the possible values of $Y$ are 1, 2, 3, 4 with respective probabilities 1/6, 3/16, 5/16, 7/16. For every possible value $x$ of $X$, the conditional distribution of $Y$ given $X=x$ is the same as the marginal distribution of $Y$.
Slice and renormalize: the possible values of $X$ are 2, 3, 4, 5, 6, 7, 8 with respective probabilities 1/16, 2/16, 3/16, 4/16, 3/16, 2/16, 1/16. The conditional distribution of $X$ given $Y=1$ is the same as the marginal distribution of $X$.
Along each possible $y$ slice, the possible values of $X$ are 2, 3, 4, 5, 6, 7, 8 with respective probabilities 1/16, 2/16, 3/16, 4/16, 3/16, 2/16, 1/16. For every possible value $y$ of $Y$, the conditional distribution of $X$ given $Y=y$ is the same as the marginal distribution of $X$.
Yes, the random variables $X$ and $Y$ independent. For every possible value $x$ of $X$ and $y$ of $Y$ the events $\{X=x\}$ and $\{Y=y\}$ are independent. Knowing the value of one variable does not change the distribution of the other.

Roughly, two random variables $X$ and $Y$ are independent if and only if:

Their joint distribution is the product of their marginal distributions
The conditional distribution of $X$ given the value of $Y$ is equal to the marginal distribution of $X$.
The conditional distribution of $Y$ given the value of $X$ is equal to the marginal distribution of $Y$.

Figure 2.28 displays mosaic plots of the distributions of the two independent discrete random variables of Example 2.68. Notice that the conditional distribution of $X$ is the same for each value of $Y$, and vice versa.

(a) Conditioning on values of $X$; color represents values of $Y$.

2.7.6 Using independence

Remember the general multiplication rule involves successive conditional probabilities \[ \textrm{P}(A_1\cap A_2 \cap A_3 \cap A_4 \cap \cdots) = \textrm{P}(A_1)\textrm{P}(A_2|A_1)\textrm{P}(A_3|A_1\cap A_2)\textrm{P}(A_4|A_1\cap A_2 \cap A_4)\cdots \] In problems with complicated relationships, determining joint and conditional probabilities can be difficult.

But when events are independent, the multiplication rule simplifies greatly. \[ \textrm{P}(A_1 \cap A_2 \cap A_3 \cap \cdots) = \textrm{P}(A_1)\textrm{P}(A_2)\textrm{P}(A_3)\cdots \quad \text{if $A_1, A_2, A_3, \ldots$ are independent} \]

When a problem involves independence, you will want to take advantage of it. Work with “and” events whenever possible in order to use the multiplication rule. For example, for problems involving “at least one” (an “or” event) take the complement to obtain “none” (an “and” event).

Exercise 2.23 A certain system consists of four identical components. Suppose that the probability that any particular component fails is 0.1, and failures of the components occur independently of each other. Find the probability that the system fails if:

The components are connected in parallel: the system fails only if all of the components fail.
The components are connected in series: the system fails whenever at least one of the components fails.
Donny Don’t says the answer to the previous part is $0.1 + 0.1 + 0.1 + 0.1 = 0.4$. Explain the error in Donny’s reasoning.

Solution (click to expand)

Solution 2.69. Let $F$ be the event the system fails, and $F_i$ the event that component $i$ fails.

If the components are connected in parallel, $F=F_1 \cap F_2 \cap F_3 \cap F_4$. \[\begin{align*} \textrm{P}(F) & = \textrm{P}(F_1\cap F_2\cap F_3 \cap F_4) & & \\ & = \textrm{P}(F_1)\textrm{P}(F_2)\textrm{P}( F_3)\textrm{P}(F_4) & & \text{independence}\\ & = (0.1)(0.1)(0.1)(0.1) = 0.0001 \end{align*}\]
“At least one fails” is an “or” event: $F= F_1 \cup F_2 \cup F_3 \cup F_4$. With independence you want “and” events. Use the complement rule \[\begin{align*} \textrm{P}(F) & = \textrm{P}(\text{at least one fails}) & & \\ & = 1 - \textrm{P}(\text{none fails})\ & & \\ & = 1 - \textrm{P}(F_1^c\cap F_2^c \cap F_3^c\cap F_4^c) & & \\ & = 1 - \textrm{P}(F_1^c)\textrm{P}(F_2^c)\textrm{P}( F_3^c)\textrm{P}(F_4^c) & & \text{independence}\\ & = 1-(0.9)(0.9)(0.9)(0.9) = 0.3439 \end{align*}\]
Donny is assuming that the component failures are disjoint, but that’s not true since multiple components could fail. Simply adding the probabilities double counts outcomes where multiple components fail. Don’t confuse “disjoint” and “independent”. It is almost always better to work with “and” events and multiplying rather than “or” events.

The complement rule is often useful in probability problems that involve finding “the probability of at least one…,” which on the surface involves unions (OR). It usually more convenient to use the complement rule and compute “the probability of at least one…” as one minus “the probability of none…”; the latter probability involves intersections (AND). Don’t forget to actually use the complement rule to get back to the original probability of interest!

Warning

When using the complement rule, subtracting a computed probability from 1 seems like a small computational step, but it’s an important one. Getting 90% correct on a test is very different than getting 10% correct! Unfortunately, the complement rule step is often overlooked when doing probability calculations. It’s a good idea to ask yourself if the probability you are computing should be greater than or less than 50%. If your computed value seems to be on the wrong side of 50%, check your calculations to see if you have forgotten (or misapplied) the complement rule.

Example 2.69 In the Powerball lottery, a player picks five different whole numbers between 1 and 69, and another whole number between 1 and 26 that is called the Powerball.
In the drawing, the 5 numbers are drawn without replacement from a “hopper” with balls labeled 1 through 69, but the Powerball is drawn from a separate hopper with balls labeled 1 through 26. The player wins the jackpot if both the first 5 numbers match those drawn, in any order, and the Powerball is a match. Under this set up, there are 292,201,338 possible winning numbers.

What is the probability the next winning number is 6-7-16-23-26, plus the Powerball number, 4.
What is the probability the next winning number is 1-2-3-4-5, plus the Powerball number, 6.
The Powerball drawing happens twice a week. Suppose you play the same Powerball number, twice a week, every week for over 50 years. Let’s say you purchase a ticket for 6000 drawings in total. What is the probability that you win at least once?
Instead of playing for 50 years, you decide only to play one lottery, but you buy 6000 tickets, each with a different Powerball number. What is the probability that at least one of your tickets wins? How does this compare to the previous part? Why?
Each ticket costs 2 dollars, but the jackpot changes from drawing to drawing. Suppose you buy 6000 tickets for a single drawing. How large does the jackpot need to be for your “expected” profit to be positive? To be $100,000? (We’re ignoring inflation, taxes, transaction costs, and any changes in the rules.)

Solution (click to expand)

Solution 2.70.

Each of the possible winning numbers is equally likely, so the probability is $1/292,201,338\approx 3\times 10^{-9}$. See Example 1.22 and the discussion following it.
Each of the possible winning numbers is equally likely. Remember, don’t confuse a general event with a specific outcome; see Example 1.22.
The probability that you lose in any single drawing is $(1-1/292201338)$. The drawings are independent so the probability that you lose all 6000 is $(1-1/292201338)^{6000}$. The probability that you win at least once is $1 - (1-1/292201338)^{6000}\approx 0.00002$. If many people each play 6000 drawings, about 2 in every 100,000 people win will at least once.
If you play 6000 different numbers, the events that each different number wins are disjoint. So the probability you win at least once is $6000/292201338\approx 0.00002$. This is about the same as the probability in the previous part. When you play 6000 different independent drawings, there is a possibility that you win multiple times, so the events of winning in each different drawing are not disjoint. But the probability of winning multiple lotteries is so small that it’s negligible. The probability of winning any single drawing is about 1 in 300 million. The probability of winning twice in any two drawings is about 1 in 85 quadrillion.
You pay $12,000 in total. Let $w$ be the value of the jackpot. You win either 0 or $w$ so your “expected” profit is $w(6000/292201338)-12000$. But this not what you expect in a single repetition. Rather, it is the profit you would expect to see on average in the long run. You probably won’t be buying 6000 tickets for a large number of drawings, so your long run average isn’t really relevant. In any case, we must have $w>584,402,676$ for the expected profit to be positive. Sometimes, but not often, the jackpot does get this high; even so, this just guarantees that your expected profit is positive. In order for your expected long run average profit to be greater than just $100,000, the jackpot must be over 5 billion dollars, and the largest jackpot ever was 1.6 billion. The moral: there are better things to do with $12,000 dollars.

2.7.7 Exercises

Exercise 2.24 Maya is a basketball player who makes 40% of her three point field goal attempts. Suppose that at the end of every practice session, she attempts three pointers until she makes one and then stops. Let $X$ be the total number of shots she attempts in a practice session. Assume shot attempts are independent, each with probability of 0.4 of being successful.

What are the possible values that $X$ can take? Is $X$ discrete or continuous?
Compute and interpret $\text{P}(X=1)$.
Compute and interpret $\text{P}(X=2)$.
Compute and interpret $\text{P}(X=3)$.
Compute $\text{P}(X>3)$ without summing the values from the previous parts. Hint: what needs to be true about the first 3 attempts for $X > 3$?

Exercise 2.25 A very large petri dish starts with a single microorganism. After one minute, the microorganism either splits into two with probability $s$, or dies. All subsequent microorganisms behave in the same way — splitting into two or dying after each minute — independently of each other.

If $s=3/4$, what is the probability that the population eventually goes extinct? (Hint: condition on the first step.)
Compute the probability that the population eventually goes extinct as a function of $s$. For what values of $s$ is the extinction probability 1?

Exercise 2.27 Consider a “best-of-5” series of games between two teams: games are played until one of the teams has won 3 games (requiring at most 5 games total). Suppose one team, team A, is better than the other, having a 0.55 probability of winning any particular game. Assume the results of the games are independent (and ignore advantage, etc). Let $X$ represent the number of games played in the series. Hint: It’s helpful to first construct a two-way table of probabilities with the number of games played and which team wins, and then use it to answer the following questions. It will also help to list some outcomes, like AABA (team A wins game 1, 2, and 4, and B wins game 3).

Compute the probability that team A wins the series in 3 games.
Compute the probability that the series ends in 3 games.
Compute the probability that team A wins the series.
Are the events “team A wins the series” and “the series ends in 3 games” independent? Explain by comparing relevant probabilities.
Let $X$ represent the number of games played in the series. Find the distribution of $X$.

Exercise 2.26 Continuing Exercise 2.20. Now we’ll consider multiple matches. Assume that the results of matches are conditionally independent given the best player.

Suppose that A beats B in the first match, and also that A beats C in the second match. Construct a Bayes table to compute your posterior probability that each of A, B, C is best given the results of the first two matches. Use as the prior your posterior probabilities from part 1 of Exercise 2.20. Explain how your probabilities changed, and why that makes sense.
Now suppose that after A beats B in the first match and A beats C in the second match, then B beats C in the third match. Construct a Bayes table to compute your posterior probability that each of A, B, C is best given the results of the first three matches. Use as the prior your posterior probabilities from the previous part. Explain how your probabilities changed, and why that makes sense.
In the previous parts we updated posterior probabilities after each match. What if we waited until the results of all three matches? Construct a Bayes table to find your posterior probability that each of A, B, C is best given the results of the first three matches (A beats B, A beats C, B beats C). Use your original prior probabilities from Exercise 2.20 (0.5 for A, 0.35 for B, 0.15 for C). The likelihood should now reflect the results of the three matches.

2.8 Chapter exercises

Why four-sided? Simply to make the number of possibilities a little more manageable. Rolling a four-sided die twice yields 16 possible pairs, while rolling a six-sided die yields 36 possible pairs.↩︎
There is no one set of universally agreed on notation, but $\Omega$ is commonly used. It is also common practice to use uppercase and lowercase letters to denote different objects, like $\Omega$ versus $\omega$.↩︎
We could have written the sample space as the Cartesian product $\Omega = \{1, 2, 3, 4\} \times\{1, 2, 3, 4\}$, where the first $\{1, 2, 3, 4\}$ set in the product represents the result of the first roll (and similarly for the second). But this Cartesian product still represents a single set of ordered pairs, and it is that single set which is the sample space corresponding to outcomes of the pair of rolls.↩︎
Why have we started with [0, 1] and not some other continuous interval? Because probabilities take values in $[0, 1]$. We will see why this is useful in more detail later.↩︎
Mathematically we can write the sample space as $[0,60]\times [0,60]=[0,60]^2$, the Cartesian product $\{(x, y): x \in [0, 60], y \in [0, 60]\}$, the set of ordered pairs whose components take values in $[0, 60]$.↩︎
We could also try $[0, m]$ where $m$ is some large dollar amount providing an upper bound on the maximum possible salary. But we would need to be sure that $m$ is large enough so that all possible outcomes are in the sample space $[0, m]$. Without knowing this bound in advance, it is convenient to just choose the unbounded interval $[0, \infty)$. There is really no harm in making the sample space bigger than it needs to be, but you can run into problems if you make it too small.↩︎
Mathematically this sample space can be written as $\Omega=\{1, 2, 3\}^\infty$.↩︎
a.k.a., jimmies ↩︎
Technically, $\mathcal{F}$ is a $\sigma$-field of subsets of $\Omega$: $\mathcal{F}$ contains $\Omega$ and is closed under countably many elementary set operations (complements, unions, intersections). This requirement ensures that if $A$ and $B$ are “events of interest”, then so are $A\cup B$, $A\cap B$, and $A^c$. While this level of technical detail is not needed, we prefer to introduce the idea of a “collection of events” now since a probability measure is a function whose input is an event (set) rather than an outcome (point).↩︎
A $d$-dimensional random vector $V$ maps sample space outcomes to $d$-dimensional vectors, $V:\Omega \mapsto \mathbb{R}^d$. The output of a random vector is a vector (or tuple) of numbers.↩︎
Throughout, we use $g$ to denote a generic function, and reserve $f$ to represent a probability density function (which we will encounter later). Likewise, we represent a generic function argument (or “dummy variable”) with $u$, since $x$ is often used to represent possible values of a random variable $X$. In the context of a random variable, $x$ typically represents the output of the function $X$ rather than the input (which is a sample space outcome $\omega$.)↩︎
In Example 2.17 sample space outcomes are pairs of rolls. If we denote a generic outcome as $\omega = (\omega_1, \omega_2)$ then $X(\omega) = X((\omega_1, \omega_2)) = \omega_1 + \omega_1$. Similarly, $Y(\omega) = Y((\omega_1, \omega_2)) = \max(\omega_1, \omega_2)$. But we don’t need this level of technical detail; defining $X$ and $Y$ in words is sufficient.↩︎
$Y(\omega) = g(X(\omega))$ so $Y$ maps $\Omega$ to $\mathbb{R}$ via the composition of the functions $g$ and $X$; that is, $Y=g\circ X$ where $(g\circ X):\Omega\mapsto \mathbb{R}$↩︎
Orange you glad I didn’t say banana?↩︎
See the inclusion-exclusion principle ↩︎
And $\{X = 3\}$ itself is short for $\{\omega\in\Omega:X(\omega) = x\}$.↩︎
A probability measure is a set function; its input is a set and its output is a number.↩︎
It’s the number of events that must be countable. The events themselves can be uncountable sets like intervals.↩︎
That the probability of each outcome must be 1/4 when there are four equally likely outcomes follows from the axioms, by writing $\{1, 2, 3, 4\} = \{1\}\cup\{2\}\cup \{3\}\cup \{4\}$, a union of disjoint sets, and applying countable additivity and $\textrm{P}(\Omega)=1$. But we don’t need this level of technical detail; our intuition tells us the probability of each four equally likely outcomes is 1/4.↩︎
Probabilities are always defined for events (sets). When we say loosely “the probability of an outcome $\omega$” we really mean the probability of the event $\{\omega\}$ consisting of the single outcome $\omega$. In this example $\textrm{P}(\{1\})=\textrm{P}(\{2\})=\textrm{P}(\{3\})=\textrm{P}(\{4\})=1/4$.↩︎
$\Omega = \{1, 2, 3\} \cup \{4\}$, a union of disjoint events, so $1 = \textrm{Q}(\Omega) = \textrm{Q}(\{1, 2, 3\}) + \textrm{Q}(\{4\})$.↩︎
Because he’s solo.↩︎
It doesn’t really matter if we round or truncate to the nearest minute, but we’re truncating so we don’t treat 0 differently than the other values (technically only times in the first 30 seconds, not minute, round to 0).↩︎
This is one reason why probabilities are defined directly for events and not outcomes.↩︎
Proof. Since $\Omega = A \cup A^c$ and $A$ and $A^c$ are disjoint the axioms imply that $1=\textrm{P}(\Omega) = \textrm{P}(A \cup A^c) = \textrm{P}(A) + \textrm{P}(A^c)$.↩︎
Proof. If $A \subseteq B$ then $B = A \cup (B \cap A^c)$. Since $A$ and $(B \cap A^c)$ are disjoint, $\textrm{P}(B) = \textrm{P}(A) + \textrm{P}(B \cap A^c) \ge \textrm{P}(A)$.↩︎
The proof is easiest to see by considering a picture like the one in Figure 2.8).↩︎
See the inclusion-exclusion principle.↩︎
$A = A\cap \Omega = A\cap(C_1 \cup C_2 \cup \cdots) = (A\cap C_1)\cup(A\cap C_2)\cup \cdots$. The $A\cap C_i$’s are disjoint since the $C_i$’s are, and the result follows from countable additivity.↩︎
In this example it is logically possible for $\textrm{P}(C \cap D)$ to be 0, but that’s not always true. For example, if $\textrm{P}(A) = 0.9$ and $\textrm{P}(B) = 0.8$, then $\textrm{P}(A \cap B)$ must be at least 0.7 so that $\textrm{P}(A \cup B)\le 1$.↩︎
A probability space is usually defined as a triple $(\Omega, \mathcal{F}, \textrm{P})$, where $\Omega$ is the sample space, $\mathcal{F}$ is a $\sigma$-field of subsets of $\Omega$ representing the collection of events of interest, and $\textrm{P}$ is a probability measure. Given that many events of interest involve random variables, we also include random variables in the model.↩︎
The values in this problem are based on a April, 2021 report by the Pew Research Center.↩︎
Based on data from the U.S. Census Bureau ↩︎
We generally encourage you to use two-way tables of whole number counts, but we’re using probabilities here to motivate the definition of conditional probability.↩︎
We have seen that “equals to” events involving continuous random variables have probability 0. We will discuss some issues related to conditioning on the value of a continuous random variable later.↩︎
The value only differs from the 0.24 in Example 2.46 due to rounding.↩︎
The value only differs from the 0.5417 in Example 2.46 due to rounding.↩︎
In computing these probabilities we have unconsciously applied “Bayes rule”, which we will discuss in more detail later.↩︎
You should really check about this birthday problem demo from The Pudding.↩︎
Which isn’t quite true. However, a non-uniform distribution of birthdays only increases the probability that at least two people have the same birthday. To see that, think of an extreme case like if everyone were born in September.↩︎
Sometimes students mistake this for $(1/365)^2$, but $(1/365)^2$ would be the probability that person 1 and person 2 both have a particular birthday, like the probability that both are born on January 1. There are $365^2$ possible (person 1, person 2) birthday pairs, of which 365 — (Jan 1, Jan 1), (Jan 2, Jan 2), etc — result in the same birthday, so the probability of sharing a birthday is $365/365^2 = 1/365$.↩︎
Proof: start with Lemma 2.5 and use the multiplication rule to write $\textrm{P}(A \cap C_1)=\textrm{P}(A|C_1)\textrm{P}(C_1)$, etc.↩︎
They should be exactly the same; any differences are due to rounding.↩︎
This section only covers Bayes’ rule for events. We’ll see Bayes’ rule for distributions of random variables later. But the ideas are analogous.↩︎
We’re using “hypothesis” in the sense of a general scientific hypothesis, not necessarily a statistical null or alternative hypothesis.↩︎
The symbol $\propto$ means “is proportional to”.↩︎
Wouldn’t it also be a mistake to not consider other animals like cow? Yes, but that’s also a mistake about prior probabilities. If you forget to include an animal like cow then you’re assigning it a prior probability of 0, so its posterior probability will automatically be 0 regardless of the likelihood.↩︎
You still might be thinking: what about cows? Or dogs? Or moose? Or horses? Cows would have a high prior probability, and they are often very large, hair, and black. So it depends on how likely it is for a cow to be running. Depending on the prior probabilities and likelihoods, a cow—or dog or moose or horse—might end up with an even higher posterior probability than a bear. In any case, the point is that a gorilla should have a posterior probability of basically 0. “Tt’s a gorilla” was not a great initial proclamation, but maybe “it’s probably just a cow (or dog/moose/horse)” would have been a fine conclusion.↩︎
Conditioning on event $E$ can also be viewed as a restiction of the sample sample from $\Omega$ to $E$. However, we prefer to keep the sample space as $\Omega$ and only view conditioning as a change in probability measure. In this way, we can consider conditioning on various events as representing different probability measures all defined for the same collection of events corresponding to the same sample space.↩︎
Remember: probabilities are assigned to events, so we are speaking loosely when we say probabilities of outcomes.↩︎
Thanks to Allan Rossman for this example.↩︎
Please replace A, B, and C with your favorite names. Possible choices: Ahsoka, Boba, Cassian. Ant-Man, Black Panther, Captain America. Arthur Ashe, Bjorn Borg, Chris Evert.↩︎
Technically, we should say “$\textrm{P}$-independent”; see Section 2.7.3 ↩︎
The proof follows from the definitions of independence and conditional probability and properties of a probability measure. For example, $\textrm{P}(A) = \textrm{P}(A\cap B) + \textrm{P}(A \cap B^c)$ so $\textrm{P}(A \cap B^c) = \textrm{P}(A) - \textrm{P}(A \cap B)$. If $A$ and $B$ are independent then $\textrm{P}(A \cap B^c) = \textrm{P}(A) - \textrm{P}(A)\textrm{P}(B) = \textrm{P}(A)(1-\textrm{P}(B)) = \textrm{P}(A)\textrm{P}(B^c)$, so $A$ and $B^c$ are independent.↩︎
That is, if one statement is true then they all are true; if one statement is false, then they all are false.↩︎
Some of these conditions are redundant. For example, $\textrm{P}(A|B)=\textrm{P}(A)$ if and only if $\textrm{P}(B|A)=\textrm{P}(B)$ so technically only one of those conditions needs to be verified.↩︎