1.2.3. Compute Bayes rule

The Bayes classification rule is derived from Bayes’ Theorem. Below is the probabilistic description of the data generating process for Example 1. The first line defines the distribution of Y, while the following two lines describe the conditional distribution of X given Y. Multiplying them provides the joint distribution of X and Y.

YBern(p),XY=0N(μ0,σ2I2),XY=1N(μ1,σ2I2).

For predictions, we care about Y given X, the conditional probability. Since Y takes only two possible values, knowing the probability of Y = 1 given X is enough for us to form our best prediction for Y.

Before applying the Bayes’ theorem to compute this conditional probability, I want to clarify some terms.

We often discuss two types of random variables in class:

  • Discrete Random Variables: Like Y in our example, we describe its distribution using a probability mass function (PMF).

  • Continuous Random Variables: Like X, we describe its distribution using a probability density function (PDF). For example, the PDf of a normal distribution is bell-shaped.

The joint distribution of X and Y here presents a challenge because it’s neither purely discrete nor continuous. For discrete variables, we can discuss probabilities of specific values (e.g., Y=1). For continuous variables, however, the probability of a specific value is always zero. In the derivation below, I’ll treat X as discrete, utilizing its density function as the corresponding PMF; this can be justified by discretizing a continuous variable.

The conditional probability is calculated as the joint divided by the marginal.

P(Y=1X=x)=P(Y=1,X=x)P(X=x)=P(Y=1,X=x)P(Y=1,X=x)+P(Y=0,X=x)=P(Y=1)P(X=x|Y=1)P(Y=1)P(X=x|Y=1)+P(Y=0)P(X=x|Y=0)=(p)(12πσ2)2exp{xμ122σ2}(p)(12πσ2)2exp{xμ122σ2}+(1p)(12πσ2)2exp{xμ022σ2}=[1+exp{12σ2(xμ12xμ02)logp1p}]1

After simplifying, the conditional probability is found to be a specific expression. This helps us establish the the optimal decision rule (i.e., the Bayes rule) for classification,

P(Y=1X=x)>0.512σ2(xμ12xμ02)<logp1p

If p=0.5, log p/(1-p)=0, then we basically predict Y=1 if xμ12<xμ02. That is, we predict x to be class 1 if it’s close to the center of class 1 and 0 if it’s closer to the center of class 0.

The decision rule looks like a quadratic function of x, but it can be simplied as a linear function of x (i.e., the optimal decision boundary for Example 1 is linear):

xμ12xμ02=x22xtμ1+μ12(x22xtμ0+μ02)=μ12μ022xt(μ1μ0)

You will be asked to implement the Bayes rule for Example 2, where given Y = 1, follows a mixture of 10 normal distributions. The derivation remains roughly the same, except that you need to replace the PDF of X given by Y by an average of 10 normals.