Consider a simple example: \(y = 1\) if \(x_1 + x_2 > 0\), and \(y=0,\) otherwise.
= matrix(runif(20), 10, 2)
x = ifelse(x[,1]+x[,2]>1, 1, 0);
y plot(x[,1], x[,2], type="n", xlab="", ylab="");
text(x[,1], x[,2], y )
= glm(y~x, family=binomial)
myfit summary(myfit)
In cases where data are perfectly separable, the likelihood for the logistic model doesn’t converge. The maximum likelihood estimate (MLE) of the parameter beta may tend towards either positive infinity or negative infinity.
However, despite this issue, the separating line remains well-defined, allowing us to use the fitted model for predictions. This is because the separating condition, such as \(3 x_1 + x_2 > 2,\) remains consistent even if we scale the coefficients. For instance, \(300 x_1 + 100 x_2 > 200\) expresses the same line.
Next we discuss fitting a logistic regression model in R and interpreting the coefficients and other outcomes. For illustration, we use the Coronary Risk‐Factor Study data, which consists of 462 males between the ages of 15 and 64 from a heart-disease high-risk region of the Western Cape, South Africa.
The response variable, chd
, is binary (1 for disease
presence and 0 for absence), and there are nine covariates, mostly
numerical, except for “family history,” which is binary.
= "https://web.stanford.edu/~hastie/ElemStatLearn//datasets/SAheart.data"
url = read.table(url, sep=",",head=T, row.names=1) heart
head(heart)
The grammar for fitting the logistic regression model is similar to
that of a linear model (lm
). You use glm
and
it can handle various models, including Poisson regression. However, in
this lecture, we are focusing on logistic regression.
= glm(chd~., data=heart, family=binomial) heartfull
After fitting the logistic regression model, you can run a summary.
The first column of the coefficient matrix contains estimated
coefficients, including the intercept and coefficients associated with
each feature. The interpretation of these coefficients is similar to the
regular linear model. For instance, if you increase sbp
by
one unit, you will increase the response by 0.0065.
However, in logistic regression, the response is the log-odds, \(\log (p / (1 - p))\), and the interpretation of coefficients involves exponentiation. Increasing \(X_1\) by one unit corresponds to the change in odds by \(\exp(\beta_1)\)
\[ \log \frac{p}{1-p} = \beta_0 + (X_1 + 1) \beta_1 + \cdots + X_p \beta_p = (\beta_0 + X_1 \beta_1 + \cdots + X_p \beta_p) + \beta_1 \]
If \(\beta_1\) is positive, \(\exp( \beta_1) > 1,\) indicating an increase in odds.
If \(\beta_1\) is negative, \(\exp( \beta_1) < 1,\), indicating a decrease in odds.
summary(heartfull)
##
## Call:
## glm(formula = chd ~ ., family = binomial, data = heart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7781 -0.8213 -0.4387 0.8889 2.5435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1507209 1.3082600 -4.701 2.58e-06 ***
## sbp 0.0065040 0.0057304 1.135 0.256374
## tobacco 0.0793764 0.0266028 2.984 0.002847 **
## ldl 0.1739239 0.0596617 2.915 0.003555 **
## adiposity 0.0185866 0.0292894 0.635 0.525700
## famhistPresent 0.9253704 0.2278940 4.061 4.90e-05 ***
## typea 0.0395950 0.0123202 3.214 0.001310 **
## obesity -0.0629099 0.0442477 -1.422 0.155095
## alcohol 0.0001217 0.0044832 0.027 0.978350
## age 0.0452253 0.0121298 3.728 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 472.14 on 452 degrees of freedom
## AIC: 492.14
##
## Number of Fisher Scoring iterations: 5
The remaining columns in the summary table provide information similar to that in linear regression. They report standard errors, z-values, and p-values. Logistic regression iteratively fits a weighted least square model. From this model, standard errors for coefficients are calculated, leading to Z-values. Using a normal distribution, we can calculate two-sided p-values to determine the significance of coefficients.
Looking at the heart disease data, most coefficients are positive,
meaning an increase in those features leads to an increased likelihood
of heart disease. Interestingly, the coefficient for
obesity
is negative, implying that an increase in body fat
(obesity) reduces the chances of heart disease. This might seem
counter-intuitive, but it’s due to correlations between features.
round(cor(data.matrix(heart)), dig=2)
## sbp tobacco ldl adiposity famhist typea obesity alcohol age
## sbp 1.00 0.21 0.16 0.36 0.09 -0.06 0.24 0.14 0.39
## tobacco 0.21 1.00 0.16 0.29 0.09 -0.01 0.12 0.20 0.45
## ldl 0.16 0.16 1.00 0.44 0.16 0.04 0.33 -0.03 0.31
## adiposity 0.36 0.29 0.44 1.00 0.18 -0.04 0.72 0.10 0.63
## famhist 0.09 0.09 0.16 0.18 1.00 0.04 0.12 0.08 0.24
## typea -0.06 -0.01 0.04 -0.04 0.04 1.00 0.07 0.04 -0.10
## obesity 0.24 0.12 0.33 0.72 0.12 0.07 1.00 0.05 0.29
## alcohol 0.14 0.20 -0.03 0.10 0.08 0.04 0.05 1.00 0.10
## age 0.39 0.45 0.31 0.63 0.24 -0.10 0.29 0.10 1.00
## chd 0.19 0.30 0.26 0.25 0.27 0.10 0.10 0.06 0.37
## chd
## sbp 0.19
## tobacco 0.30
## ldl 0.26
## adiposity 0.25
## famhist 0.27
## typea 0.10
## obesity 0.10
## alcohol 0.06
## age 0.37
## chd 1.00
In logistic regression, we don’t have residual sum of squares or \(\sigma\) estimate. Instead, we focus on metrics like “null deviance” and “residual deviance” (reported at the end of the summary). These metrics help assess how well the model fits the data and can be used for model comparison and evaluation.
Deviance measures model goodness of fit, based on likelihood comparisons. When all \(x_i\) are unique, the deviance is defined as (-2) times the log-likelihood, as follows: \[ \text{Deviance} = -2 \cdot \Big (\sum_{i: y_i = 1} \log \hat{p}_i + \sum_{i: y_i = 0} (1-\hat{p}_i) \Big). \] Here \(\hat{p}_i\) represents the estimated probability off \(P(Y = 1| X = x_i)\).
Next, let’s clarify the meanings of “null deviance” and “residual deviance”:
Null Deviance: This is the deviance associated with a logistic regression model that includes only the intercept term (i.e., it doesn’t consider any predictor variables). In this case, the estimated probability \(P(Y = 1| X = x_i)\) is determined solely by the sample mean of the observed \(y_i\) values, irrespective of the feature vector \(x_i.\)
Residual Deviance: This deviance corresponds to
the current logistic regression model under consideration. For example,
for the logistic regression model we just fit, heartfull
,
the estimated probability \[P(Y = 1| X =
x_i)\] can be obtained using
heartfull$fitted.values
.
Below, we illustrate how to compute the null deviance and residual
deviance. The computed values should match those returned at the end of
summary(heartfull)
.
=heartfull$fitted.values;
phat-2*(sum(log(phat[heart$chd==1])) +
sum(log(1-phat[heart$chd==0])))
## [1] 472.14
table(heart$chd)
##
## 0 1
## 302 160
= rep(mean(heart$chd), length(heart$chd)) # 160/(160+302)
p0hat -2*(sum(log(p0hat[heart$chd==1])) +
sum(log(1-p0hat[heart$chd==0])))
## [1] 596.1084
Estimation (probability of being 1) on the training samples.
=heartfull$fitted.values;
phat= (phat>0.5)
mypred table(heart$chd, mypred)
## mypred
## FALSE TRUE
## 0 256 46
## 1 77 83
Consider the following three males with the same value on all other predictors except age: min, max, and median. What’s the estimated chance of getting heart disease for each of them?
= heart[c(1, 1, 1),]
testsamples testsamples
## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
## 1 160 12 5.73 23.11 Present 49 25.3 97.2 52 1
## 1.1 160 12 5.73 23.11 Present 49 25.3 97.2 52 1
## 1.2 160 12 5.73 23.11 Present 49 25.3 97.2 52 1
$age = c(min(heart$age), median(heart$age), max(heart$age))
testsamples testsamples
## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
## 1 160 12 5.73 23.11 Present 49 25.3 97.2 15 1
## 1.1 160 12 5.73 23.11 Present 49 25.3 97.2 45 1
## 1.2 160 12 5.73 23.11 Present 49 25.3 97.2 64 1
predict(heartfull, newdata=testsamples)
## 1 1.1 1.2
## -0.7673285 0.5894320 1.4487137
predict(heartfull, newdata=testsamples, type="response")
## 1 1.1 1.2
## 0.3170573 0.6432348 0.8098004
After fitting a logistic regression model, we often encounter a situation similar to what we’ve experienced with linear models: there may be numerous features, but only a subset of them is truly relevant. Therefore, we need to perform variable selection.
We can employ criteria such as AIC and BIC because the logistic regression model has a likelihood. However, unlike linear regression, we can’t utilize the efficient leaps-and-bounds algorithm to explore all possible sub-models since the likelihood in logistic regression differs from residual sum of squares. Instead, we rely on a greedy algorithm, such as stepwise, backward, or forward.
Here, I’ll demonstrate how to perform this using a stepwise approach:
we start with the full model and then consider various actions, which
include removing existing covariates or taking no action; then calculate
the corresponding AIC for each scenario, rank them from smallest to
largest, and select the action associated with the smallest AIC. For
instance, we may decide to remove the alcohol
variable in
the first step.
In the next stage, we repeat this process, either removing existing covariates, taking no action, or even adding back previously removed ones. We rank these actions by their AIC values and continue removing features until we reach a point where taking no action results in the smallest AIC. At this point, we stop and consider our selected model, which typically includes a subset of the original features.
= step(heartfull, scope=list(upper=~., lower=~1)) heartstepA
## Start: AIC=492.14
## chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity +
## alcohol + age
##
## Df Deviance AIC
## - alcohol 1 472.14 490.14
## - adiposity 1 472.55 490.55
## - sbp 1 473.44 491.44
## <none> 472.14 492.14
## - obesity 1 474.23 492.23
## - ldl 1 481.07 499.07
## - tobacco 1 481.67 499.67
## - typea 1 483.05 501.05
## - age 1 486.53 504.53
## - famhist 1 488.89 506.89
##
## Step: AIC=490.14
## chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity +
## age
##
## Df Deviance AIC
## - adiposity 1 472.55 488.55
## - sbp 1 473.47 489.47
## <none> 472.14 490.14
## - obesity 1 474.24 490.24
## + alcohol 1 472.14 492.14
## - ldl 1 481.15 497.15
## - tobacco 1 482.06 498.06
## - typea 1 483.06 499.06
## - age 1 486.64 502.64
## - famhist 1 488.99 504.99
##
## Step: AIC=488.55
## chd ~ sbp + tobacco + ldl + famhist + typea + obesity + age
##
## Df Deviance AIC
## - sbp 1 473.98 487.98
## <none> 472.55 488.55
## - obesity 1 474.65 488.65
## + adiposity 1 472.14 490.14
## + alcohol 1 472.55 490.55
## - tobacco 1 482.54 496.54
## - ldl 1 482.95 496.95
## - typea 1 483.19 497.19
## - famhist 1 489.38 503.38
## - age 1 495.48 509.48
##
## Step: AIC=487.98
## chd ~ tobacco + ldl + famhist + typea + obesity + age
##
## Df Deviance AIC
## - obesity 1 475.69 487.69
## <none> 473.98 487.98
## + sbp 1 472.55 488.55
## + adiposity 1 473.47 489.47
## + alcohol 1 473.94 489.94
## - tobacco 1 484.18 496.18
## - typea 1 484.30 496.30
## - ldl 1 484.53 496.53
## - famhist 1 490.58 502.58
## - age 1 502.11 514.11
##
## Step: AIC=487.69
## chd ~ tobacco + ldl + famhist + typea + age
##
## Df Deviance AIC
## <none> 475.69 487.69
## + obesity 1 473.98 487.98
## + sbp 1 474.65 488.65
## + adiposity 1 475.44 489.44
## + alcohol 1 475.65 489.65
## - ldl 1 484.71 494.71
## - typea 1 485.44 495.44
## - tobacco 1 486.03 496.03
## - famhist 1 492.09 502.09
## - age 1 502.38 512.38
summary(heartstepA)
##
## Call:
## glm(formula = chd ~ tobacco + ldl + famhist + typea + age, family = binomial,
## data = heart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9165 -0.8054 -0.4430 0.9329 2.6139
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.44644 0.92087 -7.000 2.55e-12 ***
## tobacco 0.08038 0.02588 3.106 0.00190 **
## ldl 0.16199 0.05497 2.947 0.00321 **
## famhistPresent 0.90818 0.22576 4.023 5.75e-05 ***
## typea 0.03712 0.01217 3.051 0.00228 **
## age 0.05046 0.01021 4.944 7.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 475.69 on 456 degrees of freedom
## AIC: 487.69
##
## Number of Fisher Scoring iterations: 5
You can apply a similar process using BIC. Just note that for BIC,
you should set the parameter k
to be log(n)
instead of using the default value. You can also set
trace = 0
to suppress any intermediate results.
=dim(heart)[1]
n= step(heartfull, scope=list(upper=~., lower=~1), trace = 0, k=log(n))
heartstepB summary(heartstepB)
It’s worth mentioning that both AIC and BIC algorithms tend to select the same model, a model with five covariates.
We can also use Lasso or Ridge, or even a combination of both. The algorithm for logistic regression with Lasso/Ridge is quite similar to that of linear models. As I mentioned earlier, logistic regression is essentially a weighted least squares problem. Therefore, we can extend the existing algorithms for linear models with Lasso to logistic regression with Lasso/Ridge penalty.
In this case, the objective function we aim to minimize is the
negative log-likelihood plus the L1 penalty over the coefficient vector
beta. To achieve this, we can use packages like glmnet
.
When using Lasso, you provide the data matrix, the response vector Y,
and specify the family as binomial with alpha = 1
, which
means using only Lasso without Ridge regularization.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
= dim(heart)[2]
p =data.matrix(heart[,-p]);
X=heart[,p];
Y=glmnet(X,Y,family="binomial",alpha=1) heart.l1
You can create a path plot to visualize the estimated coefficients at different values of lambda.
plot(heart.l1,label=TRUE)
Additionally, cross-validation can help you choose the best lambda value for your model.
= cv.glmnet(X, Y, family="binomial",alpha=1)
heart.cv
plot(heart.cv)
$lambda.min
heart.cv$lambda.1se
heart.cv
predict(heart.cv, lambda=heart.cv$lambda.1se, type="coefficients")
The Challenger O-Ring Data is a historical dataset related to the Space Shuttle Challenger disaster, which occurred on January 28, 1986. The disaster resulted in the loss of seven crew members and the destruction of the spacecraft. It was a pivotal moment in the history of space exploration and had significant implications for engineering, safety, and decision-making processes.
On the night preceding the launch, a critical decision had to be made regarding the safety of proceeding with the launch.
One of the critical factors under consideration was the ambient temperature on the day of the launch. It was known that the O-rings were less flexible at lower temperatures, potentially affecting their ability to seal the joints properly.
The dataset included records of previous shuttle flights that documented both the temperature at the time of launch and the number of O-ring failures observed.
= "http://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-or-blowby.data"
url = read.table(url)
orings =orings[, c(2:3)]
oringsnames(orings)=c("damage", "temp")
order(orings$temp),] orings[
## damage temp
## 14 2 53
## 9 1 57
## 23 1 58
## 10 1 63
## 1 0 66
## 5 0 67
## 13 0 67
## 15 0 67
## 4 0 68
## 3 0 69
## 2 1 70
## 8 0 70
## 11 1 70
## 17 0 70
## 6 0 72
## 7 0 73
## 16 0 75
## 21 2 75
## 19 0 76
## 22 0 76
## 12 0 78
## 20 0 79
## 18 0 81
It’s worth noting that no previous liftoff had occurred at temperatures below 53 degrees Fahrenheit. Therefore, significant extrapolation is required from the available data to assess the risk associated with a launch temperature as low as 31 degrees Fahrenheit. However, the data visualization, as indicated by the plot, clearly demonstrates that the risk of O-ring failure is notably high at 31 degrees Fahrenheit.
=glm(cbind(damage, 6-damage) ~ temp, family=binomial, data=orings) logitmod
summary(logitmod)
The primary task is to predict the likelihood of an O-ring failure when the launch temperature falls below freezing, specifically at 31 degrees Fahrenheit. This prediction serves as a crucial component in evaluating the safety of future launches under similar conditions.
predict(logitmod, data.frame(temp=31), type="response")
## 1
## 0.8177744
=31; max.temp=max(orings$temp)
min.tempplot(orings$temp, orings$damage/6,
xlim=c(min.temp, max.temp), ylim=c(0,1),
xlab="Temp", ylab="Chance of Damage")
= seq(min.temp, max.temp, length=100)
newtemp = predict(logitmod, data.frame(temp=newtemp), type="response")
phat lines(newtemp, phat, col="red")
Despite the concerns raised by some engineers, it was ultimately decided to proceed with the launch. Tragically, just 73 seconds into the flight, the O-rings failed due to the cold temperatures, leading to the explosion of the shuttle.