(PSL) Logistic Regression

The Convergence Issue

Consider a simple example: $y = 1$ if $x_1 + x_2 > 0$, and $y=0,$ otherwise.

x = matrix(runif(20), 10, 2)
y = ifelse(x[,1]+x[,2]>1, 1, 0); 
plot(x[,1], x[,2], type="n", xlab="", ylab=""); 
text(x[,1], x[,2], y )

myfit = glm(y~x, family=binomial)
summary(myfit)

In cases where data are perfectly separable, the likelihood for the logistic model doesn’t converge. The maximum likelihood estimate (MLE) of the parameter beta may tend towards either positive infinity or negative infinity.

However, despite this issue, the separating line remains well-defined, allowing us to use the fitted model for predictions. This is because the separating condition, such as $3 x_1 + x_2 > 2,$ remains consistent even if we scale the coefficients. For instance, $300 x_1 + 100 x_2 > 200$ expresses the same line.

Heart Data

Next we discuss fitting a logistic regression model in R and interpreting the coefficients and other outcomes. For illustration, we use the Coronary Risk‐Factor Study data, which consists of 462 males between the ages of 15 and 64 from a heart-disease high-risk region of the Western Cape, South Africa.

The response variable, chd, is binary (1 for disease presence and 0 for absence), and there are nine covariates, mostly numerical, except for “family history,” which is binary.

sbp - systolic blood pressure
tobacco - cumulative tobacco (kg)
ldl - low densiity lipoprotein cholesterol
adiposity
famhist - family history of heart disease (Present, Absent)
typea - type-A behavior
obesity
alcohol - current alcohol consumption
age - age at onset

url = "https://web.stanford.edu/~hastie/ElemStatLearn//datasets/SAheart.data"
heart = read.table(url, sep=",",head=T, row.names=1)

head(heart)

Fit a logistic model

Model fitting

The grammar for fitting the logistic regression model is similar to that of a linear model (lm). You use glm and it can handle various models, including Poisson regression. However, in this lecture, we are focusing on logistic regression.

heartfull = glm(chd~., data=heart, family=binomial)

Coefficient Interpretation

After fitting the logistic regression model, you can run a summary.

The first column of the coefficient matrix contains estimated coefficients, including the intercept and coefficients associated with each feature. The interpretation of these coefficients is similar to the regular linear model. For instance, if you increase sbp by one unit, you will increase the response by 0.0065.

However, in logistic regression, the response is the log-odds, $\log (p / (1 - p))$, and the interpretation of coefficients involves exponentiation. Increasing $X_1$ by one unit corresponds to the change in odds by $\exp(\beta_1)$

\[ \log \frac{p}{1-p} = \beta_0 + (X_1 + 1) \beta_1 + \cdots + X_p \beta_p = (\beta_0 + X_1 \beta_1 + \cdots + X_p \beta_p) + \beta_1 \]

If $\beta_1$ is positive, $\exp( \beta_1) > 1,$ indicating an increase in odds.
If $\beta_1$ is negative, $\exp( \beta_1) < 1,$, indicating a decrease in odds.

summary(heartfull)

## 
## Call:
## glm(formula = chd ~ ., family = binomial, data = heart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7781  -0.8213  -0.4387   0.8889   2.5435  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.1507209  1.3082600  -4.701 2.58e-06 ***
## sbp             0.0065040  0.0057304   1.135 0.256374    
## tobacco         0.0793764  0.0266028   2.984 0.002847 ** 
## ldl             0.1739239  0.0596617   2.915 0.003555 ** 
## adiposity       0.0185866  0.0292894   0.635 0.525700    
## famhistPresent  0.9253704  0.2278940   4.061 4.90e-05 ***
## typea           0.0395950  0.0123202   3.214 0.001310 ** 
## obesity        -0.0629099  0.0442477  -1.422 0.155095    
## alcohol         0.0001217  0.0044832   0.027 0.978350    
## age             0.0452253  0.0121298   3.728 0.000193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 472.14  on 452  degrees of freedom
## AIC: 492.14
## 
## Number of Fisher Scoring iterations: 5

The remaining columns in the summary table provide information similar to that in linear regression. They report standard errors, z-values, and p-values. Logistic regression iteratively fits a weighted least square model. From this model, standard errors for coefficients are calculated, leading to Z-values. Using a normal distribution, we can calculate two-sided p-values to determine the significance of coefficients.

Correlation among variables affects coefficients

Looking at the heart disease data, most coefficients are positive, meaning an increase in those features leads to an increased likelihood of heart disease. Interestingly, the coefficient for obesity is negative, implying that an increase in body fat (obesity) reduces the chances of heart disease. This might seem counter-intuitive, but it’s due to correlations between features.

round(cor(data.matrix(heart)), dig=2)

##             sbp tobacco   ldl adiposity famhist typea obesity alcohol   age
## sbp        1.00    0.21  0.16      0.36    0.09 -0.06    0.24    0.14  0.39
## tobacco    0.21    1.00  0.16      0.29    0.09 -0.01    0.12    0.20  0.45
## ldl        0.16    0.16  1.00      0.44    0.16  0.04    0.33   -0.03  0.31
## adiposity  0.36    0.29  0.44      1.00    0.18 -0.04    0.72    0.10  0.63
## famhist    0.09    0.09  0.16      0.18    1.00  0.04    0.12    0.08  0.24
## typea     -0.06   -0.01  0.04     -0.04    0.04  1.00    0.07    0.04 -0.10
## obesity    0.24    0.12  0.33      0.72    0.12  0.07    1.00    0.05  0.29
## alcohol    0.14    0.20 -0.03      0.10    0.08  0.04    0.05    1.00  0.10
## age        0.39    0.45  0.31      0.63    0.24 -0.10    0.29    0.10  1.00
## chd        0.19    0.30  0.26      0.25    0.27  0.10    0.10    0.06  0.37
##            chd
## sbp       0.19
## tobacco   0.30
## ldl       0.26
## adiposity 0.25
## famhist   0.27
## typea     0.10
## obesity   0.10
## alcohol   0.06
## age       0.37
## chd       1.00

Compute deviances

In logistic regression, we don’t have residual sum of squares or $\sigma$ estimate. Instead, we focus on metrics like “null deviance” and “residual deviance” (reported at the end of the summary). These metrics help assess how well the model fits the data and can be used for model comparison and evaluation.

Deviance measures model goodness of fit, based on likelihood comparisons. When all $x_i$ are unique, the deviance is defined as (-2) times the log-likelihood, as follows: \[ \text{Deviance} = -2 \cdot \Big (\sum_{i: y_i = 1} \log \hat{p}_i + \sum_{i: y_i = 0} (1-\hat{p}_i) \Big). \] Here $\hat{p}_i$ represents the estimated probability off $P(Y = 1| X = x_i)$.

Next, let’s clarify the meanings of “null deviance” and “residual deviance”:

Null Deviance: This is the deviance associated with a logistic regression model that includes only the intercept term (i.e., it doesn’t consider any predictor variables). In this case, the estimated probability $P(Y = 1| X = x_i)$ is determined solely by the sample mean of the observed $y_i$ values, irrespective of the feature vector $x_i.$
Residual Deviance: This deviance corresponds to the current logistic regression model under consideration. For example, for the logistic regression model we just fit, heartfull, the estimated probability \[P(Y = 1| X = x_i)\] can be obtained using heartfull$fitted.values.

Below, we illustrate how to compute the null deviance and residual deviance. The computed values should match those returned at the end of summary(heartfull).

phat=heartfull$fitted.values;
-2*(sum(log(phat[heart$chd==1])) + 
  sum(log(1-phat[heart$chd==0])))

## [1] 472.14

table(heart$chd)

## 
##   0   1 
## 302 160

p0hat = rep(mean(heart$chd), length(heart$chd))  # 160/(160+302)
-2*(sum(log(p0hat[heart$chd==1])) + 
  sum(log(1-p0hat[heart$chd==0])))

## [1] 596.1084

Prediction

Estimation (probability of being 1) on the training samples.

phat=heartfull$fitted.values;
mypred = (phat>0.5)
table(heart$chd, mypred)

##    mypred
##     FALSE TRUE
##   0   256   46
##   1    77   83

Consider the following three males with the same value on all other predictors except age: min, max, and median. What’s the estimated chance of getting heart disease for each of them?

testsamples = heart[c(1, 1, 1),]
testsamples

##     sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1   160      12 5.73     23.11 Present    49    25.3    97.2  52   1
## 1.1 160      12 5.73     23.11 Present    49    25.3    97.2  52   1
## 1.2 160      12 5.73     23.11 Present    49    25.3    97.2  52   1

testsamples$age = c(min(heart$age), median(heart$age), max(heart$age))
testsamples

##     sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1   160      12 5.73     23.11 Present    49    25.3    97.2  15   1
## 1.1 160      12 5.73     23.11 Present    49    25.3    97.2  45   1
## 1.2 160      12 5.73     23.11 Present    49    25.3    97.2  64   1

predict log-odds
predict probabilities

predict(heartfull, newdata=testsamples)

##          1        1.1        1.2 
## -0.7673285  0.5894320  1.4487137

predict(heartfull, newdata=testsamples, type="response")

##         1       1.1       1.2 
## 0.3170573 0.6432348 0.8098004

Variable Selection

After fitting a logistic regression model, we often encounter a situation similar to what we’ve experienced with linear models: there may be numerous features, but only a subset of them is truly relevant. Therefore, we need to perform variable selection.

AIC/BIC

We can employ criteria such as AIC and BIC because the logistic regression model has a likelihood. However, unlike linear regression, we can’t utilize the efficient leaps-and-bounds algorithm to explore all possible sub-models since the likelihood in logistic regression differs from residual sum of squares. Instead, we rely on a greedy algorithm, such as stepwise, backward, or forward.

Here, I’ll demonstrate how to perform this using a stepwise approach: we start with the full model and then consider various actions, which include removing existing covariates or taking no action; then calculate the corresponding AIC for each scenario, rank them from smallest to largest, and select the action associated with the smallest AIC. For instance, we may decide to remove the alcohol variable in the first step.

In the next stage, we repeat this process, either removing existing covariates, taking no action, or even adding back previously removed ones. We rank these actions by their AIC values and continue removing features until we reach a point where taking no action results in the smallest AIC. At this point, we stop and consider our selected model, which typically includes a subset of the original features.

heartstepA = step(heartfull, scope=list(upper=~., lower=~1))

## Start:  AIC=492.14
## chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + 
##     alcohol + age
## 
##             Df Deviance    AIC
## - alcohol    1   472.14 490.14
## - adiposity  1   472.55 490.55
## - sbp        1   473.44 491.44
## <none>           472.14 492.14
## - obesity    1   474.23 492.23
## - ldl        1   481.07 499.07
## - tobacco    1   481.67 499.67
## - typea      1   483.05 501.05
## - age        1   486.53 504.53
## - famhist    1   488.89 506.89
## 
## Step:  AIC=490.14
## chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity + 
##     age
## 
##             Df Deviance    AIC
## - adiposity  1   472.55 488.55
## - sbp        1   473.47 489.47
## <none>           472.14 490.14
## - obesity    1   474.24 490.24
## + alcohol    1   472.14 492.14
## - ldl        1   481.15 497.15
## - tobacco    1   482.06 498.06
## - typea      1   483.06 499.06
## - age        1   486.64 502.64
## - famhist    1   488.99 504.99
## 
## Step:  AIC=488.55
## chd ~ sbp + tobacco + ldl + famhist + typea + obesity + age
## 
##             Df Deviance    AIC
## - sbp        1   473.98 487.98
## <none>           472.55 488.55
## - obesity    1   474.65 488.65
## + adiposity  1   472.14 490.14
## + alcohol    1   472.55 490.55
## - tobacco    1   482.54 496.54
## - ldl        1   482.95 496.95
## - typea      1   483.19 497.19
## - famhist    1   489.38 503.38
## - age        1   495.48 509.48
## 
## Step:  AIC=487.98
## chd ~ tobacco + ldl + famhist + typea + obesity + age
## 
##             Df Deviance    AIC
## - obesity    1   475.69 487.69
## <none>           473.98 487.98
## + sbp        1   472.55 488.55
## + adiposity  1   473.47 489.47
## + alcohol    1   473.94 489.94
## - tobacco    1   484.18 496.18
## - typea      1   484.30 496.30
## - ldl        1   484.53 496.53
## - famhist    1   490.58 502.58
## - age        1   502.11 514.11
## 
## Step:  AIC=487.69
## chd ~ tobacco + ldl + famhist + typea + age
## 
##             Df Deviance    AIC
## <none>           475.69 487.69
## + obesity    1   473.98 487.98
## + sbp        1   474.65 488.65
## + adiposity  1   475.44 489.44
## + alcohol    1   475.65 489.65
## - ldl        1   484.71 494.71
## - typea      1   485.44 495.44
## - tobacco    1   486.03 496.03
## - famhist    1   492.09 502.09
## - age        1   502.38 512.38

summary(heartstepA)

## 
## Call:
## glm(formula = chd ~ tobacco + ldl + famhist + typea + age, family = binomial, 
##     data = heart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9165  -0.8054  -0.4430   0.9329   2.6139  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.44644    0.92087  -7.000 2.55e-12 ***
## tobacco         0.08038    0.02588   3.106  0.00190 ** 
## ldl             0.16199    0.05497   2.947  0.00321 ** 
## famhistPresent  0.90818    0.22576   4.023 5.75e-05 ***
## typea           0.03712    0.01217   3.051  0.00228 ** 
## age             0.05046    0.01021   4.944 7.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 475.69  on 456  degrees of freedom
## AIC: 487.69
## 
## Number of Fisher Scoring iterations: 5

You can apply a similar process using BIC. Just note that for BIC, you should set the parameter k to be log(n) instead of using the default value. You can also set trace = 0 to suppress any intermediate results.

n=dim(heart)[1]
heartstepB = step(heartfull, scope=list(upper=~., lower=~1), trace = 0, k=log(n))
summary(heartstepB)

It’s worth mentioning that both AIC and BIC algorithms tend to select the same model, a model with five covariates.

Lasso/Ridge

We can also use Lasso or Ridge, or even a combination of both. The algorithm for logistic regression with Lasso/Ridge is quite similar to that of linear models. As I mentioned earlier, logistic regression is essentially a weighted least squares problem. Therefore, we can extend the existing algorithms for linear models with Lasso to logistic regression with Lasso/Ridge penalty.

In this case, the objective function we aim to minimize is the negative log-likelihood plus the L1 penalty over the coefficient vector beta. To achieve this, we can use packages like glmnet. When using Lasso, you provide the data matrix, the response vector Y, and specify the family as binomial with alpha = 1, which means using only Lasso without Ridge regularization.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-4

p = dim(heart)[2]
X=data.matrix(heart[,-p]);
Y=heart[,p];
heart.l1=glmnet(X,Y,family="binomial",alpha=1)

You can create a path plot to visualize the estimated coefficients at different values of lambda.

plot(heart.l1,label=TRUE)

Additionally, cross-validation can help you choose the best lambda value for your model.

heart.cv = cv.glmnet(X, Y, family="binomial",alpha=1)

plot(heart.cv)
heart.cv$lambda.min
heart.cv$lambda.1se

predict(heart.cv, lambda=heart.cv$lambda.1se, type="coefficients")

Challenger O-Ring Data

The Challenger O-Ring Data is a historical dataset related to the Space Shuttle Challenger disaster, which occurred on January 28, 1986. The disaster resulted in the loss of seven crew members and the destruction of the spacecraft. It was a pivotal moment in the history of space exploration and had significant implications for engineering, safety, and decision-making processes.

On the night preceding the launch, a critical decision had to be made regarding the safety of proceeding with the launch.

One of the critical factors under consideration was the ambient temperature on the day of the launch. It was known that the O-rings were less flexible at lower temperatures, potentially affecting their ability to seal the joints properly.

The dataset included records of previous shuttle flights that documented both the temperature at the time of launch and the number of O-ring failures observed.

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-or-blowby.data"
orings = read.table(url)
orings=orings[, c(2:3)]
names(orings)=c("damage", "temp")
orings[order(orings$temp),]

##    damage temp
## 14      2   53
## 9       1   57
## 23      1   58
## 10      1   63
## 1       0   66
## 5       0   67
## 13      0   67
## 15      0   67
## 4       0   68
## 3       0   69
## 2       1   70
## 8       0   70
## 11      1   70
## 17      0   70
## 6       0   72
## 7       0   73
## 16      0   75
## 21      2   75
## 19      0   76
## 22      0   76
## 12      0   78
## 20      0   79
## 18      0   81

It’s worth noting that no previous liftoff had occurred at temperatures below 53 degrees Fahrenheit. Therefore, significant extrapolation is required from the available data to assess the risk associated with a launch temperature as low as 31 degrees Fahrenheit. However, the data visualization, as indicated by the plot, clearly demonstrates that the risk of O-ring failure is notably high at 31 degrees Fahrenheit.

logitmod=glm(cbind(damage, 6-damage) ~ temp, family=binomial, data=orings)

summary(logitmod)

The primary task is to predict the likelihood of an O-ring failure when the launch temperature falls below freezing, specifically at 31 degrees Fahrenheit. This prediction serves as a crucial component in evaluating the safety of future launches under similar conditions.

predict(logitmod,  data.frame(temp=31), type="response")

##         1 
## 0.8177744

min.temp=31; max.temp=max(orings$temp)
plot(orings$temp, orings$damage/6, 
     xlim=c(min.temp, max.temp), ylim=c(0,1), 
     xlab="Temp", ylab="Chance of Damage")
newtemp = seq(min.temp, max.temp, length=100)
phat = predict(logitmod, data.frame(temp=newtemp), type="response")
lines(newtemp, phat, col="red")

Despite the concerns raised by some engineers, it was ultimately decided to proceed with the launch. Tragically, just 73 seconds into the flight, the O-rings failed due to the cold temperatures, leading to the explosion of the shuttle.