(PSL) Logistic Regression¶

In [6]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

1. The Convergence Issue¶

Consider a simple example: $y = 1$ if $x_1 + x_2 > 0$, and $y=0,$ otherwise.

In [1]:
np.random.seed(0)
n = 10
x = np.random.uniform(0,1,(n,2))
y = (x.sum(1) > 1).astype(int)
In [2]:
mpl.rcParams['figure.dpi'] = 250
mpl.rcParams['figure.figsize'] = (8, 5)
sns.set()
sns.scatterplot(x = x[:,0], y = x[:,1], hue = y);
In [4]:
x = np.hstack([x, np.ones((n,1))])
myfit = sm.GLM(y, x, family = sm.families.Binomial()).fit();
/Users/feng_macpro/anaconda3/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1257: PerfectSeparationWarning: Perfect separation or prediction detected, parameter may not be identified
  warnings.warn(msg, category=PerfectSeparationWarning)
/Users/feng_macpro/anaconda3/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1257: PerfectSeparationWarning: Perfect separation or prediction detected, parameter may not be identified
  warnings.warn(msg, category=PerfectSeparationWarning)
/Users/feng_macpro/anaconda3/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1257: PerfectSeparationWarning: Perfect separation or prediction detected, parameter may not be identified
  warnings.warn(msg, category=PerfectSeparationWarning)
In [5]:
print(myfit.summary());
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   10
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            2
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.8898e-09
Date:                Mon, 23 Oct 2023   Deviance:                   3.7795e-09
Time:                        16:39:08   Pearson chi2:                 1.89e-09
No. Iterations:                    22   Pseudo R-squ. (CS):             0.6324
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1           108.2192   7.63e+04      0.001      0.999   -1.49e+05     1.5e+05
x2             9.0769   7.89e+04      0.000      1.000   -1.55e+05    1.55e+05
const        -15.6047   3.03e+04     -0.001      1.000   -5.93e+04    5.93e+04
x3           -15.6047   3.03e+04     -0.001      1.000   -5.93e+04    5.93e+04
==============================================================================

In cases where data are perfectly separable, the likelihood for the logistic model doesn't converge. The maximum likelihood estimate (MLE) of the parameter beta may tend towards either positive infinity or negative infinity.

However, despite this issue, the separating line remains well-defined, allowing us to use the fitted model for predictions. This is because the separating condition, such as $3 x_1 + x_2 > 2,$ remains consistent even if we scale the coefficients. For instance, $300 x_1 + 100 x_2 > 200$ expresses the same line.

2. Heart Data¶

Next we discuss fitting a logistic regression model and interpreting the coefficients and other outcomes. For illustration, we use the Coronary Risk‐Factor Study data, which consists of 462 males between the ages of 15 and 64 from a heart-disease high-risk region of the Western Cape, South Africa.

The response variable, chd, is binary (1 for disease presence and 0 for absence), and there are nine covariates, mostly numerical, except for "family history," which is binary.

  • sbp: systolic blood pressure
  • tobacco: cumulative tobacco (kg)
  • ldl: low densiity lipoprotein cholesterol
  • adiposity
  • famhist: family history of heart disease (Present, Absent)
  • typea: type‐A behavior
  • obesity
  • alcohol: current alcohol consumption
  • age: age at onset
In [13]:
url = "https://web.stanford.edu/~hastie/ElemStatLearn//datasets/SAheart.data"
df = pd.read_csv(url, index_col=0)
df;
In [9]:
X = df.iloc[:,0:9]
y = df['chd']
X = pd.get_dummies(X, drop_first=True).astype(float)
X['(intercept)'] = 1

3. Fit a logistic model¶

3.1 Model fitting¶

We use the statsmodels package [Link], which can handle various models, including Poisson regression. However, in this lecture, we are focusing on logistic regression.

The grammar for fitting the logistic regression model is similar to that of a linear model, except that we must pass in the argument family=sm.families.Binomial() in order to tell statsmodels to run a logistic regression rather than some other type of generalized linear model.

In [11]:
heartfull = sm.GLM(y, X, family = sm.families.Binomial()).fit()

3.2 Coefficient Interpretation¶

After fitting the logistic regression model, you can run a summary.

The first column of the coefficient matrix contains estimated coefficients, including the intercept and coefficients associated with each feature. The interpretation of these coefficients is similar to the regular linear model. For instance, if you increase sbp by one unit, you will increase the response by 0.0065.

However, in logistic regression, the response is the log-odds, $\log (p / (1 - p))$, and the interpretation of coefficients involves exponentiation. Increasing $X_1$ by one unit corresponds to the change in odds by $\exp(\beta_1)$

$$ \log \frac{p}{1-p} = \beta_0 + (X_1 + 1) \beta_1 + \cdots + X_p \beta_p = (\beta_0 + X_1 \beta_1 + \cdots + X_p \beta_p) + \beta_1 $$
  • If $\beta_1$ is positive, $\exp( \beta_1) > 1,$ indicating an increase in odds.

  • If $\beta_1$ is negative, $\exp( \beta_1) < 1,$, indicating a decrease in odds.

In [12]:
print(heartfull.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                    chd   No. Observations:                  462
Model:                            GLM   Df Residuals:                      452
Model Family:                Binomial   Df Model:                            9
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -236.07
Date:                Mon, 23 Oct 2023   Deviance:                       472.14
Time:                        17:37:45   Pearson chi2:                     452.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2353
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
sbp                 0.0065      0.006      1.135      0.256      -0.005       0.018
tobacco             0.0794      0.027      2.984      0.003       0.027       0.132
ldl                 0.1739      0.060      2.915      0.004       0.057       0.291
adiposity           0.0186      0.029      0.635      0.526      -0.039       0.076
typea               0.0396      0.012      3.214      0.001       0.015       0.064
obesity            -0.0629      0.044     -1.422      0.155      -0.150       0.024
alcohol             0.0001      0.004      0.027      0.978      -0.009       0.009
age                 0.0452      0.012      3.728      0.000       0.021       0.069
famhist_Present     0.9254      0.228      4.061      0.000       0.479       1.372
(intercept)        -6.1507      1.308     -4.701      0.000      -8.715      -3.587
===================================================================================

The remaining columns in the summary table provide information similar to that in linear regression. They report standard errors, z-values, and p-values. Logistic regression iteratively fits a weighted least square model. From this model, standard errors for coefficients are calculated, leading to Z-values. Using a normal distribution, we can calculate two-sided p-values to determine the significance of coefficients.

3.3 Correlation among variables affects coefficients¶

Looking at the heart disease data, most coefficients are positive, meaning an increase in those features leads to an increased likelihood of heart disease. Interestingly, the coefficient for obesity is negative, implying that an increase in body fat (obesity) reduces the chances of heart disease. This might seem counter-intuitive, but it's due to correlations between features.

In [8]:
pd.get_dummies(df, drop_first=True).corr().round(2)
Out[8]:
sbp tobacco ldl adiposity typea obesity alcohol age chd famhist_Present
sbp 1.00 0.21 0.16 0.36 -0.06 0.24 0.14 0.39 0.19 0.09
tobacco 0.21 1.00 0.16 0.29 -0.01 0.12 0.20 0.45 0.30 0.09
ldl 0.16 0.16 1.00 0.44 0.04 0.33 -0.03 0.31 0.26 0.16
adiposity 0.36 0.29 0.44 1.00 -0.04 0.72 0.10 0.63 0.25 0.18
typea -0.06 -0.01 0.04 -0.04 1.00 0.07 0.04 -0.10 0.10 0.04
obesity 0.24 0.12 0.33 0.72 0.07 1.00 0.05 0.29 0.10 0.12
alcohol 0.14 0.20 -0.03 0.10 0.04 0.05 1.00 0.10 0.06 0.08
age 0.39 0.45 0.31 0.63 -0.10 0.29 0.10 1.00 0.37 0.24
chd 0.19 0.30 0.26 0.25 0.10 0.10 0.06 0.37 1.00 0.27
famhist_Present 0.09 0.09 0.16 0.18 0.04 0.12 0.08 0.24 0.27 1.00

3.4 Compute deviances¶

In logistic regression, we don't have residual sum of squares or $\sigma$ estimate. Instead, we focus on metrics like "deviance" (reported at the end of the summary). These metrics help assess how well the model fits the data and can be used for model comparison and evaluation.

Deviance measures model goodness of fit, based on likelihood comparisons. When all $x_i$ are unique, the deviance is defined as (-2) times the log-likelihood, as follows: $$ \text{Deviance} = -2 \cdot \Big (\sum_{i: y_i = 1} \log \hat{p}_i + \sum_{i: y_i = 0} (1-\hat{p}_i) \Big). $$ Here $\hat{p}_i$ represents the estimated probability off $P(Y = 1| X = x_i)$.

Below, we illustrate how to compute the deviance. The computed values should match those returned at the end of heartfull.summary().

In [14]:
def log_likelihood(phat, y):
  return -(np.log(phat[y == 1]).sum() +  np.log(1 - phat[y == 0]).sum())
In [15]:
phat = heartfull.fittedvalues
print("Residual Deviance: " + str(-2*log_likelihood(phat, y)))
Residual Deviance: -472.1400323724979

3.5 Prediction¶

Estimation (probability of being 1) on the training samples.

In [ ]:
sns.set_theme(style="white", palette=None)
ConfusionMatrixDisplay(confusion_matrix(y, heartfull.fittedvalues > 0.5)).plot()
plt.show()

Consider the following three males with the same value on all other predictors except age: min, max, and median. What’s the estimated chance of getting heart disease for each of them?

In [20]:
testsamples = X.iloc[[0,0,0],:].copy()
testsamples;
testsamples.loc[:,'age'] = [X['age'].min(), X['age'].median(), X['age'].max()]
testsamples;

Predicted Log-Odds¶

In [26]:
heartfull.predict(testsamples, which = 'linear')
Out[26]:
row.names
1   -0.767328
1    0.589432
1    1.448714
dtype: float64

Predicted Probabilities¶

In [27]:
heartfull.predict(testsamples, which = 'mean')
Out[27]:
row.names
1    0.317057
1    0.643235
1    0.809800
dtype: float64

4. Variable Selection¶

After fitting a logistic regression model, we often encounter a situation similar to what we've experienced with linear models: there may be numerous features, but only a subset of them is truly relevant. Therefore, we need to perform variable selection.

4.1 Stepwise Model Selection with AIC and BIC¶

We can employ criteria such as AIC and BIC because the logistic regression model has a likelihood. We can perform stepwise model selection using the stepAIC function implemented in Python_W3_VarSel_SubsetSelection. Since we're performing logistic regression, we need to modify the computeAIC function.

In [32]:
def computeAIC(X, Y, k=2):
    model = sm.GLM(Y, X, family = sm.families.Binomial()).fit()
    return 2*log_likelihood(model.fittedvalues, Y) + k*X.shape[1]

How to perform this using a stepwise approach? We start with the full model and then consider various actions, which include removing existing covariates or taking no action; then calculate the corresponding AIC for each scenario, rank them from smallest to largest, and select the action associated with the smallest AIC. For instance, we may decide to remove the alcohol variable in the first step.

In the next stage, we repeat this process, either removing existing covariates, taking no action, or even adding back previously removed ones. We rank these actions by their AIC values and continue removing features until we reach a point where taking no action results in the smallest AIC. At this point, we stop and consider our selected model, which typically includes a subset of the original features.

In [30]:
def stepAIC(X, Y, features = X.columns, AIC = True):
    AIC_list, action_list, feature_list = [], [], []
    best_AIC = np.inf
    best_action = ' '
                             
    n = len(Y)
    if AIC:
        k = 2
    else:
        k = np.log(n)
    AIC = computeAIC(X[features], Y, k)
    
    while(AIC < best_AIC):                     
        AIC_list.append(AIC)
        feature_list.append(list(features))
        action_list.append(best_action)
        best_AIC = AIC                     
                             
        tmp_AIC_list, tmp_action_list, tmp_feature_list = [], [], []
        
        for p in features:
            tmp_features = features.drop(p)
            tmp_AIC = computeAIC(X[tmp_features], Y, k)
            tmp_AIC_list.append(tmp_AIC)
            tmp_feature_list.append(tmp_features)
            tmp_action_list.append('- ' + p)
        
        remaining_features = [p for p in X.columns if p not in features]
        for p in remaining_features:
            tmp_features = list(features) + [p]
            tmp_AIC = computeAIC(X[tmp_features], Y, k)
            tmp_AIC_list.append(tmp_AIC)
            tmp_feature_list.append(tmp_features) 
            tmp_action_list.append('+ ' + p)
        
        best_model = np.array(tmp_AIC_list).argmin()
        AIC = tmp_AIC_list[best_model]
        features = tmp_feature_list[best_model]
        best_action = tmp_action_list[best_model]
    
    return pd.DataFrame({'AIC': AIC_list,
                         'action': action_list,
                       'features': feature_list})

AIC¶

In [33]:
myout = stepAIC(X, y)
pd.options.display.max_colwidth = 100
myout
Out[33]:
AIC action features
0 492.140032 [sbp, tobacco, ldl, adiposity, typea, obesity, alcohol, age, famhist_Present, (intercept)]
1 490.140769 - alcohol [sbp, tobacco, ldl, adiposity, typea, obesity, age, famhist_Present, (intercept)]
2 488.548965 - adiposity [sbp, tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)]
3 487.979894 - sbp [tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)]
4 487.685578 - obesity [tobacco, ldl, typea, age, famhist_Present, (intercept)]

BIC¶

You can apply a similar process using BIC. Just note that for BIC, you should set the parameter AIC = False, so k is set to be log(n) instead of 2.

In [35]:
myout = stepAIC(X, y, AIC = False)
myout
Out[35]:
AIC action features
0 533.495681 [sbp, tobacco, ldl, adiposity, typea, obesity, alcohol, age, famhist_Present, (intercept)]
1 527.360853 - alcohol [sbp, tobacco, ldl, adiposity, typea, obesity, age, famhist_Present, (intercept)]
2 521.633484 - adiposity [sbp, tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)]
3 516.928848 - sbp [tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)]
4 512.498967 - obesity [tobacco, ldl, typea, age, famhist_Present, (intercept)]

It's worth mentioning that both AIC and BIC algorithms tend to select the same model, a model with five covariates.

In [17]:
features = myout['features'][len(myout) - 1]
heartAIC = sm.GLM(y, X[features], family=sm.families.Binomial()).fit()
print(heartAIC.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                    chd   No. Observations:                  462
Model:                            GLM   Df Residuals:                      456
Model Family:                Binomial   Df Model:                            5
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -237.84
Date:                Thu, 03 Aug 2023   Deviance:                       475.69
Time:                        16:52:23   Pearson chi2:                     458.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2295
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
tobacco             0.0804      0.026      3.106      0.002       0.030       0.131
ldl                 0.1620      0.055      2.947      0.003       0.054       0.270
typea               0.0371      0.012      3.051      0.002       0.013       0.061
age                 0.0505      0.010      4.944      0.000       0.030       0.070
famhist_Present     0.9082      0.226      4.023      0.000       0.466       1.351
(intercept)        -6.4464      0.921     -7.000      0.000      -8.251      -4.642
===================================================================================

4.2 Model selection with Lasso/Ridge¶

We can also use Lasso or Ridge, or even a combination of both. As I mentioned earlier, logistic regression is essentially a weighted least squares problem. Therefore, we can extend the existing algorithms for linear models with Lasso to logistic regression with Lasso/Ridge penalty.

Use glmnet¶

There are different versions of glmnet. For example,

  • glmnet_python https://github.com/bbalasub1/glmnet_python/blob/master/test/example_binomial.py

  • LogitNet from python-glmnet https://github.com/civisanalytics/python-glmnet

Use sklearn¶

Use sklearn.linear_model.LogisticRegression [Link]

5. Challenger O-Ring Data¶

The Challenger O-Ring Data is a historical dataset related to the Space Shuttle Challenger disaster, which occurred on January 28, 1986. The disaster resulted in the loss of seven crew members and the destruction of the spacecraft. It was a pivotal moment in the history of space exploration and had significant implications for engineering, safety, and decision-making processes.

On the night preceding the launch, a critical decision had to be made regarding the safety of proceeding with the launch.

One of the critical factors under consideration was the ambient temperature on the day of the launch. It was known that the O-rings were less flexible at lower temperatures, potentially affecting their ability to seal the joints properly.

The dataset included records of previous shuttle flights that documented both the temperature at the time of launch and the number of O-ring failures observed.

In [24]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-or-blowby.data"
orings = pd.read_csv(url, delim_whitespace=True, header = None)
orings
Out[24]:
0 1 2 3 4
0 6 0 66 50 1
1 6 1 70 50 2
2 6 0 69 50 3
3 6 0 68 50 4
4 6 0 67 50 5
5 6 0 72 50 6
6 6 0 73 100 7
7 6 0 70 100 8
8 6 1 57 200 9
9 6 1 63 200 10
10 6 1 70 200 11
11 6 0 78 200 12
12 6 0 67 200 13
13 6 2 53 200 14
14 6 0 67 200 15
15 6 0 75 200 16
16 6 0 70 200 17
17 6 0 81 200 18
18 6 0 76 200 19
19 6 0 79 200 20
20 6 2 75 200 21
21 6 0 76 200 22
22 6 1 58 200 23

It's worth noting that no previous liftoff had occurred at temperatures below 53 degrees Fahrenheit. Therefore, significant extrapolation is required from the available data to assess the risk associated with a launch temperature as low as 31 degrees Fahrenheit. However, the data visualization, as indicated by the plot, clearly demonstrates that the risk of O-ring failure is notably high at 31 degrees Fahrenheit.

In [25]:
orings = orings.drop(axis = 1, columns = [0, 3, 4])
orings.columns = ['damage', 'temp']
orings = orings.sort_values(['temp'])
orings['(intercept)'] = 1
binomial_response = pd.concat([orings['damage'], 6 - orings['damage']], axis = 1)
binomial_response.columns = ['N_damaged', 'N_undamaged']
logitmod = sm.GLM(binomial_response, orings[['temp', '(intercept)']], family = sm.families.Binomial()).fit()
print(logitmod.summary())
                      Generalized Linear Model Regression Results                       
========================================================================================
Dep. Variable:     ['N_damaged', 'N_undamaged']   No. Observations:                   23
Model:                                      GLM   Df Residuals:                       21
Model Family:                          Binomial   Df Model:                            1
Link Function:                            Logit   Scale:                          1.0000
Method:                                    IRLS   Log-Likelihood:                -15.823
Date:                          Mon, 23 Oct 2023   Deviance:                       18.086
Time:                                  18:52:33   Pearson chi2:                     30.0
No. Iterations:                               6   Pseudo R-squ. (CS):             0.2344
Covariance Type:                      nonrobust                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
temp           -0.1156      0.047     -2.458      0.014      -0.208      -0.023
(intercept)     5.0850      3.052      1.666      0.096      -0.898      11.068
===============================================================================

The primary task is to predict the likelihood of an O-ring failure when the launch temperature falls below freezing, specifically at 31 degrees Fahrenheit. This prediction serves as a crucial component in evaluating the safety of future launches under similar conditions.

In [30]:
logitmod.predict([31, 1])[0]
Out[30]:
0.8177744062821898
In [31]:
plt.scatter(orings['temp'], orings['damage']/6, s = 10);
plt.xlim([31, orings['temp'].max() + 2]);
plt.ylim([-0.1,1]);
temps = np.linspace(31, orings['temp'].max()+1, 100).reshape(-1,1)
phat = logitmod.predict(np.hstack([temps, np.ones((len(temps),1))]))
plt.plot(temps, phat, linewidth = 1);
plt.xlabel("Temp");
plt.ylabel("Chance of Damage");

Despite the concerns raised by some engineers, it was ultimately decided to proceed with the launch. Tragically, just 73 seconds into the flight, the O-rings failed due to the cold temperatures, leading to the explosion of the shuttle.

In [ ]: