import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
Consider a simple example: $y = 1$ if $x_1 + x_2 > 0$, and $y=0,$ otherwise.
np.random.seed(0)
n = 10
x = np.random.uniform(0,1,(n,2))
y = (x.sum(1) > 1).astype(int)
mpl.rcParams['figure.dpi'] = 250
mpl.rcParams['figure.figsize'] = (8, 5)
sns.set()
sns.scatterplot(x = x[:,0], y = x[:,1], hue = y);
x = np.hstack([x, np.ones((n,1))])
myfit = sm.GLM(y, x, family = sm.families.Binomial()).fit();
/Users/feng_macpro/anaconda3/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1257: PerfectSeparationWarning: Perfect separation or prediction detected, parameter may not be identified warnings.warn(msg, category=PerfectSeparationWarning) /Users/feng_macpro/anaconda3/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1257: PerfectSeparationWarning: Perfect separation or prediction detected, parameter may not be identified warnings.warn(msg, category=PerfectSeparationWarning) /Users/feng_macpro/anaconda3/lib/python3.11/site-packages/statsmodels/genmod/generalized_linear_model.py:1257: PerfectSeparationWarning: Perfect separation or prediction detected, parameter may not be identified warnings.warn(msg, category=PerfectSeparationWarning)
print(myfit.summary());
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: y No. Observations: 10 Model: GLM Df Residuals: 7 Model Family: Binomial Df Model: 2 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -1.8898e-09 Date: Mon, 23 Oct 2023 Deviance: 3.7795e-09 Time: 16:39:08 Pearson chi2: 1.89e-09 No. Iterations: 22 Pseudo R-squ. (CS): 0.6324 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ x1 108.2192 7.63e+04 0.001 0.999 -1.49e+05 1.5e+05 x2 9.0769 7.89e+04 0.000 1.000 -1.55e+05 1.55e+05 const -15.6047 3.03e+04 -0.001 1.000 -5.93e+04 5.93e+04 x3 -15.6047 3.03e+04 -0.001 1.000 -5.93e+04 5.93e+04 ==============================================================================
In cases where data are perfectly separable, the likelihood for the logistic model doesn't converge. The maximum likelihood estimate (MLE) of the parameter beta may tend towards either positive infinity or negative infinity.
However, despite this issue, the separating line remains well-defined, allowing us to use the fitted model for predictions. This is because the separating condition, such as $3 x_1 + x_2 > 2,$ remains consistent even if we scale the coefficients. For instance, $300 x_1 + 100 x_2 > 200$ expresses the same line.
Next we discuss fitting a logistic regression model and interpreting the coefficients and other outcomes. For illustration, we use the Coronary Risk‐Factor Study data, which consists of 462 males between the ages of 15 and 64 from a heart-disease high-risk region of the Western Cape, South Africa.
The response variable, chd
, is binary (1 for disease presence and 0 for absence), and there are nine covariates, mostly numerical, except for "family history," which is binary.
url = "https://web.stanford.edu/~hastie/ElemStatLearn//datasets/SAheart.data"
df = pd.read_csv(url, index_col=0)
df;
X = df.iloc[:,0:9]
y = df['chd']
X = pd.get_dummies(X, drop_first=True).astype(float)
X['(intercept)'] = 1
We use the statsmodels
package [Link], which can handle various models, including Poisson regression. However, in this lecture, we are focusing on logistic regression.
The grammar for fitting the logistic regression model is similar to that of a linear model, except that we must pass in the argument family=sm.families.Binomial()
in order to tell statsmodels
to run a logistic regression rather than some other type of generalized linear model.
heartfull = sm.GLM(y, X, family = sm.families.Binomial()).fit()
After fitting the logistic regression model, you can run a summary.
The first column of the coefficient matrix contains estimated coefficients, including the intercept and coefficients associated with each feature. The interpretation of these coefficients is similar to the regular linear model. For instance, if you increase sbp
by one unit, you will increase the response by 0.0065.
However, in logistic regression, the response is the log-odds, $\log (p / (1 - p))$, and the interpretation of coefficients involves exponentiation. Increasing $X_1$ by one unit corresponds to the change in odds by $\exp(\beta_1)$
$$ \log \frac{p}{1-p} = \beta_0 + (X_1 + 1) \beta_1 + \cdots + X_p \beta_p = (\beta_0 + X_1 \beta_1 + \cdots + X_p \beta_p) + \beta_1 $$If $\beta_1$ is positive, $\exp( \beta_1) > 1,$ indicating an increase in odds.
If $\beta_1$ is negative, $\exp( \beta_1) < 1,$, indicating a decrease in odds.
print(heartfull.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: chd No. Observations: 462 Model: GLM Df Residuals: 452 Model Family: Binomial Df Model: 9 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -236.07 Date: Mon, 23 Oct 2023 Deviance: 472.14 Time: 17:37:45 Pearson chi2: 452. No. Iterations: 5 Pseudo R-squ. (CS): 0.2353 Covariance Type: nonrobust =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- sbp 0.0065 0.006 1.135 0.256 -0.005 0.018 tobacco 0.0794 0.027 2.984 0.003 0.027 0.132 ldl 0.1739 0.060 2.915 0.004 0.057 0.291 adiposity 0.0186 0.029 0.635 0.526 -0.039 0.076 typea 0.0396 0.012 3.214 0.001 0.015 0.064 obesity -0.0629 0.044 -1.422 0.155 -0.150 0.024 alcohol 0.0001 0.004 0.027 0.978 -0.009 0.009 age 0.0452 0.012 3.728 0.000 0.021 0.069 famhist_Present 0.9254 0.228 4.061 0.000 0.479 1.372 (intercept) -6.1507 1.308 -4.701 0.000 -8.715 -3.587 ===================================================================================
The remaining columns in the summary table provide information similar to that in linear regression. They report standard errors, z-values, and p-values. Logistic regression iteratively fits a weighted least square model. From this model, standard errors for coefficients are calculated, leading to Z-values. Using a normal distribution, we can calculate two-sided p-values to determine the significance of coefficients.
Looking at the heart disease data, most coefficients are positive, meaning an increase in those features leads to an increased likelihood of heart disease. Interestingly, the coefficient for obesity
is negative, implying that an increase in body fat (obesity) reduces the chances of heart disease. This might seem counter-intuitive, but it's due to correlations between features.
pd.get_dummies(df, drop_first=True).corr().round(2)
sbp | tobacco | ldl | adiposity | typea | obesity | alcohol | age | chd | famhist_Present | |
---|---|---|---|---|---|---|---|---|---|---|
sbp | 1.00 | 0.21 | 0.16 | 0.36 | -0.06 | 0.24 | 0.14 | 0.39 | 0.19 | 0.09 |
tobacco | 0.21 | 1.00 | 0.16 | 0.29 | -0.01 | 0.12 | 0.20 | 0.45 | 0.30 | 0.09 |
ldl | 0.16 | 0.16 | 1.00 | 0.44 | 0.04 | 0.33 | -0.03 | 0.31 | 0.26 | 0.16 |
adiposity | 0.36 | 0.29 | 0.44 | 1.00 | -0.04 | 0.72 | 0.10 | 0.63 | 0.25 | 0.18 |
typea | -0.06 | -0.01 | 0.04 | -0.04 | 1.00 | 0.07 | 0.04 | -0.10 | 0.10 | 0.04 |
obesity | 0.24 | 0.12 | 0.33 | 0.72 | 0.07 | 1.00 | 0.05 | 0.29 | 0.10 | 0.12 |
alcohol | 0.14 | 0.20 | -0.03 | 0.10 | 0.04 | 0.05 | 1.00 | 0.10 | 0.06 | 0.08 |
age | 0.39 | 0.45 | 0.31 | 0.63 | -0.10 | 0.29 | 0.10 | 1.00 | 0.37 | 0.24 |
chd | 0.19 | 0.30 | 0.26 | 0.25 | 0.10 | 0.10 | 0.06 | 0.37 | 1.00 | 0.27 |
famhist_Present | 0.09 | 0.09 | 0.16 | 0.18 | 0.04 | 0.12 | 0.08 | 0.24 | 0.27 | 1.00 |
In logistic regression, we don't have residual sum of squares or $\sigma$ estimate. Instead, we focus on metrics like "deviance" (reported at the end of the summary). These metrics help assess how well the model fits the data and can be used for model comparison and evaluation.
Deviance measures model goodness of fit, based on likelihood comparisons. When all $x_i$ are unique, the deviance is defined as (-2) times the log-likelihood, as follows: $$ \text{Deviance} = -2 \cdot \Big (\sum_{i: y_i = 1} \log \hat{p}_i + \sum_{i: y_i = 0} (1-\hat{p}_i) \Big). $$ Here $\hat{p}_i$ represents the estimated probability off $P(Y = 1| X = x_i)$.
Below, we illustrate how to compute the deviance. The computed values should match those returned at the end of heartfull.summary()
.
def log_likelihood(phat, y):
return -(np.log(phat[y == 1]).sum() + np.log(1 - phat[y == 0]).sum())
phat = heartfull.fittedvalues
print("Residual Deviance: " + str(-2*log_likelihood(phat, y)))
Residual Deviance: -472.1400323724979
Estimation (probability of being 1) on the training samples.
sns.set_theme(style="white", palette=None)
ConfusionMatrixDisplay(confusion_matrix(y, heartfull.fittedvalues > 0.5)).plot()
plt.show()
Consider the following three males with the same value on all other predictors except age: min, max, and median. What’s the estimated chance of getting heart disease for each of them?
testsamples = X.iloc[[0,0,0],:].copy()
testsamples;
testsamples.loc[:,'age'] = [X['age'].min(), X['age'].median(), X['age'].max()]
testsamples;
heartfull.predict(testsamples, which = 'linear')
row.names 1 -0.767328 1 0.589432 1 1.448714 dtype: float64
heartfull.predict(testsamples, which = 'mean')
row.names 1 0.317057 1 0.643235 1 0.809800 dtype: float64
After fitting a logistic regression model, we often encounter a situation similar to what we've experienced with linear models: there may be numerous features, but only a subset of them is truly relevant. Therefore, we need to perform variable selection.
We can employ criteria such as AIC and BIC because the logistic regression model has a likelihood. We can perform stepwise model selection using the stepAIC
function implemented in Python_W3_VarSel_SubsetSelection
. Since we're performing logistic regression, we need to modify the computeAIC
function.
def computeAIC(X, Y, k=2):
model = sm.GLM(Y, X, family = sm.families.Binomial()).fit()
return 2*log_likelihood(model.fittedvalues, Y) + k*X.shape[1]
How to perform this using a stepwise approach? We start with the full model and then consider various actions, which include removing existing covariates or taking no action; then calculate the corresponding AIC for each scenario, rank them from smallest to largest, and select the action associated with the smallest AIC. For instance, we may decide to remove the alcohol
variable in the first step.
In the next stage, we repeat this process, either removing existing covariates, taking no action, or even adding back previously removed ones. We rank these actions by their AIC values and continue removing features until we reach a point where taking no action results in the smallest AIC. At this point, we stop and consider our selected model, which typically includes a subset of the original features.
def stepAIC(X, Y, features = X.columns, AIC = True):
AIC_list, action_list, feature_list = [], [], []
best_AIC = np.inf
best_action = ' '
n = len(Y)
if AIC:
k = 2
else:
k = np.log(n)
AIC = computeAIC(X[features], Y, k)
while(AIC < best_AIC):
AIC_list.append(AIC)
feature_list.append(list(features))
action_list.append(best_action)
best_AIC = AIC
tmp_AIC_list, tmp_action_list, tmp_feature_list = [], [], []
for p in features:
tmp_features = features.drop(p)
tmp_AIC = computeAIC(X[tmp_features], Y, k)
tmp_AIC_list.append(tmp_AIC)
tmp_feature_list.append(tmp_features)
tmp_action_list.append('- ' + p)
remaining_features = [p for p in X.columns if p not in features]
for p in remaining_features:
tmp_features = list(features) + [p]
tmp_AIC = computeAIC(X[tmp_features], Y, k)
tmp_AIC_list.append(tmp_AIC)
tmp_feature_list.append(tmp_features)
tmp_action_list.append('+ ' + p)
best_model = np.array(tmp_AIC_list).argmin()
AIC = tmp_AIC_list[best_model]
features = tmp_feature_list[best_model]
best_action = tmp_action_list[best_model]
return pd.DataFrame({'AIC': AIC_list,
'action': action_list,
'features': feature_list})
myout = stepAIC(X, y)
pd.options.display.max_colwidth = 100
myout
AIC | action | features | |
---|---|---|---|
0 | 492.140032 | [sbp, tobacco, ldl, adiposity, typea, obesity, alcohol, age, famhist_Present, (intercept)] | |
1 | 490.140769 | - alcohol | [sbp, tobacco, ldl, adiposity, typea, obesity, age, famhist_Present, (intercept)] |
2 | 488.548965 | - adiposity | [sbp, tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)] |
3 | 487.979894 | - sbp | [tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)] |
4 | 487.685578 | - obesity | [tobacco, ldl, typea, age, famhist_Present, (intercept)] |
You can apply a similar process using BIC. Just note that for BIC, you should set the parameter AIC = False
, so k
is set to be log(n)
instead of 2
.
myout = stepAIC(X, y, AIC = False)
myout
AIC | action | features | |
---|---|---|---|
0 | 533.495681 | [sbp, tobacco, ldl, adiposity, typea, obesity, alcohol, age, famhist_Present, (intercept)] | |
1 | 527.360853 | - alcohol | [sbp, tobacco, ldl, adiposity, typea, obesity, age, famhist_Present, (intercept)] |
2 | 521.633484 | - adiposity | [sbp, tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)] |
3 | 516.928848 | - sbp | [tobacco, ldl, typea, obesity, age, famhist_Present, (intercept)] |
4 | 512.498967 | - obesity | [tobacco, ldl, typea, age, famhist_Present, (intercept)] |
It's worth mentioning that both AIC and BIC algorithms tend to select the same model, a model with five covariates.
features = myout['features'][len(myout) - 1]
heartAIC = sm.GLM(y, X[features], family=sm.families.Binomial()).fit()
print(heartAIC.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: chd No. Observations: 462 Model: GLM Df Residuals: 456 Model Family: Binomial Df Model: 5 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -237.84 Date: Thu, 03 Aug 2023 Deviance: 475.69 Time: 16:52:23 Pearson chi2: 458. No. Iterations: 5 Pseudo R-squ. (CS): 0.2295 Covariance Type: nonrobust =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- tobacco 0.0804 0.026 3.106 0.002 0.030 0.131 ldl 0.1620 0.055 2.947 0.003 0.054 0.270 typea 0.0371 0.012 3.051 0.002 0.013 0.061 age 0.0505 0.010 4.944 0.000 0.030 0.070 famhist_Present 0.9082 0.226 4.023 0.000 0.466 1.351 (intercept) -6.4464 0.921 -7.000 0.000 -8.251 -4.642 ===================================================================================
We can also use Lasso or Ridge, or even a combination of both. As I mentioned earlier, logistic regression is essentially a weighted least squares problem. Therefore, we can extend the existing algorithms for linear models with Lasso to logistic regression with Lasso/Ridge penalty.
There are different versions of glmnet. For example,
glmnet_python
https://github.com/bbalasub1/glmnet_python/blob/master/test/example_binomial.py
LogitNet
from python-glmnet
https://github.com/civisanalytics/python-glmnet
The Challenger O-Ring Data is a historical dataset related to the Space Shuttle Challenger disaster, which occurred on January 28, 1986. The disaster resulted in the loss of seven crew members and the destruction of the spacecraft. It was a pivotal moment in the history of space exploration and had significant implications for engineering, safety, and decision-making processes.
On the night preceding the launch, a critical decision had to be made regarding the safety of proceeding with the launch.
One of the critical factors under consideration was the ambient temperature on the day of the launch. It was known that the O-rings were less flexible at lower temperatures, potentially affecting their ability to seal the joints properly.
The dataset included records of previous shuttle flights that documented both the temperature at the time of launch and the number of O-ring failures observed.
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-or-blowby.data"
orings = pd.read_csv(url, delim_whitespace=True, header = None)
orings
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 6 | 0 | 66 | 50 | 1 |
1 | 6 | 1 | 70 | 50 | 2 |
2 | 6 | 0 | 69 | 50 | 3 |
3 | 6 | 0 | 68 | 50 | 4 |
4 | 6 | 0 | 67 | 50 | 5 |
5 | 6 | 0 | 72 | 50 | 6 |
6 | 6 | 0 | 73 | 100 | 7 |
7 | 6 | 0 | 70 | 100 | 8 |
8 | 6 | 1 | 57 | 200 | 9 |
9 | 6 | 1 | 63 | 200 | 10 |
10 | 6 | 1 | 70 | 200 | 11 |
11 | 6 | 0 | 78 | 200 | 12 |
12 | 6 | 0 | 67 | 200 | 13 |
13 | 6 | 2 | 53 | 200 | 14 |
14 | 6 | 0 | 67 | 200 | 15 |
15 | 6 | 0 | 75 | 200 | 16 |
16 | 6 | 0 | 70 | 200 | 17 |
17 | 6 | 0 | 81 | 200 | 18 |
18 | 6 | 0 | 76 | 200 | 19 |
19 | 6 | 0 | 79 | 200 | 20 |
20 | 6 | 2 | 75 | 200 | 21 |
21 | 6 | 0 | 76 | 200 | 22 |
22 | 6 | 1 | 58 | 200 | 23 |
It's worth noting that no previous liftoff had occurred at temperatures below 53 degrees Fahrenheit. Therefore, significant extrapolation is required from the available data to assess the risk associated with a launch temperature as low as 31 degrees Fahrenheit. However, the data visualization, as indicated by the plot, clearly demonstrates that the risk of O-ring failure is notably high at 31 degrees Fahrenheit.
orings = orings.drop(axis = 1, columns = [0, 3, 4])
orings.columns = ['damage', 'temp']
orings = orings.sort_values(['temp'])
orings['(intercept)'] = 1
binomial_response = pd.concat([orings['damage'], 6 - orings['damage']], axis = 1)
binomial_response.columns = ['N_damaged', 'N_undamaged']
logitmod = sm.GLM(binomial_response, orings[['temp', '(intercept)']], family = sm.families.Binomial()).fit()
print(logitmod.summary())
Generalized Linear Model Regression Results ======================================================================================== Dep. Variable: ['N_damaged', 'N_undamaged'] No. Observations: 23 Model: GLM Df Residuals: 21 Model Family: Binomial Df Model: 1 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -15.823 Date: Mon, 23 Oct 2023 Deviance: 18.086 Time: 18:52:33 Pearson chi2: 30.0 No. Iterations: 6 Pseudo R-squ. (CS): 0.2344 Covariance Type: nonrobust =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- temp -0.1156 0.047 -2.458 0.014 -0.208 -0.023 (intercept) 5.0850 3.052 1.666 0.096 -0.898 11.068 ===============================================================================
The primary task is to predict the likelihood of an O-ring failure when the launch temperature falls below freezing, specifically at 31 degrees Fahrenheit. This prediction serves as a crucial component in evaluating the safety of future launches under similar conditions.
logitmod.predict([31, 1])[0]
0.8177744062821898
plt.scatter(orings['temp'], orings['damage']/6, s = 10);
plt.xlim([31, orings['temp'].max() + 2]);
plt.ylim([-0.1,1]);
temps = np.linspace(31, orings['temp'].max()+1, 100).reshape(-1,1)
phat = logitmod.predict(np.hstack([temps, np.ones((len(temps),1))]))
plt.plot(temps, phat, linewidth = 1);
plt.xlabel("Temp");
plt.ylabel("Chance of Damage");
Despite the concerns raised by some engineers, it was ultimately decided to proceed with the launch. Tragically, just 73 seconds into the flight, the O-rings failed due to the cold temperatures, leading to the explosion of the shuttle.