Load Data

The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

  • lcavol: log cancer volume
  • lweight: log prostate weight
  • age: in years
  • lbph: log of the amount of benign prostatic hyperplasia
  • svi: seminal vesicle invasion
  • lcp: log of capsular penetration
  • gleason: a numeric vector
  • pgg45: percent of Gleason score 4 or 5
  • lpsa: response

The first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the “training set” and which 30 as the test set, as described on page 48 in the ESL book.

myData = read.table(file = "https://hastie.su.domains/ElemStatLearn/datasets/prostate.data", header = TRUE)
dim(myData)
names(myData)

Remove column 10 and run a quick summary of each column of the data. For convenience, we rename the response column with a generic name Y.

myData = myData[, -10]
names(myData)[9] = 'Y'
summary(myData)

Produce a pair-wise scatter plot. Caution: a big figure.

pairs(myData, pch='.')

Fit a Linear Model

Fit a linear regression model using all the predictors and print the summary of the LS results

lmfit <-  lm(Y ~ ., data = myData)
summary(lmfit)

Call:
lm(formula = Y ~ ., data = myData)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76644 -0.35510 -0.00328  0.38087  1.55770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.181561   1.320568   0.137  0.89096    
lcavol       0.564341   0.087833   6.425 6.55e-09 ***
lweight      0.622020   0.200897   3.096  0.00263 ** 
age         -0.021248   0.011084  -1.917  0.05848 .  
lbph         0.096713   0.057913   1.670  0.09848 .  
svi          0.761673   0.241176   3.158  0.00218 ** 
lcp         -0.106051   0.089868  -1.180  0.24115    
gleason      0.049228   0.155341   0.317  0.75207    
pgg45        0.004458   0.004365   1.021  0.31000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6995 on 88 degrees of freedom
Multiple R-squared:  0.6634,    Adjusted R-squared:  0.6328 
F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Check what are returned by lm and learn how to retrieve those results.

names(lmfit)  # What are returned by "lm"?
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
 [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"        
names(summary(lmfit)) # What are returned by the summary of lmfit
 [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"      
 [6] "sigma"         "df"            "r.squared"     "adj.r.squared" "fstatistic"   
[11] "cov.unscaled" 
# Regression coefficients including the intercept
lmfit$coef    
lmfit$residuals[1]
length(lmfit$residuals)

Check the calculation for Residual standard error and Multiple R-squared in the summary output: \[ \sigma = \sqrt{\frac{\sum_{i=1}^n r_i^2}{n - p-1}}, \quad p = \text{ # of non-intercept predictors} \] \[ R^2 = \frac{\sum_{i=1}^n r_i^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = \frac{\sum_{i=1}^n r_i^2}{\text{SampleVar}(y_i)\times (n-1)} = \frac{\text{SampleVar}(r_i)}{\text{SampleVar}(y_i)} \]

n = dim(myData)[1]
p = dim(myData)[2] - 1
# Residual standard error
c(sqrt(sum(lmfit$residuals^2)/(n- p - 1)), summary(lmfit)$sigma)  
# R-squared
c(1 - sum(lmfit$residuals^2)/(var(myData$Y)*(n-1)), summary(lmfit)$r.squared) 
1 - var(lmfit$residuals)/var(myData$Y)

Prediction

Predicting responses for new observations can be accomplished through two methods:

  • Method 1: Begin by organizing the new observations into a matrix or vector. Then, apply the following formula: newX * beta + beta0, where (beta0, beta) represent the coefficients obtained through LS regression.

  • Method 2: Alternatively, structure the new observations as a data frame with identical column names as those in myData. Subsequently, utilize the predict function.

To exemplify these methods, we consider two novel patients. One individual is 50 years old, while the other is 70 years old. All other attributes assume median values from the existing dataset encompassing 97 patients.

Method 1

apply(myData, 2, median)
    lcavol    lweight        age       lbph        svi        lcp    gleason 
 1.4469190  3.6230070 65.0000000  0.3001046  0.0000000 -0.7985077  7.0000000 
     pgg45          Y 
15.0000000  2.5915164 
newx1 = c(1.45, 3.62, 50, 0.30, 0, -0.80, 7, 15) #1st patient: age = 50
newx2 = newx1
newx2[3] = 70 #2nd patient: age = 70
newx = rbind(newx1, newx2)
newx %*% lmfit$coef[-1] + lmfit$coef[1]
          [,1]
newx1 2.714471
newx2 2.289507

Method 2

newdf = data.frame(newx)
names(newdf) = names(myData[, -dim(myData)[2]]) # assign col names after dropping the last col

An advantage of the predict function is that once columns have names, their order does not matter. So, let’s shuffle the columns of newdf and then predict. The result should be the same as what is returned by Method 1.

newdf
newdf = newdf[, sample(1:dim(newdf)[2])]
newdf
predict(lmfit, newdata = newdf)

Rank Deficiency

In cases where the design matrix, including the intercept, does not have full rank, the coefficient vector returned by R will contain some NA elements. An NA estimate for a column signifies that that column can be written as a linear combination of preceding columns, rendering it redundant.

NA values do not indicate errors. They simply indicate that in the LS fitting process, R discards columns with NA coefficients. However, this doesn’t hinder the utility of the fitted model for prediction purposes. The predictions derived from this model will align with those obtained from fitting a linear regression model excluding the columns with NA coefficients.

## Add a fake column named "junk", which is a linear combination of the first two predictors
myData$junk <-  myData[, 1] + myData[, 2]
tmp.lm <-  lm(Y ~ ., myData)
summary(tmp.lm)

The fitted values from the original lmfit and the new fit tmp.lim are the same.

tmp.lm$fitted[1:3]
lmfit$fitted[1:3]

## drop the "junk" column
myData = myData[, !names(myData) == 'junk' ]

Training vs Test Errors

In linear regression models, as we introduce more variables, training error (measured by RSS or MSE) consistently decreases. However, this reduction in training error does not guarantee a corresponding decrease in test error, which measures prediction accuracy on independent test data. To demonstrate this, we randomly split our dataset into 60% training and 40% test portions, progressively adding predictors. This approach highlights how additional predictors can lower training error while test error may not follow the same trend.

n <- dim(myData)[1] # sample size
p <-  dim(myData)[2] - 1 # number of non-intercept predictors

ntrain <-  round(n*0.6)
train.id <-  sample(1:n, ntrain)
train.MSE <-  rep(0, p)
test.MSE <-  rep(0, p)

for(i in 1:p){
  myfit <-  lm(Y ~ ., myData[train.id, c(1:i, (p+1))])
  train.Y <-  myData[train.id, (p+1)]
  train.Y.pred <-  myfit$fitted
  train.MSE[i] <-  mean((train.Y - train.Y.pred)^2)
  
  test.Y <-  myData[-train.id, (p+1)]
  test.Y.pred <-  predict(myfit, newdata = myData[-train.id, ])
  test.MSE[i] <-  mean((test.Y - test.Y.pred)^2)
}

## type="n": don't plot; just set the plotting region
  plot(c(1, p), range(train.MSE, test.MSE), type="n", xlab="# of variables", ylab="MSE")
  points(train.MSE, col = "blue", pch = 1)
  lines(train.MSE, col = "blue", pch = 1)
  points(test.MSE, col = "red", pch = 2)
  lines(test.MSE, col = "red", pch = 2)

Feel free to run the provided code multiple times. You’ll observe the following patterns:

  • In most instances, the blue line (representing training error) tends to be positioned below the red line (representing test error), indicating that the training error is generally superior to the test error. However, due to the inherent randomness, there might be instances where the red line falls below the blue line, indicating that the test error outperforms the training error occasionally.

  • In each iteration, the blue line consistently exhibits a monotonically decreasing trend, signifying the reduction of training error as more predictors are incorporated. On the other hand, the red line’s trajectory may not uniformly decrease. Upon closer examination of the differences between adjacent terms, you’ll observe negative values consistently for the blue line, but for the red line, these differences might encompass some positive values.

diff(train.MSE) ## always negative
diff(test.MSE)  ## not always negative

Interpret LS Coefficients

How to interpret LS coefficients? LS coefficients express how changes in predictors affect the response while holding all other predictors constant. For instance, the coefficient for age quantifies the average change in the response variable for each year increase in age, holding all other predictors constant.

It’s noteworthy that results from Simple Linear Regression (SLR), which involves only one non-intercept predictor, might differ from results in Multiple Linear Regression (MLR), which incorporates multiple predictors. This divergence can arise due to correlations among predictors.

For instance, consider the case where age is positively correlated with some predictors that have a positive effect on the response. In a joint model (MLR), the coefficient for age appears negative, seemingly contrary to its positive effect in an isolated model (SLR). This adjustment compensates for the positive contribution introduced by other predictors, ensuring that the overall model accurately represents the relationships between predictors and the response variable.

summary(lm(Y~ age, myData))

Call:
lm(formula = Y ~ age, data = myData)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.90738 -0.71234  0.06967  0.66187  2.99584 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.79906    1.00793   0.793   0.4299  
age          0.02629    0.01568   1.677   0.0968 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.144 on 95 degrees of freedom
Multiple R-squared:  0.02876,   Adjusted R-squared:  0.01854 
F-statistic: 2.813 on 1 and 95 DF,  p-value: 0.09677
round(cor(myData), dig=2)
        lcavol lweight  age  lbph   svi   lcp gleason pgg45    Y
lcavol    1.00    0.28 0.22  0.03  0.54  0.68    0.43  0.43 0.73
lweight   0.28    1.00 0.35  0.44  0.16  0.16    0.06  0.11 0.43
age       0.22    0.35 1.00  0.35  0.12  0.13    0.27  0.28 0.17
lbph      0.03    0.44 0.35  1.00 -0.09 -0.01    0.08  0.08 0.18
svi       0.54    0.16 0.12 -0.09  1.00  0.67    0.32  0.46 0.57
lcp       0.68    0.16 0.13 -0.01  0.67  1.00    0.51  0.63 0.55
gleason   0.43    0.06 0.27  0.08  0.32  0.51    1.00  0.75 0.37
pgg45     0.43    0.11 0.28  0.08  0.46  0.63    0.75  1.00 0.42
Y         0.73    0.43 0.17  0.18  0.57  0.55    0.37  0.42 1.00

Partial Regression Coef

Check how to retrieve the LS coefficient for age using Algorithm 3.1

y.star <-  lm(Y ~ ., data = subset(myData, select = -age))$res
age.star <-  lm(age ~ ., data = subset(myData, select = -Y))$res
tmpfit <-  lm(y.star ~ age.star)

The LS coefficient for age (from lmfit) is the same as the one from tmpfit. The residuals from the two LS models are also the same.

tmpfit$coef
sum((lmfit$res - tmpfit$res)^2)

F-test

Test a single predictor (in this case, F-test = t-test).

lmfit0 <- lm(Y ~ ., data = subset(myData, select = -age))
anova(lmfit0, lmfit)

Test multiple predictors.

lmfit0 <- lm(Y ~ ., data = myData[, -c(1:3)])
anova(lmfit0, lmfit)

Collinearity

Check the Car Seat Position Data from faraway package.

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know how different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers.

  • Age:
  • Weight:
  • HtShoes: height with shoes in cm
  • Ht: height without shoes in cm
  • Seated: seated height in cm
  • Arm: lower arm length in cm
  • Thigh: thigh length in cm
  • Leg: lower leg length in cm
  • hipcenter: horizontal distance of the midpoint of the hips from a fixed location in the car in mm

The researchers were interested in determining if a relationship exists between hipcenter and the other variables. Due to the high correlations among the predictors, we see high R-square, significant overall F-test, but no individual variables are significant.

library(faraway)
data(seatpos)
pairs(seatpos, pch = ".")
summary(lm(hipcenter ~ . , data=seatpos))
## check pairwise correlation
round(cor(seatpos), dig=2)

If we remove some (almost) redundant variables, the LS results make much more sense.

summary(lm(hipcenter ~ Age + Weight + Ht + Seated, data=seatpos))
summary(lm(hipcenter ~ Ht, data=seatpos))
---
title: "PSL: Linear Regression"
date: "Fall 2023"
output:
  html_notebook:
    theme: readable
    toc: TRUE
    toc_float: TRUE
---

## Load Data

The **Prostate data** can be found [[here](https://hastie.su.domains/ElemStatLearn/data.html)]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

- `lcavol`:  log cancer volume
- `lweight`: log prostate weight
- `age`: in years
- `lbph`: log of the amount of benign prostatic hyperplasia
- `svi`: seminal vesicle invasion
- `lcp`: log of capsular penetration
- `gleason`: a numeric vector
- `pgg45`: percent of Gleason score 4 or 5
- `lpsa`: response

The first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the "training set" and which 30 as the test set, as described on page 48 in the ESL book. 



```{r}
myData = read.table(file = "https://hastie.su.domains/ElemStatLearn/datasets/prostate.data", header = TRUE)
dim(myData)
names(myData)
```

Remove column 10 and run a quick summary of each column of the data. For convenience, we rename the response column with a generic name Y.

```{r}
myData = myData[, -10]
names(myData)[9] = 'Y'
summary(myData)
```

Produce a pair-wise scatter plot. Caution: a big figure.

```{r}
pairs(myData, pch='.')
```

## Fit a Linear Model

Fit a linear regression model using all the predictors and print the summary of the LS results

```{r}
lmfit <-  lm(Y ~ ., data = myData)
summary(lmfit)
```

Check what are returned by `lm` and learn how to retrieve those results.

```{r}
names(lmfit)  # What are returned by "lm"?
names(summary(lmfit)) # What are returned by the summary of lmfit
```

```{r, eval=FALSE}
# Regression coefficients including the intercept
lmfit$coef    
lmfit$residuals[1]
length(lmfit$residuals)
```

Check the calculation for `Residual standard error` and `Multiple R-squared` in the summary output:
$$
\sigma = \sqrt{\frac{\sum_{i=1}^n r_i^2}{n - p-1}}, \quad p = \text{ # of non-intercept predictors}
$$
$$ 
R^2 = \frac{\sum_{i=1}^n r_i^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = \frac{\sum_{i=1}^n r_i^2}{\text{SampleVar}(y_i)\times (n-1)} = \frac{\text{SampleVar}(r_i)}{\text{SampleVar}(y_i)}
$$

```{r}
n = dim(myData)[1]
p = dim(myData)[2] - 1
# Residual standard error
c(sqrt(sum(lmfit$residuals^2)/(n- p - 1)), summary(lmfit)$sigma)  
# R-squared
c(1 - sum(lmfit$residuals^2)/(var(myData$Y)*(n-1)), summary(lmfit)$r.squared) 
1 - var(lmfit$residuals)/var(myData$Y)
```

## Prediction

Predicting responses for new observations can be accomplished through two methods:

- **Method 1**: Begin by organizing the new observations into a matrix or vector. Then, apply the following formula: newX * beta + beta0, where (beta0, beta) represent the coefficients obtained through LS regression.

- **Method 2**: Alternatively, structure the new observations as a data frame with identical column names as those in `myData`. Subsequently, utilize the `predict` function.

To exemplify these methods, we consider two novel patients. One individual is 50 years old, while the other is 70 years old. All other attributes assume median values from the existing dataset encompassing 97 patients.

#### Method 1

```{r}
apply(myData, 2, median)
newx1 = c(1.45, 3.62, 50, 0.30, 0, -0.80, 7, 15) #1st patient: age = 50
newx2 = newx1
newx2[3] = 70 #2nd patient: age = 70
newx = rbind(newx1, newx2)
newx %*% lmfit$coef[-1] + lmfit$coef[1]
```

#### Method 2


```{r}
newdf = data.frame(newx)

# assign col names after dropping the last col
names(newdf) = names(myData[, -dim(myData)[2]]) 
```

An advantage of the `predict` function is that once columns have names, their order does not matter. So, let's shuffle the columns of `newdf` and then predict. The result should be the same as what is returned by Method 1. 

```{r}
newdf
newdf = newdf[, sample(1:dim(newdf)[2])]
newdf
predict(lmfit, newdata = newdf)
```

## Rank Deficiency

In cases where the design matrix, including the intercept, does not have full rank, the coefficient vector returned by R will contain some NA elements. An NA estimate for a column signifies that that column can be written as a linear combination of preceding columns, rendering it redundant.

**NA values do not indicate errors**. They simply indicate that in the LS fitting process, R discards columns with NA coefficients. However, this doesn't hinder the utility of the fitted model for prediction purposes. The predictions derived from this model will align with those obtained from fitting a linear regression model excluding the columns with NA coefficients.

```{r}
## Add a fake column named "junk", which is a linear combination of the first two predictors
myData$junk <-  myData[, 1] + myData[, 2]
tmp.lm <-  lm(Y ~ ., myData)
summary(tmp.lm)
```

The fitted values from the original `lmfit` and the new fit `tmp.lim` are the same. 
```{r}
tmp.lm$fitted[1:3]
lmfit$fitted[1:3]

## drop the "junk" column
myData = myData[, !names(myData) == 'junk' ]
```

## Training vs Test Errors

In linear regression models, as we introduce more variables, training error (measured by RSS or MSE) consistently decreases. However, this reduction in training error does not guarantee a corresponding decrease in test error, which measures prediction accuracy on independent test data. To demonstrate this, we randomly split our dataset into 60\% training and 40\% test portions, progressively adding predictors. This approach highlights how additional predictors can lower training error while test error may not follow the same trend.

```{r}
n <- dim(myData)[1] # sample size
p <-  dim(myData)[2] - 1 # number of non-intercept predictors

ntrain <-  round(n*0.6)
train.id <-  sample(1:n, ntrain)
train.MSE <-  rep(0, p)
test.MSE <-  rep(0, p)

for(i in 1:p){
  myfit <-  lm(Y ~ ., myData[train.id, c(1:i, (p+1))])
  train.Y <-  myData[train.id, (p+1)]
  train.Y.pred <-  myfit$fitted
  train.MSE[i] <-  mean((train.Y - train.Y.pred)^2)
  
  test.Y <-  myData[-train.id, (p+1)]
  test.Y.pred <-  predict(myfit, newdata = myData[-train.id, ])
  test.MSE[i] <-  mean((test.Y - test.Y.pred)^2)
}

## type="n": don't plot; just set the plotting region
  plot(c(1, p), range(train.MSE, test.MSE), type="n", xlab="# of variables", ylab="MSE")
  points(train.MSE, col = "blue", pch = 1)
  lines(train.MSE, col = "blue", pch = 1)
  points(test.MSE, col = "red", pch = 2)
  lines(test.MSE, col = "red", pch = 2)
```

Feel free to run the provided code multiple times. You'll observe the following patterns:

- In most instances, the blue line (representing training error) tends to be positioned below the red line (representing test error), indicating that the training error is generally superior to the test error. However, due to the inherent randomness, there might be instances where the red line falls below the blue line, indicating that the test error outperforms the training error occasionally.

- In each iteration, the blue line consistently exhibits a monotonically decreasing trend, signifying the reduction of training error as more predictors are incorporated. On the other hand, the red line's trajectory may not uniformly decrease. Upon closer examination of the differences between adjacent terms,  you'll observe negative values consistently for the blue line, but for the red line, these differences might encompass some positive values. 

```{r, eval=FALSE}
diff(train.MSE) ## always negative
diff(test.MSE)  ## not always negative
```

## Interpret LS Coefficients

How to interpret LS coefficients? LS coefficients express how changes in predictors affect the response while holding all other predictors constant. For instance, the coefficient for `age` quantifies the average change in the response variable for each year increase in age, **holding all other predictors constant**. 

It's noteworthy that results from Simple Linear Regression (SLR), which involves only one non-intercept predictor, might differ from results in Multiple Linear Regression (MLR), which incorporates multiple predictors. This divergence can arise due to correlations among predictors.
 
For instance, consider the case where `age` is positively correlated with some predictors that have a positive effect on the response. In a joint model (MLR), the coefficient for `age` appears negative, seemingly contrary to its positive effect in an isolated model (SLR). This adjustment compensates for the positive contribution introduced by other predictors, ensuring that the overall model accurately represents the relationships between predictors and the response variable.
 

```{r}
summary(lm(Y~ age, myData))
round(cor(myData), dig=2)
```


## Partial Regression Coef

Check how to retrieve the LS coefficient for `age` using Algorithm 3.1

```{r}
y.star <-  lm(Y ~ ., data = subset(myData, select = -age))$res
age.star <-  lm(age ~ ., data = subset(myData, select = -Y))$res
tmpfit <-  lm(y.star ~ age.star)
```

The LS coefficient for `age` (from `lmfit`) is the same as the one from `tmpfit`. The residuals from the two LS models are also the same.

```{r}
tmpfit$coef
sum((lmfit$res - tmpfit$res)^2)
```

## F-test

Test a single predictor (in this case, F-test = t-test).

```{r}
lmfit0 <- lm(Y ~ ., data = subset(myData, select = -age))
anova(lmfit0, lmfit)
```

Test multiple predictors.

```{r}
lmfit0 <- lm(Y ~ ., data = myData[, -c(1:3)])
anova(lmfit0, lmfit)
```


## Collinearity

Check the Car Seat Position Data from faraway package.

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know how different drivers will position the seat depending on their size and age. Researchers at the [HuMoSim laboratory](http://humosim.org/) at the University of Michigan collected data on 38 drivers.

- `Age`:
- `Weight`:
- `HtShoes`: height with shoes in cm
- `Ht`: height without shoes in cm
- `Seated`: seated height in cm
- `Arm`: lower arm length in cm
- `Thigh`: thigh length in cm
- `Leg`: lower leg length in cm
- `hipcenter`: horizontal distance of the midpoint of the hips from a fixed location in the car in mm

The researchers were interested in determining if a relationship exists between `hipcenter` and the other variables. Due to the high correlations among the predictors, we see high R-square, significant overall F-test, but no individual variables are significant.

```{r}
library(faraway)
data(seatpos)
pairs(seatpos, pch = ".")
summary(lm(hipcenter ~ . , data=seatpos))
## check pairwise correlation
round(cor(seatpos), dig=2)
```

If we remove some (almost) redundant variables, the LS results make much more sense.

```{r}
summary(lm(hipcenter ~ Age + Weight + Ht + Seated, data=seatpos))
summary(lm(hipcenter ~ Ht, data=seatpos))
```