Linear Models and Regularization: Under what circumstances do we prefer to sacrifice unbiasedness?

Linear Models and Regularization: Under what circumstances do we prefer to sacrifice unbiasedness?#

2025.03.18, 2025.04.01

Lecture outline#

Common components of all regression models#

Architecture: The connection between input and output variables
Loss function: The target quantity to be optimized. Aka cost function or objective function
Solver: The method to find a solution

Simple linear regression (SLR)#

Architecture: \(y_i = \hat{y}_i + \epsilon_i = a_0 + a_1 x_i + \epsilon_i\)

Assumptions (Are these necessary??)

Only \(\epsilon_i\) is a random variable, \(\sim N(0, \sigma_i^2)\)
All \(\epsilon_i\) are independent and identically distributed (i.i.d.); \(\text{Cov}(\epsilon_i, \epsilon_j) = 0\) when \(i \neq j\).
\(\sigma_i^2\) is the same for all \(i\); \(\sigma_i^2 = \sigma^2\)

Loss function: Ordinary least square (OLS); \( J = SSE = \sum_{i=1}^{N} \epsilon_i^2 \)

What are the advantages of using this loss function?

Solver: Analytic

Normal equations

More about SLR#

Gauss–Markov Theorem and BLUE (best linear unbiased estimator)

The sum of squares: How much variability can be explained/modeled by a regression model?

\(SST = SSE + SSR\)

What is the difference between \(R^2\) and \(r\)?

Confidence interval vs. Prediction interval

When there is a serial correlation…

Durbin–Watson statistic

Multiple linear regression (MLR)#

What does “linear” mean in a linear regression model?

Components#

Architecture: \(\textbf{y} = \textbf{X}\textbf{a} + \boldsymbol{\epsilon}\) (We’ll use this at the class) Other equivalent expression is also commonly seen in the environmental science studies:

\(\textbf{d} = \textbf{G}\textbf{m}\) (Many geophysicists use this)

Loss function: OLS. \( J = SSE = \boldsymbol{\epsilon}^\text{T} \boldsymbol{\epsilon}\)

Solver: Analytic.

What are the normal equations?
Solution: \(\hat{\textbf{a}}=(\textbf{X}^\text{T}\textbf{X})^{-1}\textbf{X}^\text{T}\textbf{y}\)

More about MLR#

Circular and categorical Data: transform them so that we can use a linear model

Analysis of Variance (ANOVA)

Sum of squares explained by each predictor
F-test

Stepwise regression#

When you have lots of potential predictors…

The more, the better?
The less, the better?
When you have no choice but to keep all the predictors?

How do we do stepwise regression?

Criticism: Isn’t this cherry-picking?

Rank-deficient (or ill-conditioned) problems and regularized least-squares (i.e., a family of shrinkage methods)

Regularized least-squares models#

Regularization tries to redesign the loss function by imposing an additional constraint.

What are its benefits?

To keep all the predictors in the regression model
To improve the solution by making the input towards full rank or better conditions.

Example: The one-hot encoding case from the worksheet

Ridge regression#

a.k.a. L-2 regularization or Tikhonov regression

Architecture: same as MLR

Loss function: Regularized OLS. \( J = \boldsymbol{\epsilon}^\text{T} \boldsymbol{\epsilon} + \lambda\textbf{a}^\text{T}\textbf{a}\)

This is the explicit expression using the Lagrange multiplier \(\lambda\).
It is equivalent to a standard OLS (\(\boldsymbol{\epsilon}^\text{T} \boldsymbol{\epsilon}\)) under a constraint \(\textbf{a}^\text{T}\textbf{a} \leq b\). See Appendix B of the textbook for more details.

Solver: Analytic

Solution: \(\hat{\textbf{a}}=(\textbf{X}^\text{T}\textbf{X}+\lambda \textbf{I})^{-1}\textbf{X}^\text{T}\textbf{y}\)

How to select the Hyperparameter \(\lambda\)?

Validation (This is the time you start to see the so-called validation data!)
Cross-validation (We’ll continue to talk about this during the NN topic)

Lasso#

a.k.a. L-1 regularization or “least absolute shrinkage and selection operator”

Architecture: same as MLR

Loss function: Regularized OLS. \( J = \boldsymbol{\epsilon}^\text{T} \boldsymbol{\epsilon} + \lambda \sum_{j=1}^{m}|a_j|\)

What is its implicit form?

Solver: Typically solved numerically since the loss function is indifferentiable. We’ll talk about some possible methods in the next topic.

LASSO also works as a predictor selector – how so?

Finding the constraint region of Ridge and LASSO
What is the shape of the regions?
Lasso finds a solution with greater sparsity.

Generalized Least Squares#

a.k.a. Weighted least squares (WLS). Exploring more ways to redesign the loss function!

Architecture: same as MLR

Loss function: WLS; \( J = \boldsymbol{\epsilon}^\text{T} \textbf{C}^{-1}\boldsymbol{\epsilon}\) where \(\textbf{C}\) is the covariance matrix of the residual.

Solver: Analytically.

What does this model typically imply for the assumption about the data? Why Mahalanobis distance?

Final thoughts#

Under what circumstances do we prefer to sacrifice unbiasedness?

\(\textbf{X}^\text{T}\textbf{X}\) is not invertible (i.e., \(\textbf{X}\) is a rank-deficient matrix)
\(\textbf{X}^\text{T}\textbf{X}\) is invertible, but \(\textbf{X}\) has columns that are highly correlated to each other (i.e., \(\textbf{X}\) is an ill=conditioned matrix)
Are there other reasons? (1)
Are there other reasons? (2)

Group discussion & Demos#

Introduction to Jupyter Notebook / JupyterLab
Load an Exercise data set using Jupyter Notebook (hosted by Google Colaboratory, Callysto Hub, or your local machine)

Linear Models and Regularization: Under what circumstances do we prefer to sacrifice unbiasedness?

Contents

Linear Models and Regularization: Under what circumstances do we prefer to sacrifice unbiasedness?#

Lecture outline#

Common components of all regression models#

Simple linear regression (SLR)#

More about SLR#

Multiple linear regression (MLR)#

Components#

More about MLR#

Stepwise regression#

Regularized least-squares models#

Ridge regression#

Lasso#

Generalized Least Squares#

Final thoughts#

Group discussion & Demos#