Lecture 18: Linear Regression - Foundations

Lecture 18: Linear Regression - Foundations#

Given a multivariate data, it can be summarised - statistically through measures of location, dispersion, shape, and association; and visually through scatter plots, line plots, bar charts, histograms, and box plots, among others. In this lecture, we will explore Regression as a tool to model relationships between multivariate data. 

Endogenous Variable#

The word endogenous comes from the Greek words endon (meaning “within”) and genes (meaning “produced” or “generated”). Hence, an endogenous variable - also known as the dependent or response variable - is one whose value is determined within the system being studied. In the context of regression, it is the variable we aim to explain or predict based on its relationship with exogenous variables.

Exogenous Variables#

The word exogenous originates from the Greek exo (meaning “outside”) and genes (meaning “produced” or “generated”). Hence, an exogenous variable - also known as an independent or predictor variable - is a variable whose value is determined outside the system being modeled and is not influenced by other variables within that model. In a regression context, exogenous variables are assumed to be known or given, and their influence on the endogenous variable is estimated through the regression process.

Examples#

  • Travel Time Prediction on an Arterial Corridor

    Endogenous Variable: travel time

    Exogenous Variables: traffic voume, signal timing parameters, time of day, etc.

  • Mode Choice Prediction for Commuter Trips

    Endogenous Variable: mode choice

    Exogenous Variables: individual-specific parameters (socio-demographic and socio-economic variables), choice-specific parameters (time, cost, safety, and reliability, among others)

  • Freight Volume Prediction in a Logistics Network

    Endogenous Variable: freight volume

    Exogenous Variables: supply parameters (industrial activity in the form of number of entities, floor space, labour force, etc.), network variables (extent of multimodal connectivity thorugh road, rail, air, and port), and demand parameters (accessibility to markets, level of urbanization, household income levels, etc.)

General Model#

Linear regression is a fundamental statistical tool used to model the relationship between a quantitaive endogenous variable and one or more exogenous variables. Consquently, the general linear regression model can be represented in the matrix form as, \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\),

where,

  • \(\mathbf{Y}\) is the vector of observed dependent (endogenous) variable values

  • \(\mathbf{X}\) is the matrix of observed values of independent (exogenous) variables. Note that it includes a column of ones for the intercept.

  • \(\boldsymbol{\beta}\) is the vector of regression coefficients

  • \(\boldsymbol{\epsilon}\) is the vector of random errors (disturbances), capturing unobserved influences

\[\begin{split} \mathbf{Y} = \begin{bmatrix} Y_1 \\ \vdots \\ Y_i \\ \vdots \\ Y_n \end{bmatrix}_{n \times 1} \mathbf{X} = \begin{bmatrix} 1 & X_{11} & \cdots & X_{1j} & \cdots & X_{1m} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 1 & X_{i1} & \cdots & X_{ij} & \cdots & X_{im} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \cdots & X_{nj} & \cdots & X_{nm} \end{bmatrix}_{n \times (m+1)} \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_j \\ \vdots \\ \beta_m \end{bmatrix}_{(m+1) \times 1} \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_i \\ \vdots \\ \epsilon_n \end{bmatrix}_{n \times 1} \end{split}\]

Consequently, the expanded form renders,

\[ Y_i = \beta_0 + \beta_1X_{i1} + \dots + \beta_jX_{ij} + \dots + \beta_mX_{im} + \epsilon \]

where,

  • \(Y_i\) is the actual observed outcome for the \(i^{th}\) observation

  • \(X_{ij}\) is the value of the \(j^{th}\) independent variable for the \(i^{th}\) observation

  • \(\beta_j\) is the regression coefficient associated with \(j^{th}\) variable

  • \(\epsilon_i\) captures the unobserved factors affecting \(Y_i\)

The regression coefficients are estimated by minimizing the sum of squared errors (SSE) as follows,

\[\begin{split} \begin{aligned} \boldsymbol{\hat{\beta}} & = \text{argmin} \ \text{SSE} \\ \boldsymbol{\hat{\beta}} & = \text{argmin} \ \boldsymbol{\epsilon}^\text{T} \boldsymbol{\epsilon} \\ \boldsymbol{\hat{\beta}} & = \text{argmin} \ (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^\text{T} (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) \\ \boldsymbol{\hat{\beta}} & = \text{argmin} \ (\mathbf{Y}^\top \mathbf{Y} - 2\boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{Y} + \boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta}) \\ \boldsymbol{\hat{\beta}} & = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{Y} \end{aligned} \end{split}\]

The resulting coefficients are referred to as the Ordinary Least Squares (OLS) estimators, and consequently, the prediction is given by,

\[ \mathbf{\hat{Y}} = \mathbf{X}\boldsymbol{\hat{\beta}} \]

Equivalently,

\[ \hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1X_{i1} + \dots + \hat{\beta}_jX_{ij} + \dots + \hat{\beta}_mX_{im} \]

These predicted values represent the best linear approximation of the outcome based on the observed explanatory variables, under the classical regression assumptions.

Assumptions#

For linear regression to yield unbiased, consistent, and efficient estimators, the following assumptions must hold true,

Note

Ubiased: An estimator is said to be unbiased if its expected value matches the true parameter it aims to estimate. Consistent: An estimator is said to be consistent if it converges to the true parameter as the sample size increases. Efficient: Among all unbiased estimators, an estimator is efficient if it has the lowest possible variance.

  • The relationship between the dependent variable and independent variables must inherently be linear in parameters. If the assumption of linearity does not hold true then the estimates are no longer unbiased.

  • The observations must be independently sampled leading to uncorrelated errors across observations. If the assumption of independence does not hold true then the estimates are no longer efficient.

  • The error terms must have constant variance for all observations, \(\text{Var}(\epsilon_i | \bf{X}) = \sigma^2\). If the assumption of homoskedasticity does not hold ture then the estimates are no longer efficient.

  • The independent variables must not form perfect linear combinations of each other. If the assumption of no multicollinearity does not hold true then the model is no longer interpretable.

  • The error term has an expected value of zero given the explanatory variables, \(\text{E}(\epsilon_i | \bf{X}) = 0\). If the assumption of zero conditional mean or errors does not hold true the estimates are no longer unbiased.

If all these assumptions hold true, then the estimators are referred to as Best, Linear, and Unbiased Estimators (BLUE).