Lecture 21: Logistic Regression - Foundations

Lecture 21: Logistic Regression - Foundations#

In the previous lectures, we performed linear regression to model the effects of exogenous variables on a quantitative endogenous variable. In this lecture, we introduce logistical regression to model the effests of exogenous variables on a qualitative/categorical endogenous variable.

Examples#

Commute Mode Choice

Endogenous Variable: mode of transport chosen for commute (e.g., car, bus, train, bike)

Exogenous Variables: socio-demographic and socio-economic variables, travel time and cost, among others
Route Selection

Endogenous Variable: chosen route among alternatives

Exogenous Variables: travel distance, time, cost, convinience, etc.
Vehicle Purchase Decision

Endogenous Variable: vehicle purchased among alternatives

Exogenous Variables: socio-demographic and socio-economic variables, vehicle price, comfort, quality, mileage, and more

General Model#

Logistical regression is a fundamental statistical tool used to model the relationship between a qualitative/categorical endogenous variable and one or more exogenous variables. It is particularly useful for modeling discrete choice behavior, where individuals select an option from a set of mutually exclusive alternatives, for instance in mode choice, route selection, or vehicle purchase decision.

Mathematically, the underlying behaviours are dervied from Random Utility theory, which states that each individual \(i\) derives a (latent) utility \((U_{ij})\) from alternative \(j\), composed of a deterministic component \((\mathbf{X}_{ij})\) and a random component \((\boldsymbol{\epsilon})\).

\[ U_{ij} = \mathbf{X}_{ij}\boldsymbol{\beta}_{ij} + \epsilon_{ij}\]

Note, that \(\mathbf{X}_{ij}\) can be further decomposed into individual specific variables, alternative specific variables, and individual-alternate specific variables.

Consequently, the individual chooses the alternative that maximizes their utility. If \(\boldsymbol{\epsilon}\) are independently and identically distributed, following a Gumbel (Extreme Value Type I) Distribution (consequently, difference in errors follows a Logit Distribution), the probability that individual chooses alternative \(j\) is given by,

\[ P(Y_{i} = j \ | \ \mathbf{X}_{ij}) = \frac{\text{exp}(\mathbf{X}_{ij}\boldsymbol{\beta}_{ij})}{\sum_{k}\text{exp}(\mathbf{X}_{ik}\boldsymbol{\beta}_{ik})} \]

Subsquently, the coefficients are estimated by maximizing the likelihood \((L)\) of observing the actual choice behaviour.

\[\begin{split} \begin{aligned} \hat{\boldsymbol{\beta}} & = \text{argmax} \ L(\boldsymbol{\beta}) \\ \hat{\boldsymbol{\beta}} & = \text{argmax} \ \prod_i \prod_k P(Y_{i} = k \ | \ \mathbf{X}_{ik})^{1(Y_{i} = k)} \\ \hat{\boldsymbol{\beta}} & = \text{argmax} \ \sum_i \sum_k \text{log}(P(Y_{i} = k \ | \ \mathbf{X}_{ik})) \times 1(Y_{i} = k) \\ \hat{\boldsymbol{\beta}} & = \text{argmax} \ l(\boldsymbol{\beta}) \end{aligned} \end{split}\]

Note

\(1(x = a)\) is an indicator function which returns \(1\) if \(x = a\) else returns \(0\).

Hereafter, developing first order conditions for log-likelihood function \((l)\) renders a system of equations which can be solved to determine the Maximum Likelihood Estimators (MLE).

Assumptions#

The relationship between utility and independent variables must inherently be linear in parameters. If the assumption of linearity does not hold true then the estimates are no longer unbiased.
The observations must be independently sampled leading to uncorrelated errors across observations. If the assumption of independence does not hold true then the estimates are no longer efficient.
The relative odds of choosing between any two alternatives must be unaffected by the presence or characteristics of other alternatives. If the assumption of independence of irrelevant alternatives does not hold true then the estimates are no longer unbiased.
The independent variables must not form perfect linear combinations of each other. If the assumption of no multicollinearity does not hold true then the model is no longer interpretable.

If all these assumptions hold true, then the maximum likelihood estimators (MLEs) for the logistic regression model are consistent, asymptotically unbiased, and efficient.