Lecture 22: Logistic Regression - Diganostics#
Ex-Ante Diagnostics#
The example below employs the Indian Time Use Survey to model the choice of shopping channel (no-shopping, in-store, online, both) as logistic regression, accounting for individual specific variables (socio-demographics and socio-economic parameters).
# Load necessary libraries
library(tidyr)
library(dplyr)
library(mlogit)
library(ggplot2)
options(repr.plot.width = 12, repr.plot.height = 8)
# 2024 ITUS Individual Data (model)
url <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_BD.csv"
data <- read.csv(url) # Loading Data
str(data)
'data.frame': 454192 obs. of 55 variables:
$ Unique_ID : chr "2024-30010-1-241-17-13-11-2-2420-4-1-1" "2024-30010-1-241-17-13-11-2-2420-4-10-1" "2024-30010-1-241-17-13-11-2-2420-4-10-2" "2024-30010-1-241-17-13-11-2-2420-4-11-1" ...
$ Unique_HH_ID : chr "2024-30010-1-241-17-13-11-2-2420-4-1" "2024-30010-1-241-17-13-11-2-2420-4-10" "2024-30010-1-241-17-13-11-2-2420-4-10" "2024-30010-1-241-17-13-11-2-2420-4-11" ...
$ q1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ q2 : int 1 1 1 1 1 1 1 1 1 1 ...
$ q3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ q4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ weekday : int 1 1 1 0 0 0 0 0 1 1 ...
$ weekend : int 0 0 0 1 1 1 1 1 0 0 ...
$ rural : int 1 1 1 1 1 1 1 1 1 1 ...
$ urban : int 0 0 0 0 0 0 0 0 0 0 ...
$ north : int 0 0 0 0 0 0 0 0 0 0 ...
$ west : int 1 1 1 1 1 1 1 1 1 1 ...
$ central : int 0 0 0 0 0 0 0 0 0 0 ...
$ east : int 0 0 0 0 0 0 0 0 0 0 ...
$ north.east : int 0 0 0 0 0 0 0 0 0 0 ...
$ south : int 0 0 0 0 0 0 0 0 0 0 ...
$ tierI : int 0 0 0 0 0 0 0 0 0 0 ...
$ tierII : int 1 1 1 1 1 1 1 1 1 1 ...
$ tierIII : int 0 0 0 0 0 0 0 0 0 0 ...
$ male : int 1 1 0 1 0 1 0 0 1 0 ...
$ female : int 0 0 1 0 1 0 1 1 0 1 ...
$ transgender : int 0 0 0 0 0 0 0 0 0 0 ...
$ gen_alpha : int 0 0 0 0 0 0 0 0 0 0 ...
$ gen_z : int 0 0 1 0 0 1 1 1 0 0 ...
$ millenials : int 0 1 0 0 1 0 0 0 0 1 ...
$ gen_x : int 1 0 0 1 0 0 0 0 1 0 ...
$ baby_boomers : int 0 0 0 0 0 0 0 0 0 0 ...
$ silent : int 0 0 0 0 0 0 0 0 0 0 ...
$ not_married : int 1 0 0 0 0 1 1 1 0 0 ...
$ married : int 0 1 1 1 1 0 0 0 1 1 ...
$ not_literate : int 0 0 0 0 1 0 0 0 1 0 ...
$ primary : int 0 0 1 0 0 0 1 0 0 1 ...
$ secondary : int 0 1 0 1 0 1 0 1 0 0 ...
$ graduate_._above : int 1 0 0 0 0 0 0 0 0 0 ...
$ nilf : int 0 0 1 0 1 0 1 0 0 0 ...
$ unemployed : int 0 0 0 0 0 0 0 0 0 0 ...
$ employed : int 1 1 0 1 0 1 0 1 1 1 ...
$ family_structure : num 0 0.333 0.333 0.4 0.4 0.4 0.4 0.4 0.429 0.429 ...
$ household_size : int 1 3 3 5 5 5 5 5 7 7 ...
$ disadvantaged : int 1 1 1 1 1 1 1 1 1 1 ...
$ not_disadvantaged: int 0 0 0 0 0 0 0 0 0 0 ...
$ no_land : int 0 0 0 0 0 0 0 0 0 0 ...
$ land_possessed : int 1 1 1 1 1 1 1 1 1 1 ...
$ low_income : int 1 1 1 0 0 0 0 0 0 0 ...
$ medium_income : int 0 0 0 1 1 1 1 1 1 1 ...
$ high_income : int 0 0 0 0 0 0 0 0 0 0 ...
$ no_dwelling : int 0 0 0 0 0 0 0 0 0 0 ...
$ kutcha : int 0 0 0 0 0 0 0 0 0 0 ...
$ pucca : int 1 1 1 1 1 1 1 1 1 1 ...
$ none : int 1 1 1 1 1 1 1 1 1 1 ...
$ instore : int 0 0 0 0 0 0 0 0 0 0 ...
$ online : int 0 0 0 0 0 0 0 0 0 0 ...
$ both : int 0 0 0 0 0 0 0 0 0 0 ...
$ choice : chr "none" "none" "none" "none" ...
$ weight : int 208857 208857 208857 208857 208857 208857 208857 208857 208857 208857 ...
# Counts
counts <- data %>%
summarise(
None = sum(none, na.rm = TRUE),
InStore = sum(instore, na.rm = TRUE),
Online = sum(online, na.rm = TRUE),
Both = sum(both, na.rm = TRUE)
) %>%
pivot_longer(cols = everything(), names_to = "channel", values_to = "count") %>%
mutate(share = (count / sum(count)))
print(counts)
# A tibble: 4 × 3
channel count share
<chr> <int> <dbl>
1 None 409897 0.902
2 InStore 39728 0.0875
3 Online 4252 0.00936
4 Both 315 0.000694
# Temporal: Quarter Counts
quarter_counts <- data %>%
summarise(
None_Q1 = sum(none * q1, na.rm = TRUE),
InStore_Q1 = sum(instore * q1, na.rm = TRUE),
Online_Q1 = sum(online * q1, na.rm = TRUE),
Both_Q1 = sum(both * q1, na.rm = TRUE),
None_Q2 = sum(none * q2, na.rm = TRUE),
InStore_Q2 = sum(instore * q2, na.rm = TRUE),
Online_Q2 = sum(online * q2, na.rm = TRUE),
Both_Q2 = sum(both * q2, na.rm = TRUE),
None_Q3 = sum(none * q3, na.rm = TRUE),
InStore_Q3 = sum(instore * q3, na.rm = TRUE),
Online_Q3 = sum(online * q3, na.rm = TRUE),
Both_Q3 = sum(both * q3, na.rm = TRUE),
None_Q4 = sum(none * q4, na.rm = TRUE),
InStore_Q4 = sum(instore * q4, na.rm = TRUE),
Online_Q4 = sum(online * q4, na.rm = TRUE),
Both_Q4 = sum(both * q4, na.rm = TRUE)
) %>%
pivot_longer(cols = everything(), names_to = "channel_quarter", values_to = "count") %>%
separate(channel_quarter, into = c("channel", "quarter"), sep = "_") %>%
group_by(quarter) %>%
mutate(share = (count / sum(count))) %>%
ungroup() %>%
mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))
# Tabulate
table <- quarter_counts %>%
ungroup() %>%
select(channel, quarter, share) %>%
pivot_wider(names_from = quarter, values_from = share)
print(table)
# A tibble: 4 × 5
channel Q1 Q2 Q3 Q4
<fct> <dbl> <dbl> <dbl> <dbl>
1 None 0.910 0.902 0.899 0.898
2 InStore 0.0761 0.0857 0.0933 0.0956
3 Online 0.0131 0.0110 0.00699 0.00601
4 Both 0.000713 0.000794 0.000704 0.000557
# Temporal: Day Counts
day_counts <- data %>%
summarise(
None_WD = sum(none * weekday, na.rm = TRUE),
InStore_WD = sum(instore * weekday, na.rm = TRUE),
Online_WD = sum(online * weekday, na.rm = TRUE),
Both_WD = sum(both * weekday, na.rm = TRUE),
None_WE = sum(none * weekend, na.rm = TRUE),
InStore_WE = sum(instore * weekend, na.rm = TRUE),
Online_WE = sum(online * weekend, na.rm = TRUE),
Both_WE = sum(both * weekend, na.rm = TRUE)
) %>%
pivot_longer(cols = everything(), names_to = "channel_day", values_to = "count") %>%
separate(channel_day, into = c("channel", "day"), sep = "_") %>%
group_by(day) %>%
mutate(share = (count / sum(count))) %>%
ungroup() %>%
mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))
# Tabulate
table <- day_counts %>%
ungroup() %>%
select(channel, day, share) %>%
pivot_wider(names_from = day, values_from = share)
print(table)
# A tibble: 4 × 3
channel WD WE
<fct> <dbl> <dbl>
1 None 0.907 0.892
2 InStore 0.0838 0.0973
3 Online 0.00900 0.0103
4 Both 0.000636 0.000847
# Spatial: Sector Counts
sector_counts <- data %>%
summarise(
None_R = sum(none * rural, na.rm = TRUE),
InStore_R = sum(instore * rural, na.rm = TRUE),
Online_R = sum(online * rural, na.rm = TRUE),
Both_R = sum(both * rural, na.rm = TRUE),
None_U = sum(none * urban, na.rm = TRUE),
InStore_U = sum(instore * urban, na.rm = TRUE),
Online_U = sum(online * urban, na.rm = TRUE),
Both_U = sum(both * urban, na.rm = TRUE)
) %>%
pivot_longer(cols = everything(), names_to = "channel_sector", values_to = "count") %>%
separate(channel_sector, into = c("channel", "sector"), sep = "_") %>%
group_by(sector) %>%
mutate(share = (count / sum(count))) %>%
ungroup() %>%
mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))
# Tabulate
table <- sector_counts %>%
ungroup() %>%
select(channel, sector, share) %>%
pivot_wider(names_from = sector, values_from = share)
print(table)
# A tibble: 4 × 3
channel R U
<fct> <dbl> <dbl>
1 None 0.916 0.880
2 InStore 0.0766 0.106
3 Online 0.00706 0.0133
4 Both 0.000473 0.00107
# Spatial: Tier Counts
tier_counts <- data %>%
summarise(
None_T1 = sum(none * tierI, na.rm = TRUE),
InStore_T1 = sum(instore * tierI, na.rm = TRUE),
Online_T1 = sum(online * tierI, na.rm = TRUE),
Both_T1 = sum(both * tierI, na.rm = TRUE),
None_T2 = sum(none * tierII, na.rm = TRUE),
InStore_T2 = sum(instore * tierII, na.rm = TRUE),
Online_T2 = sum(online * tierII, na.rm = TRUE),
Both_T2 = sum(both * tierII, na.rm = TRUE),
None_T3 = sum(none * tierIII, na.rm = TRUE),
InStore_T3 = sum(instore * tierIII, na.rm = TRUE),
Online_T3 = sum(online * tierIII, na.rm = TRUE),
Both_T3 = sum(both * tierIII, na.rm = TRUE)
) %>%
pivot_longer(cols = everything(), names_to = "channel_tier", values_to = "count") %>%
separate(channel_tier, into = c("channel", "tier"), sep = "_") %>%
group_by(tier) %>%
mutate(share = (count / sum(count))) %>%
ungroup() %>%
mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))
# Tabulate
table <- tier_counts %>%
ungroup() %>%
select(channel, tier, share) %>%
pivot_wider(names_from = tier, values_from = share)
print(table)
# A tibble: 4 × 4
channel T1 T2 T3
<fct> <dbl> <dbl> <dbl>
1 None 0.904 0.911 0.893
2 InStore 0.0874 0.0802 0.0948
3 Online 0.00825 0.00811 0.0117
4 Both 0.000776 0.000471 0.000833
Model Diagnostics#
Equally Likely Model: A baseline model that assumes all alternatives have the same probability of being chosen.
\[ P_{ij} = 1/J \quad \ \ \forall \ \ i, j \]where \(I\) = number of individuals, and \(J\) = number of alternatives.
Market Share Model: A baseline model that assigns choice probabilities equal to the observed market shares of each alternative.
Unweighted Data -
\[ P_{ij} = \frac{I_j}{I} \]where \(I_j\) = number of times alternative \(j\) is chosen.
Weighted Data -
\[ P_{ij} = \frac{\sum_{i=1}^{I} w_i y_{ij}}{\sum_{i=1}^{I} w_i} \quad \ \ \forall \ \ i, j \]where \(w_i\) is the weight for individual \(i\), and \(y_{ij} = 1\) if individual \(i\) chose alternative \(j\), else \(0\).
Estimated Model: The multinomial logit (MNL) model estimated using explanatory variables to predict choice probabilities.
\[ P_{ij} = \frac{\exp(V_{ij})}{\sum_{k \in J} \exp(V_{ik})} \]where \(V_{ij}\) is the systematic utility of alternative \(j\) for individual \(i\).
Consequently, to assess predictive capability of the model, we evaluate log-Likelihood as follows
To better compare the different models, we evaluate McFadden \(\text{R}^2\) as follows,
Estimated Model vs. Equally Likely Model:
\[ R^2_{EL} = 1 - \frac{LL_{EST}}{LL_{EL}} \]Estimated Model vs. Market Share Model:
\[ R^2_{MS} = 1 - \frac{LL_{EST}}{LL_{MS}} \]
Equivalently, we have adjusted McFadden \(\text{R}^2\), given by,
Estimated Model vs. Equally Likely Model:
\[ \bar{R}^2_{EL} = 1 - \frac{LL_{EST} - K}{LL_{EL}} \]Estimated Model vs. Market Share Model:
\[ \bar{R}^2_{MS} = 1 - \frac{LL_{EST} - K}{LL_{MS}} \]where \(K\) is the number of exogenous variables in the estimated model.
# Prepare dataset
long_data <- data %>%
select(
Unique_ID,
q1, q2, q3, q4,
weekday, weekend,
urban, rural,
north, west, south, east, central, north.east,
tierI, tierII, tierIII,
male, female, transgender,
gen_alpha, gen_z, millenials, gen_x, baby_boomers, silent,
not_married, married,
not_literate, primary, secondary, graduate_._above,
nilf, unemployed, employed,
family_structure,
household_size,
disadvantaged, not_disadvantaged,
no_land, land_possessed,
low_income, medium_income, high_income,
no_dwelling, kutcha, pucca,
none, instore, online, both,
weight
) %>%
pivot_longer(
cols = c(none, instore, online, both),
names_to = "alt", values_to = "bin"
) %>%
mutate(bin = bin == 1)
model_data <- mlogit.data(
long_data,
choice = "bin",
shape = "long",
chid.var = "Unique_ID",
alt.var = "alt"
)
# Esitmated Model
model <- mlogit(
formula = bin ~ 1 | q1 + q2 + q3 +
weekend +
urban +
north + west + south + east + north.east +
tierI + tierII +
female +
gen_alpha + gen_z + millenials + gen_x + baby_boomers +
married +
primary + secondary + graduate_._above +
unemployed + employed +
family_structure +
household_size +
disadvantaged +
medium_income + high_income,
data = model_data,
ref = "none",
weights = model_data$weight
)
summary(model)
Call:
mlogit(formula = bin ~ 1 | q1 + q2 + q3 + weekend + urban + north +
west + south + east + north.east + tierI + tierII + female +
gen_alpha + gen_z + millenials + gen_x + baby_boomers + married +
primary + secondary + graduate_._above + unemployed + employed +
family_structure + household_size + disadvantaged + medium_income +
high_income, data = model_data, weights = model_data$weight,
reflevel = "none", method = "nr")
Frequencies of alternatives:choice
none both instore online
0.90247516 0.00069354 0.08746962 0.00936168
nr method
50 iterations, 0h:11m:30s
g'(-H)^-1g = 5.42E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both -7.4168560 0.6157904 -12.0444 < 2.2e-16 ***
(Intercept):instore -3.1359745 0.0830616 -37.7548 < 2.2e-16 ***
(Intercept):online -6.1039104 0.2871559 -21.2564 < 2.2e-16 ***
q1:both 0.2791877 0.1672719 1.6691 0.0951044 .
q1:instore -0.2244205 0.0153914 -14.5809 < 2.2e-16 ***
q1:online 0.7799642 0.0466726 16.7114 < 2.2e-16 ***
q2:both 0.3613935 0.1651965 2.1877 0.0286945 *
q2:instore -0.1043859 0.0150334 -6.9436 3.822e-12 ***
q2:online 0.6014995 0.0481472 12.4929 < 2.2e-16 ***
q3:both 0.2287075 0.1701913 1.3438 0.1790046
q3:instore -0.0093966 0.0147826 -0.6357 0.5250009
q3:online 0.1030622 0.0533721 1.9310 0.0534814 .
weekend:both 0.3305004 0.1188460 2.7809 0.0054206 **
weekend:instore 0.1425404 0.0117174 12.1648 < 2.2e-16 ***
weekend:online 0.1847087 0.0334999 5.5137 3.513e-08 ***
urban:both 0.4648192 0.1278526 3.6356 0.0002773 ***
urban:instore 0.2448504 0.0119824 20.4342 < 2.2e-16 ***
urban:online 0.5042721 0.0344862 14.6224 < 2.2e-16 ***
north:both 0.2783481 0.2349655 1.1846 0.2361623
north:instore 0.2319742 0.0236956 9.7897 < 2.2e-16 ***
north:online 0.0781464 0.0617442 1.2656 0.2056398
west:both -0.4456713 0.2265514 -1.9672 0.0491605 *
west:instore -0.2370984 0.0196738 -12.0515 < 2.2e-16 ***
west:online -0.3406240 0.0497638 -6.8448 7.658e-12 ***
south:both 0.4617657 0.1914588 2.4118 0.0158728 *
south:instore 0.2891557 0.0184433 15.6781 < 2.2e-16 ***
south:online -0.7997034 0.0600485 -13.3176 < 2.2e-16 ***
east:both 0.5726507 0.1927423 2.9711 0.0029676 **
east:instore 0.5940843 0.0172637 34.4124 < 2.2e-16 ***
east:online 0.3210035 0.0445664 7.2028 5.898e-13 ***
north.east:both -0.1527257 0.2532609 -0.6030 0.5464841
north.east:instore 0.4412024 0.0225813 19.5384 < 2.2e-16 ***
north.east:online -0.2579856 0.0695310 -3.7104 0.0002070 ***
tierI:both 0.2776032 0.1588892 1.7472 0.0806113 .
tierI:instore -0.1058035 0.0159324 -6.6408 3.120e-11 ***
tierI:online -0.2106664 0.0462791 -4.5521 5.312e-06 ***
tierII:both -0.2935167 0.1533859 -1.9136 0.0556734 .
tierII:instore -0.1629028 0.0139336 -11.6914 < 2.2e-16 ***
tierII:online -0.2211167 0.0388778 -5.6875 1.289e-08 ***
female:both -1.1797157 0.1594996 -7.3964 1.399e-13 ***
female:instore -0.9256491 0.0143421 -64.5408 < 2.2e-16 ***
female:online -0.6389009 0.0408502 -15.6401 < 2.2e-16 ***
gen_alpha:both -2.5534630 0.9962016 -2.5632 0.0103713 *
gen_alpha:instore -1.1663812 0.0967893 -12.0507 < 2.2e-16 ***
gen_alpha:online -0.7042964 0.3221013 -2.1866 0.0287741 *
gen_z:both -0.7211407 0.5605096 -1.2866 0.1982406
gen_z:instore 0.5742194 0.0798757 7.1889 6.530e-13 ***
gen_z:online 1.0996030 0.2791819 3.9387 8.194e-05 ***
millenials:both 0.5069131 0.5533189 0.9161 0.3595977
millenials:instore 1.0107544 0.0799294 12.6456 < 2.2e-16 ***
millenials:online 1.5288775 0.2795385 5.4693 4.518e-08 ***
gen_x:both 0.0421728 0.5573107 0.0757 0.9396801
gen_x:instore 1.0405279 0.0797451 13.0482 < 2.2e-16 ***
gen_x:online 1.5374040 0.2792200 5.5061 3.669e-08 ***
baby_boomers:both 0.4560862 0.5455593 0.8360 0.4031563
baby_boomers:instore 1.0074645 0.0797289 12.6361 < 2.2e-16 ***
baby_boomers:online 1.2986491 0.2797404 4.6423 3.445e-06 ***
married:both 0.1933199 0.1612442 1.1989 0.2305567
married:instore 0.4609271 0.0158744 29.0359 < 2.2e-16 ***
married:online 0.3297134 0.0450564 7.3178 2.520e-13 ***
primary:both 0.2572980 0.2216710 1.1607 0.2457556
primary:instore 0.1796498 0.0179330 10.0178 < 2.2e-16 ***
primary:online 0.0250463 0.0529127 0.4734 0.6359624
secondary:both 0.4976686 0.2285602 2.1774 0.0294502 *
secondary:instore 0.2363406 0.0192631 12.2691 < 2.2e-16 ***
secondary:online 0.2464443 0.0553515 4.4524 8.493e-06 ***
graduate_._above:both 0.9545753 0.2389967 3.9941 6.494e-05 ***
graduate_._above:instore 0.2992761 0.0220033 13.6014 < 2.2e-16 ***
graduate_._above:online 0.3433125 0.0620183 5.5357 3.101e-08 ***
unemployed:both 0.7963925 0.3113131 2.5582 0.0105224 *
unemployed:instore 0.7554284 0.0355874 21.2274 < 2.2e-16 ***
unemployed:online 0.8261047 0.0858447 9.6232 < 2.2e-16 ***
employed:both -0.2305284 0.1641075 -1.4047 0.1600985
employed:instore 0.0447391 0.0151695 2.9493 0.0031851 **
employed:online -0.1993489 0.0434480 -4.5882 4.470e-06 ***
family_structure:both -0.2367034 0.3224017 -0.7342 0.4628343
family_structure:instore -0.0291962 0.0292920 -0.9967 0.3188974
family_structure:online -0.1143438 0.0841183 -1.3593 0.1740449
household_size:both -0.1667825 0.0409682 -4.0710 4.681e-05 ***
household_size:instore -0.1019963 0.0036437 -27.9922 < 2.2e-16 ***
household_size:online -0.0644479 0.0097986 -6.5772 4.793e-11 ***
disadvantaged:both 0.0445169 0.1303666 0.3415 0.7327462
disadvantaged:instore 0.0520723 0.0127374 4.0881 4.348e-05 ***
disadvantaged:online 0.0193504 0.0356309 0.5431 0.5870757
medium_income:both 0.1514223 0.1566796 0.9664 0.3338214
medium_income:instore 0.0426988 0.0139013 3.0716 0.0021293 **
medium_income:online 0.1323845 0.0408460 3.2411 0.0011909 **
high_income:both 0.4051618 0.1651532 2.4532 0.0141573 *
high_income:instore 0.1283853 0.0156172 8.2208 2.220e-16 ***
high_income:online 0.1757212 0.0456216 3.8517 0.0001173 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -146750
McFadden R^2: 0.088592
Likelihood ratio test : chisq = 28528 (p.value = < 2.22e-16)
# Model Statistics
I <- nrow(data)
J <- length(unique(data$choice))
K <- length(coef(model))
W <- long_data$weight
Y <- long_data$bin
Z <- data %>%
summarise(
none = sum(none * weight, na.rm = TRUE) / sum(weight) * nrow(data),
instore = sum(instore * weight, na.rm = TRUE) / sum(weight) * nrow(data),
online = sum(online * weight, na.rm = TRUE) / sum(weight) * nrow(data),
both = sum(both * weight, na.rm = TRUE) / sum(weight) * nrow(data)
) %>%
pivot_longer(cols = everything(), names_to = "channel", values_to = "n") %>%
mutate(p = (n / sum(n)))
P <- Z$p[model_data$alt]
LL_EL <- sum(Y * log(1/J))
LL_MS <- sum(Y * log(P))
LL_MNL <- as.numeric(logLik(model))
R2_EL <- 1 - (LL_MNL / LL_EL)
R2_MS <- 1 - (LL_MNL / LL_MS)
AR2_EL <- 1 - ((LL_MNL - K) / LL_EL)
AR2_MS <- 1 - ((LL_MNL - K) / LL_MS)
cat("\n--- Log-likelihoods ---\n")
cat(sprintf("EL : %0.3f\n", LL_EL))
cat(sprintf("MS : %0.3f\n", LL_MS))
cat(sprintf("MNL : %0.3f\n", LL_MNL))
cat("\n--- McFadden R^2 ---\n")
cat(sprintf("R2 vs EL : %0.4f\n", R2_EL))
cat(sprintf("R2 vs MS : %0.4f\n", R2_MS))
cat("\n--- Adjusted McFadden R^2 ---\n")
cat(sprintf("Adj R2 vs EL : %0.4f (K = %d)\n", AR2_EL, K))
cat(sprintf("Adj R2 vs MS : %0.4f (K = %d)\n", AR2_MS, K))
--- Log-likelihoods ---
EL : -629643.808
MS : -161013.489
MNL : -146745.668
--- McFadden R^2 ---
R2 vs EL : 0.7669
R2 vs MS : 0.0886
--- Adjusted McFadden R^2 ---
Adj R2 vs EL : 0.7668 (K = 90)
Adj R2 vs MS : 0.0881 (K = 90)
Ex-Post Diagnostics#
After fitting a logistic regression model, it is important to verify that key assumptions hold. These include linearity between predictors and the log-odds of the outcome, independence of observations, absence of multicollinearity among predictors, and correct model specification, meaning no relevant variables are omitted and no irrelevant ones are included. In multinomial logit models, the Independence of Irrelevant Alternatives (IIA) assumption must also be satisfied. Violations can compromise the reliability of inference, affecting confidence intervals, hypothesis tests, and predictions.
Residual diagnostics help identify such issues. Commonly used residuals in logistic regression are deviance residuals, which measure model fit on the deviance scale, and Pearson residuals, which represent standardized differences between observed and predicted responses. Plots of these residuals against fitted values should ideally show no systematic pattern; trends or clustering may indicate nonlinearity in the logit, omitted variables, or general model misspecification. Large residuals or high-leverage points can signal outliers or influential observations with disproportionate impact on estimates.
For multinomial models, the Hausman–McFadden test is often used to evaluate the IIA assumption. This involves estimating the model with the full set of alternatives and again with one or more alternatives removed. If IIA holds, the restricted model’s coefficients should be consistent with those from the full model; a statistically significant difference suggests that removing an alternative changes the relative odds among the remaining choices, indicating an IIA violation. Such tests are often repeated for different subsets to check robustness, though computational issues like singular covariance matrices can arise and may be addressed with generalized inverses. When IIA does not hold, models such as nested logit or mixed logit provide more flexible and realistic structures.
When diagnostics reveal assumption violations, various remedies are available. Nonlinearity in the logit can be addressed through transformations such as splines or polynomials, or by adding interaction terms. Multicollinearity may be mitigated by removing redundant predictors or using dimensionality reduction techniques. Outliers and influential points can be handled using robust estimation or selective adjustments. For IIA violations, adopting alternative model structures like nested or mixed logit is advisable. If overall model fit remains unsatisfactory, adding relevant variables, specifying meaningful interactions, or trying a different link function such as probit may yield improved results.