Lecture 22: Logistic Regression - Diganostics

Lecture 22: Logistic Regression - Diganostics#

Ex-Ante Diagnostics#

The example below employs the Indian Time Use Survey to model the choice of shopping channel (no-shopping, in-store, online, both) as logistic regression, accounting for individual specific variables (socio-demographics and socio-economic parameters).

# Load necessary libraries
library(tidyr)
library(dplyr)
library(mlogit)
library(ggplot2)
options(repr.plot.width = 12, repr.plot.height = 8)

# 2024 ITUS Individual Data (model)
url  <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_BD.csv"
data <- read.csv(url) # Loading Data
str(data)

'data.frame':	454192 obs. of  55 variables:
 $ Unique_ID        : chr  "2024-30010-1-241-17-13-11-2-2420-4-1-1" "2024-30010-1-241-17-13-11-2-2420-4-10-1" "2024-30010-1-241-17-13-11-2-2420-4-10-2" "2024-30010-1-241-17-13-11-2-2420-4-11-1" ...
 $ Unique_HH_ID     : chr  "2024-30010-1-241-17-13-11-2-2420-4-1" "2024-30010-1-241-17-13-11-2-2420-4-10" "2024-30010-1-241-17-13-11-2-2420-4-10" "2024-30010-1-241-17-13-11-2-2420-4-11" ...
 $ q1               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ q2               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ q3               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ q4               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday          : int  1 1 1 0 0 0 0 0 1 1 ...
 $ weekend          : int  0 0 0 1 1 1 1 1 0 0 ...
 $ rural            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ urban            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ north            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ west             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ central          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ east             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ north.east       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ south            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tierI            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tierII           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ tierIII          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ male             : int  1 1 0 1 0 1 0 0 1 0 ...
 $ female           : int  0 0 1 0 1 0 1 1 0 1 ...
 $ transgender      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gen_alpha        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gen_z            : int  0 0 1 0 0 1 1 1 0 0 ...
 $ millenials       : int  0 1 0 0 1 0 0 0 0 1 ...
 $ gen_x            : int  1 0 0 1 0 0 0 0 1 0 ...
 $ baby_boomers     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ silent           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ not_married      : int  1 0 0 0 0 1 1 1 0 0 ...
 $ married          : int  0 1 1 1 1 0 0 0 1 1 ...
 $ not_literate     : int  0 0 0 0 1 0 0 0 1 0 ...
 $ primary          : int  0 0 1 0 0 0 1 0 0 1 ...
 $ secondary        : int  0 1 0 1 0 1 0 1 0 0 ...
 $ graduate_._above : int  1 0 0 0 0 0 0 0 0 0 ...
 $ nilf             : int  0 0 1 0 1 0 1 0 0 0 ...
 $ unemployed       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ employed         : int  1 1 0 1 0 1 0 1 1 1 ...
 $ family_structure : num  0 0.333 0.333 0.4 0.4 0.4 0.4 0.4 0.429 0.429 ...
 $ household_size   : int  1 3 3 5 5 5 5 5 7 7 ...
 $ disadvantaged    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ not_disadvantaged: int  0 0 0 0 0 0 0 0 0 0 ...
 $ no_land          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ land_possessed   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ low_income       : int  1 1 1 0 0 0 0 0 0 0 ...
 $ medium_income    : int  0 0 0 1 1 1 1 1 1 1 ...
 $ high_income      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ no_dwelling      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ kutcha           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pucca            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ none             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ instore          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ online           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ both             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ choice           : chr  "none" "none" "none" "none" ...
 $ weight           : int  208857 208857 208857 208857 208857 208857 208857 208857 208857 208857 ...

# Counts
counts <- data %>%
  summarise(
    None = sum(none, na.rm = TRUE),
    InStore = sum(instore, na.rm = TRUE),
    Online = sum(online, na.rm = TRUE),
    Both = sum(both, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(), names_to = "channel", values_to = "count") %>%
  mutate(share = (count / sum(count)))
print(counts)

# A tibble: 4 × 3
  channel  count    share
  <chr>    <int>    <dbl>
1 None    409897 0.902   
2 InStore  39728 0.0875  
3 Online    4252 0.00936 
4 Both       315 0.000694

# Temporal: Quarter Counts
quarter_counts <- data %>%
  summarise(
    None_Q1 = sum(none * q1, na.rm = TRUE),
    InStore_Q1 = sum(instore * q1, na.rm = TRUE),
    Online_Q1 = sum(online * q1, na.rm = TRUE),
    Both_Q1 = sum(both * q1, na.rm = TRUE),

    None_Q2 = sum(none * q2, na.rm = TRUE),
    InStore_Q2 = sum(instore * q2, na.rm = TRUE),
    Online_Q2 = sum(online * q2, na.rm = TRUE),
    Both_Q2 = sum(both * q2, na.rm = TRUE),

    None_Q3 = sum(none * q3, na.rm = TRUE),
    InStore_Q3 = sum(instore * q3, na.rm = TRUE),
    Online_Q3 = sum(online * q3, na.rm = TRUE),
    Both_Q3 = sum(both * q3, na.rm = TRUE),

    None_Q4 = sum(none * q4, na.rm = TRUE),
    InStore_Q4 = sum(instore * q4, na.rm = TRUE),
    Online_Q4 = sum(online * q4, na.rm = TRUE),
    Both_Q4 = sum(both * q4, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(), names_to = "channel_quarter", values_to = "count") %>%
  separate(channel_quarter, into = c("channel", "quarter"), sep = "_") %>%
  group_by(quarter) %>%
  mutate(share = (count / sum(count))) %>%
  ungroup() %>%
  mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))

# Tabulate
table <- quarter_counts %>%
  ungroup() %>%
  select(channel, quarter, share) %>%
  pivot_wider(names_from = quarter, values_from = share)
print(table)

# A tibble: 4 × 5
  channel       Q1       Q2       Q3       Q4
  <fct>      <dbl>    <dbl>    <dbl>    <dbl>
1 None    0.910    0.902    0.899    0.898   
2 InStore 0.0761   0.0857   0.0933   0.0956  
3 Online  0.0131   0.0110   0.00699  0.00601 
4 Both    0.000713 0.000794 0.000704 0.000557

# Temporal: Day Counts
day_counts <- data %>%
  summarise(
    None_WD = sum(none * weekday, na.rm = TRUE),
    InStore_WD = sum(instore * weekday, na.rm = TRUE),
    Online_WD = sum(online * weekday, na.rm = TRUE),
    Both_WD = sum(both * weekday, na.rm = TRUE),

    None_WE = sum(none * weekend, na.rm = TRUE),
    InStore_WE = sum(instore * weekend, na.rm = TRUE),
    Online_WE = sum(online * weekend, na.rm = TRUE),
    Both_WE = sum(both * weekend, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(), names_to = "channel_day", values_to = "count") %>%
  separate(channel_day, into = c("channel", "day"), sep = "_") %>%
  group_by(day) %>%
  mutate(share = (count / sum(count))) %>%
  ungroup() %>%
  mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))

# Tabulate
table <- day_counts %>%
  ungroup() %>%
  select(channel, day, share) %>%
  pivot_wider(names_from = day, values_from = share)
print(table)

# A tibble: 4 × 3
  channel       WD       WE
  <fct>      <dbl>    <dbl>
1 None    0.907    0.892   
2 InStore 0.0838   0.0973  
3 Online  0.00900  0.0103  
4 Both    0.000636 0.000847

# Spatial: Sector Counts
sector_counts <- data %>%
  summarise(
    None_R = sum(none * rural, na.rm = TRUE),
    InStore_R = sum(instore * rural, na.rm = TRUE),
    Online_R = sum(online * rural, na.rm = TRUE),
    Both_R = sum(both * rural, na.rm = TRUE),

    None_U = sum(none * urban, na.rm = TRUE),
    InStore_U = sum(instore * urban, na.rm = TRUE),
    Online_U = sum(online * urban, na.rm = TRUE),
    Both_U = sum(both * urban, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(), names_to = "channel_sector", values_to = "count") %>%
  separate(channel_sector, into = c("channel", "sector"), sep = "_") %>%
  group_by(sector) %>%
  mutate(share = (count / sum(count))) %>%
  ungroup() %>%
  mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))

# Tabulate
table <- sector_counts %>%
  ungroup() %>%
  select(channel, sector, share) %>%
  pivot_wider(names_from = sector, values_from = share)
print(table)

# A tibble: 4 × 3
  channel        R       U
  <fct>      <dbl>   <dbl>
1 None    0.916    0.880  
2 InStore 0.0766   0.106  
3 Online  0.00706  0.0133 
4 Both    0.000473 0.00107

# Spatial: Tier Counts
tier_counts <- data %>%
  summarise(
    None_T1 = sum(none * tierI, na.rm = TRUE),
    InStore_T1 = sum(instore * tierI, na.rm = TRUE),
    Online_T1 = sum(online * tierI, na.rm = TRUE),
    Both_T1 = sum(both * tierI, na.rm = TRUE),

    None_T2 = sum(none * tierII, na.rm = TRUE),
    InStore_T2 = sum(instore * tierII, na.rm = TRUE),
    Online_T2 = sum(online * tierII, na.rm = TRUE),
    Both_T2 = sum(both * tierII, na.rm = TRUE),

    None_T3 = sum(none * tierIII, na.rm = TRUE),
    InStore_T3 = sum(instore * tierIII, na.rm = TRUE),
    Online_T3 = sum(online * tierIII, na.rm = TRUE),
    Both_T3 = sum(both * tierIII, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(), names_to = "channel_tier", values_to = "count") %>%
  separate(channel_tier, into = c("channel", "tier"), sep = "_") %>%
  group_by(tier) %>%
  mutate(share = (count / sum(count))) %>%
  ungroup() %>%
  mutate(channel = factor(channel, levels = c("None", "InStore", "Online", "Both")))

# Tabulate
table <- tier_counts %>%
  ungroup() %>%
  select(channel, tier, share) %>%
  pivot_wider(names_from = tier, values_from = share)
print(table)

# A tibble: 4 × 4
  channel       T1       T2       T3
  <fct>      <dbl>    <dbl>    <dbl>
1 None    0.904    0.911    0.893   
2 InStore 0.0874   0.0802   0.0948  
3 Online  0.00825  0.00811  0.0117  
4 Both    0.000776 0.000471 0.000833

Model Diagnostics#

Equally Likely Model: A baseline model that assumes all alternatives have the same probability of being chosen.

\[ P_{ij} = 1/J \quad \ \ \forall \ \ i, j \]

where \(I\) = number of individuals, and \(J\) = number of alternatives.
Market Share Model: A baseline model that assigns choice probabilities equal to the observed market shares of each alternative.

Unweighted Data -

\[ P_{ij} = \frac{I_j}{I} \]

where \(I_j\) = number of times alternative \(j\) is chosen.

Weighted Data -

\[ P_{ij} = \frac{\sum_{i=1}^{I} w_i y_{ij}}{\sum_{i=1}^{I} w_i} \quad \ \ \forall \ \ i, j \]

where \(w_i\) is the weight for individual \(i\), and \(y_{ij} = 1\) if individual \(i\) chose alternative \(j\), else \(0\).
Estimated Model: The multinomial logit (MNL) model estimated using explanatory variables to predict choice probabilities.

\[ P_{ij} = \frac{\exp(V_{ij})}{\sum_{k \in J} \exp(V_{ik})} \]

where \(V_{ij}\) is the systematic utility of alternative \(j\) for individual \(i\).

Consequently, to assess predictive capability of the model, we evaluate log-Likelihood as follows

\[ LL = \sum_{i=1}^{I} \sum_{j=1}^{J} y_{ij} \ln\left(P_{ij}\right) \]

To better compare the different models, we evaluate McFadden \(\text{R}^2\) as follows,

Estimated Model vs. Equally Likely Model:

\[ R^2_{EL} = 1 - \frac{LL_{EST}}{LL_{EL}} \]
Estimated Model vs. Market Share Model:

\[ R^2_{MS} = 1 - \frac{LL_{EST}}{LL_{MS}} \]

Equivalently, we have adjusted McFadden \(\text{R}^2\), given by,

Estimated Model vs. Equally Likely Model:

\[ \bar{R}^2_{EL} = 1 - \frac{LL_{EST} - K}{LL_{EL}} \]
Estimated Model vs. Market Share Model:

\[ \bar{R}^2_{MS} = 1 - \frac{LL_{EST} - K}{LL_{MS}} \]

where \(K\) is the number of exogenous variables in the estimated model.

# Prepare dataset
long_data <- data %>%
  select(
    Unique_ID,
    q1, q2, q3, q4,
    weekday, weekend,
    urban, rural, 
    north, west, south, east, central, north.east,
    tierI, tierII, tierIII,
    male, female, transgender,
    gen_alpha, gen_z, millenials, gen_x, baby_boomers, silent,
    not_married, married,
    not_literate, primary, secondary, graduate_._above,
    nilf, unemployed, employed,
    family_structure, 
    household_size,
    disadvantaged, not_disadvantaged,
    no_land, land_possessed,
    low_income, medium_income, high_income,
    no_dwelling, kutcha, pucca,
    none, instore, online, both,
    weight
  ) %>%
  pivot_longer(
    cols = c(none, instore, online, both),
    names_to = "alt", values_to = "bin"
  ) %>%
  mutate(bin = bin == 1)

model_data <- mlogit.data(
  long_data,
  choice = "bin",
  shape = "long",
  chid.var = "Unique_ID",
  alt.var = "alt"
)

# Esitmated Model
model <- mlogit(
    formula = bin ~ 1 | q1 + q2 + q3 +
                        weekend +
                        urban +
                        north + west + south + east + north.east +
                        tierI + tierII +
                        female + 
                        gen_alpha + gen_z + millenials + gen_x + baby_boomers +
                        married +
                        primary + secondary + graduate_._above +
                        unemployed + employed +
                        family_structure + 
                        household_size +
                        disadvantaged +
                        medium_income + high_income,
    data = model_data,
    ref = "none",
    weights = model_data$weight
    )
summary(model)

Call:
mlogit(formula = bin ~ 1 | q1 + q2 + q3 + weekend + urban + north + 
    west + south + east + north.east + tierI + tierII + female + 
    gen_alpha + gen_z + millenials + gen_x + baby_boomers + married + 
    primary + secondary + graduate_._above + unemployed + employed + 
    family_structure + household_size + disadvantaged + medium_income + 
    high_income, data = model_data, weights = model_data$weight, 
    reflevel = "none", method = "nr")

Frequencies of alternatives:choice
      none       both    instore     online 
0.90247516 0.00069354 0.08746962 0.00936168 

nr method
50 iterations, 0h:11m:30s 
g'(-H)^-1g = 5.42E-07 
gradient close to zero 

Coefficients :
                           Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept):both         -7.4168560  0.6157904 -12.0444 < 2.2e-16 ***
(Intercept):instore      -3.1359745  0.0830616 -37.7548 < 2.2e-16 ***
(Intercept):online       -6.1039104  0.2871559 -21.2564 < 2.2e-16 ***
q1:both                   0.2791877  0.1672719   1.6691 0.0951044 .  
q1:instore               -0.2244205  0.0153914 -14.5809 < 2.2e-16 ***
q1:online                 0.7799642  0.0466726  16.7114 < 2.2e-16 ***
q2:both                   0.3613935  0.1651965   2.1877 0.0286945 *  
q2:instore               -0.1043859  0.0150334  -6.9436 3.822e-12 ***
q2:online                 0.6014995  0.0481472  12.4929 < 2.2e-16 ***
q3:both                   0.2287075  0.1701913   1.3438 0.1790046    
q3:instore               -0.0093966  0.0147826  -0.6357 0.5250009    
q3:online                 0.1030622  0.0533721   1.9310 0.0534814 .  
weekend:both              0.3305004  0.1188460   2.7809 0.0054206 ** 
weekend:instore           0.1425404  0.0117174  12.1648 < 2.2e-16 ***
weekend:online            0.1847087  0.0334999   5.5137 3.513e-08 ***
urban:both                0.4648192  0.1278526   3.6356 0.0002773 ***
urban:instore             0.2448504  0.0119824  20.4342 < 2.2e-16 ***
urban:online              0.5042721  0.0344862  14.6224 < 2.2e-16 ***
north:both                0.2783481  0.2349655   1.1846 0.2361623    
north:instore             0.2319742  0.0236956   9.7897 < 2.2e-16 ***
north:online              0.0781464  0.0617442   1.2656 0.2056398    
west:both                -0.4456713  0.2265514  -1.9672 0.0491605 *  
west:instore             -0.2370984  0.0196738 -12.0515 < 2.2e-16 ***
west:online              -0.3406240  0.0497638  -6.8448 7.658e-12 ***
south:both                0.4617657  0.1914588   2.4118 0.0158728 *  
south:instore             0.2891557  0.0184433  15.6781 < 2.2e-16 ***
south:online             -0.7997034  0.0600485 -13.3176 < 2.2e-16 ***
east:both                 0.5726507  0.1927423   2.9711 0.0029676 ** 
east:instore              0.5940843  0.0172637  34.4124 < 2.2e-16 ***
east:online               0.3210035  0.0445664   7.2028 5.898e-13 ***
north.east:both          -0.1527257  0.2532609  -0.6030 0.5464841    
north.east:instore        0.4412024  0.0225813  19.5384 < 2.2e-16 ***
north.east:online        -0.2579856  0.0695310  -3.7104 0.0002070 ***
tierI:both                0.2776032  0.1588892   1.7472 0.0806113 .  
tierI:instore            -0.1058035  0.0159324  -6.6408 3.120e-11 ***
tierI:online             -0.2106664  0.0462791  -4.5521 5.312e-06 ***
tierII:both              -0.2935167  0.1533859  -1.9136 0.0556734 .  
tierII:instore           -0.1629028  0.0139336 -11.6914 < 2.2e-16 ***
tierII:online            -0.2211167  0.0388778  -5.6875 1.289e-08 ***
female:both              -1.1797157  0.1594996  -7.3964 1.399e-13 ***
female:instore           -0.9256491  0.0143421 -64.5408 < 2.2e-16 ***
female:online            -0.6389009  0.0408502 -15.6401 < 2.2e-16 ***
gen_alpha:both           -2.5534630  0.9962016  -2.5632 0.0103713 *  
gen_alpha:instore        -1.1663812  0.0967893 -12.0507 < 2.2e-16 ***
gen_alpha:online         -0.7042964  0.3221013  -2.1866 0.0287741 *  
gen_z:both               -0.7211407  0.5605096  -1.2866 0.1982406    
gen_z:instore             0.5742194  0.0798757   7.1889 6.530e-13 ***
gen_z:online              1.0996030  0.2791819   3.9387 8.194e-05 ***
millenials:both           0.5069131  0.5533189   0.9161 0.3595977    
millenials:instore        1.0107544  0.0799294  12.6456 < 2.2e-16 ***
millenials:online         1.5288775  0.2795385   5.4693 4.518e-08 ***
gen_x:both                0.0421728  0.5573107   0.0757 0.9396801    
gen_x:instore             1.0405279  0.0797451  13.0482 < 2.2e-16 ***
gen_x:online              1.5374040  0.2792200   5.5061 3.669e-08 ***
baby_boomers:both         0.4560862  0.5455593   0.8360 0.4031563    
baby_boomers:instore      1.0074645  0.0797289  12.6361 < 2.2e-16 ***
baby_boomers:online       1.2986491  0.2797404   4.6423 3.445e-06 ***
married:both              0.1933199  0.1612442   1.1989 0.2305567    
married:instore           0.4609271  0.0158744  29.0359 < 2.2e-16 ***
married:online            0.3297134  0.0450564   7.3178 2.520e-13 ***
primary:both              0.2572980  0.2216710   1.1607 0.2457556    
primary:instore           0.1796498  0.0179330  10.0178 < 2.2e-16 ***
primary:online            0.0250463  0.0529127   0.4734 0.6359624    
secondary:both            0.4976686  0.2285602   2.1774 0.0294502 *  
secondary:instore         0.2363406  0.0192631  12.2691 < 2.2e-16 ***
secondary:online          0.2464443  0.0553515   4.4524 8.493e-06 ***
graduate_._above:both     0.9545753  0.2389967   3.9941 6.494e-05 ***
graduate_._above:instore  0.2992761  0.0220033  13.6014 < 2.2e-16 ***
graduate_._above:online   0.3433125  0.0620183   5.5357 3.101e-08 ***
unemployed:both           0.7963925  0.3113131   2.5582 0.0105224 *  
unemployed:instore        0.7554284  0.0355874  21.2274 < 2.2e-16 ***
unemployed:online         0.8261047  0.0858447   9.6232 < 2.2e-16 ***
employed:both            -0.2305284  0.1641075  -1.4047 0.1600985    
employed:instore          0.0447391  0.0151695   2.9493 0.0031851 ** 
employed:online          -0.1993489  0.0434480  -4.5882 4.470e-06 ***
family_structure:both    -0.2367034  0.3224017  -0.7342 0.4628343    
family_structure:instore -0.0291962  0.0292920  -0.9967 0.3188974    
family_structure:online  -0.1143438  0.0841183  -1.3593 0.1740449    
household_size:both      -0.1667825  0.0409682  -4.0710 4.681e-05 ***
household_size:instore   -0.1019963  0.0036437 -27.9922 < 2.2e-16 ***
household_size:online    -0.0644479  0.0097986  -6.5772 4.793e-11 ***
disadvantaged:both        0.0445169  0.1303666   0.3415 0.7327462    
disadvantaged:instore     0.0520723  0.0127374   4.0881 4.348e-05 ***
disadvantaged:online      0.0193504  0.0356309   0.5431 0.5870757    
medium_income:both        0.1514223  0.1566796   0.9664 0.3338214    
medium_income:instore     0.0426988  0.0139013   3.0716 0.0021293 ** 
medium_income:online      0.1323845  0.0408460   3.2411 0.0011909 ** 
high_income:both          0.4051618  0.1651532   2.4532 0.0141573 *  
high_income:instore       0.1283853  0.0156172   8.2208 2.220e-16 ***
high_income:online        0.1757212  0.0456216   3.8517 0.0001173 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -146750
McFadden R^2:  0.088592 
Likelihood ratio test : chisq = 28528 (p.value = < 2.22e-16)

# Model Statistics
I      <- nrow(data)
J      <- length(unique(data$choice))
K      <- length(coef(model))
W      <- long_data$weight
Y      <- long_data$bin
Z      <- data %>%
            summarise(
              none = sum(none * weight, na.rm = TRUE) / sum(weight) * nrow(data),
              instore = sum(instore * weight, na.rm = TRUE) / sum(weight) * nrow(data),
              online = sum(online * weight, na.rm = TRUE) / sum(weight) * nrow(data),
              both = sum(both * weight, na.rm = TRUE) / sum(weight) * nrow(data)
            ) %>%
            pivot_longer(cols = everything(), names_to = "channel", values_to = "n") %>%
            mutate(p = (n / sum(n)))
P      <- Z$p[model_data$alt] 
LL_EL  <- sum(Y * log(1/J))
LL_MS  <- sum(Y * log(P))
LL_MNL <- as.numeric(logLik(model))
R2_EL  <- 1 - (LL_MNL / LL_EL)
R2_MS  <- 1 - (LL_MNL / LL_MS)
AR2_EL <- 1 - ((LL_MNL - K) / LL_EL)
AR2_MS <- 1 - ((LL_MNL - K) / LL_MS)

cat("\n--- Log-likelihoods ---\n")
cat(sprintf("EL  : %0.3f\n", LL_EL))
cat(sprintf("MS  : %0.3f\n", LL_MS))
cat(sprintf("MNL : %0.3f\n", LL_MNL))
cat("\n--- McFadden R^2 ---\n")
cat(sprintf("R2 vs EL : %0.4f\n", R2_EL))
cat(sprintf("R2 vs MS : %0.4f\n", R2_MS))
cat("\n--- Adjusted McFadden R^2 ---\n")
cat(sprintf("Adj R2 vs EL : %0.4f (K = %d)\n", AR2_EL, K))
cat(sprintf("Adj R2 vs MS : %0.4f (K = %d)\n", AR2_MS, K))

--- Log-likelihoods ---
EL  : -629643.808
MS  : -161013.489
MNL : -146745.668

--- McFadden R^2 ---
R2 vs EL : 0.7669
R2 vs MS : 0.0886

--- Adjusted McFadden R^2 ---
Adj R2 vs EL : 0.7668 (K = 90)
Adj R2 vs MS : 0.0881 (K = 90)

Ex-Post Diagnostics#

After fitting a logistic regression model, it is important to verify that key assumptions hold. These include linearity between predictors and the log-odds of the outcome, independence of observations, absence of multicollinearity among predictors, and correct model specification, meaning no relevant variables are omitted and no irrelevant ones are included. In multinomial logit models, the Independence of Irrelevant Alternatives (IIA) assumption must also be satisfied. Violations can compromise the reliability of inference, affecting confidence intervals, hypothesis tests, and predictions.

Residual diagnostics help identify such issues. Commonly used residuals in logistic regression are deviance residuals, which measure model fit on the deviance scale, and Pearson residuals, which represent standardized differences between observed and predicted responses. Plots of these residuals against fitted values should ideally show no systematic pattern; trends or clustering may indicate nonlinearity in the logit, omitted variables, or general model misspecification. Large residuals or high-leverage points can signal outliers or influential observations with disproportionate impact on estimates.

For multinomial models, the Hausman–McFadden test is often used to evaluate the IIA assumption. This involves estimating the model with the full set of alternatives and again with one or more alternatives removed. If IIA holds, the restricted model’s coefficients should be consistent with those from the full model; a statistically significant difference suggests that removing an alternative changes the relative odds among the remaining choices, indicating an IIA violation. Such tests are often repeated for different subsets to check robustness, though computational issues like singular covariance matrices can arise and may be addressed with generalized inverses. When IIA does not hold, models such as nested logit or mixed logit provide more flexible and realistic structures.

When diagnostics reveal assumption violations, various remedies are available. Nonlinearity in the logit can be addressed through transformations such as splines or polynomials, or by adding interaction terms. Multicollinearity may be mitigated by removing redundant predictors or using dimensionality reduction techniques. Outliers and influential points can be handled using robust estimation or selective adjustments. For IIA violations, adopting alternative model structures like nested or mixed logit is advisable. If overall model fit remains unsatisfactory, adding relevant variables, specifying meaningful interactions, or trying a different link function such as probit may yield improved results.

Lecture 22: Logistic Regression - Diganostics

Contents

Lecture 22: Logistic Regression - Diganostics#

Ex-Ante Diagnostics#

Model Diagnostics#

Ex-Post Diagnostics#