Lecture 13: Quiz 1 Discussion

Lecture 13: Quiz 1 Discussion#


  1. Data Classification (5)

Consider the following R dataset detailing the attributes for different flights departing from New York City including Date (year, month, and day), Time (actual & scheduled departure/arrival time), Delay (departure/arrival delay), Flight Details (carrier name, flight number, tail number and origin/destination airport), Journey Details (air time, distance traveled, and total flight time (in hours and minutes)). Classify each variable (Date, Time, Delay, Flight Details, and Journet Details) in the dataset as one of the following: Discrete Quantitative, Continuous Quantitative, Qualitative, and Categorical.

# Load Packages
library(dplyr)
library(nycflights13)

# Load Dataset
data <- flights
str(data)
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
  • Date: discrete quantitative variable

  • Time: discrete quantitative variable

  • Delay: discrete quantitative variable

  • Flight Details: categorical variable

  • Journey Details: discrete quantitative variable


  1. Data Summary (5)

Using the flights dataset filtered out for John F. Kennedy Airport for Delta Airlines flights on Christmas Eve (24th December), summarise measure of location (mean, median, mode) and dispersion (inter-quartile range and standard deviation) for departure as well as arrival delay. (5)

# Dataset filtered out for John F. Kennedy Airport for Delta Airlines flights on Christmas Eve (24th December)
data <- flights %>% filter(origin=="JFK", carrier=="DL", month==12, day==24)

# Probability Mass Table
## Departure Delay
v <- sort(unique(data$dep_delay))
f <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$dep_delay[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df_dep <- data.frame(x=v, f=f/sum(f))
## Arrival Delay
v <- sort(unique(data$arr_delay))
f <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$arr_delay[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df_arr <- data.frame(x=v, f=f/sum(f))

# Mean
## Departure Delay
df <- df_dep
z  <- sum(df$f * df$x)
message("Mean Departure Delay: ", round(z, digits=3))
## Arrival Delay
df <- df_arr
z  <- sum(df$f * df$x)
message("Mean Arrival Delay: ", round(z, digits=3))

## Median
## Departure Delay
df <- df_dep
z  <- NA
F  <- cumsum(df$f) 
for (i in 2:nrow(df)) {
    if (F[i-1] < 0.5 & F[i] > 0.5) {
        z <- df$x[i]
        break
    } else if (F[i] == 0.5) {
        z <- (df$x[i] + df$x[i+1]) / 2
        break
    }
}
message("Median Departure Delay: ", round(z, digits=3))
## Arrival Delay
df <- df_arr
z  <- NA
F  <- cumsum(df$f)
for (i in 2:nrow(df)) {
    if (F[i-1] < 0.5 & F[i] > 0.5) {
        z <- df$x[i]
        break
    } else if (F[i] == 0.5) {
        z <- (df$x[i] + df$x[i+1]) / 2
        break
    }
}
message("Median Arrival Delay: ", round(z, digits=3))

## Mode
## Departure Delay
df <- df_dep
z  <- df$x[which.max(df$f)]
message("Mode Departure Delay: ", round(z, digits=3))
## Arrival Delay
df <- df_arr
z  <- df$x[which.max(df$f)]
message("Mode Arrival Delay: ", round(z, digits=3))

## Inter Quartile Range
## Departure Delay
df <- df_dep
F  <- cumsum(df$f) 
q1 <- NA
for (i in 2:nrow(df)) {
    if (F[i-1] < 0.25 & F[i] > 0.25) {
        q1 <- df$x[i]
        break
    } else if (F[i] == 0.25) {
        q1 <- (df$x[i] + df$x[i+1]) / 2
        break
    }
}
q3 <- NA
for (i in 2:nrow(df)) {
    if (F[i-1] < 0.75 & F[i] > 0.75) {
        q3 <- df$x[i]
        break
    } else if (F[i] == 0.75) {
        q3 <- (df$x[i] + df$x[i+1]) / 2
        break
    }
}
z = q3 - q1
message("IQR Departure Delay: ", z)
## Arrival Delay
df <- df_arr
F  <- cumsum(df$f)
q1 <- NA
for (i in 2:nrow(df)) {
    if (F[i-1] < 0.25 & F[i] > 0.25) {
        q1 <- df$x[i]
        break
    } else if (F[i] == 0.25) {
        q1 <- (df$x[i] + df$x[i+1]) / 2
        break
    }
}
q3 <- NA
for (i in 2:nrow(df)) {
    if (F[i-1] < 0.75 & F[i] > 0.75) {
        q3 <- df$x[i]
        break
    } else if (F[i] == 0.75) {
        q3 <- (df$x[i] + df$x[i+1]) / 2
        break
    }
}
z = q3 - q1
message("IQR Arrival Delay: ", z)

## Standard Deviation
## Departure Delay
df <- df_dep
z  <- sqrt(sum(df$f * (df$x - sum(df$f * df$x))^2))
message("Standard Deviation Departure Delay: ", round(z, digits=3))
## Arrival Delay
df <- df_arr
z  <- sqrt(sum(df$f * (df$x - sum(df$f * df$x))^2))
message("Standard Deviation Arrival Delay: ", round(z, digits=3))
Mean Departure Delay: 4.04

Mean Arrival Delay: -5.16

Median Departure Delay: -1

Median Arrival Delay: -9

Mode Departure Delay: -2

Mode Arrival Delay: -5

IQR Departure Delay: 11

IQR Arrival Delay: 27

Standard Deviation Departure Delay: 13.092

Standard Deviation Arrival Delay: 19.996

  1. Probability Analysis (5)

Prove that \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) can be shown as follows,

\[\begin{split} \begin{aligned} & A \cup B = (A \cap B^c) \cup (A^c \cap B) \cup (A \cap B) \\ & P(A \cup B) = P((A \cap B^c) \cup (A^c \cap B) \cup (A \cap B)) \\ \end{aligned} \end{split}\]

Since, the three sets are disjoint, Axiom #3 renders,

\[\begin{split} \begin{aligned} & P(A \cup B) = P(A \cap B^c) + P(A^c \cap B) + P(A \cap B) \\ \end{aligned} \end{split}\]

Further,

\[\begin{split} \begin{aligned} & P(A) = P(A \cap B^c) + P(A \cap B) \\ & P(B) = P(A^c \cap B) + P(A \cap B) \end{aligned} \end{split}\]

Hence,

\[P(A \cup B) = (P(A) - P(A \cap B)) + (P(B) - P(A \cap B)) + P(A \cap B)\]

Rendering,

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

  1. Data Sampling (5)

a. For the following randomly sampled data from the flights dataset, compute bias and standard error for the estimator on arrival delay, given that population mean of arrival delay is 6.895 minutes (parameter value) . (3)

b. Using the Archery analogy discussed in the class, draw a representative target board to comment upon the accuracy and precision of the estimator. (2)

# Randomly sampled data from flights dataset
P <- flights$arr_delay
m <- 30
n <- 1000
z <- round(mean(P, na.rm=TRUE), digits=3)
Z <- vector("numeric", m)
for (i in 1:m) {
  set.seed(i)
  I <- order(runif(length(P)))[1:n]
  S <- P[I]
  Z[i] <- round(mean(S, na.rm=TRUE), digits=3)
}
data.frame(parameter=z, estimator=Z, error=round(Z-z, digits=3))
message("Bias: ", round(mean(Z-z, na.rm=TRUE), digits=3))
message("Standard Error: ", round(sqrt(mean((Z - mean(Z))^2)), digits=3))
A data.frame: 30 × 3
parameterestimatorerror
<dbl><dbl><dbl>
6.8956.298-0.597
6.8956.358-0.537
6.8955.903-0.992
6.8958.699 1.804
6.8955.126-1.769
6.8954.782-2.113
6.8957.001 0.106
6.8954.727-2.168
6.8958.028 1.133
6.8956.912 0.017
6.8956.271-0.624
6.8959.190 2.295
6.8956.405-0.490
6.8954.689-2.206
6.8955.129-1.766
6.8955.763-1.132
6.8957.533 0.638
6.8956.298-0.597
6.8957.129 0.234
6.8957.781 0.886
6.8957.725 0.830
6.8955.174-1.721
6.8956.946 0.051
6.8959.530 2.635
6.8955.490-1.405
6.8957.908 1.013
6.8954.631-2.264
6.8957.881 0.986
6.8956.877-0.018
6.8957.304 0.409
Bias: -0.245

Standard Error: 1.333

The target board should represent low accuracy and low precision.


  1. Hypothesis Testing (5)

Test the following hypothesis for Delta Airlines flights from John F. Kennedy Airport on Christmas Eve

a. departure delay is greater than 4 minutes

b. arrival delay is less than -5 minutes

Given, population mean of departure and arrival delay is 4.04 and -5.16 minutes, respectively; and population sd of departure and arrival delay is 13.092 and 19.996 minutes, respectively.

# Dataset filtered out for John F. Kennedy Airport for Delta Airlines flights on Christmas Eve (24th December)
data <- flights %>% filter(origin=="JFK", carrier=="DL", month==12, day==24)
nrow(data)
50
# departure delay is greater than 4 minutes
message("Null Hypothesis: Departure delay is less than or equal to 4 minutes")
message("Alternative Hypothesis: Departure delay is greater than 4 minutes")
z = round((4.04 - 4) / (13.092 / sqrt(50)), digits=3)
v = qnorm(0.95) 
message("z-statistic: ", round(z, digits=3))
message("Critical value: ", round(v, digits=3))
message("Decision: ", ifelse(abs(z) > abs(v), "Reject Null Hypothesis", "Do not reject Null Hypothesis"))

# arrival delay is less than -5 minutes
message("Null Hypothesis: Arrival delay is greater than or equal to -5 minutes")
message("Alternative Hypothesis: Arrival delay is less than -5 minutes")
z = round((-5.16 - -5) / (19.996 / sqrt(50)), digits=3)
v = qnorm(0.05)
message("z-statistic: ", round(z, digits=3))
message("Critical value: ", round(v, digits=3))
message("Decision: ", ifelse(abs(z) > abs(v), "Reject Null Hypothesis", "Do not reject Null Hypothesis"))
Null Hypothesis: Departure delay is less than or equal to 4 minutes

Alternative Hypothesis: Departure delay is greater than 4 minutes

z-statistic: 0.022

Critical value: 1.645

Decision: Do not reject Null Hypothesis

Null Hypothesis: Arrival delay is greater than or equal to -5 minutes

Alternative Hypothesis: Arrival delay is less than -5 minutes

z-statistic: -0.057

Critical value: -1.645

Decision: Do not reject Null Hypothesis