Lecture 6: Probability Analysis in R#
Note
In this lecture, we will review Axioms, Conditional Probability, and Bayes’ Rule in R using Indian Time Use Data.
# 2024 ITUS Individual Data (original)
url <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_OG.csv"
data <- read.csv(url)
In the 2024 Indian Time Use Survey dataset, variables such as education_level
and employment_status
are coded numerically. The following mappings define the categorical meaning of each code. These will be used throughout the lecture to define events and compute empirical probabilities.
Education Level:
1: not literate
2: literate: below primary
3: primary
4: upper primary/middle
5: secondary
6: higher secondary
7: diploma /certificate course (up to secondary)
8: diploma/certificate course (higher secondary)
9: diploma/certificate course(graduation & above)
10: graduate
11: post graduate and above
Employment Status:
11: working as own account worker in household enterprise
12: working as employer in household enterprise
21: working as helper in household enterprise (unpaid family worker)
31: working as regular salaried/wage employee
41: working as casual wage labour in public works
51: working as casual wage labour in other types of work
81: did not work but was seeking and/or available for work
91: attending educational institution
92: attending domestic duties only
93: attending domestic duties and is also engaged in unpaid household enterprise for household use
94: rentiers, pensioners, remittance recipients, etc.
95: not able to work due to disability
97: others
## Education Level
v <- sort(unique(data$education_level))
f <- numeric(length(v))
for (r in 1:nrow(data)) {
z <- data$education_level[r]
i <- which(v == z)
f[i] <- f[i] + 1
}
df_edu <- data.frame(edu=v, f=f, p=round(f/sum(f), 5))
## Employment Level
v <- sort(unique(data$employment_status))
f <- numeric(length(v))
for (r in 1:nrow(data)) {
z <- data$employment_status[r]
i <- which(v == z)
f[i] <- f[i] + 1
}
df_emp <- data.frame(emp=v, f=f, p=round(f/sum(f), 5))
print(df_edu)
cat("\n")
print(df_emp)
edu f p
1 1 135438 0.25376
2 2 71011 0.13305
3 3 69215 0.12968
4 4 73360 0.13745
5 5 64045 0.12000
6 6 56295 0.10548
7 7 1755 0.00329
8 8 3132 0.00587
9 9 1847 0.00346
10 10 46394 0.08693
11 11 11227 0.02104
emp f p
1 10 1 0.00000
2 11 77216 0.14468
3 12 4716 0.00884
4 21 17119 0.03207
5 31 51547 0.09658
6 41 4756 0.00891
7 51 45210 0.08471
8 81 7724 0.01447
9 91 101349 0.18989
10 92 90968 0.17044
11 93 27100 0.05078
12 94 20272 0.03798
13 95 4602 0.00862
14 97 1612 0.00302
15 99999 79527 0.14901
Axioms of Probability#
Let \(S\) be a sample space (set of all possible outcomes), and let \(P\) be a probability function mapping subsets of \(S\) to real numbers, then the axioms are:
Axiom #1 - Non-Negativity: \(P(E) \geq 0\) for any event \(E\).
all(df_edu$p >= 0)
Axiom #2 - Normalization: \(\sum_{E \in S} P(E) = 1\); the probability of the sample space is 1.
round(sum(df_edu$p), 3) == 1
Axiom #3 - Additivity: If \(A\) and \(B\) are disjoint events (i.e., \(A \cap B = \emptyset\)), then \(P(A \cup B) = P(A) + P(B)\)
Let \(A\) refer to individuals with primary education level \((\text{edu} = 3)\), and \(B\) refer to individuals with secondary eduction level \((\text{edu} = 5)\), then \(A \cup B\) refer to individuals with primary or secondary education level. Since, \(A\) and \(B\) are disjoint events (i.e., \(A \cap B = \emptyset\)), then,
p_A = df_edu$p[3]
p_B = df_edu$p[5]
p_AUB = (df_edu$f[3] + df_edu$f[5]) / sum(df_edu$f)
message("Probability that an individual has a primary education level (A): ", round(p_A, 3))
message("Probability that an individual has a secondary education level (B): ", round(p_B, 3))
message("Probability that an individual has a primary or secondary eduction level (A U B): ", round(p_AUB, 3))
message("Since A and B are disjoint events, P(A U B) = P(A) + P(B) holds ", round(p_AUB, 3) == round(p_A + p_B, 3))
Probability that an individual has a primary education level (A): 0.13
Probability that an individual has a secondary education level (B): 0.12
Probability that an individual has a primary or secondary eduction level (A U B): 0.25
Since A and B are disjoint events, P(A U B) = P(A) + P(B) holds TRUE
Laws of Probability#
These axioms render the following laws of probability,
Complement Rule: \(P(A^c) = 1 - P(A)\) follows from Axiom #2 and Axiom #3
Let \(A\) refer to individuals that are not literate \((\text{edu} = 1)\), then \(B = A^c\) refers to individuals that have some level of literacy \((\text{edu} \in [2,11])\), then,
p_A = df_edu$p[1]
p_B = sum(df_edu$f[2:11]) / sum(df_edu$f)
message("Probability that an individual is not literate (A): ", round(p_A, 3))
message("Probability that an individual has some level of literacy (B): ", round(p_B, 3))
message("Since B is complement of A, P(B) = 1 - P(A) holds ", round(p_B, 3) == round(1 - p_A, 3))
Probability that an individual is not literate (A): 0.254
Probability that an individual has some level of literacy (B): 0.746
Since B is complement of A, P(B) = 1 - P(A) holds TRUE
Monotonicity: If \(A \subset B\), then \(P(A) \leq P(B)\) follows from Axiom #3
Let \(B\) refer to individuals that have some level of literacy \((\text{edu} \in [2,11])\), and \(A\) refer to individuals with graduate level of education or above \((\text{edu} \in [10,11])\), then,
p_B = sum(df_edu$p[2:11])
p_A = sum(df_edu$p[10:11])
message("Probability that an individual has some level of literacy (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education or above (B): ", round(p_B, 3))
message("Since A is a subset of B, P(A) <= P(B) holds ", p_A <= p_B )
Probability that an individual has some level of literacy (A): 0.108
Probability that an individual has graduate level of education or above (B): 0.746
Since A is a subset of B, P(A) <= P(B) holds TRUE
Inclusion-Exclusion: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) follows from Axiom #3
Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,
p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]
f <- 0
for (r in 1:nrow(data)) {
emp <- data$employment_status[r]
edu <- data$education_level[r]
if (emp == 31 & edu == 10){
f <- f + 1
}
}
p_AXB <- f / nrow(data)
f <- 0
for (r in 1:nrow(data)) {
emp <- data$employment_status[r]
edu <- data$education_level[r]
if (emp == 31 || edu == 10){
f <- f + 1
}
}
p_AUB <- f / nrow(data)
message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): ", round(p_AXB, 3))
message("Probability that an individual is working as regular salaried/wage employee or has graduate level of education (A U B): ", round(p_AUB, 3))
message("P(A U B) = P(A) + P(B) - P(A X B) holds ", round(p_AUB, 3) == round(p_A + p_B - p_AXB, 3))
Probability that an individual is working as regular salaried/wage employee (A): 0.097
Probability that an individual has graduate level of education (B): 0.087
Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): 0.029
Probability that an individual is working as regular salaried/wage employee or has graduate level of education (A U B): 0.155
P(A U B) = P(A) + P(B) - P(A X B) holds TRUE
Conditional Probability#
Conditional probability quantifies the likelihood of an event occurring given that another event has already occurred. The conditional probability of event A given event B is defined as:
Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,
p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]
f <- 0
for (r in 1:nrow(data)) {
edu <- data$education_level[r]
emp <- data$employment_status[r]
if (edu == 10 & emp == 31){
f <- f + 1
}
}
p_AXB <- f / nrow(data)
p_AB <- f / df_edu$f[which(df_edu$edu == 10)]
message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): ", round(p_AXB, 3))
message("Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): ", round(p_AB, 3))
message("P(AB) = P(A X B) / P(B) holds ", round(p_AB, 3) == round(p_AXB / p_B, 3))
Probability that an individual is working as regular salaried/wage employee (A): 0.097
Probability that an individual has graduate level of education (B): 0.087
Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): 0.029
Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): 0.333
P(AB) = P(A X B) / P(B) holds TRUE
Bayes’ Rule#
Expanding over condtional probability, Bayes’ rule renders
Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,
p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]
f <- 0
for (r in 1:nrow(data)) {
edu <- data$education_level[r]
emp <- data$employment_status[r]
if (edu == 10 & emp == 31){
f <- f + 1
}
}
p_AB <- f / df_edu$f[which(df_edu$edu == 10)]
p_BA <- f / df_emp$f[which(df_emp$emp == 31)]
message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): ", round(p_AB, 3))
message("Probability that an individual has graduate level of education given the individual is working as regular salaried/wage employee (BA): ", round(p_BA, 3))
message("P(AB) = P(BA) x P(A) / P(B) holds ", round(p_AB, 3) == round(p_BA * p_A / p_B, 3))
Probability that an individual is working as regular salaried/wage employee (A): 0.097
Probability that an individual has graduate level of education (B): 0.087
Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): 0.333
Probability that an individual has graduate level of education given the individual is working as regular salaried/wage employee (BA): 0.3
P(AB) = P(BA) x P(A) / P(B) holds TRUE
Tip
Beyond education_level
and employment_status
, test the axioms, Axioms, Conditional Probability, and Bayes’ Rule for categorical variables such as gender
and marital_status
gender:
1: male
2: female
3: transgender marital_status:
1: never married
2: currently married
3: widowed
4: divorced/separated