Lecture 6: Probability Analysis in R

Lecture 6: Probability Analysis in R#

Note

In this lecture, we will review Axioms, Conditional Probability, and Bayes’ Rule in R using Indian Time Use Data.

# 2024 ITUS Individual Data (original)
url  <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_OG.csv"
data <- read.csv(url)

In the 2024 Indian Time Use Survey dataset, variables such as education_level and employment_status are coded numerically. The following mappings define the categorical meaning of each code. These will be used throughout the lecture to define events and compute empirical probabilities.

Education Level:

1: not literate
2: literate: below primary
3: primary
4: upper primary/middle
5: secondary
6: higher secondary
7: diploma /certificate course (up to secondary)
8: diploma/certificate course (higher secondary)
9: diploma/certificate course(graduation & above)
10: graduate
11: post graduate and above

Employment Status:

11: working as own account worker in household enterprise
12: working as employer in household enterprise
21: working as helper in household enterprise (unpaid family worker)
31: working as regular salaried/wage employee
41: working as casual wage labour in public works
51: working as casual wage labour in other types of work
81: did not work but was seeking and/or available for work
91: attending educational institution
92: attending domestic duties only
93: attending domestic duties and is also engaged in unpaid household enterprise for household use
94: rentiers, pensioners, remittance recipients, etc.
95: not able to work due to disability
97: others

## Education Level
v  <- sort(unique(data$education_level))
f  <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$education_level[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df_edu <- data.frame(edu=v, f=f, p=round(f/sum(f), 5))

## Employment Level
v  <- sort(unique(data$employment_status))
f  <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$employment_status[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df_emp <- data.frame(emp=v, f=f, p=round(f/sum(f), 5))

print(df_edu)
cat("\n")
print(df_emp)

   edu      f       p
  1 135438 0.25376
  2  71011 0.13305
  3  69215 0.12968
  4  73360 0.13745
  5  64045 0.12000
  6  56295 0.10548
  7   1755 0.00329
  8   3132 0.00587
  9   1847 0.00346
10  46394 0.08693
11  11227 0.02104

     emp      f       p
   10      1 0.00000
   11  77216 0.14468
   12   4716 0.00884
   21  17119 0.03207
   31  51547 0.09658
   41   4756 0.00891
   51  45210 0.08471
   81   7724 0.01447
   91 101349 0.18989
  92  90968 0.17044
  93  27100 0.05078
  94  20272 0.03798
  95   4602 0.00862
  97   1612 0.00302
99999  79527 0.14901

Axioms of Probability#

Let \(S\) be a sample space (set of all possible outcomes), and let \(P\) be a probability function mapping subsets of \(S\) to real numbers, then the axioms are:

Axiom #1 - Non-Negativity: \(P(E) \geq 0\) for any event \(E\).

all(df_edu$p >= 0)

TRUE

Axiom #2 - Normalization: \(\sum_{E \in S} P(E) = 1\); the probability of the sample space is 1.

round(sum(df_edu$p), 3) == 1

TRUE

Axiom #3 - Additivity: If \(A\) and \(B\) are disjoint events (i.e., \(A \cap B = \emptyset\)), then \(P(A \cup B) = P(A) + P(B)\)

Let \(A\) refer to individuals with primary education level \((\text{edu} = 3)\), and \(B\) refer to individuals with secondary eduction level \((\text{edu} = 5)\), then \(A \cup B\) refer to individuals with primary or secondary education level. Since, \(A\) and \(B\) are disjoint events (i.e., \(A \cap B = \emptyset\)), then,

p_A   = df_edu$p[3] 
p_B   = df_edu$p[5]
p_AUB = (df_edu$f[3] + df_edu$f[5]) / sum(df_edu$f)

message("Probability that an individual has a primary education level (A): ", round(p_A, 3))
message("Probability that an individual has a secondary education level (B): ", round(p_B, 3))
message("Probability that an individual has a primary or secondary eduction level (A U B): ", round(p_AUB, 3))
message("Since A and B are disjoint events, P(A U B) = P(A) + P(B) holds ", round(p_AUB, 3) == round(p_A + p_B, 3))

Probability that an individual has a primary education level (A): 0.13

Probability that an individual has a secondary education level (B): 0.12

Probability that an individual has a primary or secondary eduction level (A U B): 0.25

Since A and B are disjoint events, P(A U B) = P(A) + P(B) holds TRUE

Laws of Probability#

These axioms render the following laws of probability,

Complement Rule: \(P(A^c) = 1 - P(A)\) follows from Axiom #2 and Axiom #3

Let \(A\) refer to individuals that are not literate \((\text{edu} = 1)\), then \(B = A^c\) refers to individuals that have some level of literacy \((\text{edu} \in [2,11])\), then,

p_A = df_edu$p[1]
p_B = sum(df_edu$f[2:11]) / sum(df_edu$f)

message("Probability that an individual is not literate (A): ", round(p_A, 3))
message("Probability that an individual has some level of literacy (B): ", round(p_B, 3))
message("Since B is complement of A, P(B) = 1 - P(A) holds ", round(p_B, 3) == round(1 - p_A, 3))

Probability that an individual is not literate (A): 0.254

Probability that an individual has some level of literacy (B): 0.746

Since B is complement of A, P(B) = 1 - P(A) holds TRUE

Monotonicity: If \(A \subset B\), then \(P(A) \leq P(B)\) follows from Axiom #3

Let \(B\) refer to individuals that have some level of literacy \((\text{edu} \in [2,11])\), and \(A\) refer to individuals with graduate level of education or above \((\text{edu} \in [10,11])\), then,

p_B = sum(df_edu$p[2:11])
p_A = sum(df_edu$p[10:11]) 

message("Probability that an individual has some level of literacy (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education or above (B): ", round(p_B, 3))
message("Since A is a subset of B, P(A) <= P(B) holds ", p_A <= p_B )

Probability that an individual has some level of literacy (A): 0.108

Probability that an individual has graduate level of education or above (B): 0.746

Since A is a subset of B, P(A) <= P(B) holds TRUE

Inclusion-Exclusion: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) follows from Axiom #3

Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,

p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]

f <- 0
for (r in 1:nrow(data)) {
  emp <- data$employment_status[r]
  edu <- data$education_level[r]
  if (emp == 31 & edu == 10){
    f <- f + 1
  }
}
p_AXB <- f / nrow(data)

f <- 0
for (r in 1:nrow(data)) {
  emp <- data$employment_status[r]
  edu <- data$education_level[r]
  if (emp == 31 || edu == 10){
    f <- f + 1
  }
}
p_AUB <- f / nrow(data)

message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): ", round(p_AXB, 3))
message("Probability that an individual is working as regular salaried/wage employee or has graduate level of education (A U B): ", round(p_AUB, 3))
message("P(A U B) = P(A) + P(B) - P(A X B) holds ", round(p_AUB, 3) == round(p_A + p_B - p_AXB, 3))

Probability that an individual is working as regular salaried/wage employee (A): 0.097

Probability that an individual has graduate level of education (B): 0.087

Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): 0.029

Probability that an individual is working as regular salaried/wage employee or has graduate level of education (A U B): 0.155

P(A U B) = P(A) + P(B) - P(A X B) holds TRUE

Conditional Probability#

Conditional probability quantifies the likelihood of an event occurring given that another event has already occurred. The conditional probability of event A given event B is defined as:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{provided} \ \ P(B) > 0 \]

Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,

p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]

f <- 0
for (r in 1:nrow(data)) {
  edu <- data$education_level[r]
  emp <- data$employment_status[r]
  if (edu == 10 & emp == 31){
    f <- f + 1
  }
}
p_AXB <- f / nrow(data)
p_AB  <- f / df_edu$f[which(df_edu$edu == 10)]

message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): ", round(p_AXB, 3))
message("Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): ", round(p_AB, 3))
message("P(AB) = P(A X B) / P(B) holds ", round(p_AB, 3) == round(p_AXB / p_B, 3))

Probability that an individual is working as regular salaried/wage employee (A): 0.097

Probability that an individual has graduate level of education (B): 0.087

Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): 0.029

Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): 0.333

P(AB) = P(A X B) / P(B) holds TRUE

Bayes’ Rule#

Expanding over condtional probability, Bayes’ rule renders

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \quad \text{provided } P(B) > 0 \]

Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,

p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]

f <- 0
for (r in 1:nrow(data)) {
  edu <- data$education_level[r]
  emp <- data$employment_status[r]
  if (edu == 10 & emp == 31){
    f <- f + 1
  }
}
p_AB  <- f / df_edu$f[which(df_edu$edu == 10)]
p_BA  <- f / df_emp$f[which(df_emp$emp == 31)]

message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): ", round(p_AB, 3))
message("Probability that an individual has graduate level of education given the individual is working as regular salaried/wage employee (BA): ", round(p_BA, 3))
message("P(AB) = P(BA) x P(A) / P(B) holds ", round(p_AB, 3) == round(p_BA * p_A / p_B, 3))

Probability that an individual is working as regular salaried/wage employee (A): 0.097

Probability that an individual has graduate level of education (B): 0.087

Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): 0.333

Probability that an individual has graduate level of education given the individual is working as regular salaried/wage employee (BA): 0.3

P(AB) = P(BA) x P(A) / P(B) holds TRUE

Tip

Beyond education_level and employment_status, test the axioms, Axioms, Conditional Probability, and Bayes’ Rule for categorical variables such as gender and marital_status gender:

1: male
2: female
3: transgender marital_status:
1: never married
2: currently married
3: widowed
4: divorced/separated

Lecture 6: Probability Analysis in R

Contents

Lecture 6: Probability Analysis in R#

Axioms of Probability#

Laws of Probability#

Conditional Probability#

Bayes’ Rule#