Lecture 6: Probability Analysis in R

Lecture 6: Probability Analysis in R#

Note

In this lecture, we will review Axioms, Conditional Probability, and Bayes’ Rule in R using Indian Time Use Data.

# 2024 ITUS Individual Data (original)
url  <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_OG.csv"
data <- read.csv(url)

In the 2024 Indian Time Use Survey dataset, variables such as education_level and employment_status are coded numerically. The following mappings define the categorical meaning of each code. These will be used throughout the lecture to define events and compute empirical probabilities.

Education Level:

  • 1: not literate

  • 2: literate: below primary

  • 3: primary

  • 4: upper primary/middle

  • 5: secondary

  • 6: higher secondary

  • 7: diploma /certificate course (up to secondary)

  • 8: diploma/certificate course (higher secondary)

  • 9: diploma/certificate course(graduation & above)

  • 10: graduate

  • 11: post graduate and above

Employment Status:

  • 11: working as own account worker in household enterprise

  • 12: working as employer in household enterprise

  • 21: working as helper in household enterprise (unpaid family worker)

  • 31: working as regular salaried/wage employee

  • 41: working as casual wage labour in public works

  • 51: working as casual wage labour in other types of work

  • 81: did not work but was seeking and/or available for work

  • 91: attending educational institution

  • 92: attending domestic duties only

  • 93: attending domestic duties and is also engaged in unpaid household enterprise for household use

  • 94: rentiers, pensioners, remittance recipients, etc.

  • 95: not able to work due to disability

  • 97: others

## Education Level
v  <- sort(unique(data$education_level))
f  <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$education_level[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df_edu <- data.frame(edu=v, f=f, p=round(f/sum(f), 5))

## Employment Level
v  <- sort(unique(data$employment_status))
f  <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$employment_status[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df_emp <- data.frame(emp=v, f=f, p=round(f/sum(f), 5))

print(df_edu)
cat("\n")
print(df_emp)
   edu      f       p
1    1 135438 0.25376
2    2  71011 0.13305
3    3  69215 0.12968
4    4  73360 0.13745
5    5  64045 0.12000
6    6  56295 0.10548
7    7   1755 0.00329
8    8   3132 0.00587
9    9   1847 0.00346
10  10  46394 0.08693
11  11  11227 0.02104

     emp      f       p
1     10      1 0.00000
2     11  77216 0.14468
3     12   4716 0.00884
4     21  17119 0.03207
5     31  51547 0.09658
6     41   4756 0.00891
7     51  45210 0.08471
8     81   7724 0.01447
9     91 101349 0.18989
10    92  90968 0.17044
11    93  27100 0.05078
12    94  20272 0.03798
13    95   4602 0.00862
14    97   1612 0.00302
15 99999  79527 0.14901

Axioms of Probability#

Let \(S\) be a sample space (set of all possible outcomes), and let \(P\) be a probability function mapping subsets of \(S\) to real numbers, then the axioms are:

  • Axiom #1 - Non-Negativity: \(P(E) \geq 0\) for any event \(E\).

all(df_edu$p >= 0)
TRUE
  • Axiom #2 - Normalization: \(\sum_{E \in S} P(E) = 1\); the probability of the sample space is 1.

round(sum(df_edu$p), 3) == 1
TRUE
  • Axiom #3 - Additivity: If \(A\) and \(B\) are disjoint events (i.e., \(A \cap B = \emptyset\)), then \(P(A \cup B) = P(A) + P(B)\)

Let \(A\) refer to individuals with primary education level \((\text{edu} = 3)\), and \(B\) refer to individuals with secondary eduction level \((\text{edu} = 5)\), then \(A \cup B\) refer to individuals with primary or secondary education level. Since, \(A\) and \(B\) are disjoint events (i.e., \(A \cap B = \emptyset\)), then,

p_A   = df_edu$p[3] 
p_B   = df_edu$p[5]
p_AUB = (df_edu$f[3] + df_edu$f[5]) / sum(df_edu$f)

message("Probability that an individual has a primary education level (A): ", round(p_A, 3))
message("Probability that an individual has a secondary education level (B): ", round(p_B, 3))
message("Probability that an individual has a primary or secondary eduction level (A U B): ", round(p_AUB, 3))
message("Since A and B are disjoint events, P(A U B) = P(A) + P(B) holds ", round(p_AUB, 3) == round(p_A + p_B, 3))
Probability that an individual has a primary education level (A): 0.13

Probability that an individual has a secondary education level (B): 0.12

Probability that an individual has a primary or secondary eduction level (A U B): 0.25

Since A and B are disjoint events, P(A U B) = P(A) + P(B) holds TRUE

Laws of Probability#

These axioms render the following laws of probability,

  • Complement Rule: \(P(A^c) = 1 - P(A)\) follows from Axiom #2 and Axiom #3

Let \(A\) refer to individuals that are not literate \((\text{edu} = 1)\), then \(B = A^c\) refers to individuals that have some level of literacy \((\text{edu} \in [2,11])\), then,

p_A = df_edu$p[1]
p_B = sum(df_edu$f[2:11]) / sum(df_edu$f)

message("Probability that an individual is not literate (A): ", round(p_A, 3))
message("Probability that an individual has some level of literacy (B): ", round(p_B, 3))
message("Since B is complement of A, P(B) = 1 - P(A) holds ", round(p_B, 3) == round(1 - p_A, 3))
Probability that an individual is not literate (A): 0.254

Probability that an individual has some level of literacy (B): 0.746

Since B is complement of A, P(B) = 1 - P(A) holds TRUE
  • Monotonicity: If \(A \subset B\), then \(P(A) \leq P(B)\) follows from Axiom #3

Let \(B\) refer to individuals that have some level of literacy \((\text{edu} \in [2,11])\), and \(A\) refer to individuals with graduate level of education or above \((\text{edu} \in [10,11])\), then,

p_B = sum(df_edu$p[2:11])
p_A = sum(df_edu$p[10:11]) 

message("Probability that an individual has some level of literacy (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education or above (B): ", round(p_B, 3))
message("Since A is a subset of B, P(A) <= P(B) holds ", p_A <= p_B )
Probability that an individual has some level of literacy (A): 0.108

Probability that an individual has graduate level of education or above (B): 0.746

Since A is a subset of B, P(A) <= P(B) holds TRUE
  • Inclusion-Exclusion: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) follows from Axiom #3

Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,

p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]

f <- 0
for (r in 1:nrow(data)) {
  emp <- data$employment_status[r]
  edu <- data$education_level[r]
  if (emp == 31 & edu == 10){
    f <- f + 1
  }
}
p_AXB <- f / nrow(data)

f <- 0
for (r in 1:nrow(data)) {
  emp <- data$employment_status[r]
  edu <- data$education_level[r]
  if (emp == 31 || edu == 10){
    f <- f + 1
  }
}
p_AUB <- f / nrow(data)

message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): ", round(p_AXB, 3))
message("Probability that an individual is working as regular salaried/wage employee or has graduate level of education (A U B): ", round(p_AUB, 3))
message("P(A U B) = P(A) + P(B) - P(A X B) holds ", round(p_AUB, 3) == round(p_A + p_B - p_AXB, 3))
Probability that an individual is working as regular salaried/wage employee (A): 0.097

Probability that an individual has graduate level of education (B): 0.087

Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): 0.029

Probability that an individual is working as regular salaried/wage employee or has graduate level of education (A U B): 0.155

P(A U B) = P(A) + P(B) - P(A X B) holds TRUE

Conditional Probability#

Conditional probability quantifies the likelihood of an event occurring given that another event has already occurred. The conditional probability of event A given event B is defined as:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{provided} \ \ P(B) > 0 \]

Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,

p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]

f <- 0
for (r in 1:nrow(data)) {
  edu <- data$education_level[r]
  emp <- data$employment_status[r]
  if (edu == 10 & emp == 31){
    f <- f + 1
  }
}
p_AXB <- f / nrow(data)
p_AB  <- f / df_edu$f[which(df_edu$edu == 10)]

message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): ", round(p_AXB, 3))
message("Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): ", round(p_AB, 3))
message("P(AB) = P(A X B) / P(B) holds ", round(p_AB, 3) == round(p_AXB / p_B, 3))
Probability that an individual is working as regular salaried/wage employee (A): 0.097

Probability that an individual has graduate level of education (B): 0.087

Probability that an individual is working as regular salaried/wage employee and has graduate level of education (A X B): 0.029

Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): 0.333

P(AB) = P(A X B) / P(B) holds TRUE

Bayes’ Rule#

Expanding over condtional probability, Bayes’ rule renders

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \quad \text{provided } P(B) > 0 \]

Let \(A\) refer to individuals working as regular salaried/wage employee \((\text{emp} = 31)\), and \(B\) refer to individuals with graduate level of education \((\text{edu} = 10)\), then,

p_A = df_emp$p[which(df_emp$emp == 31)]
p_B = df_edu$p[which(df_edu$edu == 10)]

f <- 0
for (r in 1:nrow(data)) {
  edu <- data$education_level[r]
  emp <- data$employment_status[r]
  if (edu == 10 & emp == 31){
    f <- f + 1
  }
}
p_AB  <- f / df_edu$f[which(df_edu$edu == 10)]
p_BA  <- f / df_emp$f[which(df_emp$emp == 31)]

message("Probability that an individual is working as regular salaried/wage employee (A): ", round(p_A, 3))
message("Probability that an individual has graduate level of education (B): ", round(p_B, 3))
message("Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): ", round(p_AB, 3))
message("Probability that an individual has graduate level of education given the individual is working as regular salaried/wage employee (BA): ", round(p_BA, 3))
message("P(AB) = P(BA) x P(A) / P(B) holds ", round(p_AB, 3) == round(p_BA * p_A / p_B, 3))
Probability that an individual is working as regular salaried/wage employee (A): 0.097

Probability that an individual has graduate level of education (B): 0.087

Probability that an individual is working as regular salaried/wage employee given the individual has graduate level of education (AB): 0.333

Probability that an individual has graduate level of education given the individual is working as regular salaried/wage employee (BA): 0.3

P(AB) = P(BA) x P(A) / P(B) holds TRUE

Tip

Beyond education_level and employment_status, test the axioms, Axioms, Conditional Probability, and Bayes’ Rule for categorical variables such as gender and marital_status gender:

  • 1: male

  • 2: female

  • 3: transgender marital_status:

  • 1: never married

  • 2: currently married

  • 3: widowed

  • 4: divorced/separated