Lecture 3: Introduction to R

Lecture 3: Introduction to R#

Note

R is an open-source programming language and software environment specifically designed for statistical computing, data analysis, and data visualization. Originally developed by statisticians Ross Ihaka and Robert Gentleman in the early 1990s, R has since grown into a powerful tool used by scientists, researchers, and analysts across a range of disciplines. Its strength lies in its extensive ecosystem of packages, active community support, and seamless integration with advanced statistical methods, making it ideal for tasks such as hypothesis testing, regression modeling, data mining, and creating high-quality visualizations.

Installing R and RStudio on Windows#

Follow these steps:

Step 1: Install R#

Visit https://cran.r-project.org
Click on “Download R for Windows” > “base” > Download the installer
Run the .exe file and follow installation instructions

Step 2: Install RStudio#

Visit https://posit.co/download/rstudio-desktop/
Download the RStudio Desktop version for Windows
Run the installer and follow instructions

Once installed, open RStudio to begin writing R code!

Hello World!#

# Hello World in R
print("Hello World!")

[1] "Hello World!"

Data Types in R#

R supports the following basic data types:

Character
Numeric
Integer
Logical
Complex

Here are some examples:

# Character
x <- "CE5540"
message("Type of x is: ", typeof(x))

# Numeric
r <- 3.14
message("Type of y is: ", typeof(r))

# Integer
v <- 42L
message("Type of v is: ", typeof(v))

# Logical
f <- TRUE
message("Type of f is: ", typeof(f))

# Complex
z <- 2 + 3i
message("Type of z is: ", typeof(z))

Type of x is: character

Type of y is: double

Type of v is: integer

Type of f is: logical

Type of z is: complex

Data Structures in R#

R supports the following data structures:

Vectors
Matrices
Lists
Data Frames

Here are some examples:

# Vectors
v1 <- c("Apple", "Banana", "Mango")
v2 <- c(9, 1, 5, 4, 6, 7, 0, 3, 8)
v3 <- c(1:5)
message("# Vectors")
print(v1)
print(v2)
print(v3)
message("Accessing a value in a vector: v1[1] = ", v1[1], ". Notice that R follows 1-based indexing!")

# Matrices
m1 <- matrix(c(9, 1, 5, 4, 6, 7, 0, 3, 8), nrow = 3, byrow = TRUE)
m2 <- matrix(c(9, 1, 5, 4, 6, 7, 0, 3, 8), nrow = 3, byrow = FALSE)
message("\n# Matrices")
print(m1)
print(m2)
message("Accessing a value in a matrix: m1[1][3] = ", m1[1,3])

# Lists
l <- list(name="John", age=25L, scores=c(90, 85, 88))
message("\n# List")
print(l)
message("Accessing a value in a list: l$name = ", l$name)

# Data Frames
df <- data.frame(Name=c("Alice", "Bob"), Age=c(23L, 25L))
message("\n# Data Frames")
print(df)
message("Accessing a value in a data frame: df$Age[1] = ", df$Age[1])

# Vectors

[1] "Apple"  "Banana" "Mango" 
[1] 9 1 5 4 6 7 0 3 8
[1] 1 2 3 4 5

Accessing a value in a vector: v1[1] = Apple. Notice that R follows 1-based indexing!

# Matrices

     [,1] [,2] [,3]
[1,]    9    1    5
[2,]    4    6    7
[3,]    0    3    8
     [,1] [,2] [,3]
[1,]    9    4    0
[2,]    1    6    3
[3,]    5    7    8

Accessing a value in a matrix: m1[1][3] = 5

# List

$name
[1] "John"

$age
[1] 25

$scores
[1] 90 85 88

Accessing a value in a list: l$name = John

# Data Frames

   Name Age
1 Alice  23
2   Bob  25

Accessing a value in a data frame: df$Age[1] = 23

Control Flow#

Here is how you would write control flow statements in R

x <- 10

if (x > 0) {
  message("x is a positive number")
} else if (x < 0) {
  message("x is a negative number")
} else {
  message("x is zero!")
}

x is a positive number

Writing Loops in R#

R supports both for and while loops.

# For loop
message("# For Loop")
for (i in 1:5) {
  message("Iteration:", i)
}

# While loop
message("\n\n# While Loop")
i <- 1
while (i <= 5) {
  message("Count:", i)
  i <- i + 1
}

# For Loop

Iteration:1

Iteration:2

Iteration:3

Iteration:4

Iteration:5

# While Loop

Count:1

Count:2

Count:3

Count:4

Count:5

Writing Functions in R#

Functions are blocks of code that can be reused. Here’s how to define and call one.

# Factorial Function (Iterative Form)
factorial_iterative <- function(n) {
  result <- 1
  for (i in 2:n) {
    result <- result * i
  }
  return(result)
}

# Example usage
factorial_iterative(5)

120

# Factorial Function (Recursive Form)
factorial_recursive <- function(n) {
  if (n == 0 || n == 1) {
    return(1)
  } else {
    return(n * factorial_recursive(n - 1))
  }
}

# Example usage
factorial_recursive(5)

120

Summarising Data in R#

In this segment of the lecture, we will develop measures of location, dispersion, and shape discussed in the previous lecture through 2024 ITUS sample individual dataset.

# 2024 ITUS Individual Data (original)
url  <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_OG.csv"
data <- read.csv(url) # Loading Data
message("2024 ITUS sample individual data is retreived as ", typeof(data), " as follows: ")
str(data)             # Data Structure

2024 ITUS sample individual data is retreived as list as follows:

'data.frame':	533719 obs. of  23 variables:
 $ survey_year      : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
 $ fsu_serial_no    : int  30010 30010 30010 30010 30010 30010 30010 30010 30010 30010 ...
 $ sector           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ nss_region       : int  241 241 241 241 241 241 241 241 241 241 ...
 $ district         : int  17 17 17 17 17 17 17 17 17 17 ...
 $ stratum          : int  13 13 13 13 13 13 13 13 13 13 ...
 $ sub_stratum      : int  11 11 11 11 11 11 11 11 11 11 ...
 $ sub_round        : int  2 2 2 2 2 2 2 2 2 2 ...
 $ fod_sub_region   : int  2420 2420 2420 2420 2420 2420 2420 2420 2420 2420 ...
 $ nsc              : int  4 4 4 4 4 4 4 4 4 4 ...
 $ household_id     : int  1 2 2 2 2 3 3 3 3 3 ...
 $ individual_id    : int  1 1 2 3 4 1 2 3 4 5 ...
 $ response_code    : int  1 1 1 1 99999 1 1 1 1 1 ...
 $ day_of_week      : int  2 3 3 3 99999 7 7 7 7 7 ...
 $ type_of_day      : int  1 1 1 1 99999 1 1 1 1 1 ...
 $ relation_to_head : int  1 1 2 4 6 1 2 5 5 5 ...
 $ gender           : int  1 1 2 2 1 1 2 1 1 2 ...
 $ age              : int  45 54 52 29 3 48 44 23 13 18 ...
 $ marital_status   : int  1 2 2 2 1 2 2 1 1 1 ...
 $ education_level  : int  10 6 4 10 1 5 2 5 4 4 ...
 $ employment_status: int  10 94 92 31 99999 51 92 11 91 92 ...
 $ industry         : int  85 99999 99999 86 99999 1 99999 1 99999 99999 ...
 $ weight           : num  208857 208857 208857 208857 208857 ...

library(dplyr)

# Filtering out into a smaller dataset
data <- data %>% filter(data$nss_region==241, data$employment_status==81)
View(data)

A data.frame: 78 × 23
survey_year	fsu_serial_no	sector	nss_region	district	stratum	sub_stratum	sub_round	fod_sub_region	nsc	⋯	day_of_week	type_of_day	relation_to_head	gender	age	marital_status	education_level	employment_status	industry	weight
<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	⋯	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<dbl>
2024	30010	1	241	17	13	11	2	2420	4	⋯	1	1	5	1	22	1	10	81	99999	208857
2024	30013	1	241	26	13	10	1	2423	4	⋯	4	1	5	1	18	1	5	81	99999	207000
2024	30013	1	241	26	13	10	1	2423	4	⋯	7	1	6	1	18	1	5	81	99999	207000
2024	30013	1	241	26	13	10	1	2423	4	⋯	7	1	5	1	19	1	4	81	99999	207000
2024	30014	1	241	29	13	11	4	2420	4	⋯	7	1	5	1	35	1	2	81	99999	196571
2024	30019	1	241	24	13	11	3	2424	4	⋯	2	1	6	1	18	1	4	81	99999	233429
2024	30040	1	241	25	13	3	1	2423	4	⋯	6	1	6	1	28	1	1	81	99999	214629
2024	30040	1	241	25	13	3	1	2423	4	⋯	4	1	5	1	20	1	5	81	99999	214629
2024	30047	1	241	21	13	4	4	2420	4	⋯	6	1	1	2	78	3	1	81	99999	269500
2024	30048	1	241	25	13	3	2	2423	4	⋯	1	1	5	1	22	1	4	81	99999	177739
2024	30070	1	241	20	13	8	3	2420	4	⋯	6	1	5	1	18	1	5	81	99999	225129
2024	30070	1	241	20	13	8	3	2420	4	⋯	3	1	6	1	18	1	5	81	99999	225129
2024	30090	1	241	21	13	6	2	2420	4	⋯	2	1	5	1	25	1	5	81	99999	267650
2024	30092	1	241	21	13	5	2	2420	4	⋯	3	1	5	1	22	1	5	81	99999	152221
2024	30820	1	241	23	13	13	3	2424	4	⋯	6	1	5	1	19	1	6	81	99999	211693
2024	30820	1	241	23	13	13	3	2424	4	⋯	7	1	5	1	28	1	10	81	99999	211693
2024	30821	1	241	18	13	14	2	2420	4	⋯	2	1	6	1	18	1	4	81	99999	173107
2024	30823	1	241	25	13	12	1	2423	4	⋯	4	1	3	1	35	2	5	81	99999	261914
2024	30823	1	241	25	13	12	1	2423	4	⋯	1	1	3	1	52	2	5	81	99999	261914
2024	30823	1	241	25	13	12	1	2423	4	⋯	2	1	5	1	35	1	4	81	99999	261914
2024	30823	1	241	25	13	12	1	2423	4	⋯	2	1	3	2	38	4	5	81	99999	261914
2024	30829	1	241	26	13	14	1	2423	4	⋯	5	1	5	1	20	1	4	81	99999	236268
2024	30834	1	241	18	13	16	3	2420	4	⋯	5	1	1	1	55	2	5	81	99999	193541
2024	30837	1	241	23	13	17	3	2424	4	⋯	7	1	5	1	23	1	11	81	99999	235750
2024	30838	1	241	23	13	16	2	2424	4	⋯	4	1	5	1	40	1	5	81	99999	283938
2024	30839	1	241	23	13	16	1	2424	4	⋯	5	1	5	1	28	1	6	81	99999	304798
2024	30839	1	241	23	13	16	1	2424	4	⋯	7	1	5	1	32	1	3	81	99999	304798
2024	30862	1	241	25	13	17	2	2423	4	⋯	4	1	3	1	33	2	4	81	99999	276000
2024	30862	1	241	25	13	17	2	2423	4	⋯	7	2	6	1	17	1	5	81	99999	276000
2024	30863	1	241	25	13	17	4	2423	4	⋯	1	1	5	1	21	1	4	81	99999	267950
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋱	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2024	66483	2	241	21	1	3	1	2420	4	⋯	5	1	6	1	17	1	5	81	99999	322071
2024	66483	2	241	21	1	3	1	2420	4	⋯	4	1	5	2	8	1	2	81	99999	322071
2024	66487	2	241	23	2	1	4	2424	4	⋯	4	1	8	1	48	1	8	81	99999	194448
2024	66487	2	241	23	2	1	4	2424	4	⋯	4	1	6	1	16	1	2	81	99999	194448
2024	66487	2	241	23	2	1	4	2424	4	⋯	5	1	3	1	44	2	6	81	99999	194448
2024	66487	2	241	23	2	1	4	2424	4	⋯	1	1	1	1	54	2	10	81	99999	194448
2024	66490	2	241	19	4	2	4	2420	4	⋯	5	1	3	1	35	2	11	81	99999	185784
2024	66491	2	241	19	4	3	3	2420	4	⋯	1	1	6	1	24	1	4	81	99999	156975
2024	66491	2	241	19	4	3	3	2420	4	⋯	5	1	5	1	29	1	11	81	99999	156975
2024	66493	2	241	19	4	1	3	2420	4	⋯	2	1	6	1	29	2	5	81	99999	208214
2024	66498	2	241	19	4	2	2	2420	4	⋯	2	1	5	1	29	1	5	81	99999	211621
2024	66920	2	241	25	5	6	4	2423	4	⋯	6	1	5	1	25	1	10	81	99999	1005964
2024	66920	2	241	25	5	6	4	2423	4	⋯	6	1	5	1	19	1	4	81	99999	1005964
2024	66920	2	241	25	5	6	4	2423	4	⋯	6	1	5	2	23	1	10	81	99999	1005964
2024	66927	2	241	25	5	7	4	2423	4	⋯	4	1	5	1	23	1	7	81	99999	201875
2024	66927	2	241	25	5	7	4	2423	4	⋯	4	1	5	1	22	1	7	81	99999	201875
2024	66928	2	241	25	5	6	3	2423	4	⋯	5	1	1	1	42	2	5	81	99999	282161
2024	66929	2	241	25	5	6	1	2423	4	⋯	1	1	6	1	18	1	5	81	99999	274800
2024	66930	2	241	25	5	8	3	2423	4	⋯	6	1	5	1	23	1	4	81	99999	403196
2024	66930	2	241	25	5	8	3	2423	4	⋯	6	1	5	1	22	1	10	81	99999	403196
2024	66930	2	241	25	5	8	3	2423	4	⋯	6	1	5	1	20	1	5	81	99999	403196
2024	66933	2	241	25	5	7	3	2423	4	⋯	1	1	1	1	30	1	2	81	99999	166813
2024	66933	2	241	25	5	7	3	2423	4	⋯	7	2	5	1	15	1	4	81	99999	166813
2024	66933	2	241	25	5	7	3	2423	4	⋯	3	1	5	1	21	1	3	81	99999	166813
2024	66934	2	241	25	5	9	2	2423	4	⋯	4	1	5	1	30	1	10	81	99999	179852
2024	66934	2	241	25	5	9	2	2423	4	⋯	3	1	1	1	52	2	4	81	99999	179852
2024	66934	2	241	25	5	9	2	2423	4	⋯	3	1	5	1	21	1	10	81	99999	179852
2024	66935	2	241	25	5	8	4	2423	4	⋯	2	1	5	2	25	1	10	81	99999	322557
2024	66935	2	241	25	5	8	4	2423	4	⋯	5	1	5	1	19	1	4	81	99999	322557
2024	66939	2	241	25	5	9	4	2423	4	⋯	4	1	5	1	24	1	10	81	99999	253698

# Creating frequency table
v  <- sort(unique(data$age))
f  <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$age[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df <- data.frame(x=v, f=f)
View(df)

A data.frame: 30 × 2
x	f
<int>	<dbl>
8	1
15	2
16	2
17	3
18	8
19	5
20	3
21	3
22	8
23	6
24	2
25	4
26	3
28	4
29	3
30	2
32	1
33	1
35	5
36	1
38	1
40	1
41	1
42	1
44	1
48	1
52	2
54	1
55	1
78	1

# Measures of Location
df <- data.frame(x=v, f=f/sum(f)) 

## Mean
### Manually Computed Mean
v1 <- sum(df$x * df$f)
### Auto Computed Mean
v2 <- mean(data$age)
message("Manually Computed Mean = ", round(v1, digits=3), " and Auto Computed Mean = ", round(v2, digits=3))

## Median
### Manually Computed Median
qtl <- function (x, f, p) {
  v <- NA
  F <- cumsum(f)
  n <- length(x)
  for (i in 2:n) {
      if (F[i] == p) {
        v <- (x[i] + x[i+1]) / 2
        return(v)
      }
      if (F[i-1] < p & F[i] > p) {
        v <- x[i]
        return(v)
      }
  }
}
v1 <- qtl(df$x, df$f, 0.5)
### Auto Computed Median
v2 <- median(data$age)
message("Manually Computed Median = ", round(v1, digits=3), " and Auto Computed Median = ", round(v2, digits=3))

## Mode
v <- df$x[which.max(df$f)]
message("Mode = ", v)

Manually Computed Mean = 26.949 and Auto Computed Mean = 26.949

Manually Computed Median = 23 and Auto Computed Median = 23

Mode = 18

# Measures of Dispersion

## Range
v = max(df$x) - min(df$x)
message("Range = ", v)

## Inter-Quartile Range
### Manually Computed Inter-Quartile Range
v1 <- qtl(df$x, df$f, 0.75) - qtl(df$x, df$f, 0.25)
### Auto Computed Inter-Quartile Range
v2 <- IQR(data$age)
message("Manually Computed Inter-Quartile Range = ", round(v1, digits=3), " and Auto Computed Inter-Quartile Range = ", round(v2, digits=3))

## Standard Deviation
### Manually Computed Standard Deviation
v1 <- sqrt(sum(df$f * (df$x - sum(df$x * df$f))^2))
### Auto Computed Standard Deviation
v2 <- sd(data$age)
message("Manually Computed Standard Deviation = ", round(v1, digits=3), " and Auto Computed Standard Deviation = ", round(v2, digits=3))

Range = 70

Manually Computed Inter-Quartile Range = 11 and Auto Computed Inter-Quartile Range = 11

Manually Computed Standard Deviation = 11.35 and Auto Computed Standard Deviation = 11.423

# Measures of Shape
library(moments)

## Skewness
### Manually Computed Skewness
v1 <- (sum(df$f * (df$x - sum(df$x * df$f))^3)) / (sqrt(sum(df$f * (df$x - sum(df$x * df$f))^2)))^3
### Auto Computed Skewness
v2 <- skewness(data$age)
message("Manually Computed Skewness = ", round(v1, digits=3), " | Auto Computed Skewness = ", round(v2, digits=3))

## Kurtosis
### Manually Computed Kurtosis
v1 <- (sum(df$f * (df$x - sum(df$x * df$f))^4) / sum(df$f)) / (sqrt(sum(df$f * (df$x - sum(df$x * df$f))^2)))^4
### Auto Computed Kurtosis
v2 <- kurtosis(data$age)
message("Manually Computed Kurtosis = ", round(v1, digits=3), " | Auto Computed Kurtosis = ", round(v2, digits=3))

Manually Computed Skewness = 1.816 | Auto Computed Skewness = 1.816

Manually Computed Kurtosis = 7.337 | Auto Computed Kurtosis = 7.337