Lecture 3: Introduction to R#

Note

R is an open-source programming language and software environment specifically designed for statistical computing, data analysis, and data visualization. Originally developed by statisticians Ross Ihaka and Robert Gentleman in the early 1990s, R has since grown into a powerful tool used by scientists, researchers, and analysts across a range of disciplines. Its strength lies in its extensive ecosystem of packages, active community support, and seamless integration with advanced statistical methods, making it ideal for tasks such as hypothesis testing, regression modeling, data mining, and creating high-quality visualizations.


Installing R and RStudio on Windows#

Follow these steps:

Step 1: Install R#

  1. Visit https://cran.r-project.org

  2. Click on “Download R for Windows” > “base” > Download the installer

  3. Run the .exe file and follow installation instructions

Step 2: Install RStudio#

  1. Visit https://posit.co/download/rstudio-desktop/

  2. Download the RStudio Desktop version for Windows

  3. Run the installer and follow instructions

Once installed, open RStudio to begin writing R code!

Hello World!#

# Hello World in R
print("Hello World!")
[1] "Hello World!"

Data Types in R#

R supports the following basic data types:

  • Character

  • Numeric

  • Integer

  • Logical

  • Complex

Here are some examples:

# Character
x <- "CE5540"
message("Type of x is: ", typeof(x))

# Numeric
r <- 3.14
message("Type of y is: ", typeof(r))

# Integer
v <- 42L
message("Type of v is: ", typeof(v))

# Logical
f <- TRUE
message("Type of f is: ", typeof(f))

# Complex
z <- 2 + 3i
message("Type of z is: ", typeof(z))
Type of x is: character

Type of y is: double

Type of v is: integer

Type of f is: logical

Type of z is: complex

Data Structures in R#

R supports the following data structures:

  • Vectors

  • Matrices

  • Lists

  • Data Frames

Here are some examples:

# Vectors
v1 <- c("Apple", "Banana", "Mango")
v2 <- c(9, 1, 5, 4, 6, 7, 0, 3, 8)
v3 <- c(1:5)
message("# Vectors")
print(v1)
print(v2)
print(v3)
message("Accessing a value in a vector: v1[1] = ", v1[1], ". Notice that R follows 1-based indexing!")

# Matrices
m1 <- matrix(c(9, 1, 5, 4, 6, 7, 0, 3, 8), nrow = 3, byrow = TRUE)
m2 <- matrix(c(9, 1, 5, 4, 6, 7, 0, 3, 8), nrow = 3, byrow = FALSE)
message("\n# Matrices")
print(m1)
print(m2)
message("Accessing a value in a matrix: m1[1][3] = ", m1[1,3])

# Lists
l <- list(name="John", age=25L, scores=c(90, 85, 88))
message("\n# List")
print(l)
message("Accessing a value in a list: l$name = ", l$name)

# Data Frames
df <- data.frame(Name=c("Alice", "Bob"), Age=c(23L, 25L))
message("\n# Data Frames")
print(df)
message("Accessing a value in a data frame: df$Age[1] = ", df$Age[1])
# Vectors
[1] "Apple"  "Banana" "Mango" 
[1] 9 1 5 4 6 7 0 3 8
[1] 1 2 3 4 5
Accessing a value in a vector: v1[1] = Apple. Notice that R follows 1-based indexing!


# Matrices
     [,1] [,2] [,3]
[1,]    9    1    5
[2,]    4    6    7
[3,]    0    3    8
     [,1] [,2] [,3]
[1,]    9    4    0
[2,]    1    6    3
[3,]    5    7    8
Accessing a value in a matrix: m1[1][3] = 5


# List
$name
[1] "John"

$age
[1] 25

$scores
[1] 90 85 88
Accessing a value in a list: l$name = John


# Data Frames
   Name Age
1 Alice  23
2   Bob  25
Accessing a value in a data frame: df$Age[1] = 23

Control Flow#

Here is how you would write control flow statements in R

x <- 10

if (x > 0) {
  message("x is a positive number")
} else if (x < 0) {
  message("x is a negative number")
} else {
  message("x is zero!")
}
x is a positive number

Writing Loops in R#

R supports both for and while loops.

# For loop
message("# For Loop")
for (i in 1:5) {
  message("Iteration:", i)
}

# While loop
message("\n\n# While Loop")
i <- 1
while (i <= 5) {
  message("Count:", i)
  i <- i + 1
}
# For Loop
Iteration:1

Iteration:2

Iteration:3

Iteration:4

Iteration:5



# While Loop

Count:1

Count:2

Count:3

Count:4

Count:5

Writing Functions in R#

Functions are blocks of code that can be reused. Here’s how to define and call one.

# Factorial Function (Iterative Form)
factorial_iterative <- function(n) {
  result <- 1
  for (i in 2:n) {
    result <- result * i
  }
  return(result)
}

# Example usage
factorial_iterative(5)
120
# Factorial Function (Recursive Form)
factorial_recursive <- function(n) {
  if (n == 0 || n == 1) {
    return(1)
  } else {
    return(n * factorial_recursive(n - 1))
  }
}

# Example usage
factorial_recursive(5)
120

Summarising Data in R#

In this segment of the lecture, we will develop measures of location, dispersion, and shape discussed in the previous lecture through 2024 ITUS sample individual dataset.

# 2024 ITUS Individual Data (original)
url  <- "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_OG.csv"
data <- read.csv(url) # Loading Data
message("2024 ITUS sample individual data is retreived as ", typeof(data), " as follows: ")
str(data)             # Data Structure
2024 ITUS sample individual data is retreived as list as follows: 
'data.frame':	533719 obs. of  23 variables:
 $ survey_year      : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
 $ fsu_serial_no    : int  30010 30010 30010 30010 30010 30010 30010 30010 30010 30010 ...
 $ sector           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ nss_region       : int  241 241 241 241 241 241 241 241 241 241 ...
 $ district         : int  17 17 17 17 17 17 17 17 17 17 ...
 $ stratum          : int  13 13 13 13 13 13 13 13 13 13 ...
 $ sub_stratum      : int  11 11 11 11 11 11 11 11 11 11 ...
 $ sub_round        : int  2 2 2 2 2 2 2 2 2 2 ...
 $ fod_sub_region   : int  2420 2420 2420 2420 2420 2420 2420 2420 2420 2420 ...
 $ nsc              : int  4 4 4 4 4 4 4 4 4 4 ...
 $ household_id     : int  1 2 2 2 2 3 3 3 3 3 ...
 $ individual_id    : int  1 1 2 3 4 1 2 3 4 5 ...
 $ response_code    : int  1 1 1 1 99999 1 1 1 1 1 ...
 $ day_of_week      : int  2 3 3 3 99999 7 7 7 7 7 ...
 $ type_of_day      : int  1 1 1 1 99999 1 1 1 1 1 ...
 $ relation_to_head : int  1 1 2 4 6 1 2 5 5 5 ...
 $ gender           : int  1 1 2 2 1 1 2 1 1 2 ...
 $ age              : int  45 54 52 29 3 48 44 23 13 18 ...
 $ marital_status   : int  1 2 2 2 1 2 2 1 1 1 ...
 $ education_level  : int  10 6 4 10 1 5 2 5 4 4 ...
 $ employment_status: int  10 94 92 31 99999 51 92 11 91 92 ...
 $ industry         : int  85 99999 99999 86 99999 1 99999 1 99999 99999 ...
 $ weight           : num  208857 208857 208857 208857 208857 ...
library(dplyr)

# Filtering out into a smaller dataset
data <- data %>% filter(data$nss_region==241, data$employment_status==81)
View(data)
A data.frame: 78 × 23
survey_yearfsu_serial_nosectornss_regiondistrictstratumsub_stratumsub_roundfod_sub_regionnscday_of_weektype_of_dayrelation_to_headgenderagemarital_statuseducation_levelemployment_statusindustryweight
<int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><dbl>
20243001012411713112242041151221108199999208857
20243001312412613101242344151181 58199999207000
20243001312412613101242347161181 58199999207000
20243001312412613101242347151191 48199999207000
20243001412412913114242047151351 28199999196571
20243001912412413113242442161181 48199999233429
20243004012412513 31242346161281 18199999214629
20243004012412513 31242344151201 58199999214629
20243004712412113 44242046112783 18199999269500
20243004812412513 32242341151221 48199999177739
20243007012412013 83242046151181 58199999225129
20243007012412013 83242043161181 58199999225129
20243009012412113 62242042151251 58199999267650
20243009212412113 52242043151221 58199999152221
20243082012412313133242446151191 68199999211693
20243082012412313133242447151281108199999211693
20243082112411813142242042161181 48199999173107
20243082312412513121242344131352 58199999261914
20243082312412513121242341131522 58199999261914
20243082312412513121242342151351 48199999261914
20243082312412513121242342132384 58199999261914
20243082912412613141242345151201 48199999236268
20243083412411813163242045111552 58199999193541
20243083712412313173242447151231118199999235750
20243083812412313162242444151401 58199999283938
20243083912412313161242445151281 68199999304798
20243083912412313161242447151321 38199999304798
20243086212412513172242344131332 48199999276000
20243086212412513172242347261171 58199999276000
20243086312412513174242341151211 48199999267950
202466483224121131242045161171 58199999 322071
202466483224121131242044152 81 28199999 322071
202466487224123214242444181481 88199999 194448
202466487224123214242444161161 28199999 194448
202466487224123214242445131442 68199999 194448
202466487224123214242441111542108199999 194448
202466490224119424242045131352118199999 185784
202466491224119433242041161241 48199999 156975
202466491224119433242045151291118199999 156975
202466493224119413242042161292 58199999 208214
202466498224119422242042151291 58199999 211621
2024669202241255642423461512511081999991005964
202466920224125564242346151191 481999991005964
2024669202241255642423461522311081999991005964
202466927224125574242344151231 78199999 201875
202466927224125574242344151221 78199999 201875
202466928224125563242345111422 58199999 282161
202466929224125561242341161181 58199999 274800
202466930224125583242346151231 48199999 403196
202466930224125583242346151221108199999 403196
202466930224125583242346151201 58199999 403196
202466933224125573242341111301 28199999 166813
202466933224125573242347251151 48199999 166813
202466933224125573242343151211 38199999 166813
202466934224125592242344151301108199999 179852
202466934224125592242343111522 48199999 179852
202466934224125592242343151211108199999 179852
202466935224125584242342152251108199999 322557
202466935224125584242345151191 48199999 322557
202466939224125594242344151241108199999 253698
# Creating frequency table
v  <- sort(unique(data$age))
f  <- numeric(length(v))
for (r in 1:nrow(data)) {
  z <- data$age[r]
  i <- which(v == z)
  f[i] <- f[i] + 1
}
df <- data.frame(x=v, f=f)
View(df)
A data.frame: 30 × 2
xf
<int><dbl>
81
152
162
173
188
195
203
213
228
236
242
254
263
284
293
302
321
331
355
361
381
401
411
421
441
481
522
541
551
781
# Measures of Location
df <- data.frame(x=v, f=f/sum(f)) 

## Mean
### Manually Computed Mean
v1 <- sum(df$x * df$f)
### Auto Computed Mean
v2 <- mean(data$age)
message("Manually Computed Mean = ", round(v1, digits=3), " and Auto Computed Mean = ", round(v2, digits=3))

## Median
### Manually Computed Median
qtl <- function (x, f, p) {
  v <- NA
  F <- cumsum(f)
  n <- length(x)
  for (i in 2:n) {
      if (F[i] == p) {
        v <- (x[i] + x[i+1]) / 2
        return(v)
      }
      if (F[i-1] < p & F[i] > p) {
        v <- x[i]
        return(v)
      }
  }
}
v1 <- qtl(df$x, df$f, 0.5)
### Auto Computed Median
v2 <- median(data$age)
message("Manually Computed Median = ", round(v1, digits=3), " and Auto Computed Median = ", round(v2, digits=3))

## Mode
v <- df$x[which.max(df$f)]
message("Mode = ", v)
Manually Computed Mean = 26.949 and Auto Computed Mean = 26.949

Manually Computed Median = 23 and Auto Computed Median = 23

Mode = 18
# Measures of Dispersion

## Range
v = max(df$x) - min(df$x)
message("Range = ", v)

## Inter-Quartile Range
### Manually Computed Inter-Quartile Range
v1 <- qtl(df$x, df$f, 0.75) - qtl(df$x, df$f, 0.25)
### Auto Computed Inter-Quartile Range
v2 <- IQR(data$age)
message("Manually Computed Inter-Quartile Range = ", round(v1, digits=3), " and Auto Computed Inter-Quartile Range = ", round(v2, digits=3))

## Standard Deviation
### Manually Computed Standard Deviation
v1 <- sqrt(sum(df$f * (df$x - sum(df$x * df$f))^2))
### Auto Computed Standard Deviation
v2 <- sd(data$age)
message("Manually Computed Standard Deviation = ", round(v1, digits=3), " and Auto Computed Standard Deviation = ", round(v2, digits=3))
Range = 70

Manually Computed Inter-Quartile Range = 11 and Auto Computed Inter-Quartile Range = 11

Manually Computed Standard Deviation = 11.35 and Auto Computed Standard Deviation = 11.423
# Measures of Shape
library(moments)

## Skewness
### Manually Computed Skewness
v1 <- (sum(df$f * (df$x - sum(df$x * df$f))^3)) / (sqrt(sum(df$f * (df$x - sum(df$x * df$f))^2)))^3
### Auto Computed Skewness
v2 <- skewness(data$age)
message("Manually Computed Skewness = ", round(v1, digits=3), " | Auto Computed Skewness = ", round(v2, digits=3))

## Kurtosis
### Manually Computed Kurtosis
v1 <- (sum(df$f * (df$x - sum(df$x * df$f))^4) / sum(df$f)) / (sqrt(sum(df$f * (df$x - sum(df$x * df$f))^2)))^4
### Auto Computed Kurtosis
v2 <- kurtosis(data$age)
message("Manually Computed Kurtosis = ", round(v1, digits=3), " | Auto Computed Kurtosis = ", round(v2, digits=3))
Manually Computed Skewness = 1.816 | Auto Computed Skewness = 1.816

Manually Computed Kurtosis = 7.337 | Auto Computed Kurtosis = 7.337