Lecture 2: Statistics

Lecture 2: Statistics#

Note

In the previous lecture, we had a brief discussion about summarizing data. In this lecture, we will dive deeper into this subject - specifically, exploring ways to summarize quantitative data through measures of location, dispersion, and shape.

Measures of Location (MoL)#

Measures of Location aim to capture the nature of data through a single-value representation.

Mean#

The arithmetic mean represents the expected (average) value of the data.

For discrete quantitative data, let there be \(n\) distinct discrete values such that each discrete value \(x_i\) has an associated frequency \(f_i\), rendering probability mass function as \(f(x_i) = f_i / \sum_i f_i\), then the mean is given by,

\[\mu = \sum_{i=1}^{n} f(x_i) x_i\]

For continuous quantitative data with probability density function \(f(x)\), the mean is given by,

\[\mu = \int f(x)x \ \ dx\]

Median#

The median is the number that divides the (ordered) data in half. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median.

For discrete quantitative data arranged in an ascending order, let there be \(n\) distinct discrete values such that each discrete value \(x_i\) has an associated frequency \(f_i\), then the median, \(\alpha_{p=0.5}\), is given by as follows,

\[\alpha_{p} = x_m; \ \ x_m \ \ | \ \ \frac{\sum_{i=1}^{m-1} f_i}{\sum_{i=1}^{n} f_i} < p \ \ \text{and} \ \ \frac{\sum_{i=1}^{m} f_i}{\sum_{i=1}^{n} f_i} > p; \ \ \text{if} \ \ x_m \ \ \text{exists}\]

\[\alpha_{p} = (x_m + x_{m+1}) / 2; \ \ x_m \ \ | \ \ \frac{\sum_{i=1}^{m} f_i}{\sum_{i=1}^{n} f_i} = p; \ \ \text{otherwise}\]

For continuous quantitative data grouped into \(n\) intervals of width \(w\) (arranged in an ascending order), each with frequency \(f_i\), such that the median \(\alpha_{0.5}\) lies in the median class \(X_m: [mw, (m+1)w]\), then this median is given by,

\[\alpha_{0.5} = mw + w(\sum_{i=1}^{n} f_i / 2 - \sum_{i=1}^{m-1} f_i) / f_m\]

Mode#

The mode of a set of data is the most common value among the data.

For discrete quantitative data, let there be \(n\) distinct discrete values such that each discrete value \(x_i\) has an associated frequency \(f_i\), then the mode is given by,

\[\beta = \text{argmax}f_i\]

For continuous quantitative data grouped into \(n\) intervals of width \(w\) (arranged in an ascending order), each with frequency \(f_i\), such that the mode \(\gamma\) lies in the modal class \(X_m: [mw, (m+1)w]\), the mode is given by, the mode is given by,

\[\beta = mw + w(f_m - f_{m-1})/(2f_m - f_{m-1} - f_{m+1})\]

Tip

For qualitative and categorical data, mean and median lack any statistical significance, however, mode can be defined.

In general, the mean and the median need not be close together. If the data has a symmetric distribution, the mean and median are exactly equal, but if the distribution of the data is skewed, the difference between mean and the median can be large. Specifically, the median is smaller than the mean if the data are skewed to the right, and larger than the mean if the data are skewed to the left.

Test Yourself

Suppose we want to know how much money a family can afford to spend on housing. In such a situation, which MoL would you employ?

Suppose we want to infer the affluence level of a nation. In such a scenario, which MoL would you employ?

Suppose we want to design tax brackets based on household incomes. In such a case, which MoL would you employ?

Measures of Dispersion (MoD)#

Measures of Dispersions aim to capture the variability in data through a single-value representation.

Range#

As the name suggests, range represents the width of the data, given by,

\[r = \text{max}(X) - \text{min}(X)\]

Inter-Quartile Range#

The inter-quartile range represents width of the interval that contains the middle 50% of the data, given by,

\[\rho = \alpha_{0.75} - \alpha_{0.25}\]

Standard Deviation#

The standard deviation measures how spread out the data are around the mean, given by,

For discrete quantitative data with \(n\) discrete values, each with frequency \(f_i\) for discrete value \(x_i\), rendering probability mass function as \(f(x_i) = f_i / \sum_i f_i\), the standard deviation is given by,

\[\sigma = \sqrt{\sum_{i=1}^{n} f(x_i) (x_i - \mu)^2}\]

For continuous quantitative data with probability density function \(f(x)\), the standard deviation is given by,

\[\sigma = \sqrt{\int f(x)(x - \mu)^2 \ \ dx}\]

Measures of Shape (MoS)#

Measures of Shape aim to capture the distribution of data through a single-value representation.

Skewness#

Skewness measures the asymmetry of the distribution around its mean. A symmetric distribution has zero skewness. Positive skew indicates a longer tail on the right, while negative skew implies a longer tail on the left.

For discrete quantitative data with \(n\) discrete values, each with frequency \(f_i\) for discrete value \(x_i\), rendering probability mass function as \(f(x_i) = f_i / \sum_i f_i\), the skewness is given by,

\[\gamma = \frac{1}{\sigma^3} \sum_{i=1}^{n} f(x_i) (x_i - \mu)^3 \]

For continuous quantitative data with probability density function \(f(x)\), the skewness is given by,

\[\gamma = \frac{1}{\sigma^3} \int f(x)(x - \mu)^3 \ \ dx\]

Kurtosis#

Kurtosis measures the peakedness of a distribution. A value of 3 represents the kurtosis of a normal distribution (mesokurtic), while a value greater than 3 indicates a sharper peak (leptokurtic), and a value less than 3 indicates a flatter peak (platykurtic).

For discrete quantitative data with \(n\) discrete values, each with frequency \(f_i\) for discrete value \(x_i\), rendering probability mass function as \(f(x_i) = f_i / \sum_i f_i\), the kurtosis is given by,

\[\kappa = \frac{1}{\sigma^4} \sum_{i=1}^{n} f(x_i) (x_i - \mu)^4 \]

For continuous quantitative data with probability density function \(f(x)\), the kurtosis is given by,

\[\kappa = \frac{1}{\sigma^4} \int f(x)(x - \mu)^4 \ \ dx\]

Tip

The mean, standard deviation, skewness, and kurtosis, together form the first, second, third, and fourth moments of data.

Transformations#

Given a random variable \(X\) with

mean \(\mu_X\)
median \(\alpha_{X,0.5}\)
mode \(\beta_X\)
range \(r_X\)
inter-quartile range \(\rho_X\)
standard deviation \(\sigma_X\)
skewness \(\gamma_X\)
kurtosis \(\kappa_X\)

The transformation \(Y = aX + b\) has

mean \(a\mu_X + b\)
median \(a\alpha_{X,0.5} + b\)
mode \(a\beta_X + b\)
range \(|a|r_X\)
inter-quartile range \(|a|\rho_X\)
standard deviation \(|a|\sigma_X\)
skewness \(\frac{|a|}{a}\gamma_X\)
kurtosis \(\kappa_X\)